All we need is an easy explanation of the problem, so here it is.
This a frequently-asked question by users new to distributed systems and high-velocity, internet-scale environments.
Why is performing counting partitions in a table with
COUNT() not a good idea in Cassandra?
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
One of the issues we constantly come across is users trying to perform a
COUNT() in Cassandra. More often than not, the question they ask is "why does it timeout?". Why indeed?
As a DBA for relational databases (RDBMS), performing a count of records in a table is something you do regularly. The answer to the question is simple enough but it requires understanding the fundamentals of distributed architectures and how Cassandra works. Let us begin…
A star is born
Cassandra uses log-structured merge-tree for storing data. Briefly it means that writes (
DELETE statements) are written to disk at some interval in a sequential manner like a log file. These files known as SSTables are immutable meaning they do not change once written to disk thus avoiding a read-before-write making writes very fast.
Consider a simple table with multiple columns. In the normal course of a day, a new partition is inserted into the table with values for 2 columns. At some point, the memtable gets flushed to disk written as an SSTable. This new inserted partition is the equivalent of one record.
Some time later, another column in the same partition is inserted into the table. This too counts as one record. After a few seconds, another column in the same partition is inserted then the memtable is flushed to disk. Since these 2 fragments for the same partition were in the memtable at the same time, they are merged together before getting written to an SSTable thus only "counting" as one record.
Later that day, one of the existing columns for the same partition gets updated with a new value. Since Cassandra does not perform a read before it writes to disk, Cassandra does not know whether the column for the partition already exists in other SSTables so the update is just inserted as another record which eventually makes it to another SSTable (yes, updates in Cassandra are just inserts under-the-hood aka "upserts").
In this example, 3 records exist in 3 different SSTables which are in fact fragments of the same single partition. But Cassandra does not know that these 3 records are for the same partition until it reads that partition. And really, this is what an unbounded
- it reads ALL the partitions
- across ALL the SSTables (and memtables)
- across ALL the nodes in the Cassandra ring
in order to come up with the result.
On a small table in a small cluster, sure it’s not a big deal running a count query. But imagine if you had hundreds of SSTables worth over 500GB per node on a cluster with dozens [if not hundreds] of nodes. See how such an innocent query becomes so expensive? It has to perform a full table scan across your cluster which is why your query will timeout before it has the chance to complete — it does not scale.
A galaxy full of stars
To complicate matters, consider that while the count is in progress, partitions are constantly being created and updated — this is big data after all and you’re using Cassandra because you have a scale problem. Even if you were lucky enough to get a result, is that count valid? You might as well be counting stars. Because by the time you’ve counted half the stars in the night sky given enough time, chances are your count is obsolete because the universe is not static — new stars are born and old stars die all the time. Similarly, your table is not static — new partitions are created, existing partitions are updated and even some are deleted.
"Can I use a counter?", you ask. Maybe. If your use case is such that you only write a partition once and never overwrite it again, ever. If your use case involves overwrites then you have no way of working out whether it is the first write or not so your counter becomes useless. You might come up with some sort of algorithm and store a count somehow but in all likelihood, your "counter" solution won’t scale since it will probably involve (a) a read-before-write, (b) an additional write for every operation, or (c) both.
If you must
You might as well get an estimate from
nodetool tablestats but that’s exactly what you would get — an estimate. This is because it suffers from the same problem. When Cassandra flushes the memtable to disk, it knows how many keys are in that SSTable and updates the table stats accordingly . While we’re here, be aware that deletes are also inserts under-the-hood, i.e. a tombstone is inserted into the table so it’s counted in
tablestats. It only gets decremented after the tombstones get compacted out after
Finally if you really must do a count, do it with DSE Analytics or DSE Search but those are for another blog post. I’ll get around to it sometime after I finish counting stars, maybe… I’ve oversimplified what happens with the table stats updates to make it simple to digest for the intended audience. There are in fact algorithms used such as HyperLogLog which improves the accuracy of the aggregated values.
2 years after I first published this blogpost, along came DataStax Bulk Loader (aka DSBulk). It is a tool for efficiently loading and unloading data from Apache Cassandra though that is not the extent of its abilities.
DSBulk has a nice feature for counting data in large tables in a distributed manner. It is the recommended tool for loading or unloading data in CSV or JSON format. It performs up to 4x faster than the cqlsh
And yes, DataStax made it freely available to open-source Apache Cassandra users. For details, see Counting data in tables with DSBulk. Cheers!
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂