How to efficiently run range based queries in Cassandra

All we need is an easy explanation of the problem, so here it is.

I wanted to use Cassandra in a project, but it’s important that I’m able to do a few ranged queries (for example, 12345 <= time < 67890 ).

Unfortunately, Cassandra’s design seems to preclude these sort of queries, except in 2 cases (and then only for number or dates fields): if the queried column has a secondary index or if it has a clustered key. (Am I right about this? I couldn’t be sure from the documentation, it looks like secondary index would only allow using the = operator)

My main question is: is there a way to efficiently run queries BETWEEN two numbers? Maybe in clustered columns? And if yes, how many of these columns can run such queries (i.e., can I have 3 columns that would accept ranged queries without a major performance hit)?

Related:
https://stackoverflow.com/questions/11348158/cassandra-query-with-where-clause-containing-greather-or-lesser-than-and
https://stackoverflow.com/questions/29692738/how-do-secondary-indexes-work-in-cassandra
https://stackoverflow.com/questions/24949676/difference-between-partition-key-composite-key-and-clustering-key-in-cassandra/24953331#24953331

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

My main question is: is there a way to efficiently run queries BETWEEN two numbers?

Yes, but it depends on how you build the PRIMARY KEY. Let’s say that I have a table of users, and I want to be able to query them by age and state of residence. I could build the table with a PK like this:

PRIMARY KEY (state,age,user_id)

Then, a query like this works:

> SELECT * FROM users_by_state_and_age
  WHERE state='MN'
      AND age >= 40
      AND age < 50;

 state | age | user_id                              | name
-------+-----+--------------------------------------+---------
    MN |  44 | 2176e0b2-313e-472a-879b-7cd2c404846a |  Jessie
    MN |  46 | 9cd1fa2d-ea7e-417f-a3fc-bf96d77b1aba |   Aaron
    MN |  46 | e6d28709-6e8f-4455-b158-3cc5c8f58b5c | Coriene

The thing with Cassandra, is that it can support a query like this, as long as you design the table to support it.

Note: In this use case, I’d probably want an additional partition key, as partitions by state would likely grow unbound and cause issues with size eventually.

Edit

If one uses instead PRIMARY KEY ((state, age), user_id) to solve the problem mentioned in your note, would one still be able to run range queries on the age column

No. age in this case is part of the partition key. Cassandra needs the complete partition key to compute a hash and determine where in the cluster (which node) the data is stored. For a range query on even a partial partition key, Cassandra would have to compute hashes for all range values and would end up sending requests to multiple nodes (which you don’t want).

Could secondary indexes be used for range queries?

No, that doesn’t work.

And what are the costs of secondary indexes, is it just the space they take?

Secondary indexes are essentially lookup tables which tell Cassandra which node is responsible for the indexed data. So instead of reading data for a table from one node, it’s reading indexed data from a table on one node and likely being redirected to another for the index. Essentially secondary index query time will at least be doubled.

You’ll also take a hit at write-time, as the index is kept in-sync with the table.

Somewhat related: https://stackoverflow.com/questions/29692738/how-do-secondary-indexes-work-in-cassandra/

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply