MongoDB – query performance when sharding by _id

All we need is an easy explanation of the problem, so here it is.

When we shard a collection by the MongoDB’s _id field, we could achieve the most uniform distribution there is but the docs indicate query performance doesn’t scale well due to the scatter gather pattern used.

In Elasticsearch, similar behaviour occurs except the queries on those shards are executed concurrently. Is that the case with MongoDB as well? Or does Mongos scatter the queries to each of the shards and gather the results in sequence and that is why it’s inefficient? I couldn’t any detailed information anywhere on how this works internally

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

See mongos Broadcast Operations:

mongos instances broadcast queries to all shards for the collection unless the mongos can determine which shard or subset of shards stores this data.

After the mongos receives responses from all shards, it merges the data and returns the result document. The performance of a broadcast operation depends on the overall load of the cluster, as well as variables like network latency, individual shard load, and number of documents returned per shard. Whenever possible, favor operations that result in targeted operation over those that result in a broadcast operation.

Looks like, mongos queries the shards concurrently. At least term "broadcast" is a clear indication for that. The response time of your query is mainly determined by the slowest response from all the shards I would assume.

Method 2

The answer is yes.

The scattered query is what you get when you use _id as the sharding key. The query is sent to all shards and result is "gathered" from them.
As here is said

If a query does not include the shard key, the mongos must direct the
query to all shards in the cluster. These scatter gather queries can
be inefficient. On larger clusters, scatter gather queries are
unfeasible for routine operations.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply