MongoDB shards


MongoDB shards

A shard is a single MongoDB instance that holds a subset of the sharded data. Shards can be deployed as replica sets to increase availability and provide redundancy. The combination of multiple shards creates a complete data set.

What is sharding?

Sharding is the process of distributing data across multiple hosts. In MongoDB, sharding is achieved by splitting large data sets into small data sets across multiple MongoDB instances.

How does sharding work?

The underlying hardware becomes the main limitation when dealing with high throughput applications or very large databases. High query rates can stress disk drives' CPU, RAM, and I/O capacity, resulting in a poor end-user experience. There are two types of scaling methods.

  • Vertical scaling
  • Horizontal Scaling

Vertical scaling is the traditional way of increasing the hardware capabilities of a single server. The process involves upgrading the CPU, RAM, and storage capacity. However, technological limitations and cost constraints often challenge upgrading a single server.

Horizontal scaling

This method divides the dataset into multiple servers and distributes the database load among each server instance. Distributing the load reduces the strain on the required hardware resources and provides redundancy in case of failure.

However, horizontal scaling increases the complexity of underlying architecture. MongoDB supports horizontal scaling through sharding.

MongoDB sharding works by creating a cluster of MongoDB instances consisting of at least three servers. That means sharded clusters are comprised of three main components:

  • The shard
  • Mongos
  • Config servers

Shard

A shard is a single MongoDB instance that holds a subset of the sharded data. Shards can be deployed as replica sets to increase availability and provide redundancy. The combination of multiple shards creates a complete data set. For example, a 2 TB data set can be broken down into four shards containing 500 GB of data from the original data set.

Mongos

Mongos act as the query router providing a stable interface between the application and the sharded cluster. This MongoDB instance is responsible for routing the client requests to the correct shard.

Config Servers

Configuration servers store the metadata and the configuration settings for the whole cluster.

Components

  • The application communicates with the routers (mongos) about the query to be executed.
  • The mongos instance consults the config servers to check which shard contains the required data set to send the query to that shard.
  • Finally, the result of the query will be returned to the application.
  • It’s important to remember that the config servers also work as replica sets.

Shard Keys

When sharding a MongoDB collection, a shard key gets created as one of the initial steps. The “shard key” is used to distribute the MongoDB collection’s documents across all the shards. The key consists of a single field or multiple fields in every document. The sharded key is immutable and cannot be changed after sharding, and a sharded collection only contains a single shard key.

The shard key can directly impact the performance of the cluster. Hence can lead to bottlenecks in applications associated with the cluster. To mitigate this, before sharding the collection, the shard key must be created based on:

  • The schema of the data set
  • How the data set is queried

Chunks

Chunks are subsets of shared data. MongoDB separates sharded data into chunks distributed across the shards in the shared cluster. Based on the shard key, each chunk has an inclusive lower and exclusive upper range. A balancer specific for each cluster handles the chunk distribution.

The balancer runs as a background job and distributes the chunks needed to achieve an even balance of chunks across all shards. This process is called even chuck distribution.

Sharding benefits & limitations

Benefits

  • In traditional replication scenarios, the primary node handles the bulk of write operations. At the same time, the secondary servers are limited to read-only operations or maintaining the backup of the data set. However, as sharding utilizes shards with replica sets, all queries are distributed among all the nodes in the cluster.
  • As each shard consists of a subset of the complete data set, adding additional shards will increase the cluster's storage capacity without complex hardware restructuring.
  • Replication requires vertical scaling when handling large data sets, and this requirement can lead to hardware limitations and prohibitive costs compared to the horizontal scaling approach. But, because MongoDB utilizes horizontal scaling, the workload is distributed. When the need arises, additional servers can be added to a cluster.
  • In sharding, both read and write performance directly correlates to the number of server nodes in the cluster. This process provides a quick method to increase the cluster’s performance by simply adding additional nodes.
  • A sharded cluster can continue to operate even if single or multiple shards are unavailable. While the data on those shards are unavailable, the client application can still access all the other available shards within the cluster without any downtime. In production environments, all individual shards deploy as replica sets, further increasing the availability of the cluster.

Limitations

  • Sharding requires careful planning and maintenance to maintain a sharded cluster—because of the complexity involved.
  • When you shard a MongoDB collection, there is no way to unshard the sharded collection.
  • The shard key directly impacts the overall performance of the underlying cluster, as it is used to identify all the documents within the collections.
  • There are some operational restrictions within a MongoDB sharded environment. For example, the geoSearch command is not supported within a sharded environment.
  • In an instance where a shard key or a prefix of compound shard key is not present, Mongo will perform a broadcast operation that queries all the shards in the cluster, which can result in long-running query tasks.