Sharding is a mechanism widely used in today's most popular database systems, an effective way to divide, categorize, and organize data into manageable chunks. In some database types, such as MySQL, the use of sharding is an architectural decision, while in other databases, such as MongoDB, sharding is a fully supported, native feature. If effectively used, it can play a role in scaling a system effectively to meet organizational or business demands.
Sharding can be a complicated concept, not only because of how specific systems treat and support it, but because it can be tricky to determine the correct situations in which you should use it. In MongoDB, there's real risk in misusing sharding — once you've set up your shards, it can be very difficult to undo them. It's therefore key to understand the principles of sharding before actually applying them to your system. In this post, we'll look at how sharding is defined and should be thought about, specifically in MongoDB. In a future post, we'll dig more into actual how-tos and sharding processes.
What is Sharding?
Sharding is essentially a methodology for categorizing your database into smaller, more manageable parts. Conceptually, as the name "sharding" implies, it results in an organizational structure that allows you to break off subsets of data in smaller pieces, so the system can tactically allocate and deploy its hardware resources at the highest efficiency — these subsets are known, of course, as shards. The MongoDB Manual defines sharding as "a method for distributing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations.” In MongoDB, there are many different conceptual schemas you can use to categorize, divide, and define shards. Some common examples of logic you might use to organize shards are application functionality, user location, or mathematical function.
In some cases, a system will even support multiple shards on a single piece of hardware, making it easier for that component to smartly distribute its resources — disk, CPU, memory utilization, or some combination — depending on which of its shards requires them. To this end, sharding can help you with resource optimization.
For large data sets, sharding lets you distribute data among multiple servers and can make it easier to manage that data than it would be if you were storing it on a single machine. Similarly, effective use of sharding can give you higher throughput than you'd see otherwise; by splitting reads/writes across multiple servers, you can scale your workload horizontally, as opposed to vertically, avoiding the pitfall of putting too much executional pressure on any one part of your system. This is an example of the contrast between the ways sharding and replication can affect a system — if you're replicating your system but not sharding, the vertical pressure of the workload can throttle your throughput.
Sharding Basics in MongoDB
The concepts and goals of sharding in MongoDB aren't fundamentally different than sharding in other database types, but MongoDB sharding does have a handful of unique characteristics, which we'll take a quick look at here.
First of all, structurally, MongoDB handles sharding natively, on a per-collection basis. It requires that servers be assigned specialized roles, which define how those servers behave in a sharded environment. A server's role can be one of the following:
- Config-Server: Deployed as a replica-set, config-servers track state about which servers contain what parts of a sharded collection.
- Mongos/Router server: These servers are individual instances that do not store data locally. Instead, they query individual shards using cached state from the config-servers, as needed.
- Shard-Server: These are the MongoDB instances that actually store collection data. Shards can be deployed as standalone instances or as a replica-set (the latter is highly recommended in production!).
When a MongoDB collection is sharded, data is distributed based on an identifying property called a shard key. A shard key is a column that exists in every document in the collection, representing a field that can be used to organize those documents, based on range. For example, if you wanted to organize users based on the names of the states in which they live, you could begin by assigning all users with states starting with the letters A-C to shard 1 and all users of states starting with the letters D-F to shard 2.
In a future article, we'll take a closer look at the actual methods for defining shard keys and other processes of sharding in MongoDB. For now, remember the following:
- A collection cannot be un-sharded once it has been sharded! Be sure you want to proceed before doing so.
- You cannot change the shard-key for a sharded collection!
- Backing up a sharded cluster is more complicated than a non-sharded cluster.
Ultimately, sharding's setup in MongoDB isn't overly complex — the basic features and effects of sharding are virtually the same as in any other database. However, sharding is powerful and it can backfire if used improperly or if the user doesn't have a clear idea of what they are trying to achieve. It's important to understand how sharding is defined in MongoDB in the broadest ways.