Tudip
02 December 2017
MongoDB is the next-generation database that lets you do things you could never do before. It is a leading non-relational database management system and a prominent member of the NoSQL movement. Rather than using the tables and fixed schemas of a relational database management system (RDBMS), MongoDB uses key-value storage in the collection of documents. It also supports a number of options for horizontal scaling in large, production environments.
MongoDB is a NoSQL document database system that scales well horizontally and implements data storage through a key-value system. Learning MongoDB Sharding, Replication and Clusters is very critical as it’s a popular choice for web applications and websites these days. MongoDB is easy to implement and access programmatically.
What is MongoDB Sharding?
MongoDB achieves scaling through a technique known as “sharding”. It is the process of writing data across different servers to distribute the read and write load and data storage requirements. Sharding is the process of storing data records across multiple machines and it is MongoDB’s approach to meeting the demands of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharding solves the problem with horizontal scaling. With sharding, you add more machines to support data growth and the demands of read and write operations.
What is MongoDB Replication?
Unlike relational database servers, scaling NoSQL databases to meet increased demand on your application is fairly simple – you drop in a new server, make a couple of config changes, and it connects to your existing servers, enlarging the cluster. All existing databases and collections are automatically replicated and synced with the other member nodes. A replication cluster works well when the entire data volume of your database(s) is able to fit onto a single server. Each server in your replication cluster will host a full copy of your databases. Replica Sets are a great way to replicate MongoDB data across multiple servers and have the database automatically failover in case of server failure. Read workloads can be scaled by having clients directly connect to secondary instances. Note that master/slave MongoDB replication is not the same thing as a Replica Set, and does not have automatic failover.
MongoDB Clusters:
MongoDB Cluster contains three separate components
1. Config Server (mongod) Config servers are used to store the metadata that links requested data with the shard that contains it. It organizes the data so that information can be retrieved reliably and consistently. We use only one config server for testing purpose.
2. Query Routers (mongos) These machines are responsible for communicating to the config servers to figure out where the requested data is stored. It then accesses and returns the data from the appropriate shard(s). Each query router runs the “mongos” command.
3. Shard Servers (mongod) Shards are responsible for the actual data storage operations. In production environments, a single shard is usually composed of a replica set instead of a single machine. This is to ensure that data will still be accessible in the event that a primary shard server goes offline. In MongoDB Cluster, the data is divided into these shards equally (depending upon the configuration setting of balancer)
4. Replicas and Replica Set MongoDB handles replication through an implementation called “replication sets”. Replication sets in their basic form are somewhat similar to nodes in a master-slave configuration. A single primary member is used as the base for applying changes to secondary members.
CLUSTER
|–> Config Servers |–> Query Routers |–> Shard Servers (Replica Set)
|–> Replicas
How to setup a Sharded Cluster for highly available distributed datasets?
Initialize Config Servers and Configure Query Router You can initialize the config server by setting up their role as configsvr and configdb respectively. You can start the mongod and mongos server just by appending the –configsvr and –configdb in the command.
Start Replication and Add Members From the mongo shell, initiate the replica set:
rs.initiate()
This command initiates a replica set with the current host as its only member. This is confirmed by the output, which should resemble the following:
{ “info2” : “no configuration specified. Using a default configuration for the set”, “me” : “192.0.2.1:27017”, “ok” : 1 }
While still connected to the mongo shell, add the other hosts to the replica set:
rs.add("mongo-repl-2") rs.add("mongo-repl-3")
If you configured other members for your replica set, add them using the same command and the hostnames you set in your /etc/hosts file. Once all members have been added, check the configuration of your replica set:
rs.conf()
This will display a replica set configuration object with information about each member as well as some metadata about the replica set. If you need to do so in the future, you can also check the status of your replica set:
rs.status()
This shows the state, uptime, and other data about the set. If your replica set is configured properly, the data should be present on your secondary members as well as the primary. To test this, connect to the mongo shell with the administrative user on one of your secondary members, and run:
db.getMongo().setSlaveOk()
This command enables secondary member to read operations on a per-connection basis, so be sure to disconnect before you deploy your replica set into production. By default, read queries are not allowed on secondary members to avoid problems with your application retrieving stale data. This can become an issue when your database is undergoing more complex queries at a higher load, but because of the relatively simple test data we wrote, this is not an issue here.
Add Shards to the Cluster From the mongos interface, add each shard individually:
sh.addShard( "mongo-shard-1:27017" ) sh.addShard( "mongo-shard-2:27017" )
These steps can all be done from a single mongos connection; you don’t need to log into each shard individually and make the connection to add a new shard. If you’re using more than two shards, you can use this format to add more shards as well. Be sure to modify the hostnames in the above command if appropriate. Optionally, if you configured replica sets for each shard instead of single servers, you can add them at this stage with a similar command:
sh.addShard( "rs0/mongo-repl-1:27017,mongo-repl-2:27017,mongo-repl-3:27017" )
In this format, rs0 is the name of the replica set for the first shard, mongo-repl-1 is the name of the first host in the shard (using port 27017), and so on. You’ll need to run the above command separately for each individual replica set.
Enable Sharding at Database Level
Enable sharding on the new database:
sh.enableSharding("exampleDB")
Sharding Strategy Before we enable sharding for a collection, we’ll need to decide on a sharding strategy. When data is distributed among the shards, MongoDB needs a way to sort it and know which data is on which shard. To do this, it uses a shard key, a designated field in your documents that is used by the mongos query router know where a given piece of data is stored. The two most common sharding strategies are range-based and hash-based.
- Range-based sharding
- Hash-based sharding
Enable Sharding at Collection Level Create a new collection called exampleCollection and hash its _id key. The _id key is already created by default as a basic index for new documents:
db.exampleCollection.ensureIndex( { _id : "hashed" } )
Finally, shard the collection:
sh.shardCollection( "exampleDB.exampleCollection", { "_id" : "hashed" } )
Test Your Cluster Check your data distribution:
db.exampleCollection.getShardDistribution()
You can also get more information on MongoDB official docs. At Tudip, we have setup all the scripts to for performing MongoDB Sharding, Replication, Clusters and various operations.