MySQL to MongoDB Migration in Drupal

MySQL is one of the widely used open source relational databases. In recent times, there has been a lot of interest on the NoSQL databases that can handle Big Data. Among NoSQL databases, MongoDB (from humongous) is a cross-platform document-oriented open source NoSQL database is very popular in the development community. Migration between the relational database and NoSQL database is a bit tricky & time-consuming. This blog is aimed to share our experience in migrating the MySQL based solution to NoSQL based MongoDB solution for Drupal environment

Our Challenges

In our product, lots of user preference data that include various statistics are collected. Among these data, the server snapshot has the most data volume (because it is collecting details on every running process). Over a period of time, this results in millions of rows in the database, even for shorter duration. Despite this huge amount of data, response time was never a problem for us. We were able to fine tune our queries and servers to ensure that results were returned quickly. The majority of operations on our database are writing data with minimal number of reads.

The problem we encountered was that we are unable to index (SOLR Indexing) the data in real-time to be able to pick the indexed information (user preference) in real-time using MySQL. As our data had to go through various transformation queues in Drupal MySQL setup which took more than 2 minutes to get the complete user data saved in the database and another couple of minutes to get it indexed in SOLR.

Reasons for choosing MongoDB

The decision to choose MongoDB was influenced because of the following reasons:

Very easy to setup.
PHP module available.
Automatic sharding
Good documentation.

Our Implementation details

Switching from a relational database to a NoSQL database is notstraight forward, even though it is aonetime activity; the need for migration has to be well understood and if the NoSQL is apt for that specific context then move forward on the migration. Generally, it is takes time for such implementation but it is not difficult to implement. However, one needs to be aware and ready to face challenges here & there along the journey. Some of these are very specific to MongoDB but majority will apply generally for all NoSQL databases.

Schema – less design

Schema-less design using key-value pair means things are much more flexible for future changes but it also means that every value includes field name which are redundant from the typical relation database storage perspective. We use relatively long and descriptive column names in MySQL such as field_session_id_value or field_category_id. This works fine when the number of rows are less as the extra storage only amounts to few bytes per row but when there are 1 million rows, each with maybe 100 bytes of field names, then the disk space gets eaten up unnecessarily. As we use Drupal and it has its own way of storing the field names we have not changed much in Drupal but we ensured to use smaller names for the custom tables that we had created, in order to save the space.

Storage

MongoDB stores data in flat files using their own binary storage objects. This means that data storage is very compact and efficient, actually perfect for high data volumes. However, it allocates a set of files per database and pre-allocates those files on the file system for speed. Each data file is pre-allocated to a given size. (This is done to prevent file system fragmentation, among other reasons.) The first file for a database is .0, then .1, etc. .0 will be 64MB, .1 128MB, etc., up to 2GB. Once the files reach 2GB in size, each successive file is also 2GB.

This was a problem because MongoDB was frequently pre-allocating in advance when the data would almost never need to “overflow” into another file, or only a tiny amount to another file. This is particularly the case with free accounts where we clear out data after a month. Such pre-allocation caused large amounts of disk space to be used up.

You can use repairDatabase on a database to rebuilds the database, de-fragmenting the associated storage in the process.

repairDatabase requires free disk space equal to the size of your current data set plus 2 gigabytes. If the volume that holds database path lacks sufficient space, you can mount a separate volume and use that for the repair.

There are also other options like –smallfiles and –noprealloc (deprecated as of Mongo 2.6) can be used to solve the storage problem but both will have a performance impact

Addressing unexpected locking and blocking issues in MongoDB

In MongoDB, removing rows locks blocks the entire database. Adding indexes also does the same. When we imported our data, this caused problems because large data sets were causing the locks to exist for long period until the indexing had been completed. This is a not a problem when you first create the “collection” (tables in MySQL) because there are only few (or no) rows, but creating indexes later will cause problems. Previously in MySQL we would delete rows by using a wide ranging WHERE clause, for e.g. to delete rows by date range or server ID. Now in MongoDB, we are iterating through all the rows to be deleted and delete them individually. This process is relatively slower, but it prevents the entire database locking issue.

Addressing database corruption issues

In MySQL or any other RDBMS if a database (more likely few tables) gets corrupted, it can be easily repaired individually. However, in MongoDB, you have to repair at a database level. There is a command to do this but it reads all the data and re-writes it to a new set of files. This means that not only all the data is checked, some disk space freed up as files are compacted but it also means that the entire database is locked and blocked during the duration of time this operation happens. With our database being around 100 GB, this operation took several hours.

However, corruption will generally occur only if the database processes are abruptly terminated whilst it is in the middle of an operation. This situation can be easily avoided by controlled access to these processes by the administrative users only.

Improved Performances

The main reason for moving to MongoDB were not response time, however it turned out that in many cases, time for query execution is significantly faster than the MySQL. This is because MongoDB keeps as much data as possible in RAM and thus it acts as if the data is cached during query even if is not cached.

However, we don’t have any precise statistics but in some cases we observed that cached query takes around 5ms and non-cached around 25-100ms, depending on the query. Indexes help speed up queries in many cases but our operations are write intensive and indexes on constantly changing data can slow things down.

MongoDB has support for integration with application written in PHP, Python, Java, Ruby, C++, Perl, etc. We had used a native C PHP module for MongoDB to get performance improvements.

Community / Commercial Support

MongoDB is an open source NoSQL database and has a strong development community support. We can be sure that development will continue and so does the bug fixing. Indeed, the mailing list has been a very useful source of help to us during the migration. The documentation is good but some things are still unclear or not documented Also, the response from the mailing list from the developers within hours is very helpful.

Even though NoSQL databases are gaining acceptance, MongoDB is relatively new compared to the likes of MySQL and so there are limited numbers of experienced people in the community. Hence, we had also taken up a support contract with compose.io for guaranteed 24/7 phone & e-mail coverage to get help quickly if required.

Conclusion

Migration from MySQL to MongoDB did solve the problem and a great learning exercise for us. We had learnt and addressed issues that appeared during the migration. Overall, our product’s ability to handle indexing data in real-time is addressed very well with this migration. We can look forward for the greater community support in the future to avoid buying support contract as many applications start adopting MongoDB as their primary data store.