Fri, Jun 14, 2013

NoSQL document storage woes

 

A case of investigating NoSQL solutions, namely MongoDB and CouchDB, for a relatively small document storage requirement.

Scenario

  • A web app
  • Durable storage and retrieval of dynamic schemaless JSON documents (average 25 KB in size)
  • No need of searching or querying data within the documents
  • Hundreds of thousands of documents, but not quite millions
  • Low frequency inserts and updates (few hundreds a day); rare deletions
  • Single server deployment for the foreseeable future

MongoDB

Mongo is supremely fast, has good drivers for Python, PHP, Ruby et. al., and is very flexible with queries. Everything looked good, except for one thing.

Excessive disk usage, due to preallocation. While this is by design and is a good for intense writes, it doesn’t make much sense for a tiny single server setup. Mongo consumed 6+ GB of disk space to store some 300,000 documents, 1.5 GB worth (after a batch insert with no updations or deletions).

In addition, Mongo doesn’t free up space previously occupied by deleted documents. The only way to reclaim wasted space after lots of deletions is to run db.repairDatabase() and db.compact(). However, these operations are extremely resource intensive and time consuming. They also render Mongo locked and unusable whilst they run. Even in the test setup, running db.repairDatabase() on the 6 GB database took about 15 minutes. Imagine running it on a database that is 60 GB in size.

More over, it is not possible to predict future disk usage in any sensible manner. 1.5 GB worth of documents consumed 6 GB space, after thousands of updations and deletions, there is no way of predicting how much space 2 GB worth of documents could consume.

While the noprealloc option disables the preallocation bheaviour, it does come at a great cost to performance. Inserting the abovesaid 300,000 documents with preallocation disabled cost some 30-60 seconds between batches of inserts where Mongo had to create new database files “post” allocation.

Verdict: No go for the scenario in question due to excessive disk usage.

CouchDB

Looks sweet and simple. Unlike Mongo, the disk usage is very conservative, by orders of magnitude in fact. CouchDB works over HTTP, with REST like queries. That means, it doesn’t need drivers. One can just use any HTTP library and start communicating. One potential downside is that CouchDB is slow compared to Mongo. But it shouldn’t make much of a difference for the scenario in question.

However, the problem lies elsewhere. CouchDB does not support queries in the traditional sense. There is no concept of a query or an ad-hoc query. Instead, queries have to be written as views. Views are Javascript functions that contain a query logic that are applied to every single document in the database (the map-reduce way). Views have to be pre-constructed and saved in the database. Granted, once they are created, they are fast, but the creation process is slow and tedious. It is impossible to jump off the couch and run an ad hoc query like age > 50, for instance. Everything has to be constructed and saved as views. Not flexible.

In the end …

CouchDB was chosen for it’s simplicity, portability, and the conservative disk usage compared to Mongo. For the particular scenario, the inability to query the documents is not an issue.

Update: While CouchDB’s disk usage appears conservative initially, revisions (updates and deletions) quickly churn through the disk. This is because CouchDB stores revisions for every update and delete operation for the purpose of replication and conflict resolution. A workaround is to enable compaction, and reduce the revision limit (revs_limit) from the default 1000. Be warned, the lower this number, the higher the risk of conflicts in replication. If you are not using replication, it is not an issue, and you can set it to something like 10.

Update, May 2015 CouchDB has been in production for close to two years and now stores close to two million documents. The disk usage is just 3 GB (although it needs routine compaction). There is also continuous replication to a second instance else where, which is trivial to setup.

Recommended read: Comparison of various NoSQL systems

« The occasional blog