25. March 2012 12:13
A couple of weeks back, while at SxSW, I attended an excellent presentation about NOSQL databases by Gary Dusabek of Rackspace: NoSQL Databases: Breaking the Relational Headlock. The following post summarizes some of the key points and provides a comparison of the various technologies. He didn’t go over CouchDb, Couchbase or Membase, so I’ll add my own notes about those offerings as well, since I personally have used each.
The Problems with RDBMS
The major problems with traditional Relational Database Management Systems is the inability to scale linearly, Single Point of Failure (SPoF), lack of sharding features, and the requirement of de-normalization to ease the use of data. Typically, to deal with scale, you would add processors, memory, disk space etc. to build a bigger box capable of handling increased volume or throughput. This is normally referred to as “vertical scaling”. Unfortunately, vertical scaling is not cost efficient; the cost of CPU and memory increases disproportionately to performance – it’s cheaper and more efficient to cluster cheaper hardware – horizontal scaling. Additionally, RDBMS performance tends to suffer when transactions are introduced to ensure data corrected-ness, consistency and isolation.
Considerations for Choosing
When choosing a NOSQL solution the following considerations must be evaluated:
- Fault tolerance – what is an acceptable level?
- Recoverability – volatility (in-memory/fast) or persisted (slower, but less volatile)
- Replication – fully distributed or master/slave?
- Access – polyglot drivers? Do they all offer consistent functionality?
- Hooks – before/after command execution (sprocs and triggers)?
- Distribution mode – sharding strategy?
- Data model
- Key/Value pairs?
- Data structures?
- Transactional semantics? BASE vs ACID?
- Read vs Write throughput – where are your scaling issues? What are the usage patterns of your data?
- Deployment, Management, Administration – how to add or remove nodes without affecting clients?
What NOSQL Offers
All being said, NOSQL solutions are not necessarily a replacement for RDBMS, but a complement to handle issues of scalability and complexity. An example usage would be as the Q in CQS…store a master copy in a fully normalized form in RDBMS and then push a de-normalized form into a NOSQL solution for scaling reads. Additionally, by virtue of being schema-less, development is typically easier and faster.
Some NOSQL Databases
The following is an non-exhaustive overview of NOSQL databases:
- Master/slave replication – master is a SPoF
- Gives failover and reliability, but not consistency
- Only master receives writes
- Document orientated, thus naturally denormalized – stored natively as BSON
- Flexible schema
- Programmer friendly
- Many language drivers – C#, Java, PHP, Ruby, Python et al
- Atomic on a single document for writes
- Allows for complex queries – by ranges and multiple criteria for instance
- Not good for DW/data analytics
- Blocking offline compaction
- SPoF – the master dies, everything dies
- Master/Slave replication
- Good for real-time stat tracking
- Very fast – in memory database
- Volatility – in memory database – potential for data loss
- Like Memcached, but with data structures: lists, sets, hashtables
- RAM limitations – whole set fits in memory, but also allows for offline storage
- Good when the entire dataset can fit in memory
- Fully distributed – shared nothing – no SPoF
- Relationships via links
- Map/Reduce framework
- Completely schema-less – keys and buckets
- Scales linearly
- Tunable consistency - can adjust for read vs write optimization etc
- Pre and Post commit hooks
- Pluggable backend storage
- Bit cast – everything in memory
- InnoDb –everything won’t fit in memory
- Memcached-like in memory
- REST API
- Dynamic clustering via “vnodes” similar to Membase/Couchbase vbuckets – when a node is added or removed the data is automatically re-indexed
- Data is stored unsorted
- Written in Erlang
- Has a query language called CQL – SQL like syntax
- Dynamo based distribution system – BigTable like
- Allows for range queries, but prone to “hotspots” – uneven distribution of key/value pairs
- Data center “rack aware”
- Hadoop integration provided by datastax.com
- Configurable caching – like a super-fast Memcached
- Some schema schematics – hybrid columnar and row based storage system
- Keeps sort order of data, but can be changed on the fly
- When growing the cluster “hotspots” may occur – uneven distribution of keys and values
- Part of the Hadoop suite of tools: HBase, HDFS, Sqoop, Hive, etc
- Versioned cells – you can query data as it existed at a particular point of time
- Easy Hadoop integration by default
- Hadoop NameNode is a SPoF – Secondary NameNode provides some redundancy
- Schema maintenance requires downtime
- Complicated balancing – HBase region servers then HDFS
Couchbase – not covered in session
- Fast, in-memory database due to Memcached interface integration
- Provides Map/Reduce framework for creating different views of the data you wish to display
- Stores data as JSON documents via Key/Value pairs
- Combines the best attributes of Memcached (caching), Membase (administration and scaling) and CouchDb (mapreduce)
- No SPoF – fully replicated data
- When a node is added data is automatically rebalanced and replicated across the cluster!
- Depending upon bucket type, data can be persisted to disk or stored in-memory
- Can easily support multi-tenancy via buckets – just create a bucket for each client
- Written in Erlang – newer 2.0 version has more C/C++ for performance reasons
- Product keeps changing…first it was Membase, then they added Memcached, and now CouchDb functionality – moving target for long-term NOSQL deployment
Next Up: Details…
This is a just a cursory overview of several NOSQL databases, I’ll be evaluating each one in detail in the coming weeks to get a better feel for where each solution fits given a particular scenario. From what I can see, some are more specific in the scope of problem sets that they satisfy, while others are more general purpose tools that satisfy a range of scenarios.