30. August 2013 19:16
On September 13th 2013, leading NOSQL database provider, Couchbase, will be invading San Francisco for it’s annual community conference, Couchbase [SF] 2013! I’ll be there as well participating in a Couchbase Cluster – smaller interactive and discussion driven sessions – representing the .NET Client SDK.
The event will host three different tracks: developer, operations and administration with speakers from Couchbase and from Customers of Couchbase who have a wealth of experience and various use-cases to share. There will also be smaller interactive sessions that go over advanced topics, new offerings such as Mobile Couchbase and the various Couchbase Labs projects that are available for free on Github or via Nuget (if you are a .NET developer).
So if your in San Francisco on September 13th or are willing to travel a bit, come by and join in on the fun!
25. March 2012 12:13
A couple of weeks back, while at SxSW, I attended an excellent presentation about NOSQL databases by Gary Dusabek of Rackspace: NoSQL Databases: Breaking the Relational Headlock. The following post summarizes some of the key points and provides a comparison of the various technologies. He didn’t go over CouchDb, Couchbase or Membase, so I’ll add my own notes about those offerings as well, since I personally have used each.
The Problems with RDBMS
The major problems with traditional Relational Database Management Systems is the inability to scale linearly, Single Point of Failure (SPoF), lack of sharding features, and the requirement of de-normalization to ease the use of data. Typically, to deal with scale, you would add processors, memory, disk space etc. to build a bigger box capable of handling increased volume or throughput. This is normally referred to as “vertical scaling”. Unfortunately, vertical scaling is not cost efficient; the cost of CPU and memory increases disproportionately to performance – it’s cheaper and more efficient to cluster cheaper hardware – horizontal scaling. Additionally, RDBMS performance tends to suffer when transactions are introduced to ensure data corrected-ness, consistency and isolation.
Considerations for Choosing
When choosing a NOSQL solution the following considerations must be evaluated:
- Fault tolerance – what is an acceptable level?
- Recoverability – volatility (in-memory/fast) or persisted (slower, but less volatile)
- Replication – fully distributed or master/slave?
- Access – polyglot drivers? Do they all offer consistent functionality?
- Hooks – before/after command execution (sprocs and triggers)?
- Distribution mode – sharding strategy?
- Data model
- Key/Value pairs?
- Data structures?
- Transactional semantics? BASE vs ACID?
- Read vs Write throughput – where are your scaling issues? What are the usage patterns of your data?
- Deployment, Management, Administration – how to add or remove nodes without affecting clients?
What NOSQL Offers
All being said, NOSQL solutions are not necessarily a replacement for RDBMS, but a complement to handle issues of scalability and complexity. An example usage would be as the Q in CQS…store a master copy in a fully normalized form in RDBMS and then push a de-normalized form into a NOSQL solution for scaling reads. Additionally, by virtue of being schema-less, development is typically easier and faster.
Some NOSQL Databases
The following is an non-exhaustive overview of NOSQL databases:
- Master/slave replication – master is a SPoF
- Gives failover and reliability, but not consistency
- Only master receives writes
- Document orientated, thus naturally denormalized – stored natively as BSON
- Flexible schema
- Programmer friendly
- Many language drivers – C#, Java, PHP, Ruby, Python et al
- Atomic on a single document for writes
- Allows for complex queries – by ranges and multiple criteria for instance
- Not good for DW/data analytics
- Blocking offline compaction
- SPoF – the master dies, everything dies
- Master/Slave replication
- Good for real-time stat tracking
- Very fast – in memory database
- Volatility – in memory database – potential for data loss
- Like Memcached, but with data structures: lists, sets, hashtables
- RAM limitations – whole set fits in memory, but also allows for offline storage
- Good when the entire dataset can fit in memory
- Fully distributed – shared nothing – no SPoF
- Relationships via links
- Map/Reduce framework
- Completely schema-less – keys and buckets
- Scales linearly
- Tunable consistency - can adjust for read vs write optimization etc
- Pre and Post commit hooks
- Pluggable backend storage
- Bit cast – everything in memory
- InnoDb –everything won’t fit in memory
- Memcached-like in memory
- REST API
- Dynamic clustering via “vnodes” similar to Membase/Couchbase vbuckets – when a node is added or removed the data is automatically re-indexed
- Data is stored unsorted
- Written in Erlang
- Has a query language called CQL – SQL like syntax
- Dynamo based distribution system – BigTable like
- Allows for range queries, but prone to “hotspots” – uneven distribution of key/value pairs
- Data center “rack aware”
- Hadoop integration provided by datastax.com
- Configurable caching – like a super-fast Memcached
- Some schema schematics – hybrid columnar and row based storage system
- Keeps sort order of data, but can be changed on the fly
- When growing the cluster “hotspots” may occur – uneven distribution of keys and values
- Part of the Hadoop suite of tools: HBase, HDFS, Sqoop, Hive, etc
- Versioned cells – you can query data as it existed at a particular point of time
- Easy Hadoop integration by default
- Hadoop NameNode is a SPoF – Secondary NameNode provides some redundancy
- Schema maintenance requires downtime
- Complicated balancing – HBase region servers then HDFS
Couchbase – not covered in session
- Fast, in-memory database due to Memcached interface integration
- Provides Map/Reduce framework for creating different views of the data you wish to display
- Stores data as JSON documents via Key/Value pairs
- Combines the best attributes of Memcached (caching), Membase (administration and scaling) and CouchDb (mapreduce)
- No SPoF – fully replicated data
- When a node is added data is automatically rebalanced and replicated across the cluster!
- Depending upon bucket type, data can be persisted to disk or stored in-memory
- Can easily support multi-tenancy via buckets – just create a bucket for each client
- Written in Erlang – newer 2.0 version has more C/C++ for performance reasons
- Product keeps changing…first it was Membase, then they added Memcached, and now CouchDb functionality – moving target for long-term NOSQL deployment
Next Up: Details…
This is a just a cursory overview of several NOSQL databases, I’ll be evaluating each one in detail in the coming weeks to get a better feel for where each solution fits given a particular scenario. From what I can see, some are more specific in the scope of problem sets that they satisfy, while others are more general purpose tools that satisfy a range of scenarios.
10. January 2012 22:37
I came across the following press release (a bit old) and liked what I read. Specifically, that Couchbase was working on UnQL support with MS:
“Couchbase unveiled and released to the public domain the UnQL query
language, (UNstructured Query Language). Jointly developed with
Microsoft and SQLite, UnQL is designed to provide a common query
language for NoSQL developers and help drive widespread adoption of
NoSQL technology. Each company has committed to delivering product
support for UnQL in 2012.”
By going to UnQL and partnering with MS, this puts Couchbase in an awesome position to develop a Linq (IQueryable) implementation of UnQL. If this happens, then querying a NOSQL or a RDBMS (or anything else) will be unified from the CLR perspective.
For instance, the following Linq query in the CLR (C# syntax):
var query = (from f in Context.Foo
select new f).
Could emit UnQL if Context is NOSQL or SQL if RDBMS…genius. If only Java had something like IQueryable <sigh>.
It also looks like Couchbase is dumping the CouchDb HTTP REST API for the binary Memcached protocol, which should be a big win from a performance perspective (sorry CouchDb users). Membase already uses the protocol, so it’s just matter of switching the HTTP REST API for UnQL.
Another develpoment in Couchbase is that CouchDb has been forked. The good news it’s still going to be open-source:
“As J. Chris Anderson notes in the comments, Couchbase is completely open source and Apache licensed:
Everything Couchbase does is open source, we have 2 github pages that are very active:
Probably the most fun place to jump into development is the code review: http://review.couchbase.org/
Let me clarify, if you like Apache CouchDB, stick with it. I'm working on something I think you'll like a lot better. If not, well, there's still Apache CouchDB.”
While possibly a bit traumatic for CouchDb afficiandos, this should be a huge win for Couchbase fans and for companies investing in Couch as stable, NOSQL solution.