Azure DocumentDB is Now Cosmos DB: A Global Scale, Flexible NoSQL Database

One of the big announcements at Microsoft Build 2017 is Cosmos DB, a new NoSQL database service from Azure that aims to provide a globally scaled, high-performance NoSQL data store.

It’s fundamentally a replacement of DocumentDB; or, to be more charitable, an expansion of DocumentDB that adds additional features. Effectively, DocumentDB is now Cosmos DB, with many of the same features and managed in many of the same ways.

But the documentation now refers to CosmosDB, not DocumentDB, and you can no longer provision a DocumentDB. Instead, you provision a Cosmos DB. However, you can still query Cosmos DB using the DocumentDB APIs, which are available for the .NET Framework, .NET Core, Java, Node.js, Python and Xamarin.

(Note: The current Azure certification exams do not reflect this change in DocumentDB. If you’re taking one of our certification courses, you’ll want to continue to focus on DocumentDB as the name of Azure’s NoSQL as a service offering and focus on its former features. When the exams are revised, probably later this year, we will update our courses to adopt this change.)

Cosmos DB bridges four data storage technologies: DocumentDB, of course; and because DocumentDB supported MongoDB queries, so does Cosmos DB.

Additionally, Cosmos DB supports the structures and queries used in Azure table Storage. Unlike Document DB, Azure table Storage has not been absorbed by Cosmos DB. You can still create general-purpose Storage accounts that contain table Storage, probably because table Storage is used to hold on to metrics and monitoring data, and retooling that to use a global NoSQL store like Cosmos DB would be expensive and pointless.

A diagram of the Cosmos DB Service, from https://docs.microsoft.com/en-us/azure/cosmos-db/introduction

Graph Database Support

The real big news here is that Cosmos DB supports graph databases via Gremlin.

A graph database is different from a traditional database in that it’s meant to map relationships between entities, rather than record specific properties. That is, a graph database is really more about tagging and categorization than about the actual facets of an entity.

Consider IMDB. It contains information about movies, actors, directors, etc.; I can get a quality score for The Shawshank Redemption or see what movies Frank Darabont has directed or find out what awards James Whitmore has won, for example.

Those are facets of each of those objects and are the purview of relational databases.

But I can also, through roundabout means, use IMDB to see how many Academy Awards have been won by everyone with a speaking role in The Shawshank Redemption, and how many of those awards were after 1994, which could give me an idea of how much of a career boost that movie was to people. Or I could map the number of films it takes to satisfy the Six Degrees of Kevin Bacon. (Answer: One; Rohn Thomas, who was in both The Shawshank Redemption and Telling Lies in America.)

That’s what a graph database does: Categorize information so that we can create queries for those categories and see the connections between entities.

This service had been pretty much absent from Azure, which seems astounding, especially given the importance of artificial intelligence and big data to Microsoft’s vision. (Of course, you can deploy anything you want on a virtual machine, including a graph database; but we’re talking about scale and performance here, something difficult to do on infrastructure.)

Data Models

As a result, Cosmos DB has four basic means of data storage:

Graph, which we just described

Key-value pairs: The most basic model of NoSQL data storage. Think pure JSON. They type of data stored is not considered important, and each record/entity can either possess a given key-value pair or not. For example, I might have a store that holds on to people, but I don’t have email addresses for all those people. In this kind of store, I would simply not provide a key-value pair for email for those people. This causes some problem when retrieving properties; I have to be prepared to handle both nulls and malformed/unanticipated data.

Column family: A column family is the more traditional means of relational data storage. Each record has a defined number of properties, or columns, and the values in those columns are of a given type. This provides me with consistent entities to query, but it also makes it more difficult to create and update records, since my entities might be missing information.

Documents: In this model of NoSQL data storage, each “column” of a record or entity is actually another record, or document. This allows me to store very complicated objects very efficiently, because each property of the entity either may or may not exist, and if it exists, it does not have to structurally match the other, similarly keyed entities of other records. This gives me some of the benefits of relational databases, with considerably better performance for individual record reads and writes; but I can get wildly unanticipated query results if I’m not careful, and those unexpected results can be difficult to handle gracefully.

Scaling and Availability

Like DocumentDB, Cosmos DB records are distributed across partitions and those partitions are (usually) replicated globally. Depending on the application, you either query for records from a central endpoint, and which specific global partition answers the request is resolved automatically by Azure; or you can call to a specific region.

While replication is turnkey, you can define the regions where you want partitions replicated and also geo-fence against replicating to places where data sovereignty is an issue. Failover is automatic, although you can set priority orders for each region to become primary.

Through this replication, Azure expects latency and throughput to be exceptional, especially on a global scale. If you’re in enough regions and program things correctly, the least latent partition to a specific request should serve the request immediately.

As with all NoSQL stores, performance is achieved in part through a sliding consistency. That is, all record modifications – be they creates, updates, or deletes – are not immediately reflected across all replica partitions.

Instead, you determine a consistency target, and Cosmos DB meets it:

Strong: The most recent record is always read. Writes are committed as soon as a quota of all replicated partitions can accept the write. Because of the strictness of this consistency level, global distribution of partition replicas is not possible.

Bounded staleness: Records are brought to consistency after a given time period or a number of write operations. Reads will return the most recent record within the staleness window.

Session: Record consistency is maintained for the specific client during her current session. She sees all her own changes, but her changes are committed only at the end of her session.

Consistent prefix: Records will eventually become consistent, but the user may not read the newest record nor be certain that his write is the last write. However, the user can be certain that as records are written, he will read them in order. In other words, if there are three writes – A, B, and C – to a record, the user will see either version A; or version A, then version B; or version A, then version B, then version C. The user would not see versions B or C without first seeing version A, and the user would not see version C without first seeing version A, then version B.

Eventual: The user is not guaranteed to read the latest version of a record, nor will the user see his own writes necessarily. This is the best performing consistency, but it’s least reliable in terms of which record versions the user sees. For example, the user might update a record, but see different values when asking for that record after the update.

Highly Available, High Performance

The Cosmos DB has four facets to its service-level agreement, or SLA: availability, or that the underlying service is available; consistency, or when reads and writes will be reflected across all nodes, based on the consistency model you choose; latency, or the speed at which reads, and to a lesser degree, writes, are transacted; and throughput, or the number of reads and writes that can be handled at once.

All four measures are guaranteed at 99.99%, but the definitions for some things, such as latency and consistency, are technical. Short answer: Azure promises Cosmos DB will be very fast and very error-free.

Doug Vanderweide

Here to teach you all things Azure. MCSE: Cloud Platform and Infrastructure, MCSD: Azure Solutions Architect and Microsoft Certified Trainer. I'm a .NET and LAMP stack developer with 20+ years' experience. Follow me on Twitter @dougvdotcom

Leave a Reply

Your email address will not be published. Required fields are marked *