Analytics 101: Choosing the right database
When you’re new to the analytics ecosystem, it’s easy to get overwhelmed by all of your options for getting up and running. This is especially true when it comes to picking the right foundational technology: the database.
Choosing properly: four rules of the road
When it comes to finding the right database for the job, there are a few good rules of thumb.
Start with the end in mind
Unpredictable queries are almost always the root cause of database performance problems. If no one queried your database, then your performance metrics would be great. To make unpredictable queries predictable you need to take as much on-the-fly computation out of the system as you can.
To do that, you’ve got to anticipate what your users are going to ask for and have it ready (or as close to ready as possible) ahead of time. If you’re consistently able to have those answers ready, then consider a database with semantics to match.
Choose the right data model
Each database is built for its own unique type of workload. Its authors have made intentional trade-offs to make their database good at some things while sacrificing flexibility or performance in other categories.
For example, Riak was built to get large chunks of data in and out very quickly. It isn’t necessary for Riak to understand the semantics of the data in order to do that, so it isn’t optimized to do so. That’s why Riak isn’t a good choice if you need to do aggregations like SUM, MIN, MAX, and so on at query time.
When you start with the end in mind and have a good notion of how your data is going to be accessed, you should be able to pick a database that best matches that pattern.
Remember that disks are fast—and memory is faster
Networks are getting faster every day. Even SAN-based disks like EBS are a significantly wider bottleneck than they used to be. That said, the key to building performant services involving disk access is to read from the disk in a predictable way.
Consider, for example, that you’re using Postgres. If your query pattern calls for aggregating a few hundred rows that are neatly organized on pages, you’ll get (mostly) consistent performance. If the number of rows that you need to aggregate is unbounded, or if you’re not predictably accessing pages, you’ll have a lot of problems.
That said, memory is much faster than disk—but it’s also really, really expensive. If your data set is small (or you’re willing to pay a premium to buy enough memory for larger data sets), then serving your data out of memory will give you significant performance improvements.
Consider both reads and writes
Data isn’t actually useful unless it’s in a database. Too often, engineers forget about the process of loading data and end up with performance bottlenecks because the database is so busy servicing writes that it can’t service reads.
The good news is that you can predict writes much better than you can reads. In most cases, you’re better off issuing writes in large, infrequent batches than trying to do many small, frequent writes. Another effective strategy is to use a queue to buffer writes as a mechanism for throttling throughput.
In extreme scenarios, it might be worthwhile to have two database clusters: one for writing to, and one for reading from. The clusters’ built-in replication can act as a buffer or throttling mechanism.
Databases to consider
Now that you know the key considerations for selecting a database, here are a few categories and vendors to get you started on your way.
General purpose databases
These are mostly relational databases, with the exception of MongoDB. All of these databases work well serving small- to large-sized data sets in a high-throughput environment with multiple query patterns. They also all support some type of secondary indexing, which makes it easy to add new methods for accessing your data quickly.
It’s worth noting that this type of database can be problematic for very large data sets, or for workloads distributed across multiple machines (again, with the exception of MongoDB, which natively supports clustering—but in a somewhat weird way).
Distributed table-oriented databases
These databases are much like their relational cousins, but forego certain features in order to make it possible to scale out across a cluster. Each has its own set of tradeoffs and features, so be sure to investigate them individually.
These databases are best suited for when you know exactly what data you want to look up. Even though some of them support secondary indexes, you should consider them in the context of being key-value only.
The main considerations in this category will be level of scale and read consistency. Redis is highly consistent, but can’t handle very large data sets. Alternately, Riak can scale out to whatever your ops team can handle—but sometimes it’ll take a bit before you can read what you just wrote.
Massively parallel processing databases
Traditionally, MPP databases have been deployed when in-house analyst teams need the ability to slice-and-dice (very, very large) data sets at will. Concurrent reads to these types of databases were very limited.
However, offerings like Impala, Redshift, and Presto are starting to change this. Their features that make it possible to use these technologies in user-facing applications.
Similarly, these tools have traditionally only been made available to enterprises willing to sign long contracts with huge price tags. But new candidates like Redshift and BigQuery are upsetting this model by making technology available to anyone with a credit card.
Choosing a database is just the first step
Once you’re up and running with the right database, it’s time to see the results of your analytics. That’s where Reflect comes in, making it possible to build beautiful and engaging data visualization interfaces in minutes. Want to see it in action? Request access today.