TL;DR --- Consume only the data you need to, pipe it through well-understood components, and weigh your options carefully, even under duress
Last week we built our first full-fledged public-facing application using the Reflect data visualization platform: the GitHub report card. The Report Card enables any GitHub user to visualize their 2016 GitHub activity across both public and private repos, including commits by month and day of the week, top collaborators, preferred languages, and much more.
The end result was a web interface that tells a rich user-specific narrative using just a few simple visualization components. Report Cards are accessible only to you as a GitHub user (provided that you authorize our application), so we can’t link to any generated Report Cards. But here’s a screenshot just to provide a taste:
We’re not inclined toward false modesty here at Reflect so we’ll come out and say it: we thought the Report Card was pretty slick. But we were still overwhelmed by the response: over 5,000 GitHub users generated a Report Card within 24 hours of it being released, and a huge number of them tweeted their results.
To display the total number of GitHub users who have generated a report, we built this handy Reflect visualization:
There were lots of high-fives to be had at the Reflect office on launch day but also a sense of relief that things went off as well as they did, because a few days before the launch it became clear that there were some backend issues with how we were building the Report Cards that were sure to produce a sub-standard experience for users. So we put our heads together and came up with a better solution.
Here’s how it all went down.
The Report Cards look simple enough—just a handful of widgets and some fancy CSS, right?—but building the Report Card is a lot more complex than it may appear. The tricky part is that although GitHub has a very good API, a lot of the metrics that we provide in the Report Card don’t directly correspond to any specific API resource or endpoint.
In other words, it’s easy enough to, say, fetch the URL for GitHub user’s profile picture and then display that image in the Report Card. But figuring out things like who you collaborated with the most and how many total lines of code you committed requires a complex pipeline that can:
The core problem with the pipeline at first was simply that we underestimated the complexity of steps 1 and 2, and several days before the launch it became clear that the Report Cards were taking far too long to generate (sometimes several minutes, whereas our goal was more in the 15-20 second range).
So I used my CTO clout to gather a few of our engineers together and came up with a better solution. The main shortcoming of the initial approach, it turns out, was that we were ingesting far more data than we needed to from the GitHub API on a per-user basis. So we trimmed it down to only the data we needed to tell the story that we wanted to tell for each user:
So what data did we stop ingesting? Per-commit data. Initially we wanted to provide a range of hyper-granular data about commits, and this required us to pull in huge amounts of data for each user. But we were never able to turn this data into intuitive visualizations that made immediate, unambiguous sense. So we scrapped those visualizations and opted for a simple additions/deletions metric for lines of code, and this was information that we could get much more directly via the GitHub API.
Tightening up the input end of the data funnel yielded some big gains right away, but we still needed to trim many more seconds off of Report Card generation time. Initially, all three steps in the pipeline were performed by a single running process that showed signs of strain from the start. This served as our next target for optimization.
Instead of relying on a single running process to ingest, transform, and store Report Card data, we decided to distribute these tasks across an array of components. So we decided on a task-queue-based system that would enable us to ingest/transform/store in a truly asynchronous (and, it turns out, much more efficient) fashion.
Architecturally, we wanted to worry as little as possible about infrastructure, so we decided to use a handful of AWS (Amazon Web Services) services that our team is well acquainted with:
|SQS (Simple Queuing Service)||Job queuing|
|Lambda||Distributing work (mostly in the form of consuming SQS queues)|
|DynamoDB||Tracking the state of jobs through the pipeline|
|Kinesis||Pumping data into Redshift|
|API Gateway||A central interface to control the pipeline|
Although the pipeline involves a number of components, the basic data flow in our improved pipeline is pretty simple:
Virtually all of Reflect’s backend is written in Go, and we use it whenever we can. But AWS Lambda doesn’t offer native support for Go. Fortunately, there’s a library called Apex that enables you to run Go processes on Lambda. We’re currently running six Lambda functions using Apex, and we highly recommend it.
We were pretty happy with this setup in general, but Redshift really stood out as the surprise Most Valuable Player amongst them. It was the part of the system that we were the most worried about because we hadn’t had great luck with concurrent reads in the past. So our strategy was to adopt a wide cluster (currently eight nodes) and to use each user’s GitHub username as the distribution key.
Using usernames as distribution keys means that all the data for each user is colocated on a single Redshift instance. Even better, each user’s dataset is fairly small by Redshift standards (under 5,000 records). Keeping user datasets both small and colocated has made reads much faster than they otherwise would’ve been and kept load on the leader node low.
Overall, Redshift performed much better than expected compared with past experience.
Despite going from a sub-standard data pipeline to a really solid one in a very short period of time, there were some issues that we never quite resolved that you should be aware of in case you have a use case like ours:
Building the Report Card was a fantastic learning experience for our team. In a very short time we acquired tons of invaluable information about the usability of our platform and we are already at work re-designing elements of Reflect in light of our experience.
The lessons we learned about specific tools are too numerous to mention here, so we’ll break it down into a few bullet points: