What we learned visualizing 5,000 GitHub accounts in 24 hours
TL;DR — Consume only the data you need to, pipe it through well-understood components, and weigh your options carefully, even under duress
Last week we built our first full-fledged public-facing application using the Reflect data visualization platform: the GitHub report card. The Report Card enables any GitHub user to visualize their 2016 GitHub activity across both public and private repos, including commits by month and day of the week, top collaborators, preferred languages, and much more.
The end result was a web interface that tells a rich user-specific narrative using just a few simple visualization components. Report Cards are accessible only to you as a GitHub user (provided that you authorize our application), so we can’t link to any generated Report Cards. But here’s a screenshot just to provide a taste:
We’re not inclined toward false modesty here at Reflect so we’ll come out and say it: we thought the Report Card was pretty slick. But we were still overwhelmed by the response: over 5,000 GitHub users generated a Report Card within 24 hours of it being released, and a huge number of them tweeted their results.
To display the total number of GitHub users who have generated a report, we built this handy Reflect visualization:
There were lots of high-fives to be had at the Reflect office on launch day but also a sense of relief that things went off as well as they did, because a few days before the launch it became clear that there were some backend issues with how we were building the Report Cards that were sure to produce a sub-standard experience for users. So we put our heads together and came up with a better solution.
Here’s how it all went down.
The Report Card pipeline
The Report Cards look simple enough—just a handful of widgets and some fancy CSS, right?—but building the Report Card is a lot more complex than it may appear. The tricky part is that although GitHub has a very good API, a lot of the metrics that we provide in the Report Card don’t directly correspond to any specific API resource or endpoint.
In other words, it’s easy enough to, say, fetch the URL for GitHub user’s profile picture and then display that image in the Report Card. But figuring out things like who you collaborated with the most and how many total lines of code you committed requires a complex pipeline that can:
- perform a variety of operations against GitHub’s API for each user in a way that can gracefully handle not just failure but also API operations that produce paginated results and/or require multiple API calls (i.e. calls that return an HTTP 202 to acknowledge the request but then require you to retry the call until background processing is complete);
- normalize the output of all of those API operations; and finally
- store the end result in a database so that the Reflect API can generate visualizations for the user.
Initial problems with the pipeline
The core problem with the pipeline at first was simply that we underestimated the complexity of steps 1 and 2, and several days before the launch it became clear that the Report Cards were taking far too long to generate (sometimes several minutes, whereas our goal was more in the 15-20 second range).
So I used my CTO clout to gather a few of our engineers together and came up with a better solution. The main shortcoming of the initial approach, it turns out, was that we were ingesting far more data than we needed to from the GitHub API on a per-user basis. So we trimmed it down to only the data we needed to tell the story that we wanted to tell for each user:
- all commits by the user across all repos that we could get access to
- stats for the repos (changes per week per repo)
- primary language of the repo, stargazers, watchers, subscribers, etc.
So what data did we stop ingesting? Per-commit data. Initially we wanted to provide a range of hyper-granular data about commits, and this required us to pull in huge amounts of data for each user. But we were never able to turn this data into intuitive visualizations that made immediate, unambiguous sense. So we scrapped those visualizations and opted for a simple additions/deletions metric for lines of code, and this was information that we could get much more directly via the GitHub API.
Tightening up the input end of the data funnel yielded some big gains right away, but we still needed to trim many more seconds off of Report Card generation time. Initially, all three steps in the pipeline were performed by a single running process that showed signs of strain from the start. This served as our next target for optimization.
Get in line, buddy: from a single process to a task queue
Instead of relying on a single running process to ingest, transform, and store Report Card data, we decided to distribute these tasks across an array of components. So we decided on a task-queue-based system that would enable us to ingest/transform/store in a truly asynchronous (and, it turns out, much more efficient) fashion.
Architecturally, we wanted to worry as little as possible about infrastructure, so we decided to use a handful of AWS (Amazon Web Services) services that our team is well acquainted with:
|SQS (Simple Queuing Service)||Job queuing|
|Lambda||Distributing work (mostly in the form of consuming SQS queues)|
|DynamoDB||Tracking the state of jobs through the pipeline|
|Kinesis||Pumping data into Redshift|
|API Gateway||A central interface to control the pipeline|
Although the pipeline involves a number of components, the basic data flow in our improved pipeline is pretty simple:
- The GitHub user authenticates themselves with GitHub via a Node.js app running on Heroku.
- When authentication succeeds, the web app enqueues a Report Card generation task on SQS
- An EC2 instance runs a process that listens for SQS jobs; when a new job arrives, it’s dispatched to a set of AWS Lambda functions that crawl the GitHub API and transform the data in various ways
- Those functions dump that transformed data into a Kinesis Firehose, which updates DynamoDB (which we used to track the state of generation jobs) and loads data into Redshift
- Our process running in EC2 then waits for a time interval that we chose for Firehose to tell the user that the scrape is complete and that the Report Card is thus ready.
AWS Lambda functions in Go? You betcha
Virtually all of Reflect’s backend is written in Go, and we use it whenever we can. But AWS Lambda doesn’t offer native support for Go. Fortunately, there’s a library called Apex that enables you to run Go processes on Lambda. We’re currently running six Lambda functions using Apex, and we highly recommend it.
Redshift: the Report Card pipeline’s MVP
We were pretty happy with this setup in general, but Redshift really stood out as the surprise Most Valuable Player amongst them. It was the part of the system that we were the most worried about because we hadn’t had great luck with concurrent reads in the past. So our strategy was to adopt a wide cluster (currently eight nodes) and to use each user’s GitHub username as the distribution key.
Using usernames as distribution keys means that all the data for each user is colocated on a single Redshift instance. Even better, each user’s dataset is fairly small by Redshift standards (under 5,000 records). Keeping user datasets both small and colocated has made reads much faster than they otherwise would’ve been and kept load on the leader node low.
Overall, Redshift performed much better than expected compared with past experience.
Great progress, yet issues remain
Despite going from a sub-standard data pipeline to a really solid one in a very short period of time, there were some issues that we never quite resolved that you should be aware of in case you have a use case like ours:
- Load on the GitHub API was a concern from the beginning. In the end, the API kept up well, with one exception: the statistics generation API lagged a bit more than we would have liked. The good news is that the people at GitHub were amazingly responsive throughout the entire process. Redshift was the MVP but GitHub support gets a major high five. If you’re making heavy use of the GitHub API, get in touch with the team ahead of time, discuss your use case, and work with them as partners.
- AWS’s Kinesis Firehose was downright blazing at times and unacceptably laggy at others. Consistent Firehose performance was a nut that we never quite cracked. If the Report Card project were our core product rather than an example application, we may have opted to use a solution like Kafka instead. If you’re considering Kinesis Firehose, run plenty of exploratory tests to ensure that it will suit your needs.
- The influx of users occasionally came at a very burst-like tempo. Our SQS work queue frequently got backed up, which led to longer report generation times. We intentionally throttled concurrent report generation to keep the GitHub API happy, but if given more time we would have explored ways to better handle these bursts.
- Some Report Card generation jobs simply failed, for a wide variety of reasons. We dealt with these failures by retrying the jobs manually, but this is clearly a less-than-ideal approach. If given more time, we would’ve designed our task queue system in such a way that failed jobs would be placed back on the queue and retried automatically a specified number of times.
Building the Report Card was a fantastic learning experience for our team. In a very short time we acquired tons of invaluable information about the usability of our platform and we are already at work re-designing elements of Reflect in light of our experience.
The lessons we learned about specific tools are too numerous to mention here, so we’ll break it down into a few bullet points:
- Don’t ingest more data than you need to tell the story you want to tell. In the process of refining that story, we were able to streamline our data funnel. No matter which tools you use, when you’re designing a pipeline for data visualization, let yourself be guided by the narrative(s) you want to tell.
- Don’t delegate too much work to single processes that are liable to become bottlenecks. That might mean creating a queue-based logic or something else. But if you do spread computationally intensive tasks across components, make sure that those components are known quantities. If one of those components is a question mark (as Kinesis Firehose was for us), approach with caution.
- When you dogfood your product, really dogfood it. Build something that is substantial enough that you push the limits of your own platform. This will help you not only to squash the inevitable bugs and glitches but also to expose usability issues that would never be apparent to you otherwise.