Data is the foundation of everything we do here at Ellevation. All of our products, as different as they are, rest on retrieving, interpreting, and displaying all the different kinds of data we receive from our partner districts. And there’s a lot of it: student data, teacher data, test scores, course grades, term schedules – plus anything and everything else we or the districts we serve might need.
I’ve been working with data at Ellevation since long before we ever had a formal engineering team dedicated to our data ingestion pipeline (one might call me a founding member!). In that time I’ve learned a) just how important it is for us to have a formal engineering team dedicated to our data ingestion pipeline, and b) just how interesting – and gnarly – data problems can get working in this space.
Where We Started
A few years ago, the Data Ingestion team had a goal: migrate one of our major data ingestion systems away from the siloed, opaque process it was using, and bring it in line with the more standard pattern the rest of ingestion used. The plan was to build the new system alongside the old, and then cut over in advance of Back-to-School. As one of the people most familiar with both patterns, I was asked to orchestrate.
Here’s the thing, though: we had tried to do another major migration the year before, and it had – to put it mildly – totally and completely blown up in our faces.
The outcome of that project had left behind a delicate (and ominous) landscape that I was determined to adapt to and learn from. I knew that in order to lead our new project to a safe and successful conclusion, the work would be equal parts technical and personal: all the work of designing, building, and testing our new system, plus all the work of drumming up trust, understanding, and interest (yes!) from the rest of the organization in what we were doing.
Today, we’ll start technical, and I’ll come back another time to share with you the personal – including how we (somehow) managed to get the entire company excited about tracking subtle data inconsistencies. For real!
Where We Came From
First, the ill-fated history: a sister team of ours had taken on (what we assumed would be) a straightforward, if not simple, project. We wanted to migrate our SFTP endpoint– where we receive uploads of data from our partner districts– from being backed by a specific machine, to instead being backed by Amazon’s S3 storage service. The idea was to make our upload processes more stable, more secure, and more accessible to the team. In order to minimize potential disruption to our users, we planned our cutover for the summer, during the lull in the school year when our traffic is typically the lowest. That is to say, before Back-to-School.
A brief detour: what’s the deal with Back-to-School? It’s not an uncommon question from engineers new to EdTech (including me, back when I first joined Ellevation). All industries have their patterns, their busy seasons and their slow seasons, and EdTech is no different; our busiest time of year is in August & September, right when students across the US are returning to school after summer break. During that time, a flood of new data comes in from our partner districts: registrations and transfers and enrollments all take place, and teachers configure and start using new products in their classrooms.
It made sense, at the time, for us to want to finish our SFTP migration in the lull before Back-to-School. It seemed reasonable to save a change of that scale for when traffic and usage of the product would be at its lowest. If something went wrong, the disruption to our partner districts would be minimal, and we would have time and leeway to diagnose, plan, and administer any necessary fixes.
We prepped and tested; all looked good. We implemented the changes and tested again; golden. We were ready for the coming wave of Back-to-School in August.
– Well, we thought we were.
In wanting to take advantage of the summer to minimize disruption to our users, we overlooked a crucial factor: the period before Back-to-School wasn’t just a lull in activity (ie, traffic to and use of our product), it was also a lull in the flow of data. In fact, many districts simply stop sending us data altogether during the summer, because there’s so little going on in their schools. All the testing and prep we had done in the lull before Back-to-School, we had done with the few districts that were still sending us some amount of data, or with test data we created ourselves.
But, as many backend engineers know, live data is always more complicated than the most complicated test data you can come up with. As soon as real-life use cases started to stream in, our seemingly well-tested system fell apart. We had to scramble to patch up a sudden bloom of new errors, and faced an onslaught of complaints from users who couldn’t complete their most important workflows during their most critical time of year (e.g. many EL administrators are required by law to collect and record data about their EL students at the start of the year… and they couldn’t do so, if our system hadn’t successfully ingested their data). Our partner-facing teams (customer support and customer success), who work every day with our partners, felt rightfully betrayed; our mistake had made their lives even more difficult, during what was already their busiest time of year.
We had prepared our metaphorical sand castle for the ebb, but the flow had washed us out to sea.
Fast-forward to my team’s new project, which had the same level of scale, the same deadly combination of high-complexity-low-visibility, and a newly burnt-out, low-trust partner-facing organization watching our every move.
Cool. No pressure. No problem.
Where We (Actually) Started
Our team started this project in January of 2021, and wanted it to be fully complete – built, tested, and shipped, with all districts migrated to the new system – before Christmas. (You’ll notice that “Christmas” is in a timeframe correctly categorized as “after Back-to-School”; we’ll get to that.) By May we had our first rough cut of a system that could do all the things we needed it to do: receive an input of data from some location (in our case, typically a raw .CSV or .TXT file, plus a few loads from external integrations), map and transform it using existing rules that had been defined in the old process, and load it into our (test) databases. The question then became: is it doing all these things correctly, or not?
(I can feel the objections bubbling up from here, in my office, in the past. Don’t worry, the definition of “correct” is something we’ll come back to.)
We had learned our lesson about shipping critical pieces of our ingestion system during the summer. We needed to figure out how to both minimize the potential for errors in this new segment we had built, and also minimize disruption to our partners during a critical use period (ie, Back-to-School). But as we transitioned out of the school year and into the summer term, the same phenomenon emerged: districts began sending us less and less “normal” data; in many cases, districts stopped sending us data at all.
We were determined not to be handicapped by the time of year. Instead, we turned our thinking around: how can we best use the summer to our advantage?
Our system wasn’t without its bugs and edge cases, as all first (and second, and third, and fourth, and nth) cuts are. At this point we had only been testing with procedurally-generated data we created ourselves – a couple hundred thousand rows of very nice, reasonable, regular data. In the short term we knew we needed to test against more realistic data if we wanted to catch the full scope of our bugs, and in the long term we knew we were going to have to find a way to evaluate if our final output was “correct” relative to the input.
Which brings us back to the question that’s been begged for the past few paragraphs: what do we mean by “correct” data?
Where We Ended Up
It was critical to this project that we keep the ultimate goal front and center: a seamless, silent transition from an existing process to a new one. That meant, in some cases (others on my team might argue many cases), we had to close our eyes and grit our teeth against what we considered mistakes, anti-patterns, or otherwise “bad” data. The goal was not to pursue our platonic ideal of what the ingest system should look like – that’s for a different project, and a different blog post. The goal was to ingest the data exactly the same as the old system did, warts and all. Which was, unfortunately, almost as hard to pull off.
We needed some kind of test harness, and the reprieve of summer (both from the load on the system, and our users’ immediate need for the data) had given us a few precious weeks to build one. We determined that our harness needed to:
Simulate real-world conditions as closely as possible. Respect our obligations to student privacy (ie, we couldn’t be constantly downloading district data directly to our individual machines). Be able to compare the output from the existing system to the output from the new system. Give us metrics by which we could easily track how well our new system was performing relative to the existing one.
We called it our “beta pathway.” We tweaked the production code such that, once it had fully completed a job in the existing system, it would then repeat the same job with the same input and transformations via the new code as well. The new code would then write its version of the final ingested data to a set of secondary tables in the database, disconnected from anything in the app. By the end, we would have two sets of identical tables that ideally (eventually) would contain identical data.
That’s #1 and #2 off our list. But just having the data wasn’t enough; we needed a way to extract usable information out of it on a regular basis. Which districts had discrepancies in their data, and which didn’t? What kinds of discrepancies were they, and were there consistent patterns of similar discrepancies across many districts? And, crucially, if we made a change, did it resolve those discrepancies in a detectable, expected way?
We iterated on a set of SQL scripts that could provide us both a full diff of the tables, as well as a birds-eye overview of the differences therein. The latter we loaded into a long-running Google Sheet that visualized the data for us, in order to expose trends and track progress; the former we exported into Amazon Athena for easy queryability, so we could leverage it to find the sources of the trends we saw. Since our districts imported data every night, we set up and scheduled a Jenkins job that kicked this audit process off for us every morning once the import was complete, so we could start each day with a fresh look at how the day before had impacted our progress.
It took time – and about thirteen different scripts – before we settled into a system that best suited our needs, but it worked. By the end of the summer, our beta pathway had flushed out the most egregious of bugs, errors, and discrepancies in our new system.
Where We Were Going
We were ready for the real test: Back-to-School, and the flood of “normal” data that we knew was coming. Having resolved most of the loud errors in the code, it was time to turn our attention to the silent ones: data discrepancies. Missing rows, extra rows, improperly transformed columns, “bad” data we’d inadvertently fixed, values we didn’t realize we’d needed to derive – the discrepancies could be anything, and each time another district started sending us their data for the new school year, we found more categories of them.
But the system worked. Slowly but surely, we weeded out issues with our processing. Day by day the number in the FULL MATCH column of our spreadsheet (that is, districts whose production and beta tables were deemed identical by our set of SQL scripts) creeped up. We were getting more and more confident that we could do a safe, clean switchover, with users none the wiser.
Our next hurdle – which will need a post all its own – was convincing the rest of the company of that confidence, too.