Operation Data Liberation
I’ve had the opportunity recently to talk to people in several different city governments that are facing a common challenge — how to liberate operational data from a legacy system.
This is a challenge that lots of city governments face, and it strikes me that there are some common lessons that can be derived from cities that have gone down this road already for those that are still trying to figure out the right approach.
The following suggestions are crafted from my own experience as a municipal government official charged with making data more widely available, and those of people in similar positions that I’ve had a chance to speak with.
I hope this helps someone in a city government that is currently facing this challenge.
The Problem
You’ve got operational data in a legacy system that you want to make more easily available to users inside of (and outside of) your city government.
This may be a part of a larger data or analytics strategy that is spinning up, or a one-off request from the Mayor or other political leader for a specific project. Regardless of how it landed in your lap, the job is yours now.
You’ve got to figure out the best way to provide access to data in a legacy system in a way that is both useful for others and maintainable by you.
Some Hurdles You’ll Face
You probably won’t have additional resources allocated to support this project, and if the system in question is a bespoke solution created by a vendor they may want to charge you a handsome sum to do the work for you.
If this is one of the first projects you’re undertaking to get operational data out of a legacy system for other purposes, you may not have a lot of the policy and technology infrastructure in place to support what you want to do. In a way this is a good thing — you have a greenfield opportunity to help shape policy and set up infrastructure that can be used to support a broader data strategy for your city. On the other hand, you’ll probably have to figure out what you need to get the job done on your own. As issues or questions arise, It will be your job to figure out how to address them.
Finally, you’ll have to make some definite choices technology wise. The good news is there are lots of different options for liberating data in legacy systems, and many are low cost and leverage open source software. However, different solutions may require different levels of technical expertise — some or all of which may be absent from your current IT environment. You’ll need to figure out how to introduce some new tools to your environment and — hopefully — keep them aligned with the skill sets of your staff and consistent with established IT policies.
With these challenges in mind, here is a description of one approach that cities are using to take data from a legacy system and make it more easily usable by people inside and outside city government.
Extract, Transform, Load
The first step in liberating data from a legacy system is to make a copy of it.
This seems pretty straightforward but it’s worth emphasizing as you consider how to get started. The use of your data for analytics and other purposes should occur outside of the environment your legacy system is currently running in.
Your goal here is to make a separate copy of your production data that is stored outside of the production system that is using it. You want this copy to be as close to identical to your production data as possible (with the caveat that there there may be data in your legacy system that is not appropriate to share, or that is unable to be shred because of existing policy). And since regular usage of your production system will cause your copy to become stale over time, you want to make the process of generating a copy repeatable at regular (preferable short) intervals.
This sequence of steps is generally referred to in the world of data management as Extract, Transform, Load (ETL). It’s a common need and there are a number of tools and approaches that can be used to do it well. That said, it can be difficult to cut through the noise around ETL, sift through the competing options, and select an approach or solution that is affordable and maintainable. Additionally, for governments that have a heterogeneous technology environment with lots of different legacy systems, more than one ETL approach may be required.
Evaluating the ETL lifecycle in reverse can be a helpful way work through the issues:
Load
Where are you going to put your data once you get it out of your legacy system?
A relational database platform (RDBMS) of some flavor probably makes the most sense. You can use an RDBMS with a whole bunch of commercially available reporting tools, and an increasing number of cloud-based and open source tools. You can also use it to build an API (more on this below).
Already got a database environment and licenses? You’re probably good to go. If not, think about where you’re going to house your data (on a machine racked in your data center, or somewhere in the cloud) and select a database solution that is a good fit for your current IT environment and employee’s skill sets.
Transform
The amount of transformation you’ll need to do on your data before loading it will depend on the system it is coming from. If your legacy data is already in an RDBMS you might be able to skip over this step and simply replicate from your production database to a backup.
At a minimum, particularly if your data is in an older legacy system, you may need to extract it into an intermediate format (e.g., CSV) and alter it before loading it into another database. This is almost always a pretty solid approach — there are lots of tools available for managing and manipulating these kinds of files, and for loading them into an RDBMS. These tools are scriptable and can be automated, making them easy to drop into your ETL toolchain.
If extracts from your legacy system are initially going to be infrequent (once a month or once a week) then you may be able to work through this process manually at first. As extracts become more frequent, you can start automating it incrementally, with the goal of having regular, frequent extracts occurring in a completed automated way.
Extract
This is the part of the ETL process that you don’t get to choose. Your data is in the system that it’s in — your challenge is to get it out of that system in a way that won’t disrupt its current operation and that is maintainable by you and/or other city IT staff.
If your data is already in an RDBMS, then there are undoubtedly utilities available to replicate or backup data (hopefully you are already using them). You should evaluate these options to decide which is best given your current scenario.
It your data is in an older system like an AS/400 or a mainframe system, you may need to create a custom solution for extracting data. If you have vendors supporting these systems, you can inquire about potential approaches and pricing. But keep in mind — these systems are pretty common in municipal government. Your counterparts in other local governments face this same challenge and some have already figured out how to do it efficiently. Communicating with other government IT officials might be one of your most effective mechanism for extracting data from older legacy systems.
Once you’ve identified an ETL approach that works for you, you will have a copy of your production data in a separate system that is now ready to use. Depending on who wants to use this data, and how they want to use it, you’ll have several options for what to do next.
Making Use of Your Data
Replicating production data so that people in operation and policy offices can run reports is a pretty common use case. If this was the driver for liberating your data from a legacy system, then most of what remains is simply pointing your reporting tool of choice at your newly created database.
But if you want to enable other uses for your data — e.g., using it to support a custom dashboard or application — or if you’re not sure what all of the uses for your data will be — you have some more decisions to make.
One option that you have is to put your data in an open data platform. If your data liberation work is part of a larger effort to make data more accessible to external users, you may already have one of these (or you may soon be getting one). If not, then you can still use an open data platform to house your data for use in internal reporting, analytics and application development. There are lots of commercial and open source options that can be run both on premise and in the cloud.
Another option is to create an API for your data. This will allow you to tailor the way that users can access the data in a way that most effectively matches their use case. This option will, however, require you to make some technology choices.
Every modern programming language has frameworks for building APIs, and odds are that you’ve already got some APIs running in your current IT environment (even if you didn’t build them). If your IT organization is a Microsoft shop (which is pretty common in municipal government), there are lots of great options for you to use to build an API. In addition, Microsoft’s Azure platform provides some nice options for hosting your data or your API application, as well as for managing APIs.
Evaluate the options that most closely align with your current environment and that are a good fit for the talents of your existing IT staff, because odds are they are the ones that will have to manage it going forward.
What Comes Next?
Whether you’ve put your data in an open data portal or created a standalone API, you now have an interface that you can manage that allows people to access your data. There are a few things you’ll need to think about to manage it effectively in the long run:
- Who gets to use your API? Will they be required to authenticate to use it, or provide a key with calls to your API?
- Will you open up access to your API to a select group of users (e.g., those on your internal network, or at a specific IP range) or will it be available more broadly, for anyone to use?
- How will you manage traffic to your API to prevent abuse or overuse by specific users?
- How will you monitor your API to ensure it is up and operating properly?
- Will you log requests to your API and do regular diagnostics?
Beyond general management of your API, (whether you realize it or not) you’re also going to need to provide some support for users of your API:
- Will you provide examples to users for how they can make calls and get data from your API?
- Who do users contact if they are having a problem using your API?
- Where do users go to ask questions, or learn more about how to use your API?
- Will users (internal or external) be able to suggest changes or enhancements to your API?
That’s a lot to consider for a project that you may not have explicitly chosen to undertake in the first place.
If you’re working with a vendor that has provided an open data portal, many of the things listed above may be in place already. And your vendor will likely have experience working with other customers — when in doubt ask your vendor to provide you with examples of how other customers are addressing a common issue.
If you’ve built a custom API, here are a few tips to you get started thinking about the issues listed above:
Lend users a hand
Provide users with enough information to get them started using your API without having the have a human help them. Your IT staff will thank you.
Invest some time in creating user facing materials with examples of how to make calls to your API, and a list of frequently asked questions. The more information you provide up front that makes it easy for users to start using your API the less phone calls and emails you’ll need to respond to.
Also, set up a way for users to report issues — a shared email box that is checked regularly is a good way to get started.
Set up a way to monitor the uptime of your API so you’ll know when it’s down. This will allow you to get it back up quickly and also communicate to users that you are aware of the problem (again, resulting in fewer phone calls and emails).
Treat your API like a shared resource
You should treat an API like a shared resource because that’s exactly what it is — improperly managed, one consumers overuse of this resource can crowd out the legitimate use of others.
Requiring authentication for users is one way to keep track of who is using your API, and issuing API keys is another option. Keep in mind, though, that if users can’t sign up quickly and easily without human intervention, then your phone will start to ring (or your inbox will start to fill up).
You should strongly consider a way to rate limit usage of your API — there are a number of commercial and open source API tools that can help you do this. You should also give thought to a caching strategy to improve the performance of your API — this includes not only setting up a proxy cache but also sending the appropriate caching headers with your API responses.
Plan for the future
As your API gets used, the people using it will form opinions about what they like and what they don’t like. Give them a way to provide this input to you in a way that you can refer back to. This will help you prioritize changes and enhancements to your API going forward and ensure you are directing your finite resources to the things your users want most.
You’ll also want to think about things like versioning your API to allow you to gracefully deprecate features and add new ones.
Data APIs are increasingly common in government, and there are lots and lots of guides that can help you work through these issues.
The challenge of getting operational data from a legacy system and using it to support an internal analytics project, or an external open data effort are increasingly common in municipal government. You’ll face challenges in getting started, but you can take comfort in knowing that lots of other local government IT leaders are traveling this same road. Sharing your experience and asking others to share theirs is an effective way of working through these issues.
Don’t let the fact that you don’t know everything keep you from doing the things you can do right now.