TL;DR: Old, deprecated code/infrastructure is a challenge that every engineer will come across. Remedy what you can and remember that some extra effort can go a long way. It can uncover issues that, when addressed, will save you in the future.
Part of the challenge of software development is maintaining legacy code and infrastructure. When you ignore or neglect these, issues start to pop up and your reliability suffers, causing pain for your customers. The trick here is to actively steward each project. Establishing strong ownership is tough in practice, and can lead to stale systems and code as teams move on to new features and roles.
“Postman” is the internal name for a critical service at Blameless that manages requests for all customer instances. It’s a tiny Golang program that connects to a database to redirect requests to the right cluster. It's one of the oldest services at Blameless, and it's also the least updated and understood.
Although it has had incredible uptime, largely because of that lack of change, it runs on a non-standard version of outdated infrastructure from days past. We learned this when LetsEncrypt finally deprecated their ACME API endpoint that kube-lego was using to provision TLS certificates. Going into the project, our intent was to simply resume TLS certificate renewal on this cluster, but we were able to gain so much more value by adding a bit more effort to migrate to a brand new cluster.
Typically we try to make small incremental changes, and only focus on one variable at a time. After considering how critical Postman was to all of our operations, we decided it was less risky to create a new cluster in a working state than to change an existing, outdated one. This project then morphed into a migration that made sense.
After modifying our CI Pipeline to point to a new environment, we were able to clean up some of the stale code that was still using dep for dependency management. Migrating that to use Go Modules helped simplify our repository and will hopefully make maintenance easier in the future. From there we were able to test our deployment in dev like we typically do and promote to production. “Production” here actually refers to a new environment separate from our legacy environment that was still serving traffic. This means that there was limited risk pushing updates here as DNS was still referencing our existing Postman cluster.
After testing extensively and copying over valid cert data from the existing cluster, we were able to do a DNS cutover to enable the new cluster to accept traffic from production clients. Because the certificate was still valid, we were able to cutover without any perceived downtime for our customers. Our team was able to do this confidently because of the precautions taken — and it was awesome to see the traffic flow in our monitoring tools. :) See below for a stacked bar graph showing the real-time increase of our Postman redirect traffic (in purple) showing up in its new home.
Postman Redirect Traffic
Our new environment no longer uses the deprecated kube-lego but rather its successor, cert-manager. We already use cert-manager for our customer instance deployments, so it was pretty simple to migrate the configuration to be compatible. Our original problem here was directly mitigated, but we gained more value than we initially thought:
- This new cluster runs on a standardized VPC.
- Reworking the CI deployments allowed us to deploy Postman in multiple regions, reducing latency and improving redundancy for our single point of failure.
- All Kubernetes clusters are now managed with terraform and are not unique, which also allows for easier administration and observability into Postman.
- We no longer needed to support Helm v2, as this was our last legacy deployment using it.
Ideally this would have been a project that was planned way in advance, by a dedicated team of stewards that owned this service and its maintenance. After this experience, our team decided that ownership would fall on the SRE team because of the experience here and the nature of the application.
The most important takeaway is that we learned about this service and uncovered some dark debt that, if neglected, would further deprecate and cause problems as we continue to grow. Sometimes the better option is to invest a bit more work to create a lot more value. As an SRE, it’s critical that we understand our systems and help foster the sense of ownership throughout the team. By making ourselves key stakeholders, we’re prepared to deliver reliability to our end users. We now serve as an example for the rest of the organization — showcasing a working knowledge of Postman and more confidence in taking on stewardship.