It’s been a while since the last time I wrote. Don’t get fooled by the publishing date of my previous post, in reality I haven’t written anything in the last four months. I’m not sure if the short, dark and chilling winter days had anything to do with it, but my motivation to create was gone. Most of the time, I didn’t want to be anywhere close to a computer, and, getting shit done was an excruciating battle with myself.
Even though this past winter season was pretty uneventful, it offered a moment of introspection and reflection, from which I learned that the true motivation to all of my projects comes from an intrinsic desire to build great systems. In a sense, coding is my art, and, I write code for the love of beauty and the fear of death. Hereupon, I won’t justify Cloud Adventures as the means to riches, but, as an expression of my will that is intrinsically valuable to me.
Enough of that unrelated story on winter depression; now that I’m back at the keyboard, I’ve been on a quest to get Cloud Adventures’ platform back on track. Specifically, I’ve been working with OpenTofu and Terragrunt to build a reliable, secure, tested, immutable, and scalable infrastructure. Before we dive into the low level details, let me define some of the common problems platform projects face as companies grow, technologies change, and teams evolve.
The evolution of a platform
Once upon a time, a manager at a company or startup figured out the value of using the cloud, and, decided it was time to run some of the company’s projects in a public cloud. Whomever was responsible of the configuration opened an AWS account and created a couple of users whom were assigned the role of administrators. Services started to get deployed in the cloud, and, when changes were required, team members would login into the AWS console to update the configuration, or redeploy some code to servers.
After a while, customers began experiencing outages. To improve user satisfaction, the company started to invest into the reliability of their systems by hiring “Bob” -I’m just giving him a name after Bob the Builder-, someone with experience in networking and the cloud. Bob knew that best practices were to provision infrastructure with OpenTofu, manage fleet configuration with Ansible, make deployments with ArgoCD, and monitor services with Prometheus. Tools may vary, but that should be a somewhat accurate representation. My focus today is on the evolution of the infrastructure provisioned with OpenTofu, so lets talk about how that plays out over time.
The story of the OpenTofu module begins with a single monolithic state that describes the resources of one AWS account. At this point in time, Bob has managed to create an abstraction over the cloud, but as the module grows, changes begin to take a long time to run (a big state forces OpenTofu to make a lot of API calls), and, team members still prefer to make changes directly in the console, a practice that creates a constant infrastructure drift. Even though drifts are happening, no one wants to update the OpenTofu module because “it works in production” and running tofu apply
could break the environment. Infrastructure feature development slows down and the system complexity increases.
Now that one big module has slowed down the team progress, engineers test different solutions to control the size of the state, and, keep OpenTofu’s code clean. After some of these experiments are completed, what once was a single module with one entrypoint and a state file, has become a collection of small and diverse modules. Some of the characteristics of the modules are that they have scatter hard-coded variables, and dependencies among resources is not easy to identify by looking at the code. No one knows how every part of the infrastructure has been provisioned or how to update it, so everyone is making changes directly in the AWS console to avoid missing the project’s deadlines.
Manually changing the infrastructure and compromising on drifts works for a while but soon after productivity hits rock bottom. Deadlines are missed left and right. Service deployments fail, and the only provisioning of infrastructure done with OpenTofu is made by some of the seasoned employees who use over-complex --target
expressions or symlinked files (don’t even ask what this means, you don’t want to know). Nevertheless, business is booming and out of no where, just to add some more pressure to the overworked platform team, management requests the adoption of a multi-{account, region, cloud} strategy as they want to improve workloads’ security, enforce delimited access to external auditors, improve latency for customers across the globe, and leverage some awesome cloud features offered by a different cloud provider.
At the same time, Bob feels overworked and wants a change of pace, and, decides to quit. No one else in the team has a complete understanding of the evolution of the project, the lessons learned throughout the way, or how every part of the infrastructure modules fit together. Yet, the decision of going multi-{account, region, cloud} has been made. The infrastructure team starts to run some experiments to achieve the goal, reduce failures, and recover some of the speed they’ve lost along the way.
The experimentation phase, pioneered by the team, leads to the development of a custom tool which dynamically configures providers and the OpenTofu backend (or state config) for each environment and region. Unfortunately, this tool/approach fails to incorporate the lessons from past experiments and result in the repetition of some errors.
Management is tired of the delays and failures of the infrastructure, and, decides is time to step in. They hire an experienced platform engineer manager called “Scoop” -yes Bob the Builder’s best friend-, who knows how to get the project back on track.
Scoop makes amendments to the team’s processes; some restrictions are put in place, and, over time, managing the infrastructure stops being such a big pain. Speed picks back up again, confidence in the platform increases and failures reach an all time low. Change is never easy, but Scoop’s good leadership has saved a once struggling cloud project.
Key takeaways from the evolution of a platform
From the previous history, the most important lesson is that managing infrastructure for complex systems is hard even if you are using IaC tools. The good news is that we are programmers, we have principles to deal with complexity, even if it is hard we know it is possible. To help us understand how to manage complex infrastructure while maintaining speed and control over failures, let’s do a post-mortem analysis.
Summary
Changes to the infrastructure managed with IaC became a burden. No one felt comfortable applying changes as the execution plan included out of scope updates (there were lots of drifts), apply commands took several minutes to complete, resource dependencies were difficult to understand, and, the system was prone to errors.
Impact
- Customer impact: infrastructure updates were the root cause of a high percentage of errors experienced by customers. The application experienced downtimes (during deployments), thus users where not able to use the company’s services.
- Developer experience (platform team) impact: the unmanaged complexity of the IaC modules, poor software engineering practices, and, the lack of documentation made changes to the code base near impossible.
- Developer experience (application teams) impact: team members experienced constant application bugs due to infrastructure failures and felt the constant need to involve the platform team on day to day tasks like service deployments.
Root Cause
- Infrastructure code did not have an effective software development lifecycle.
Contributing factors
- Human access to the cloud provider’s console was not restricted to read-only actions.
- Over complex differences/constrains set across environments.
- Lack of testing environments and automated test pipelines for infrastructure code.
- Poor abstraction over modules (this is particularly true and a critical issue for the networking layer or shared resources across environments).
Action plan for Cloud Adventures
In this section I’m just going to state the restrictions I’ll be applying to my infrastructure project. I hope this post is a somewhat accurate description of some of the issues you will encounter as your cloud environment grows. On the other hand, if you’ve experienced some of these incidents in the past I recommend you to read through the restrictions I’m about to state as I strongly believe that enforcing (even if it is just a few of them) to your development process will help you decrease infrastructure bugs.
Restrictions - A short checklist to mitigate infrastructure bugs
-
Create ONE shared and versioned controlled modules repository. Main modules will independently reference versions of this repository.
-
Modules should use and require named providers. Prefer specific names over a generic “aws”.
-
Restrict access to the cloud provider’s web console to read-only, as automation should be in charge of applying changes.
-
Encourage experimentation by creating sandbox accounts for every engineer. Only to these accounts they should have full console access.
-
Keep the size of the main modules large enough (but not too large) to represent important abstractions to the business.
-
Tag resources such that it is easy to find their declaration in the code base.
-
Prefer data sources or Terragrunt
dependencies
over hard coded variables. -
Add automation to lint, statically analyze, test and document OpenTofu modules.
-
Modules and live infrastructure should have E2E tests. These tests should provision infrastructure to accounts that reflect the current state of the target environment.
-
Don’t let anyone in the team forget they are writing infrastructure CODE. Enforce coding best practices as you would with any other project in the organization.
Finally, if you like this type of posts I encourage you to subscribe to my newsletter, and, if you want to further talk about infrastructure shoot me a message on linkedin.