After a rocky start, Terraform has become a great success story where I am. We've got both infrastructure and development teams using it, sharing both learnings and resources such as modules and code. A consequence of that is I spent quite a lot of my time helping people get started with IaC and Terraform. In doing so I've seen a number of common pitfalls, all of which are avoidable. You can save yourself a lot of grief by knowing what these are and making a deliberate effort to avoid them.
Pitfall 1: Still doing things manually
This is the most important pitfall, and yet it's the hardest one to avoid. This is because the temptation to do it wrong is just too big, and I haven't yet met the team who didn't learn this lesson until after their first big Terraform state file blow-up.
What happens is the team start happily Terraforming away, get a few resources running, then hit their first challenge - typically something like security groups or IAM. At that point they say the dreaded words: "we'll just do this one thing manually". And so they do just that one thing manually. What this does is introduce two problems. Firstly, your process has gone from a reliable "run Terraform" to a far more error-prone, "run Terraform, then do this manual thing". You find your manual changes getting overwritten the next time someone runs Terraform. Or, worse, Terraform is unable to reconcile your manual changes with its own state file.
This is where you end up with a process that is, "run this partial apply in Terraform, then change this manually, then open the state file in Notepad and put in your changes, then run this partial apply, then wonder why it's all broken, then realise the runbook doesn't work if you added new resources to Terraform, then…" It's arduous, it's error prone, you spend half your day trying to fix corrupted state files and you don't really know whether your Terraform configuration actually matches the infrastructure (although given the plan says "38 to destroy, 156 to add, 309 to modify" it's a safe assumption it doesn't).
Yes, making security groups and IAM policies work in Terraform is harder, as you end up with a longer REPL cycle. Similarly, adding that one environment variable you need is fiddly, and it's three steps to apply it compared to one in the console. Get over that hump. If you don't trust yourself not to make manual changes, make a read-only account for using the AWS console (or similar) and then a separate access-all-areas API key to use with Terraform. You'll thank yourself when "just that one" environment variable needs to be applied consistently to 20 resources instead of just one.
Pitfall 2: Terraform as second-class code
I'm seeing the same mistake being made with Terraform. One of the important things about IaC is that it's infrastructure as code - and that means first-class code that you treat with the same rigour as any of your Java, C#, Go or whatever else. Terraform's finicky interpolation syntax and lack of good control flow statements make this even more important; if you don't put in the effort to structure your configuration and keep it readable then you're going to end up with something that's very hard to maintain and share between developers.
You've got plenty of built-in abstraction elements like modules, variables and simple loops (via the count attribute) - use them.
Pitfall 3: Overly specific modules
This is a bit of a special case of the second pitfall, but it's one worth calling out. Terraform gives you modules as a way to abstract away details from your configuration, enforce standards such as tagging or standardised VPC layouts, and provide a foundation for sharing code between teams. In order to fulfil this promise, your modules need to be generic - you should be able to re-use them across multiple different resources and even projects only by changing the input variables.
What I see a lot from teams starting their journey is something more like, "this is a module which creates a lambda function called Bob in the application tier of our VPC, assuming you created the app tier with a subnet of 192.168.2.0/24". And then later in the week they add, "this is a module which creates a lambda function called Alice in the application tier…" and so on until someone points out the app tier got changed manually outside of Terraform and actually it's somewhere in the 10.0.0.0/8 range.
Remember: a module needs to be generic, and it shouldn't make assumptions about things outside of its influence. There are lots of things you may want to keep common between your lambda functions - basic permissions necessary for them to operate, logging policies, tags, whether they are created inside a VPC… but specific details such as name, subnet or which security groups they're a member of are things to pass in as variables. A simple thing to keep in mind when editing modules is to ask yourself, "will this change make the module more capable, or merely less generic?" If the latter, it's probably time to rethink your change. If your change would require changing one module to two similar ones that do the same thing, it's definitely time to rethink your change.
Pitfall 4: Not reading the documentation
Terraform has some pretty good documentation. It can be terse in parts, but I think I've only once been in the situation where I needed to do something and I couldn't work it out from the documentation available. Which makes it unfortunate that I often find myself having the following conversation:
"I need to get the ID of
foo and somehow set the
bar to it, but I've no idea how?"
"Did you check the Terraform docs for
"Ah… um…" (half-heartedly scrolls up and down the page without really reading anything) "..what about this input variable?"
The Terraform docs for resources follow a common pattern. "Example Usage" will give you an example of the resource being used in context (i.e. the example may have other resources in it). This is a useful guide, but it may not necessarily apply in your situation; you can't just copy it without understanding what it does. "Argument Reference" tells you the variables you can put in to configure the resource. "Attributes Reference" tells you what outputs come out of the resource - these are the things you can use elsewhere with an interpolation:
Sometimes the Attributes Reference will contain the magic phrase, "all of the argument attributes are also exported as result attributes". This means you can use any of the arguments you supplied when creating the resource as an attribute. Be aware that this doesn't apply to all resources - read the docs!
Pitfall 5: One codebase per environment
This one happens quite a bit. You start out building your dev environment. Then you decide to create another environment - a QA or staging or something like that. So you move all of your existing Terraform code into a directory called "dev", copy it to a directory called "qa", and then start changing all the cluster sizes and other bits to match. You add a few new resources to support things you didn't realise you'd need - meanwhile someone else on your team adds a bunch of different resources to the dev environment, and before you know it you've got two wildly divergent codebases.
Then someone asks, "this works on dev but not on QA, what's the difference between the environments?" Have fun with your diff tool.
Alternatively, if you're sensible you start out by creating a workspace in Terraform for your dev environment, and a
dev.tfvars file. Whenever you do something that feels like it might be unique to the dev environment, you add a variable, stick it in
dev.tfvars, and carry on. When you create QA (or whatever), you create a workspace for it, create a
qa.tfvars and spend a little while customising your variables. Whenever you need to do something environment-specific, you work out how to do it with a variable and a bit of interpolation logic.
This way, you not only avoid having unnecessarily divergent environments, but also get a concise
.tfvars file which tells you, for each environment, exactly what's unique about it. Often these can be surprisingly small: I have one product where all that's in my
.tfvars file is a list of the instance types to use for each service, CIDR blocks for each of the subnets and a list of the IP addresses allowed to access that environment. Unsurprisingly, I see very few (if any) "well, it was working in preprod…" problems on that product.
Pitfall 6: Not knowing when to be declarative
Sometimes you can automate too far. Consider a Terraform configuration where you've got a bunch of AWS Lambda functions as part of the infrastructure. In this situation, Terraform needs to manage which version of the code goes on each environment. Firstly, this means we need to have explicit version numbers on our code artefacts - we don't want to grab
latest.zip and find we've accidentally deployed someone's experimental branch to production.
However, even with semantic versioning on code artefacts, one of the teams I work with was getting extremely animated about the fact they had a
1.0.23 sitting in their
.tfvars file. We were at risk of spending a lot of time trying to automate grabbing the latest version, without having to update the variable. Thing is, the variable serves a purpose. Remember the previous pitfall: your
.tfvars file defines your environment. Therefore, you need it to be deterministic - which means being explicit about your version numbers. If I run terraform today and get
1.0.23, but tomorrow I get
1.0.25, that's a very bad place to be in because I don't know what version of the code is going to end up on my environment.
There is a time to start automating the update of version numbers, and that's when you start plugging Terraform into your deployment pipeline. But that should be performed by variable substitution in your deployment tooling, and there should still be a record in the deployment tool of what got deployed where. And the reason to do this is that a deployment tool is a better place to have your declarative statement about what versions of which code sit on each environment, not because having an explicit version in .tfvars is bad.
Pitfall 7: Not thinking about the pipeline
It's easy to get started with Terraform. This is both a blessing and a curse: it means it's also easy to end up with state files on your local filesystem that require checking in and out of source control, manually copying code artefacts about so you can deploy them, and having no good solution for secrets management and key storage. A few weeks later and you're having that awkward conversation which goes, "so I think Greg checked in the state file, but then I'm not sure if I got Shauna's latest code from the CI server, which means we might have just reverted to 0.4.16 on production… er, oops?"
You don't need anything sophisticated to solve a lot of these problems - an appropriately secured and encrypted S3 bucket to store your state files and deploy code artefacts to will get you a long way, and if you're on AWS it's possible to get your secrets out of KMS without needing to set up a Vault server.
Terraform is great. It's not perfect, but I've hardly ever seen a tool come along and open up discussions between dev and ops teams in the same way, and the fact it does what it does pretty damn well is a bonus. But, as with any tool, there are things you've got to be careful of and things that force you to change the way you work. I've seen teams lose a lot of time, build unreliable environments and have an overall bad experience with Terraform because they never paid attention to the above pitfalls, and I've seen teams get proficient at an unbelievable rate because they listened to all the above before writing a single line of configuration. You know which of those teams you want to be, right?