Measuring the impact of AI coding tools
There is little doubt of AI's promise to transform software development. But how can engineering teams measure the ROI of these tools in a way that aligns with their priorities?

There is little doubt of AI's promise to transform software development. But how can engineering teams measure the ROI of these tools in a way that aligns with their priorities?
Every software engineering team is looking at AI tools for coding. In a Stack Overflow survey, 76% of developers reported they are either using or planning to use AI tools for development. However, with the pace of change in this space, many are already disillusioned with the tools they adopted during the early hype. There is little doubt of AI's promise to transform software development. But how can engineering teams measure the ROI of these tools in a way that aligns with their priorities? These are just some of the questions we are getting from folks who are considering AI coding tools. So, we want to share a few things we learned from our work with some of the most sophisticated software engineering teams in the world.
Before measuring return on investment (ROI), we must think about what the said investment is supposed to accomplish. If the goal of any software development organization is to efficiently build and operate software that delivers value, then any tooling investment should help achieve this goal. Questions that need to be answered:
Software engineering is, by design, an innovative pursuit. While convenient, thinking of developer workflows as factory operations is faulty. Whether software developers are happy using their tools and want to come back to them repeatedly is also an important factor in determining whether a tool has been successful in a given setup. Therefore, the third important question in determining the value of a tool is:
Any ROI measurement should encompass productivity, impact, and satisfaction.
Traditionally, much of the conversation around developer productivity has been in an area referred to as the outer loop of software development—the set of activities that are more about planning, building, integrating, deploying, monitoring, etc. Traditional productivity measurement methods like DORA (DevOps Research and Assessment) fall short due to this limited scope (focus mainly on deployment & operations efficiency, and not on coding). DORA can also lead to metrics manipulation and doesn't account for the team makeup (size, skills, tenure) or project complexity.
The inner loop of software development - the part that deals with reading, understanding, and writing code - hasn't traditionally received a ton of attention from developer productivity experts. This diagram from a developer's thoughts on developer productivity is too good not to reproduce here:

Despite not having established methods to measure productivity of the inner loop, it is the area of software development that has seen the most transformation from the AI revolution. This is because LLMs are especially good at analyzing and understanding patterns within source code. Not covering the inner loop in measurement is definitely a miss.
Another popular framework, SPACE (Satisfaction and well-being, Performance, Activity, Communication and collaboration, and Efficiency and flow), attempts to take a more "holistic approach" to developer productivity. However, too many of the metrics are qualitative and subjective, making it hard to measure and quantify impact consistently.
Not all engineering teams have the same priorities or level of maturity, especially when it comes to the ability to measure things or the consistency and repeatability of processes. We have seen teams on all sides of the spectrum.
However, from our direct work with many such customers, we have learned that most engineering teams want to be cognizant of the following productivity metrics:
Any of these metrics, when taken together with other inputs like the size of the team, average developer compensation, etc., can provide reasonable estimates of the corresponding monetary impact. However, this write-up doesn't go into those details, understanding fully well that all engineering teams are different.
The "time saved" metric is the easiest starting point for evaluating the ROI of AI coding assistants but also the most basic because it lacks the context of your business, teams, and workflows. Every AI coding assistant on the market has established their own flavor of time savings metric based on the tool's usage. It is easy to estimate how much time a developer would have spent simply "typing" the code that was generated by the tool. That essentially is the time saved. You can also look at these savings at a more granular level based on the type of interaction that produced the code and analyze where the savings are coming from:
This time savings can easily be measured for an individual and, when aggregated across a team, can provide team-level savings as well.
This measurement is very basic because it has the potential to both underestimate and overestimate the amount of savings:
Regardless, it is a good starting point because it is easy to understand. The team at 1Password found that using Sourcegraph saved every developer 7 hours per month.
Sometimes, if the right thing to do—like writing unit tests—is too hard, people won't do it. Cody's value is that it lets developers be their best selves while relieving the burden of repetitive tasks.
— James Griffin-Allwood, Staff Developer at 1Password
The second set of metrics is related to the "output" produced. In this case, the output is code. How many lines of code were inserted into the editor? How much of the inserted code actually persisted? Just like the time savings metrics, even code output produced can be easily measured by user interaction types - autocompletes, chats, prompts, etc.- and is possible to measure both at an individual and team level.
These are useful metrics to think about and go one step further than just time savings. However, even here, there is potential for incorrect conclusions:
The next set of metrics are those aligned with the intended outcomes of the individual, team, or organization. How does an individual think of impact? What does success look like for them in their current project? Similarly, what is the intended outcome of a team project? By measuring the utility of a tool in the context of a business outcome (individual or team), its true value can be understood. While the actual "business outcome" could be different for different people, we have seen the following examples:
These are harder to measure because most organizations do not have the luxury of comparing how a project would go with or without a tool. A better approach is to choose a smaller representative project and run a tight pilot or A/B test with a small number of developers who can perform the type of tasks that they might perform in the actual project—both with and without the tool.
An underappreciated (but potentially revolutionary) aspect of using AI in coding is that it has unlocked many projects that in the past were considered too risky or too cumbersome to even take on. Legacy code, too many dependencies, lack of documentation, or lack of relevant skills in the current team have often dissuaded teams from taking on potentially impactful projects. Now, many such projects are not only being initiated but also successfully completed. Such code transformations can sometimes produce even more business impact than adding a new product feature because they provide the foundation for more people to be able to innovate in this codebase.
Leidos often takes on customer projects related to the modernization and migration of legacy code. For example, migrating from Oracle to PostgreSQL once took a full sprint, if not longer. Cody got them 80% to 90% of the way there within minutes.
I really can't express how blown away I am with Cody. I can't go back to... whatever it was like before. I use Cody every day, all day long. No matter what I write, Cody helps improve it and it goes way beyond coding in some specific language, you can make Cody explain things to you every step of the way. Really, this is it, the future of coding.
— Leidos Engineer
Almost all engineers and engineering leaders report that "developer satisfaction" is important to them.
Most developers believe a flow experience involves spending time coding (and not information gathering). In other words, the IDE is the flow zone. Engineers at Qualtrics report having to leave their IDE to find information on the web 28% less often when using Cody, and they can understand code 25% faster.
One of the most daunting things as a junior engineer is working on a large, existing codebase. There is always a ton of domain knowledge about that code that's restricted to the people who wrote it, no matter how well it's documented. There are always nuances that only the code authors know. But if developers know how to prompt Cody, Cody can find context and explain the code to them.
— Brendan Doyle, Senior Software Engineer at Qualtrics
Developer satisfaction, often based on self-reported data, is qualitative. However, it is still useful to run internal surveys among the tool's actual users to keep track of developer sentiment, discover qualitative insights about actual use case scenarios, and identify training needs for your teams.
Measuring the ROI of AI coding assistants is not trivial. You need to consider what your real business objectives are, whether the data gathering for ROI calculations is through the tool or your own custom measurements, and how mature your team operations are. However, given the promise of AI coding assistants, several smart engineering leaders are willing to dive deep and find out for themselves. If you happen to be one of them, we would love to hear from you and help you with your exploration. Please feel free to sign up for our enterprise trial, in which our team of experts will guide you through the measurement process.

With Sourcegraph, the code understanding platform for enterprise.
Schedule a demo