Decomposing a massive Rails monolith with Kirsten Westeinde, software development manager, Shopify

What's it like to deconstruct one of the largest Rails codebases (3 million lines of code, 500,000+ lifetime commits, 40,000 files) on the planet? And why didn't Shopify follow the standard path to microservices, but instead choose to modularize their monolith?

In this episode, Kirsten Westeinde, software development manager at Shopify, describes how her team led the charge in refactoring and re-architecting Shopify's massive codebase, sharing the winding path they took to make this massive change and the way they tackled both the technical and human sides of this challenge.

Show Notes

Ruby on Rails: https://rubyonrails.org/

React: https://reactjs.org/

React Native: https://reactnative.dev/

Go: https://golang.org/

Python: https://www.python.org/

Scala: https://www.scala-lang.org/

Kirsten Westeinde: https://www.kirstenwesteinde.com/

(Blog post) Deconstructing the Monolith: Designing Software that Maximizes Developer Productivity by Kirsten Westeinde: https://shopify.engineering/deconstructing-monolith-designing-software-maximizes-developer-productivity

(2019 Shopify Unite conference talk) Deconstructing the Monolith: https://www.youtube.com/watch?v=ISYKx8sa53g

(Blog post) Under Deconstruction: The State of Shopify’s Monolith by Philip Müller: https://shopify.engineering/shopify-monolith

Wedge: Deprecated internal Shopify tool that was built to score engineering teams on how well they are doing on their componentization journey.

Packwerk: https://github.com/Shopify/packwerk

Shipit presents Packwerk: https://www.youtube.com/watch?v=olEA157z7kU

(Blog post) Enforcing Modularity in Rails App with Packwerk by Maple Ong: https://shopify.engineering/enforcing-modularity-rails-apps-packwerk

Zeitwerk: https://github.com/fxn/zeitwerk

Martin Fowler’s design stamina hypothesis: https://martinfowler.com/bliki/DesignStaminaHypothesis.html

Shopify Fulfillment Network (SFN): https://www.shopify.com/fulfillment

Kafka: https://kafka.apache.org/

Transcript

Beyang Liu: All right. I'm here with Kirsten Westeinde. Kirsten is a software development manager at Shopify where I understand, Kirsten, you've been involved in a years-long effort to decompose Shopify's massive Rails monolith, or perhaps like multiple monoliths now, into smaller, modular components. Is that right?

Kirsten Westeinde: That's right.

Beyang Liu: Cool. Thanks so much for taking the time. Sounds like a massive effort and I'll bet that Shopify engineering has been very busy over the past year. So I really appreciate you taking the time to come chat with us and share things that you've learned along the way.

Kirsten Westeinde: Absolutely. Thanks for having me.

Beyang Liu: So I think the focal point of our conversation is really going to be this massive decomposition effort. It's going to be very fascinating to hear about because I think a lot of engineering teams try to undertake projects like this, especially as the company grows and some succeed, but a lot end in failure. And so I think you have a lot to share in terms of how you can pull this off successfully. But before we dive into all of that, we have kind of a tradition on the podcast of first asking people how they got into programming. Right now you're this software development manager who's in charge of this very complex project. You have a lot of responsibility, but at some point you had your first contact with code. So, for you, what was that? And how did that start your journey into this world of bits and ones and zeros that we live in?

Kirsten Westeinde: Absolutely. Yeah. If you had asked me when I was a kid, I definitely would not have thought that this would have been where I ended up. I actually didn't even really know what programming was before I went into university, but I did end up studying engineering just because of my love for math and science. And the school that I went to ended up doing a general first year where we basically took classes from all of the different engineering tracks to kind of learn a bit about them and decide where we wanted to go. And so in my first year I took a programming class, we learned C++ and it turns out that I really enjoyed it and I was really good at it. And the more that I looked into it, it seemed like a really interesting field to be in. So I had originally thought I would be a civil engineer, but I'm very glad that I ended up going the direction I did.

Beyang Liu: C++—that's kind of like diving off the deep end as your first programming language.

Kirsten Westeinde: Yeah. It's always funny to hear what people learn in school. It's rarely what we end up using in the field I find.

Beyang Liu: Okay, cool. So from your start in university, what was kind of like the rough journey from there to joining Shopify?

Kirsten Westeinde: I actually joined Shopify as an intern while I was still in school. So I did two internships with other companies, but I joined Shopify in 2013 out of the Ottawa office as an intern. And then I ended up doing three different internships with Shopify before joining full-time after I graduated. And through those internships, I really used the time to try out different things and try to understand where my interests lay. So I did some web development using Rails, I did some mobile development, and I worked on some internal tools and external products. So I've really had my hands in a lot of different places at Shopify over the years.

Beyang Liu: Awesome. So you've been there since 2013, which means you've kind of had a front-row seat to this whole rocket ship trajectory that the company has been on.

Kirsten Westeinde: Yeah, it's been crazy. When I joined, I think we had 150 people working for the company and we're a private company. We had only a handful of merchants. We had one small office and fast forward to today, we're an international company. We have... It's hard to keep track of it, but I think like 7,000+ employees. At this time, we're a public company, so much has changed. I like to tell people I haven't worked at the same company for eight years, I've actually worked at like eight different companies because it really has been changing that quickly.

Beyang Liu: Yeah, that's crazy. And you know, I bet most of our listeners don't need any introduction to Shopify, but you know, for the one or two people who've been living under a rock for the past seven years. Chances are... If you bought anything on the internet, chances are you use Shopify, right? Because you power a lot of the kind of small businesses and independent operators that sell things online right?

Kirsten Westeinde: Yeah, that's correct. We actually don't force the Shopify brand on our merchant sites. So a lot of people buy things from Shopify stores without actually knowing that they have. But basically if you bought something from Amazon and it was enjoyable or—sorry—from anything other than Amazon and it was an enjoyable experience, that was probably a Shopify store.

Beyang Liu: Yeah. And actually for me personally, increasingly I find myself buying more and more things off Amazon because there's like... A lot of times the things that you can get in these kinds of independent stores are just like higher quality or it just feels closer to the seller. And so thank you for enabling all that. So diving into this large-scale refactoring and re-architecture project, I thought maybe we could start off by giving people an overview of what Shopify looks like as an engineering organization and as a codebase. So in broad strokes, what are the major teams and what are the major parts of the code, and what does it all kind of do as an engineering system?

Kirsten Westeinde: Definitely. So we talk about the core Shopify product as the core codebase. And so that's basically what powers merchants to manage their orders, to manage shipping, etc. So all of the core pieces of the product live in one Rails codebase. And it's actually one of the largest, probably the largest Rails codebase on the planet because we got started with Rails like right when Rails became a thing, and it's been under continuous development since at least 2006. And the last time I checked, we were pushing like 3 million lines of code, more than 500,000 lifetime commits and about 40,000 files. So it's massive. And in terms of an engineering org, I think we're at about 1,500 or so engineers and we're actually aiming to hire another 2021 this year, so it'll be continuing on this massive pace of growth. And if I had to guess, I'd guess probably about half of the engineering team works daily in the core codebase and the other half works on applications outside of core.

Beyang Liu: That's crazy, that's so much code. Does it... Have you gotten to the point where Git operations start to slow down at all? Like if you were to clone the entire codebase to your local machine... Like do people even do that? Or how long does that take?

Kirsten Westeinde: Yeah, people do it, it works. The test suite definitely can get slow, but development itself is not a problem. We actually... We always joke, like we host Kylie Jenner's shop on Shopify and it's like one of the largest stores. And we joke that like we are the Kylie to so many other systems that we use. So we're hosted on Google Cloud and like we are by far the largest product hosted on Google Cloud and so we're often stretching the boundaries of their systems. And similarly, like with Rails itself, we are often the ones that find those bugs that you only find when you're operating at this scale. And so what we do is we really make choices about what technologies we want to use and then really double down on them. So we actually have a full team at Shopify who are Rails core contributors, and we try to actually just evolve the tools that we use to be better and to be better able to support us and others.

Beyang Liu: What languages are in use? So with Rails, I assume Ruby is a big language, are there any others that are big parts of the stack?

Kirsten Westeinde: Yeah. Shopify has always been very opinionated about technology choices. So essentially any problem that can be solved with Ruby on Rails should be, there are some exceptions. If applications have really, really high performance constraints or need parallelization, sometimes we use Go, the front ends are in React. And for mobile development, we've just standardized around React Native going forward as well. And then in data land, we have some Python and Scala and in infrastructure we have Go and Lua as well.

Beyang Liu: So you wrote this blog post about how you were in charge of this project to kind of decompose Shopify’s main Ruby on Rails monolith into smaller, more modular components, not microservices, but different modularized components, we'll get into that later. Tell us at a high level, what was the mandate of that project and how did you come to be involved in that.

Kirsten Westeinde: Yeah. So in, I think it was around early 2017, it was becoming very clear that our core codebase was reaching a tipping point. Where it was becoming more and more challenging to build functionality into it because of all of the interdependencies. So projects were taking longer, adding things that used to be simple were causing test failures and big headaches. And also it was really... We're finding it really challenging to onboard new people into the codebase because there was just so much to learn. And so we knew we had all of these problems, but we didn't want to get into solutioning without really understanding what the problems were. So we actually sent out a large survey to all the engineers working in the core codebase to hear their feedback about what the pain points were before deciding what to do.

Kirsten Westeinde: And it was that survey that actually led us to start this project. We originally called it a very clever name of, "Break core up into multiple pieces." But that name eventually evolved into componentization where essentially we wanted to make this one large codebase feel like many smaller codebases. And so we wanted to separate the code by business domain basically because that is the code that more often needs to change together. And so what that would mean is if we were hiring someone onto say the shipping team, they should be able to just understand the shipping component without having to understand all of the other elements of the core codebase. So at a high level, that was the goal.

Beyang Liu: Got it. So if I were a buzzword-driven engineering manager and you came to me and said, "Let's decompose this monolith into different independent components." I would say, "Well, that's obvious, you know, we live in 2021 and the obvious solution to decomposing a monolith is to break it down into these things called microservices. But I understand that that's not the approach that you ended up taking. So can you talk about why you didn't follow the naive path of like, let's write a bunch of microservices. What did you end up with instead? What is this modular monolith that you talk about?

Kirsten Westeinde: Yeah. I mean, I think for one thing we had been developing on Shopify core for so long that pausing everything and building out net new microservices just wasn't feasible for us. Whereas the direction we ended up going was making incremental changes in the existing codebase while teams continued to develop and add features to it. So that was a big reason, but the other thing is that unless we wanted to start from scratch, to be able to start extracting some of these pieces out into microservices, we would need to have them well encapsulated and have boundaries in place, which is what we ended up doing. But we ended up deciding that we wanted to stop there because monoliths have a lot of pros, a lot of benefits.

Kirsten Westeinde: Having only one codebase means you only have to do all of your gem bumps and your security fixes in one place. It also means you only have to manage one deployment infrastructure and one test pipeline. There's also a lot less surface area in terms of places that you could be attacked, and I think that makes the system a lot more resilient. And we found it helpful to keep all of our data in one place, in one database and allow the different components to have access to the data that they needed instead of trying to synchronize it across many different systems.

Beyang Liu: Yeah.

Kirsten Westeinde: And lastly, with microservices, the communication between the different components has to happen over the network, which introduces a whole bunch of complexity around API version management and backwards compatibility, as well as just dealing with the network in general. It can be slow. There's lots of ways it can fail. So those were all the things that we liked about the monolith and didn't want to just give those all up. And we kind of felt like moving to microservices was more of a topology change whereas what we wanted was an architecture change. There was nothing necessarily wrong with our monolith, it was just that there were no well-defined boundaries between different things.

Beyang Liu: So you had this principled, thoughtful approach of "We have developer pain, let's address that directly, but we're not going to involve it with changing the topology of production, introducing all these unnecessary network boundaries because that introduces its own complexity and its own baggage" in a way.

Kirsten Westeinde: Yeah, exactly. Our monolith had gotten us so far. We, I think, had learned so much about what is good about it, that it gave us pause to really ask the question and really think through whether microservices were right for us. And the answer at the time was no.

Kirsten Westeinde: That being said, we are believers of service-oriented architecture, so we do have other applications outside of the core codebase. However, everything related to the core functionalities of commerce, we keep in one place. But there have been some examples where we actually have extracted components out of the core codebase because they weren't related to the core functionality of commerce and we pulled them out into their own services, but that was exponentially easier to do post componentization.

Beyang Liu: So what did this modularization of the monolith actually entail? I assume you started with... My knowledge of Ruby On Rails is fairly limited, but I understand the standard directly out is: you have a model directory, you have a view directory, you have a controller directory. Was that the structure of the codebase at the start of the project?

Kirsten Westeinde: Yeah, that's correct. So we started before this, we basically just had a Rails app out of the box, so you're exactly right. They were organized by our software concepts, but the first thing that we wanted to do was restructure our codebase to be organized by real-world concepts like orders, shipping, inventory, billing. And what we ended up doing was we actually audited every single Ruby class in the code base, which was like 6,000 at the time, in a spreadsheet and manually labeled which component we thought that they belonged in based on the initial list of components we had. And then we built up one massive PR with automated scripts from the spreadsheet to basically just create those top-level folders. And then within the orders folder it would still be models, views, controllers, but we would just move all of the specific models, views and controls into that folder. So we did that all in one big PR, which was kind of terrifying, but ended up being fine. We have good test coverage. So it is all good.

Beyang Liu: What was the size of that diff? It must have been like...

Kirsten Westeinde: It was every file, every file was touched.

Beyang Liu: Wow.

Kirsten Westeinde: Yeah. So that was the first step.

Beyang Liu: Why wasn't that sufficient? The straw man is like, "Okay, if we put everything in its own directories and now each team gets to work in their own directory and that's done. They're like separate codebases." Why was that not just the end of the project?

Kirsten Westeinde: Well, all that had achieved was moving files around, right? Like in Ruby On Rails, everything is automatically globally accessible. So just because all the code for orders was within an orders folder, it could still use whatever it wanted from any of the other folders. So we still had all the same problems with fragility of the test suite and really if I was working in the orders component, I still had to understand about the taxes component and the shipping component because of how tightly coupled they were and how there was no clearly defined API for them to communicate between the folders. Nothing had structurally changed about how the code was interacting.

Kirsten Westeinde: So it was a great first step in the sense that I now knew if I'm looking for something orders related, I can find it. And that the top-level component folders had explicit owners, which was actually a huge win. But beyond that, we still hadn't solved a lot of the existing problems.

Beyang Liu: You mentioned something in your blog post about how one of the initial pain points was as the codebase grew, the amount of code that a given developer would have to understand to get a thing done was like something like O of N, it grew linearly with the size of the codebase. And that obviously is untenable as you grow, because it just means your productivity approaches asymptotically down to zero. So how'd you go from O of N to sub linear? What were the things that you had to make in terms of codebase structure to enable that?

Kirsten Westeinde: Yeah, definitely. So once we moved all the code into their separate folders, the next ask was for each component to define a public API. And ideally that public API should be easy enough to understand that I can understand what it's doing without having to go under the hood and look into the component. And so that would allow me to get to know one component really well. We also asked each component to explicitly say which other components they depended on. So say I was joining the orders team, I could really deeply get to know the orders component. And then as my next step, I could look at what the dependency is listed out for the orders component is, and maybe just start getting to know their public API so I know what methods are available to me from those components without needing to understand all the nuances of how they get the things done exposing the public API.

Beyang Liu: Yeah. What about the human side of this kind of change? In the organizations that we've worked with that have attempted something like this, there's the technical side, which involves making the proper changes to the code, in this case, introducing a bunch of interfaces and making those interface boundaries well-defined, but then there's also this human, semi-political aspect to it where you're going to each of these engineering teams and each of them has their own goals and objectives that they're working toward, business goals, and you're coming to them and essentially asking them to add work onto their plate to satisfy this change.

Beyang Liu: Can you talk about that element of it? Did you have to do a lot of horse trading or, "Do me a favor and I'll do you a favor?" What does that look like from your perspective?

Kirsten Westeinde: Yeah, it's interesting. I've actually talked about this project at a few different conferences and the two questions I always get are, "How did you get buy-in from the business to invest in this? And how did you get buy-in from the individual teams to invest in this?" I guess there's no really easy answer or perfect way to do it and it's going to be so nuanced depending on the organization that you're within. But one of Shopify's core cultural values that were kind of taught from the beginning, is build for the long term. And it's really driven home that we're trying to build this business for a hundred years, not for 10 years. And so with that kind of mindset, it is easier to get buy-in for these short-term pain for long-term gain type projects.

Kirsten Westeinde: I think as well, the developers that had been working in the core codebase had all been feeling this pain. And so they were eager for opportunities to make it better and easier to develop within. So it wasn't always easy. And to be honest with you, different components are at completely different points in their componentization journey based on how well their team has been able to prioritize this type of work. One thing that we did was we built this tool called Wedge, a lot of complexity under the hood, but it basically gave components scores as to how well they were doing on their componentization journey. And we made it like a little fun competition between teams.

Beyang Liu: You gamified it.

Kirsten Westeinde: Exactly. Yeah. As a technical initiative that the whole engineering org was working towards.

Beyang Liu: That's really interesting. So you kind of built a tool that condensed all these complex refactorings into something that you can measure, like a numerical score and that made it easier to track and also verify. How did you do that? How do you convert adding an interface or defining a good interface or eliminating a cyclical dependency or something into a number that is meaningful that the engineering teams don't shake their heads at and that also is useful for tracking it at a top level.

Kirsten Westeinde: Yeah. So, Wedge is a really interesting tool. What we did was we used it to hook into Ruby trace points during our test suite run to get a full call graph of all the calls that were being made as our code was being executed. And then we sorted the callers and callees by component and basically pulled out all of the calls that were happening across component boundaries and then sent them to Wedge. And we'd also send some other data like code analysis and active record associations and inheritance and such. And then Wedge basically just went through everything that was sent to it and determined which of these things were okay, and which ones were violating. So if a cross-component call was being made, if there was no dependency explicitly declared between those two components, it would be violating. And if it was anything other than the public interface it would be violating.

Kirsten Westeinde: We actually found though that it really is hard to get a number that is right. Especially with as much complexity as there is in call graph logging. So in the end we actually ended up canning Wedge because we didn't find it to be a useful feedback cycle. And also mostly because we had to run the full test suite to get this feedback and that takes way too long for it to be helpful. So we ended up going in a different direction. But in the early days it was really helpful to, like you said, gamify it.

Beyang Liu: Yeah. That makes a lot of sense. As you learn more about what was useful and productive, what kind of stepped into the place that Wedge occupied? Was there another tool that you used or was there some different set of criteria that you used to gauge overall progress and success?

Kirsten Westeinde: Yeah. We ended up building this tool called Packwerk, which actually as of September of last year is open source. And it's a really cool tool that basically analyzes static constant references. So we found that there's less ambiguity in static references and because these references are always explicitly introduced by developers, it's more actionable to highlight them, and it's much faster. So we're able to run a full analysis on our largest codebase in a few minutes, which means that we can actually put it in as part of our pull request workflow, which is a way more helpful feedback loop for developers.

Beyang Liu: That's really interesting. So the Wedge was this dynamic tool. It kind of built up this call graph or reference graph from runtime by observing what actually got called and then Packwerk was building up that graph, but statically based on imports and references in source code. Is that right?

Kirsten Westeinde: Yeah, that's correct. Packwerk uses the same assumptions as Zeitwerk, which is the Rails code loader. And the way that Rails is able to make constants globally accessible is it basically infers the file location based on the constant's name. And we did the exact same thing with Packwerk.

Beyang Liu: So the programming languages part of my brain is saying, "But wait a minute, Ruby is a dynamically typed language, not all types are known at compile time." But were you able to solve that problem in Packwerk? Like inferring types or was it some element of best effort? Like, this is good enough.

Kirsten Westeinde: Yeah. A hundred percent the latter. We understand that it's an imperfect solution and there's a lot of crazy stuff that can happen in Rails that will not be detected through this, but it's good enough essentially. And it does catch a lot of things. So we decided that the fact that it can happen so much faster and provide that faster feedback loop is a trade-off we're willing to accept for the fact that yes, maybe some things will slide under the radar.

Beyang Liu: I guess it's important to recognize that there's humans in the loop in this process. So in some sense, it'd be a much harder problem to completely automate it because then you'd have to get everything precise. But as long as there's humans in the loop, they can step in and say, "Well, we recognize this tool's a little bit fuzzy, as long as the signal to noise ratio is decently high, we can work with this."

Kirsten Westeinde: Absolutely, and the magic of this tool is not only within the tool itself, but it's within the conversations that it sparks, because if on my PR I get told, this is a cross-component violation, and maybe I don't know what that is, or maybe I'm a more junior programmer, and I don't have an idea of how to do inversion of control, for example, there's been a lot more conversations around software architecture across Shopify because of this tool.

Kirsten Westeinde: So, what that means is sometimes maybe the tool won't catch cross-component boundaries, but because it's now becoming a part of our engineering culture, maybe someone will in a PR review. Building that muscle of good software design really is the overarching goal of this whole thing. So, yes, the tool's not perfect, but it's been really valuable for us.

Beyang Liu: You just mentioned inversion of control, and I think that along with some other things you called out in your blog post were good rules of thumb or good general principles that you may have discovered along the way of doing this. Can you talk about some of those principles or maybe empirical patterns, in terms of if someone were to attempt this on a Rails codebase in general, what are the maybe tactical things? Tactical things, tricks that you can apply to make the codebase more decomposed and modular.

Kirsten Westeinde: Yeah. I think the first thing I would say is that if someone's going to be trying this on a Rails codebase, they'll probably be in a similar situation that we are, or that we were, which is basically, if you look at the dependency graph that has just naturally happened, it's crazy. Every component was basically depending on every other component. There's a ton of cyclic dependency. Rails lends itself naturally to have high dependency.

Kirsten Westeinde: So, I'll say that first. Even just being able to visualize that dependency graph allows you to reason at a much higher level about, does it make sense for this thing to depend on this thing? Sometimes it does, and other times it doesn't, but when it doesn't, that's when you start to use some of these tactical approaches for decoupling those things.

Kirsten Westeinde: I mentioned inversion of control, which basically is saying ... So, I'll give an example. I work on a product called the Shopify Fulfillment Network, where, basically, some merchants, we host their products on their behalf. When orders come in, we'll choose the most optimal warehouse to fulfill it out of to get it to the buyer as quickly as possible.

Kirsten Westeinde: SFN, that's the acronym for the Shopify Fulfillment Network. It needs to know when an order comes in on a merchant store. One way to do that could be for the orders component to make a call to a method in SFN and say, "New order has been created," and then SFN responds with, like, "Great. We got it."

Kirsten Westeinde: That's a hard dependency from the orders component to SFN, whereas if we want to get rid of that dependency, we can actually just use a method like a publish subscribe mechanism. There's a bunch of different ones to do that, but at Shopify, we use Kafka events a lot. Basically, the orders component can just say, like, "An order happened," and then SFN can choose to subscribe to that and do its processing without needing to necessarily reply, and that has now broken that dependency between the two.

Beyang Liu: It's like, at run time, you want this one thing to provide behavior to this other thing that is going to run, but at compile time, you don't actually want an import going from that thing to this other thing, because that adds a hard dependency, and it means that someone working on the first thing will have to go through the second thing and understand how it behaves. There's not this interface boundary that says, "Okay, don't worry about the stuff behind this curtain."

Kirsten Westeinde: Yeah. That's exactly right, and had there been a hard dependency, like, say, from the orders component to SFN, then the orders component would have to care every time SFN changed. Now, having done this inversion of control, it does not at all. It just has to fire its event.

Beyang Liu: Another thing that you talked about in the post was this kind of loose coupling, but high cohesion pattern that got adopted. Do you mind explaining what that means?

Kirsten Westeinde: Absolutely, yeah. This was one of the core principles of this project that we were striving towards. Coupling is basically the level of dependency between modules, or classes, or in this case, components. So, you want that to be low, because you want your dependency graph to be as light as possible, but you want the cohesion to be high. The cohesion is basically the measure of how much elements within a certain encapsulation belong together. So, whether you're looking at a module level, a class level, or the package level.

Kirsten Westeinde: One thing to note is that it's really hard to get the component interface right when the classes within it have interfaces that are not well-defined, and the modules within it. It really starts with good software design from the bottom up. That makes it easier, a lot easier to get the component interfaces right.

Kirsten Westeinde: Then, on the point of cohesion, there’s two different types of cohesion that we think about. The first is functional cohesion, so that basically means code that performs the same task list together. You can think of service objects as being functionally cohesive, whereas data or informational cohesion is code that operates on the same object living together.

Kirsten Westeinde: Again, Rails really lends itself to data or informational cohesion, because of the way that active record models are built up to interact with one database model. We often add a lot of methods relating to that object on its model, even though they might be parts of completing completely different tasks. I'm sure different languages lend themselves to different ones, but we found ourselves way over-skewed in the direction of data and informational cohesion.

Beyang Liu: Where is this project today? You mentioned it started in 2017, and it's been a multi-years-long effort. Are things still being componentized in the main Rails codebase, or at this point, has it been componentized, and it's more or less in a done state, you just have to maintain the status quo?

Kirsten Westeinde: Yeah, I wish I could tell you we were in a done state, but the reality is that we're not. All of the code has been moved into components, and all of the components do have publicly defined entry points, but the reality was that we were starting from a place where a lot of these calls were already violating the rules. So, one of the nice features that Packwerk built was that you could essentially declare bankruptcy, and start from a certain point, and only start tracking violations beyond that. We have a list of deprecated references, and the rule is basically that you can't add any new violations.

Beyang Liu: Got it.

Kirsten Westeinde: We know that there are some existing ones, but no new ones can be added. Over time, that deprecated references list basically gives us a tech debt to-do list to work off of. Like I said, different components are in different shape, but we haven't yet gotten to the point where all of those violations are gone. Only once all of those violations are gone will we be able to actually enforce those boundaries.

Beyang Liu: Fascinating. You have this list of tech debt to-do items. In a given iteration cycle, how do you decide how many of those to tackle versus how many, I guess, more product-oriented things to tackle? Because that's something that our teams at Sourcegraph, I assume, like probably any software team struggles with. What is the right balance of tech debt versus new feature development to take on?

Kirsten Westeinde: Yeah. That's a hard question.

Beyang Liu: Can you put a price on a tech debt item? It's almost an impossible exercise.

Kirsten Westeinde: Yeah, no, it's really a hard question. Honestly, the answer's going to be different team to team at Shopify. One of the patterns that I've seen in my years at Shopify is that we built a lot of features. We move really quickly leading up to Black Friday and Cyber Monday, because that's when our merchants make the vast majority of their sales for the year. So, it's really important that we've given them the tools that they need to be successful during that time.

Kirsten Westeinde: Come the holiday season, we actually tend to cool down a bit on feature development, just because we don't want to be breaking things when our merchants are having their most important sales of the year. So, often, that feature cooldown can be a good time to tackle some of these larger units of tech debt. That said, I'm not saying that we just leave tech debt to do at one point in the year. We definitely try to find some pause between projects, or even just parallel tracks of work, where some people are working on more technical debt, and some people are working on product features. It's definitely a hard balance, but we have been slowly chipping away at it.

Beyang Liu: I understand that the Shopify Fulfillment Network, which you mentioned earlier, that's a separate codebase. It also is or was a Rails monolith. You're currently tackling a similar project in that codebase, but adjusting some things based on what you've learned from this first major...

Kirsten Westeinde: Decomposition.

Beyang Liu: ... decomposition componentization effort.

Kirsten Westeinde: Exactly.

Beyang Liu: Tell me about round two. What's different this time in the sequel?

Kirsten Westeinde: Yeah. We actually took a pretty different philosophical approach. What we did in Core was we moved all of the code into the components, and then slowly, over time, started trying to break some of those dependencies, whereas what we've done in the Shopify Fulfillment Network was we've introduced separate components. We've actually used Rails engines, which are like the one modularity feature that comes out of the box with Rails, and that can allow us to enforce those boundaries a little bit more strictly. Then, over time, piece by piece, moved bits of code into it that are always respecting the boundaries.

Kirsten Westeinde: So, we flipped it on its head that way, whereas we still do have the main app controllers, models, etc., but the hard rule is we can't add any more code into the app folder. All new code goes into component folders. Over time, we're pulling code from the app folder out into its, basically, correct component, so that anything that's in a component is respecting the component boundaries, whereas that is not the case in Core.

Beyang Liu: Interesting, so, with Core, the approach was move all the files around into their appropriate directories, and then, over time, reduce the linkedness or monolithic-ness of each component, whereas with Fulfillment, you basically said, keep the existing monolith as is, but each new feature has to be added to something that's external in a Rails engine. Then, we'll have some connection between the two, so that the functionality makes it into the application, but we always preserve this, I guess, invariance because all of the new components are modularized. Is that about right?

Kirsten Westeinde: Yeah, that's right. One of the main benefits of that is that with so many developers working in a codebase, people tend to want to follow the patterns that they see existing. In Core, there is a mix of good patterns and bad patterns, because some stuff has been adjusted to break that interdependency, and some stuff hasn't. So, it's hard to know which pattern to follow, which way is the right way, when there are both examples, whereas in SFN, anything that has been componentized is following the patterns that we're wanting to strive for, so there's more good examples to be able to learn from, and demonstrate to other developers, and follow.

Beyang Liu: I get all the learnings here. This is really fantastic insight. One thing I'd love to hear about is like, what is the process on the ground that you arrive at these insights? Like, for this new project, we don't want to move a bunch of stuff in the existing codebase around, but we want to start with something new, or moving from the dynamic dependency call graph to the static version, what are those conversations like? Is it you and a bunch of other engineers or engineering managers in a room, whiteboarding, or someone coming up with a bright idea, and slowly convincing other people? Are there any, I guess, war stories that come to mind when you reach these insights?

Kirsten Westeinde: Yeah. It's always a conclusion that we come to over time. In the example of Wedge, it was working well for a while, but over time, the problems with it became louder, and louder, and louder until we couldn't ignore them. So with that one, it was really, as our test suite was getting larger and slower, it was really just that it wasn't a helpful feedback loop to have to wait for that entire test suite to run to be able to get that insight. And so we knew there was a problem with that. But then the question is, what's the solution?

Kirsten Westeinde: So we have a few different avenues for brainstorming these ideas. One is actually a team whose mandate is architecture patterns, and they're the ones that built Packwerk and provided a lot of the patterns to follow. And so sometimes it's just projects that get resourced, and we do explore and prototype phases. And so being armed with the learnings from Wedge, it led us down the road to Packwerk.

Kirsten Westeinde: But the other thing is we have a software architecture guild at Shopify, which is basically anyone from the company who is interested in software architecture. We have a Slack channel and a bi-weekly meetup. And that's where a lot of these conversations come to a head. Because I think there were a lot of learnings from Core, but then for example, the developers who were starting to build out SFN may not have had access to that frontline information.

Kirsten Westeinde: And so we did a lot of pairing actually as a way to share this knowledge and have a lot of discussions with the architecture guild. Because Shopify is very, I guess each team has a lot of autonomy. So the SFN team had the autonomy to decide what the right solution for it was, but definitely wanted to leverage the learnings that were present from Core. So that's the type of discussion that we would have at an architecture guild. Or we also do technical design reviews for any big tech projects we're kicking off. So that would be another. We have templates for those, and we have filled those templates with some prompting and questions to try to make people think about modularization as part of the upfront design.

Beyang Liu: Interesting. Tell me more about this guild. So what is a guild exactly? Is it just a collection of individuals who have a common technical interest area? Or how is it organized? Is there a process associated with it? How does it fit into the overall org structure of the engineering team?

Kirsten Westeinde: Yeah. Actually it doesn't fit into the org structure of the engineering team. It's composed of people from across everywhere in the engineering organization. It's completely opt-in, and it grew organically from people who were curious to be having conversations like this. I think one big learning actually that my coworker Phillip called out in his blog post was that he wished that we had started this sooner in the componentization process. So more minds could be involved and had buy-in basically for them to apply the approach that we were aligning upon in their different areas of the company.

Kirsten Westeinde: But it really is just people that want to nerd out about software architecture, and people come with presentations every other week. Sometimes we'll just share the architecture of one of our systems. Sometimes we'll have more pointed discussions about certain software design strategies. It's pretty organic, but people like it.

Beyang Liu: That's awesome. So the guild formed after the start of the original project.

Kirsten Westeinde: Yeah. That's right.

Beyang Liu: At what stage do you think it makes sense to form a guild? I think that we run into the same issue, and I think a lot of other engineering organizations run into this issue, where you have your org chart and you try your best to produce an org chart that reflects the needs of the product and how knowledge and information needs to flow. But no chart is ever going to be perfect, and there's all these areas of expertise that don't neatly fit into that. And I guess the question in my mind has always been, at what point do you start to need these guild structures? Is it at 100 engineers? Is it at 1,000 engineers? Or what, in your mind, is the right size?

Kirsten Westeinde: Yeah. We've actually experimented with a few different things. The guild is more of a meetup. We less drive change through the guild, I would say, other than grassroots movements. But we also have what we call a technical leadership team, which is on a rotational basis. Technical leaders from across the company will be on that team. And anytime we're doing a tech design review, you'll get paired up with someone from the technical leadership team. You can bring them your gnarliest problems, and it's made up of people from all different parts of the org with all different perspectives on the problem.

Kirsten Westeinde: So it's a way to try to, I guess, make sure that we're leveraging the learnings that are available at the company level. And then there's also what we were chatting about before, I forget what it's... the tools and patterns team. And their actual mandate and what they work on day in and day out is building out some of these tools and patterns.

Kirsten Westeinde: And I think I would guess that team spun up probably in 2018 just based on this project having started in 2017. And yeah, we were probably 500 to 600 engineers around then. It's going to be different for every organization, and I think it depends on how much your organization already has it baked into their DNA to be thinking about these things, to be thinking about good software design, technical initiatives, etc., or not. Sometimes it might make sense to start earlier if that's not as organically present.

Kirsten Westeinde: There's trade-offs to each approach. One of the challenges that the tools and patterns team has is that they're just building tools and patterns, but they aren't necessarily using those tools and patterns to build a product. So they need to make sure that they are in touch with the people that are using those tools and patterns and actually solving the right problems. And so we do that through pairing and internal documentation presentations, etc. But we really try to make sure that the problems that need to be solved are being solved, and we're not just adding friction to developers' lives because we think it's nicer.

Beyang Liu: Yeah. That makes a ton of sense. Looking back on both projects that you've been involved in with decomposition, to what extent are the patterns that you discovered, do you think, specific to those particular codebases? And to what extent do you feel like there's some general principles here that would be widely applicable? That's one question. A more proximal question would be the way that you're approaching modularization on the fulfillment network codebase. Do you wish that you had done that with the main codebase initially, knowing what you know now? Or do you think that each codebase found its proper approach?

Kirsten Westeinde: Yeah. It's a challenging question because they both have their trade-offs. So I think that you just need to know in your situation which trade-offs you're willing to accept. For the Shopify Core case, what we were really struggling with and really wanting to optimize for was that we had a lot of code that didn't have ownership. And so having moved the code into folders that have explicit owners is a big win for that. Whereas in SFN, we still have a lot of code living in the app component that doesn't have as clear owners, but for us that's okay because it's a smaller group of people working on it. And so it's easier to share the load across our team, which is smaller. Whereas for core, it was really important that we get it right in terms of code ownership.

Kirsten Westeinde: So I think you really have to ask yourself, what is the intermediate state that you're most comfortable with? Because these things take a long time. And so I think Phillip has an amazing quote in his blog post. He says, "My experience tells me that a temporary incomplete state will at least last longer than you expect. So choose an approach based on which intermediate state is most useful for your situation." And it's 100% true in this case. Both of those codebases are in intermediate states that are correct for them, and I think you just need to think about what that means for your codebase.

Beyang Liu: Yeah. That's very fair. Somewhere out there there's a person listening who wants to propose a big refactoring re-architecture project, but you often run into resistance from Product, from other stakeholders in the business because the value of such a project is not always clear. What advice would you have for people in that position, just in the very early stages of thinking through starting such a project and getting organizational buy-in?

Kirsten Westeinde: Yeah. One thing that I think about often as it relates to this is, I don't know if you're familiar with Martin Fowler's design stamina hypothesis, but it basically says that in the beginning—

Beyang Liu: I read that in your post.

Kirsten Westeinde: Yeah, I love it. I talk about it all the time. It basically says that in the beginning, no design is actually the best design, and you'll be able to move more quickly without design. But at some point you're going to reach a point where not having a design is going to slow you down more and more and more, and adding incremental functionality just gets harder and harder.

Kirsten Westeinde: So the first question that I would ask is, are we past that point? He calls it the design payoff line. Are we really at a point where we need this design? Because honestly, it's worse to have a bad design than to have no design, in my opinion. So you want to make sure that you have enough information and have built enough features to understand what design your codebase needs. So that's the first thing.

Kirsten Westeinde: But if you think that you are in that position, I always find that the business and the product managers are very motivated by data. So the more data that you can capture, the better. So I don't know that your organization uses story points, but say you could point to an issue of story point three that used to take half a day. Now it takes a day-and-a-half. Anything that you can point to to show that you are getting slower is a really good starting point.

Kirsten Westeinde: Often they will have been feeling similar things. It probably won't come out of left field to them. They might feel like projects are taking longer than they should and are harder than they should. So the more you socialize the idea of, "X would be easier if we did this" and really just keep being the squeaky wheel, that can be really helpful.

Kirsten Westeinde: The other thing too is we would absolutely never have gotten buy-in to stop feature development and do this. So if you can do it in an incremental way that you can chip away at over time, it'll be much more likely that you can get buy-in. And if you're in the lucky situation where your company has enough engineers to be working on parallel tracks of work, then if some people can still be building features while some people do this, it's going to be a lot easier of a pill to swallow for the business.

Beyang Liu: Yeah. That makes a lot of sense. Kirsten, thanks so much for taking the time today. If there's people listening to this and they're interested in learning more about this effort or more about projects and learnings that you've undertaken, where should they go on the internet to find out more and learn more?

Kirsten Westeinde: Well, Shopify has an engineering blog where we've published a few different blog posts on this topic. So I would definitely start with checking those out. We also have a ShipIt Presents new YouTube series about some of the engineering efforts at Shopify, and there's an episode that speaks about this as well. So check those out. And there's lots of other interesting non-componentization-related stuff on the shelf for the engineering blog. So I would say give that a look.

Beyang Liu: All right. My guest today has been Kirsten Westeinde. Kirsten, thanks for being on the show.

Kirsten Westeinde: Thanks for having me. It was fun.