Episode 8: Rijnard van Tonder, creator of Comby
Rijnard van Tonder, Beyang Liu
Rijnard van Tonder is the creator of Comby, a pattern-matching syntax and command-line tool that offers a more expressive and more user-friendly alternative to regular expressions for many common patterns in code.
Rijnard earned his PhD from Carnegie Mellon University in 2019. In this podcast, we chat about the state of the art in static analysis and automated bug-fixing, new tools made in industry like Pyre and SapFix, and what place machine learning has in the world of developer tools.
Show Notes
Rijnard van Tonder: https://twitter.com/rvtond, https://rijnard.com
Comby pattern matching syntax: https://comby.dev
Comby Gitter channel: https://gitter.im/comby-tools/community
Program synthesis: https://en.wikipedia.org/wiki/Program_synthesis
Tree sitter: https://github.com/tree-sitter/tree-sitter
Codemod, from Facebook: https://github.com/facebook/codemod
Automated bug fixing and program repair: https://en.wikipedia.org/wiki/Automatic_bug_fixing
Chomsky's hierarchy (regular, context-free, and Turing-complete languages): https://en.wikipedia.org/wiki/Chomsky_hierarchy
Regular expressions, or regex (pronounced "REG-ex" or "REJ-ex"): https://en.wikipedia.org/wiki/Regular_expression
Holes in Comby (https://comby.dev/docs/syntax-reference) vs. named capturing groups in regular expressions (https://www.regular-expressions.info/named.html)
Rascal: https://www.rascal-mpl.org
Spoofax Language Workbench: http://www.metaborg.org/en/latest
Infer static analyzer, from Facebook: https://fbinfer.com
Rice's theorem: https://en.wikipedia.org/wiki/Rice%27s_theorem
SapFix, Sapienz, Mark Harman from Facebook (note: in the recording, we mixed up Sapienz with SapFix. SapFix is the end-user tool, Sapienz is an underlying technology which can be used to surface issues to SapFix): https://engineering.fb.com/developer-tools/finding-and-fixing-software-bugs-automatically-with-sapfix-and-sapienz/, https://engineering.fb.com/developer-tools/sapienz-intelligent-automated-software-testing-at-scale, https://research.fb.com/blog/2019/05/spotlight-session-with-mark-harman
Pyre, from Facebook: https://pyre-check.org
Transcript
If you notice any errors in this transcript, you can propose changes to the source.
Beyang Liu: All right, I'm here with Rijnard van Tonder, an engineer at Sourcegraph who created the Comby pattern matching syntax, which is an alternative to regular expressions, but a much easier to use. And designed and optimized for use in a code. Rijnard, welcome to the show.
Rijnard van Tonder: Hi Beyang. Yeah, thanks for having me on.
Beyang: So I guess before we dive into all the technical stuff, I thought it'd be interesting to go over the backstory of how we met each other. Do you remember how we first met?
Rijnard: Yeah. I think you might have to fill in some gaps here. I remember my side of things. Basically, I was in grad school at CMU and working on research problems, taking classes, but at the same time curious to explore other things as you do in grad school. At one point, code search came up on my radar and I started playing with it. I thought, code search, this is cool, but also I was poking at security holes trying to find some fun things. There wasn't much. I think the basic stuff was covered. But at some point, I found you could have an information leak for public and private repos, and that's when I reached out. I think you were the person who replied to that email, I think.
Beyang: Yeah. I think you sent it to security at Sourcegraph, which is our catch-all for any security bug report. Yeah, I did reply to that.
Rijnard: Yeah. I think that kicked off like, "Oh, do you want to talk about what Sourcegraph does? Do you have any feedback about the product?" and so on. I remember helping on a call with a couple of folks back then. I thought it was interesting. You're tackling a hard problem that there was a bit of a gap, right? The reason I came across Sourcegraph so stuff is code search that works a little bit more ubiquitously than anything else out there.
Beyang: Yeah. I remember that, because I remember looking you up when you reported the bug. It was a very nice bug report. It was for an issue that was indeed an issue. It was well described and it was a good report. I looked you up and saw that you did research into software engineering, developer tools, static analysis, that general area. Afterwards we were email buddies for a while. Then, I think at some point you came out to California because you were doing an internship at Facebook or something like that. You dropped by the office.
Rijnard: Yeah, that's right. That was a couple of years after. I think that was at least-
Beyang: Yeah, that's right.
Rijnard: ... two years after we had the email exchange. Yeah, I was just there and I was like, "Oh, I remember that company, Sourcegraph. So I'm going to go pay a visit." Yeah, at the time, I think Sourcegraph was doing a lot of things in the LSP space, right? So mine were server protocol stuff and I was also becoming more aware of those efforts. And so that tied into some of the things that I ended up working on during the Facebook internship there. Working on Python and surfacing Python related code intelligence or code editor functionality.
Beyang: Yeah, at least from my point of view, it was all very serendipitous. I knew very early on that I would love to work with you someday. Just so happy that you ended up joining the team, and bringing all your thoughts and insights into what's on the research frontier to what we're doing at Sourcegraph. Because I think there's a lot of interesting product features and capabilities that can be built on top of that, which I suspect we'll get into over the course of this conversation.
Rijnard: Yeah. So thinking about it now, right, what I remember as well is everybody I talked to at Sourcegraph were very keen to get feedback, right, on not just the product or the ideas that Sourcegraph was working on, but also just like, "What are your thoughts of having, for example, code search or code intelligence that's ubiquitous?" And these sorts of things. As always, there was never a dead end to the conversation to anyone I've had at Sourcegraph. When we initially talk, it's always like, "What do you think of this? What do you think?" That stood out to me.
Rijnard: I think that basically I identified a deep desire to go after those goals, go after the things that the company is setting out to do, which is this idea of code search everywhere, code intelligence everywhere, every editor, every interface. I think, it was also a question of, I guess, mutual interest and really commitment to this idea. That's why I ended up here.
Beyang: Awesome. I guess, before we get into research that you've done and Comby in particular, I was like to kick off the show by asking people how they got into programming originally. So, what was your backstory?
Rijnard: I think I got into my first real programming experience during high school. So my final two years in high school learned to cut in Java. I still remember, I think, my first programming experience was looking at a textbook and there was this program in the textbook. It's like, "Copy this, right, character-for-character into your editor, and then type it out." I'm like, "I can do this." Right. And we're going to run this thing. I'm copying. Then after I think I've copied the whole thing in the book I click run or compile, it was compile, right?
Rijnard: This thing just tells me all of these syntax errors happening everywhere. I'm like, "What? How did I make this mistake? I looked very carefully at what the book was telling me to do." And it's like, "You missed a semi-colon, you missed a brace or something." I was so surprised back then because I was like, "Wow, it's so easy to make a mistake during this." Right.
Beyang: Yeah.
Rijnard: So that's one of the first recollections I had during programming. From there it became more interested in understanding the whole theory around it. Not so much computer theory, but now just general ideas like algorithms, data structures, that sort of thing. Then, in university focused in learning more of computer science. I think, it was only a university that I got exposed to Linux for the first time and becoming a command line power user and all that thing happened over the course of my varsity years. I can go into detail, but that's basically the early beginnings for me.
Beyang: Yeah, I think that makes sense. It's, I think, a little bit similar to how I got into programming. My first programming language was also Java in a high school environment.
Rijnard: My condolences.
Beyang: There are, I think, worst first experiences.
Rijnard: [crosstalk 00:09:18]. I'm just poking fun.
Beyang: Yeah. Where along the line of your studies did you decide that you wanted to get a PhD, or was that something that you always felt that you wanted to do?
Rijnard: Definitely not something that I had my heart set on doing. I think my story aligns with a lot of other graduate students or people who end up in academia, at least for computer sciences. You try and shoot for a technical job, maybe you just failed too many technical interviews, and you're like, "Okay, I'm going to go to grad school now." I definitely have a couple of data points of people who shared that experience. To some extent I am one of them. I think I did a couple of Google interviews and it didn't work out. So I was like, "Well, let's try this other avenue," right? You send out applications for various things. So I'm like, "I'm open to this whole PhD idea." That ended up working out. So, I went for it.
Beyang: Yeah. I got rejected from Facebook myself, so I can definitely sympathize with getting rejected from tech companies.
Rijnard: Yeah.
Beyang: Cool. How did you pick your research area? Was that something that developed over the course of your undergraduate career or was this some other influence or inspiration?
Rijnard: The reasoning to my mind was, I'm only interested in going through a whole research experience or doing academic research if it's a topic that I really care about, right? Because all the advice out there is, "Only do this if you really are committed to the idea or you're really interested or passionate, if you will." I don't think that's a prerequisite for going into research and my perspective on that has changed a bit afterwards. Certainly, going in it was a question of, "I know that I want to focus on these areas. This is interesting to me.
Rijnard: I'm not going to put myself through a lot of pain to learn things that I'm not interested in." The stuff that stood out to me back then was, software security research, but also just automated techniques, automated program analysis for finding bugs, for example. That was the biggest piece of attraction, right, was can we use tools to automatically find bugs? Bugs, and security bugs specifically, are super interesting because they have such severe consequences. It's amazing what ends up happening if you exploit one of these super severe bugs in an important system.
Rijnard: And so for me, it was a question of, "How far can we push automated techniques to find these really complex, really interesting bugs?" Then from there it changed a bit of direction, but that interest has always been there.
Beyang: Yeah. That makes sense. I guess the general category I put all that under is like, it's almost like metaprogramming, like programming stuff tools that write programs or help you debug programs.
Rijnard: Yeah, there's definitely a lot of, I think, meta to it for sure. People love throwing that turnaround. I think metaprogramming as a term is pretty loaded. So I won't try to put too much definition to that. I know people who are in the program synthesis community, right, programs generating programs to do things. This is all also very meta, but also quite different from writing programs to find bugs and other programs. But definitely related.
Beyang: Yeah. So one of the things that you covered in your PhD research was this new pattern matching syntax called Comby. Can you tell us a bit about what Comby is and what was the motivation for creating it?
Rijnard: Comby is a tool I wrote that essentially fit the bill for the research problem I was tackling at the time. I didn't really find other tools that could do quite the thing that I wanted out of a co-transformation tool. Basically, that came down to I wanted to selectively change parts of a program, parts of program syntax without having to compile the whole program. Without needing to know that the whole program scope or calls across files or anything like that. Just give me one file, even if, this file contains some malformed syntax, or it contains macros or things like this, I didn't need a full compilation or even full parts of this thing. All I wanted to be able to do was change an if statement, right, or a for loop or things like this.
Rijnard: And to make it more concrete, what ended up happening was I wanted to change something in Java and C. It was a simple if statement. The conventional way that I would have had to go was, find tools I could change particular if statement in a matching, whatever, conditions I was interested in changing, and these separate languages. Now there are tools out there that can do ... you can express the transformation that covers multiple languages. They were a little bit more heavyweight than I was willing to entertain.
Rijnard: So I ended up going, "Can I just find some simple Java and some simple C thing. Maybe client can do this for me in some Java parser and things." I ended up picking out some Java parser project, as well. Then I wanted to change this particular if statement, but I couldn't do it without parsing the whole program. I couldn't even parse a single file, and so then I ended up going to this Slack or a Gitter channel, I don't quite remember. I asked them, "Can I change this one thing just without doing anything super complex?" And the response was like, "Oh, why would you want to do that?
Rijnard: Why not just parse the whole thing and do the thing that you want to do?" I was like, "That's fine because, at the end of the day, these tools are made for particular reasons. The reason I wanted wasn't aligned with this, right? So I want you to change these fragments." Comby is an answer to that, which is basically like parser program to a general parse tree that preserves some of the fundamental syntax that corresponds to a tree structure, right, which is typically like braces and parentheses and these things.
Rijnard: But you also have to take into account things like comments and strings, because if you have parentheses or braces in a comment or a string, you don't want that to be considered part of the tree or the syntax that you're trying to parse. So it's not just a question of how you're going to detect a couple important characters. It's like you actually have to parse the whole thing. The good news is that notion generalizes over a lot of languages, right? Because at the end of the day, trees are useful in programs. It turns out this is a useful representation that basically all programs use.
Rijnard: If we can kind of approximate that in some way, then it's easy to manipulate. Now, there are tons of tools out there, right, that go along this thinking. I can name a lot off the top of my head nowadays. There's like tree-sitter. There's all various kinds of [crosstalk 00:17:24]-
Beyang: Codemod.
Rijnard: Yeah, and so the way I see it is, there's a lot of tooling in this space that addresses parts of the problems and parts of the design space of co-transformation. It's a matter of picking the right tool for the job. And if that tool that doesn't exist, you have to engineer it.
Rijnard: So Comby is my response to engineering for this purpose-fit role of manipulating syntax at smallish scale. At the scale of simple link checks or linear transformations auto fixes this thing. And at the end of the day, I needed that to integrate with a more sophisticated approach of program repair or what others might call automated program fixing or automated bug fixing. So that's the context.
Beyang: That makes sense. So just to lay out the spectrum of pattern matching and replacement tools and syntaxes. I guess, on the far end of the spectrum, there's the full-on compiler-based static analysis-based tools. That Java library that you were mentioning that actually hook into Java C or whatever the compiler is, and they compile the code. They build up the AST, and then in order to express your transformation, you probably have to express it in terms of programmatic changes to the AST or something like that, right?
Rijnard: Right.
Beyang: So that's like one end of the spectrum. Then on the other end of the spectrum, I would say probably things like regular expressions or tools that make use of regular expressions. Do you think that's an accurate bookending of the spectrum?
Rijnard: I think that's a good way to look at it. To my mind, there are multiple spectrum, let's say. The way I think of tooling to categorize the vast tooling landscape of things that match and change code, it's more of a 2D landscape where you have various axes, right? And so along these axes, you have data points. So the one that you point out is a very good one to think about in terms of expressivity, right? What is the set of transformations that I can express? So regular expressions is on one end of this.
Rijnard: You can map some of this notion of expressivity to a more formal way of thinking in terms of Chomsky's hierarchy and that sort of thing. I don't think you have to go that far, because we're, at the end of the day, talking about tools here. And so the tools correspond, in some sense, to Chomsky's hierarchy. But I found that trying to fit tooling into that formal notion isn't very good because there's a lot of overlapping and fuzziness.
Beyang: Chomsky's hierarchy is this hierarchy of languages from your regular languages all the way up to turn complete languages? Is that-
Rijnard: Right. It's a way to describe a distinct capabilities and computation, right?
Beyang: Okay.
Rijnard: Something that is expressive at a context-free level can recognize languages in a regular expression language and so on. So it's just a breakdown basically of at front-level recognizing languages, but it corresponds to the ability to compute, right?
Beyang: Yep. Makes sense.
Rijnard: As far as tooling goes, regular expression tooling like Grep have certain operators that break out of the formal notion of regular expressions. Right. So in terms of what you were saying about the spectrum of tooling, yeah, there's this idea of expressivity, right? So regular expressions and then compilers are at least able to recognize context-free constructs, which I think is like balanced parentheses. But typically, you need a context sensitive notion to recognize things like type depths and C, for example. So this is like you need to know about things that were defined at the point that they were used in order to interpret or parse something correctly.
Rijnard: So there are all these notions and tools take that into account to varying degrees. If you use an actual compiler tool, it knows about all of these things, right? That's a useful way to think of it in terms of expressivity and that translates into performance, as well, right? Regular expressions can be extremely fast and compiling can be comparatively slow. So that's one way. The other way that I think of it is in terms of how you specify things, right? How you express what you can match or transform. And this is something where with Comby, it was like, "I want something simple and lightweight.
Rijnard: I don't want to go and write a program or a script or a clang plugin. This is just too much effort for what I'm trying to achieve." And similarly, with regular expressions there, some great things about it. But a lot of people also gripe about the syntax. And so there's a lot to take into account that is not about expressivity or performance. It's also how you describe what you're trying to do. And I think that's really where a lot of design choices and tooling choices come into play, as far as how important it is for a tool to succeed or be effective in the domain that it's intended for.
Beyang: So, Comby, if I understand correctly, what you're saying is it's the syntax in this tool that hits a sweet spot in this two-dimensional space. It's expressive enough that you can describe basically 98% of what you want to do in everyday finding our place instances in code. But at the same time, it has this ease of use or developer ergonomics aspect to it that makes it a lot easier to write patterns in it than something like regex, which as far as I know, every developer that uses regex has a love-hate relationship with it.
Rijnard: Right. Yeah. The goal is definitely to make it lightweight. To make it as declarative as possible. So, I've made the choice of not allowing metasyntax or escape characters or things like this just as a usability thing. But over time, I've also come to recognize that it's very useful in this world to match lexical constructs. Just things that are regular expressions are good at and why reinvent another language. So especially one that's so popular that people are already, to some extent, familiar with.
Rijnard: Comby is now advancing to the state where you can optionally embed regular expression syntax into it to match those things. Things like certain character classes. So there's definitely parts of syntax that you can improve on and certainly I've strived to make it declarative. So it's not so much dependent on writing a program or expressing exactly, "Oh, this is a function that I want to match with," or "This is a variable." None of that is really incorporated. It's all purely syntactic and purely reliance just course-syntactic structures that certain languages have.
Rijnard: So if you're in Erlang and you have certain keywords that delineate blocks or ruby, then it would recognize those things. So it's language specific at some level, but not at the semantic level where you necessarily know that, "Oh, I want to match a function block." So basically, there's a way of identifying syntax that correspond to the underlying language construct that you might be interested in.
Beyang: Yeah. It's like it handles the balanced delimiter patterns that are fairly common in code, like string quotes, parenths, brackets, that sort of thing.
Rijnard: Right.
Beyang: It handle those really well. Whereas, those are always super annoying to express in regex because regular expressions have no concept of memory. So that they can't keep track of how many nested layers deep you're into. Whereas, Comby understands these at a foundational level and the syntax makes it easy to-
Rijnard: Yeah.
Beyang: Because I've tried it myself and it's like if you want to switch the order of arguments in a function call, for instance, it's just, I guess, maybe you can get to the notion of holes. But you express these holes which are in the syntax it's just like Colon, bracket and then a number. And then that becomes an argument. You can type it out. It looks like actual code, the pattern that you wrote. And then that's a pattern. Then you express the replacement pattern, just referencing the numbers that you entered into those holes later. It's super-intuitive, I think.
Rijnard: Yeah. You can compare [inaudible 00:27:18] syntax. You can compare this just to named identifiers or named groups in regular expressions. People rarely end up, I think, in practice using named identifiers for regular expressions because they're not using the context of changing code, right? It's just, "I just want to match this thing. Maybe I'll group some expressions, but I'm not going to attach a name to it." I think that also limits like readability, right? If I wrote something and I'm going to match a group of something, maybe it corresponds to a telephone number, I give that to you [inaudible 00:27:47].
Rijnard: And I'm like, "Here's the regex." Maybe you don't even know that this group expression corresponds to the members that are care about. So, Comby is like you tend to explicitly match that identified to the texts that the hole contextually corresponds to. And so you can reference that when you're rewriting the code, which is with your swap arguments example. It's like, "Okay, now, you already have the identifiers that you can use to swap, arguments one and two." Yeah, as you said, it's a way to identify and rewrite structural pieces of your code.
Rijnard: So maybe you want to identify calls inside of a loop or inside a double for loops, and nested for loops, and there are certain calls you want to match or check. so one example is, it's funny, so you can find code, right, where you compile regular expressions, right? So you can use Comby, for example, to find where you're maybe compiling and recompiling a regular expression inside a loop, and that's redundant. Typically, what you want to do is you would want to compile the regular expression once outside of the loop, if it doesn't change, so the loop doesn't affect the regular expression.
Rijnard: And then just run that regular expression on whatever's inside the loop. So there are certainly cases like that, that I found in Java or Go or pick your language, right? This is an easy mistake to make. So that's a very concrete and accessible example of how you might use it where regular expressions are just going to be hard to identify. Yeah.
Beyang: We should mention to folks listening that, Comby is not just a syntax. It's actually a command-line tool. And if you go to comby.dev or you can look up the documentation, actually just download the tool. I think it supports Mac OS, Linux. I don't know if it's on a Windows yet, but [crosstalk 00:29:45]-
Rijnard: Everything's on windows nowadays with a Windows-
Beyang: Oh, yeah, subsystem.
Rijnard: Yeah. That's the recommended way to go, for sure. Yeah. I'm always happy to help out. I'm helping at the Gitter channel or just [crosstalk 00:30:01].
Beyang: Yeah. There's a Gitter channel and you're super-responsive on it, I noticed.
Rijnard: Yeah. It's a valuable way for me to also collect feedback and understand what people want the teaser tool for.
Beyang: What sort of people have reached out so far, and what sort of use cases have you seen in the wild?
Rijnard: It's interesting. I think there's a wide array of things that people want to do once you tell them, "Hey, you can change your code in certain ways and it can be more sophisticated than regular expressions." And I think what's interesting is that you end up with a lot of interesting and unexpected ideas. So one example of just usage is people wanting a way to review code change by code change. So I want to go through these because they're not all going to match exactly the thing I'm interested in, or maybe it changes some test code that I don't want to have effected.
Rijnard: And so, I think Codemod, the Facebook variety of Codemod, the actual tool on GitHub pioneered that idea, which is this iterative review patch or code change. And that's something I integrated after the fact, right? In response to [crosstalk 00:31:20] flow.
Beyang: This is the interactive mode. So you can go through and it'll actually show you each place that the pattern match and ask you, "Do you want to change this? Yes or no?"
Rijnard: Yes, exactly.
Beyang: Cool.
Rijnard: That was something that people wanted just to integrate into their workflow. What I've seen other people want to do is changing things inside of HTML tags, [inaudible 00:31:47] names, this sort of thing crops up often. And that's again, another place where that pointed out to me that to some extent, you want to support regular expressions to let people match their arbitrary classes of characters, but within angled-bracket tags or within strings, for example.
Beyang: How much do you find yourself reaching for Comby and day-to-day pattern matching find-replace tasks, versus something like Grepper, SAID, or like the find and replace in your editor?
Rijnard: It's an interesting question. Right now, I'm most drawn to using Comby for higher-level applications rather than, "Oh, I want to change a couple of things in my code." I've done it a couple of times, working with Sourcegraph. And so, it's popped up a couple of times where it's like, "Oh, I want to change ..." particularly things in tests. For example, you have a bunch of test cases and these test cases are very similar. Maybe you have 20, or 30, or 40 test cases. And each of these test cases is like a record or a struct.
Rijnard: You have a field and a value, and maybe you want to rename those fields or you want to add something to the structure or whatever, and then it can be useful because then you can just match on each independent struct and then add some field, then maybe it's dependent on some of their inputs. So I recall a couple of times needing that, maybe, I don't know, maybe once a week, or once every couple of weeks I run into a case like that.
Rijnard: So not super often, but I don't end up using SAID for things much either. But the kind of appeal I see right now for using Comby is more and in the sense of manipulating code for other purposes. So one is mutation testing. So for the unfamiliar, mutation testing is this idea of transforming your program, and then running a test suite. And then, if your test suite doesn't detect anything bad, you've essentially created a mutant program that, if you've changed it in some way that you wanted to feel, but don't, then that's a gap in your test suite, right?
Rijnard: And so, you can imagine that easily changing programs in that scenarios is very useful. There's a lot of research in that space, like academic research and a lot of tooling associated with that research as well. And usually it does come down to a more heavyweight tool, or maybe you're just like, "Hey, I'm just going to focus on Java or C to evaluate some interesting new approach." So with Comby, that could be something that you could do at the most general level, right? Not restricted to any particular language as far as the tooling goes.
Beyang: I myself have tried to use some of those more heavyweight refactoring tools that actually hook into the AST. And my experience is that, this is quite like a learning curve because you have to learn the ins and outs of the AST, in order to ensure that you can actually express the pattern that you're trying to describe. And it's almost ... There's a greater mental barrier, because when I think about the pattern that I want to match, oftentimes it's like at a textual or syntactic level. Like, "Oh, switch these arguments," or, "Oh, add another argument at the end of this function," or, "move this struct ... rename this struct field or something like that.
Beyang: And I'm not really thinking about which AST node specifically I have to modify, and which layer in the tree that might occur, and what are all the other hidden AST nodes that I don't even know about, but are implicitly constructed when the language gets parsed.
Rijnard: Yeah. That's part of it, right? Clang has an excellent query matching, or way to match C or C++ syntax with a query language. So they recognize that visitors writing a plugin or that, does a programmatic thing using a visitor interphase is kind of heavyweight sometimes, so we're going to introduce this queer language. But in order to use this queer language, you have to be familiar with all of the grammar constructs to make sure that you're matching on this specific thing in the C++ grammar that you're actually interested in. And many of the time, you might not even know the L value for assignment corresponds to the left-hand side. It's very heavyweight.
Rijnard: As, what I would say, a declarative technique allows you to do is kind of, what you see is what you get, right? And that's the aim here. Now, of course, there are many tools again out there that do declarative transformation and matching, especially on the academic side. So you can find tools like Rascal and Spoofax's Language Workbench that covers a lot of languages and you can define all of these declarative ways to match your syntax.
Rijnard: But at the end of the day, you either have to dig up the thing that you want, or learn a little bit more context about those tools. And with Comby, it's this thing that basically I wanted to just to brew install, and then run the thing that I care about and not pay too much attention to. Is this a fully-fledged parser? Did someone go and make the effort to turn this into a declarative way to specify things for Java, or whatnot?
Beyang: Yeah. I want to return to an earlier use case. We've talked a bit about how this can make the lives of people who are doing serious refactors easier. So people who would use more of a heavyweight tool if Comby didn't exist, but I want to return to that every day use case that rejects replacement use case. Because you said you don't use it every day. I actually find myself using it more and more often. And I do wonder whether ... In my view, there's this hurdle that any new everyday tool has to overcome, which is kind of the familiarity hurdle, right?
Beyang: The people who have already invested many hours of our lives into learning regex syntax and all its special variance, that's a thing that we already have, and now along comes a new syntax and it's like, "Oh, do I really want to learn this? The thing that I have, it's not perfect, but it works reasonably well." But I do wonder because I know a lot of other programmers, both new and some fairly experienced who actually have avoided regex so far. And it's something that's like scary to them.
Beyang: And I almost feel that Comby would be, in many ways, a better starting pattern-matching syntax because it's just more intuitive to use. You can parse it with your eyeballs. You don't have to go and upload it to one of those regex visualizers to figure out what the heck is going on. Do you think that Comby will evolve into a tool like that, especially for beginners where they use it in more of everyday setting?
Rijnard: I am very happy to see that happen, but it's not ... I wouldn't say that's an end goal. At least to the extent of, I'm not pushing for the tool to be adopted by, let's say novice or unfamiliar programmers. It's more a question of, the tool is designed to do something in a simple way, something more complex than Regex allows in a simpler way. And I think you can absolutely hone in on, how do I make it minimally simple for people to do X, as far as matching or transforming code goes?
Rijnard: And I think that would ... At some point you're going to come up to ... Your tool's going to sacrifice something in the interest of simplicity. So the thing that's top of mind for me in designing syntax for Comby is, make it dead easy to match code and make things correspond to code. But at the same time, there've been more and more examples where, once you start using this tool, you want something more out of it. You want to maybe ... Here's a concrete example, right? It's like, "Hey, I want to sort my list of imports in my program alphabetically," right?
Rijnard: And so, Comby can make it super easy for you to match all the import statements within some import group, parenthesize import group in [inaudible 00:41:10], for example. But it doesn't have any native capability to go and say, "Okay, I have a list or a set of lines that were matched and now I want to sort these less graphically," right? That capability doesn't exist, and you could type that stuff into another program and then rewrite, and that's something you could do.
Rijnard: But the point here is, at the end of the day, it's good at doing one thing really well, which is match syntax that corresponds to code structures. And then, whatever comes on top of that, I'd love to support things and branch out capabilities, but you have to draw the line somewhere. And I think the same thing happens essentially when you say, "Hey, I want to make this thing super accessible, right? To people who don't know regular expressions." And I don't know what the shape of that solution looks like, but certainly some of the designs and decisions around the syntaxes could be advocate for a solution like that.
Beyang: Yeah. That makes sense. I want to take a step back and chat about some of the other research that you've done. But before we move on to that, I think it would be remiss of us if we were not to mention that the Comby syntax is actually supported in Sourcegraph, currently. So if folks want to try that out, you just go to sourcegraph.com and in the search bar, it's the bracket icon, it's called structural search. And so, if folks want to try that out, we'd love to hear feedback on that.
Rijnard: Yeah. And I'm looking to put out more examples of that as well. So yeah, go find interesting things. Try to give us, give us feedback. And the benefit is there that you don't even have to clone your repos. We have repos up there and you can search, for example, some of the more popular GitHub repositories, Java, Gitter, Python, all these things. So yeah, it works out of the box.
Beyang: Actually, just a quick question on that, in implementing that syntax for Sourcegraph, did you have to do anything special to get it to scale?
Rijnard: Well, we're doing some fancy things, leveraging the fact that we have indexed a lot of source code already and can look up in certain files, whether a file at all contains strings that would match. So there are various kind of optimizations that I worked on, and they'll tend to identify that sort of thing, right? And it's not any different than whipping out like, ripgrep and then finding files that you want to search and then piping that into Comby.
Rijnard: But the benefit is on sourcegraph.com, of course, that we do trigram indexing, which is even faster than ripgrep. So that's why it works really well out of the box.
Beyang: Cool. But so, with respect to the other research that you've done, my understanding is that ... You wrote Comby because you were having trouble with pattern matching and replacement because of another aspect of your research. Can you tell us about that, I guess, initial motivating problem, or maybe just about your thesis in general and what were the themes there?
Rijnard: Yeah. So my high-level focus was to take existing tools that we have out there that are fairly popular and use fun practice to find bugs, right? So to find bugs that are a little bit more sophisticated at a semantic level, right? So not your, "Oh, I'm missing a semi-colon, or "I have some lint check that's failing," but something more in the line of, oh, you forgot to close this file resource. And maybe you've opened a file or a socket, and one function and then three function calls deep. You're done with this thing, or you've turned up the stack and you forget to close it, right?
Rijnard: So from a program analysis perspective, this gets pretty complex because now you have to consider calls across multiple functions, different contexts in which a function is called, whether any logic inside of, if conditionals or for loops change, whether you close the file later or not because you don't want false positive. You don't want the analyzer telling you, "Hey, you didn't close this file," but then you did and the analyzer's too dumb to know the difference, right? So that's the flavor of maybe the bug that you want to fix.
Rijnard: And so, I went and dove in and said, "Okay, given that an analysis run, it knows things, it has reasoned that there's this bug that happened. How can we go about fixing that automatically? So we want to make some change to the program and fix this bug in a way that's that we can have a reasonable amount of confidence that it is fixed." And so, there's a lot of detail to that, and I won't necessarily go into all of it, but basically you leverage the fact that a static analysis.
Rijnard: And so, in this case, I used a tool called Infer, which is also open source on GitHub and that's maintained by Facebook. And I think it's a great tool to have out there. And it finds this bug resource leaks, and memory leaks in C programs and so on. And so, I was using that as one tool as a basis of trying to automatically fix the bugs it reports. And so, at the end of the day, you can use some of the things that the analysis inferred about your code to also inform a fix, right?
Rijnard: You can, for example, identify places in the code that also close the file on some condition, and you can identify that in the analysis output. But once you have that, right? It's like, "Okay, I'm ready to make a change. I'm ready to change the program." And that means you have to make a syntactic change, right? At some level, this stuff has to translate back into [inaudible 00:47:43] also tactically change the program. Okay. So you're like, "I'm ready for this. I just need to add this closed socket inside this if condition. This is what I need to do."
Rijnard: And it's like, "Okay, how can I match on the body of this if condition reliably, so I can just insert the ...?" And this is around into this problem, which is like, "Okay, I know everything that I need to know, what I want to do. It's just not easy to just do this one thing, right? I just got to match this if condition's body inside these braces, but I can't because there are too many parentheses in the if condition. There are too many braces inside that confuse regular expressions."
Rijnard: I tried regular expressions at first just like, "Maybe I can get around this. Maybe I can just ... So it's a research problem. This isn't part of some novel or significant contribution. I just want to do this change," right? And so, this is where I got-
Beyang: You thought finding the issue would be the hard part, and then actually making the change would be easy, but it turns out making the change was a lot more annoying?
Rijnard: Yeah, absolutely. And so, and it became thorn in my side, right? And it was at that point where, I ended up using, I think a specific Java thing at that point. But I came out of that thinking, "This is ridiculous. I'm not doing that again. And this is actively hindering other things I want to do in my research, so I'm going to go build this tool." And that's how it sprung out. And yeah, it was built around that purpose.
Beyang: That motivating problem is really interesting because the magical ... The thing that I've used is probably closest to that would be FindBugs in the Java world. But when you get suggestions like that, where it finds actual bugs in your code and flags them to you, that is almost like ... The first time you see that, it's like a wow moment. You're like, "Wow, this is magical." It knows what's going on in the code.
Beyang: And I think these days you see more and more ... stronger claims being made about what actually can be done. I think you and I have chatted at length about ... machine learning for instance is a topic, or a buzzword that's thrown about a lot these days. And I've seen it used a number of times in the specific domain of code transformations and automatically writing programs which admittedly would take it a step beyond even just the semantic analysis and transformation.
Beyang: We've chatted about this at some length. I'd be curious to ... if you could share your thoughts with the podcast audience about the intersection of machine learning, and language analysis, and programming automation.
Rijnard: Yeah. I'll scope it a little bit more to say, I'll only comment on, I think, the interesting aspects around, can we use, for example, machine learning to automatically generate or fix programs? I think I'm more familiar with the effort around automatically trying to fix programs as in, for example, machine learning. And I think there's also a distinction between applying machine learning in industry scenarios versus academic efforts around machine learning.
Rijnard: So to me, it's a matter of, well, there are many approaches that say, well, programs are just data and we're going to feed it into some supervised learning algorithm. We're going to get some outputs and we can use that, right? And show that the result is useful for some context. And you can absolutely do that for something like automated program repair or automated bug fixing. But I think the challenge here is that, if you treat it like a black box, right? A, the stuff you're going to get out is going to be a function of the fidelity of things that you fed into it.
Rijnard: And if you're just treating it as texts, then there's only ... There's an extent to how good you can do when you get the outputs. And so, despite the tendency to just throw things at a machine learning algorithm, which happens, right? In research. And I think it's not totally unreasonable to report on those results and say, "Hey, we've observed this and it's ... Or maybe, we're not really sure," right? But I think it's important to caveat that and say, "Well, programs are a lot more structured. That's why they're so interesting." And we have a wealth of history on program analysis and research, that really dives into, "What's the complexity? What's the underlying complexity of the thing we're dealing with? This problem that we're trying to solve?"
Rijnard: And it's already been proven at a formal level, any sufficiently interesting property is impossible to detect in general, right? And so, this corresponds to Rice's theorem, and I don't know. If you're interested in that, you can go dive into that. But the point is like, neither machine learning nor anything else can actually solve this at a very general level, but we can do pretty dang good if we reason about what we can and cannot do in certain contexts, right? And what approximations we're going to make.
Rijnard: And so, I think machine learning as a tool to solve a problem is very course, unless you encode all of the stuff that we know. And so, of these things that we know about programs is they have structure, they have a grammar, they have various attributes and semantics around certain buggy properties. And so, I think the research space is split between those two lines of thought, right? It's like, do we come from the perspective of, okay, we're doing this as a program, with a foundation that's logical, or able to reason about and model maybe in a discreet sense, versus, okay, we have a very powerful inference engine, but it is based on elements that we're not going to reason about, or necessarily incorporate, or explain how we arrived at a certain result.
Rijnard: And so, my hope is, these two things converge and get closer to each other, that they're a bit detached. And so, of these two camps, I fall very much into the study and research programs as a structured or a logical concept. And then, maybe on top of that, implement various ways to do something like machine learning, or other AI approaches to reveal interesting properties or ways of fixing programs, right?
Beyang: Yeah. Are there any efforts, either in industry or academia that you're aware of, that in your view are taking the right approach of synthesizing these two worlds?
Rijnard: I don't know so much about synthesis. I think, well, I don't know of any good examples right now, at least, that I can point to. I will say that, in terms of where we are going with automated reasoning and things like being able to generate programs automatically, or fix programs automatically, are we going to put software developers out of jobs, or are they going to ... At some point, we're not going to need certain types of software developers or engineers?
Rijnard: In terms of that, I think it's clear that we're not quite, and I don't see us getting there anytime soon, automating a way, any substantive engineering ability. What we are doing is getting closer to removing the tedium around certain bug fixes and reasoning about some pieces of code. And also, just engineering oversight. So it's like, you're coding, you're just trying to implement some new interface to call out some other code that ... or removing a feature flag or something like this, and then bugs crop up.
Rijnard: And it's like, "I don't want to deal with this right now." Or you're maybe not thinking about that right now, or it's not in your mind right now, and then an analysis tool picks that up, right? And that's really where the value is, and that's really the gap I see, the tooling and an automated nature of software analysis going is really to make us more effective at working on the core of the problem, right? And the core of our day jobs.
Rijnard: We don't want to deal with all the tedium and the, do I have to check this thing, whether it's well again?
Beyang: Yeah. The knowing parts of the job.
Rijnard: Exactly.
Beyang: We want to focus on the creative and fun parts.
Rijnard: Yeah. So as far as successful efforts in industry, I think, since I was at Facebook and I follow along the research that's adjacent to some of their stuff, I think they're really doing well at doing automated program repair, or automated bug fixing for these more sophisticated classes of bugs that I mentioned. And so, they're looking at things like, know the references, for Java and so on and automatically fixing them, incorporating that into their CI.
Rijnard: And you can go look up on the internet, there's a project called Sapienz. Mark Harman and his team behind that. They do some pretty interesting stuff where it's really at this frontier of, "Okay, we're at the point where we can change programs and their results in mostly reliable, or at least feasible fixes to bugs that we are willing to show to developers and say, hey, does this fix your issue?" And it becomes this kind of push button approach to automating these fixes.
Rijnard: And that's really what I also went after in my research. It's like, can we achieve this idea of push button program repair, where we have enough confidence about a bug fix that we're willing to show it to developers and say, "Hey, we actually tested this," right? "This fix stopped this test from failing," or "this fix stops this analysis from reporting a bug. And we have high confidence that this is an actual issue."
Rijnard: And so, I'm excited to see that expand, and I think it's a question of, when? It's going to happen eventually, where we see more of this tooling crop up. And it's very much going to be, I think, a function of integrating with a developer's workflow, whether that's CI, or in their editor, or a tool that they use to search or review code, that's really going to be the interface to this sort of interaction, where you interact with an automated tool and what it did. There's going to have to be a human in the loop for a lot of it, and I see that factoring into, especially things like your CI workflow and editors.
Beyang: Yeah. You mentioned that you worked at Facebook for a bit. I believe you also did stints at MSR and Google, if I'm not mistaken.
Rijnard: Mm-hmm (affirmative).
Beyang: Of those three ... We only have a short time left, but are there any projects that you think are worth calling out there that were in particular, really interesting to you in the space of developer tools?
Rijnard: Yes. My focus became more specialized, I think, as I did more of these internships, right? So I think the most recent and most focused to my interest and speciality was at Facebook, right? So I worked on Pier, which is a static type checker for Python. And it's really interesting to me that over time, Facebook has gone and been very much investing in tooling, right? For software quality, things like bug detection, type-checking for Python and this sort of thing.
Rijnard: And my impression is, Facebook is at the forefront of doing this. Google or Microsoft for sure, they invest in their tooling and a lot of it's great, but I think a lot of talented researchers and engineers are working on these tools that have really impressive results. So finding anything from super-sophisticated bugs in hack that could lead to Facebook servers being compromised, or just daily developer, "Hey, I missed this noteworthy reference check."
Rijnard: They've really covered a broad space of it. I think Amazon is also now ramping up a lot of their dev tools and program analysis efforts. But certainly, it's been interesting to see Facebook as a social media company at the end of the day, really investing so heavily into software quality dev tools. And so, that to me just says, if it isn't already, it will be a ubiquitous concern for any software company, right?
Beyang: Yeah.
Rijnard: As you grow, as you understand the complexity of what you're doing, and what your software ... and the activities that your engineers are engaged in, it becomes a critical piece that has to integrate with what you're doing. And if you're a company that isn't aware of that yet, right? It's a bit of a blind spot. And I think this problem becomes more important as you grow, obviously the size of your company matters, right?
Rijnard: And so, that's really where the value of these automated tools, like automated bug fixes, automated bug finding come into play as soon as you reach a scale where you want to cover a lot of ground, and a lot of the complexity of millions, billions of lines of code.
Beyang: If folks listening want to learn more about Comby or any of the research that you've done, how would you recommend they go about doing that?
Rijnard: Now, the easiest way is probably to just find me on Twitter, or just DM me on Twitter, and then we can fire off an email exchange if that's something that you're open to, or you want to chat more in depth. So you can find me on Twitter at RVTonder ... R-V-T-E ... how do you do that? At R-V-T-O-N-D.
Beyang: Cool. And we'll put that in the show notes as well.
Rijnard: Sure.
Beyang: My guest today has been Rijnard van Tonder. Rijnard, thanks for being on the show.
Rijnard: Thanks Beyang.