How Sourcegraph maps the Python universe: Introduction


When we first started building Sourcegraph, we solicited feedback about the idea of building a semantically aware code-search engine that worked for both dynamically and statically typed languages. A lot of people warned us it’d be extremely difficult (and they were right). To do it, they said, you’d have to do large-scale type inference for dynamic languages, and that’s undecidable. Today Sourcegraph has indexed nearly 10,000 Python repositories and 10,000,000 Python symbols, and can show you rich information like usage examples of a function, people who use your code, and annotated source code with jump-to-definition links.

image
image

Here’s how we’re doing it.

In order to get all that information, Sourcegraph needs to resolve function, class, and package names whenever they are referenced in code. Getting this information means we do many of the things that a compiler or interpreter would—resolve dependencies, lex and parse the code, construct an AST, build symbol tables—almost everything, except, of course, emit assembly or bytecode.

In some cases, however, this isn’t enough. In order to resolve things like method calls and field names, we need to know the type of the receiver object. Take the below examples, for instance. In the first case, we need to know that the app variable refers to an instance of the Flask class in order to look up the field jinja_env and link to its definition. We could simply search the database for any field named “jinja_env” and in this case, the name might be unique enough to resolve to the correct field. But consider the other case: a usage of the size method of the FTP class in the ftplib standard library module. In order to correctly resolve this reference to ftplib.FTP.size and not some other size method, we need to determine the type of self.client.

image
image

In dynamically typed languages like Python, this information is not explicitly available in code and might only get decided at runtime (hence, undecidable by simple reduction to the halting problem). For both performance and security reasons, we want to avoid running the code. So instead, we use the static analysis techniques provided by the awesome open-source library PySonar.

There are a handful of open-source libraries that do some form of static analysis on Python. PySonar and its successor PySonar2 are by far the most advanced Python static analysis libraries for our purposes. Conceived and implemented by the brilliant Yin Wang, PySonar uses some pretty awesome techniques for resolving types statically. Yin wrote a great blog post describing these techniques.

Though general type inference is undecidable, in most cases Python code follows predictable patterns that are tractable to static inference. PySonar also uses interprocedural analysis (techniques that traverse function boundaries). That means that by having a global view of the world’s Python repositories, Sourcegraph can do a better job of type inference than any local system ever could. In practice, it often works quite well.

PySonar is a crucial component in our system, but the challenges don’t end there. Sourcegraph also must resolve dependencies to the correct versions of release binaries and source code, and it must globally resolve cross-repository references and function calls. We’ll describe how we tackle these problems in a later post.

The output of all this is a graph of the Python universe. With it, we can efficiently compute all the info you see on Sourcegraph and help you program better by finding usage examples, seeing how your code depends on and is depended on by others, exploring link-annotated source, and effectively making use of open-source libraries.