GraphQL at Twitter
@beyang
Tom Ashworth (@tgvashworth) is a software engineer on the core services team at Twitter.
Tom's going to give us a whirlwind tour of how GraphQL fits into the architecture at Twitter. We're going to cover:
- The Twitter GraphQL API and how it fits into Twitter's microservices architecture
- Subscriptions architecture
- Twitter's schema and plan for GraphQL in the long term
API
Twitter started experimenting with GraphQL two years ago:
- Dec 2016 TweetDeck
- Feb 2017: Twitter Lite
- Sept 2017: Android & iOS
Twitter is famously a microservices company. There's a service for everything, and often multiple new services for every new feature. Tom counted the number of deployment configuration files in their monorepo and they numbered over 2000.
Here's a high level overview of the backend architecture at Twitter, pre-GraphQL:
At a high level, there are 2 types of services:
- HTTP services
- Thrift services
Thrift is a Remote Procedure Call system that's used extensively at Twitter, Facebook, and many other companies.
Each service exposes an asynchronous, typed interface, so communication between microservices already uses types. They aren't GraphQL types, but they are types.
And here is where the GraphQL API fits in as an HTTP and Thrift service:
GraphQL fits into this picture as an HTTP and Thrift service alongside the others. It uses many data sources including existing HTTP endpoints and other Thrift services. Remember, communication between all Twitter's microservices is already typed using Thrift. The entire system sounds like it could be a GraphQL-native pattern. So, the GraphQL community isn't the first to find a service-to-service type system useful.
Twitter uses Scala, so they use the open source Sangria library.
Let's dive into some of the challenges of implementing GraphQL at Twitter. One challenge is monitoring. In HTTP services, errors are easy to detect in the HTTP error codes. However GraphQL is capable of partial successes. GraphQL requests almost always succeed with a 200 even if no data was returned.
So instead of HTTP error codes, they track the number of the exceptions per query. For each GraphQL query, they track the number of exceptions generated. This becomes their success rate measure, so dashboards and alerting revolves around this metric.
They also use GraphQL-specific defenses. Imagine a pathologically nested query (followers of followers of followers...):
They have two defences: complexity and depth.
First, they measure query complexity. They assign a "score" (some point value) to each field, and calculate the total cost of a query. The total cost is the complexity of a query, and is calculated before execution. They then limit queries to some maximum complexity, and reject queries that are too costly.
Twitter also uses query depth measurement, which simply calculates the height of a query. For example, they could reject queries that goes further than 10 fields deep.
They don't allow arbitrary queries. All queries must be uploaded and stored ahead of time and exchanged for a unique key:
These are called stored operations or persisted queries. This protects against attackers exploring the GraphQL API or running introspection queries against it to find out what data is available or looking for vulnerabilities.
Lastly on the topic of the Twitter API, Tom will cover how they handle authentication in GraphQL. The Twitter Frontend (TFE) is a layer between all internal twitter services and the outside world. It functions as a reverse proxy, API gateway, and router. It talks to the GraphQL API (in addition to the other HTTP API services):
Imagine a request to api.twitter.com that is authenticated using a cookie header. TFE is responsible for user authentication. Because validating a cookie is fairly involved, this is done in the TFE layer, and passes on a bundle of Twitter Identity Assertion headers to GraphQL.
The GraphQL API then passes the bundle through to the other services, so each service doesn't need to worry about authentication. The lower-level services are responsible for authorization (permissions) checks. Thus, the GraphQL API server does not need to concern itself with any authn or authz checks.
Subscriptions
Tom will structure his discussion of subscriptions around Laney Kuenzel's description of GraphQL subscriptions: "Clients subscribe to an event with a GraphQL query and receive payloads."
Clients subscribe to an event occurring in the backend with a GraphQL query, and receive payloads. The payloads represent the result of running the query against the event.
Twitter's subscriptions system divides functionality in a similar way to the quote:
- Event production
- Query execution
- Payload delivery
They've been deliberate in creating boundaries between each of the three parts to work on each component independently and hide some of the complexity of the implementation.
"Clients subscribe to an event": event production
Twitter has an extensive pub/sub system called EventBus, with thousands of different events that flow through it. They wanted to give clients access to this stream of data with GraphQL subscriptions.
EventBus is a distributed queue system. Event producers are services that publish events to EventBus streams. Subscribers are services that subscribe to EventBus to receive events.
To expose this data to GraphQL, they built a "Subscriptions" service that subscribes to EventBus streams. This Subscriptions service is separate from the GraphQL API.
"...with a GraphQL query": query execution
Now that there's a service that picks up events from EventBus, the events need to be hooked up to queries. In their Subscriptions design, Twitter extended the GraphQL API to execute GraphQL queries against an event that is supplied as part of the request.
This means that the API receives both a query and an event, and uses the event as the basis for the query execution. This is similar to a Thrift interface, so they are using GraphQL over both Thrift and HTTP.
This JSON bundle shows that the API will run the query against the event. This keeps the GraphQL API concerned only with executing GraphQL queries:
So when an event arrives on an EventBus stream, the Subscriptions service calls out to the GraphQL API service for execution. At this point, the Subscription service picks up events from EventBus and execute them using the GraphQL API:
"...and receive payloads": payload delivery
Next, the resulting payloads need to be delivered. Streamed event delivery at scale is a hard problem, so the team at Twitter didn't try to reinvent the wheel. Instead, they decided to build event delivery for GraphQL subscriptions on top of an existing Twitter technology called Live Pipeline, a pub/sub system over a streaming HTTP connection. When you see typing indicators in your DMs or see the number of retweets live update, that's Live Pipeline at work.
In the live pipeline model, clients listen for events on a specific "topic". Event producers can push events onto a topic by publishing them to Live Pipeline. Then the event is delivered to all clients subscribed to the event's topic.
To deliver GraphQL subscription payloads, a unique Live Pipeline topic is used per GraphQL subscription. Combining a bunch of data about the subscription creates a unique topic, and the client is informed of the topic so they can subscribe to it using Live Pipeline.
When a client initiates a GraphQL subscription query, we return a status message containing a Live Pipeline topic as part of a GraphQL union Type field at the root of the subscription:
The client subscribes to this topic and gets the result of the query.
Putting it all together:
- The client makes a subscription query to the GraphQL API
- The API contacts an event producer to notify that there's a new subscriber
- It then returns a Live Pipeline topic, which the client subscribes to
- Once the event producer starts making events, the Subscriptions service will pick those up and execute the original queries using the API
- After the API has executed the query, the payload is published to the client through Live Pipeline.
This is the general overview of Twitter's GraphQL subscriptions system. This system is built to deliver thousands of events per second to thousands of clients.
Schema
Lastly, Tom will tell us about the vision for GraphQL at Twitter.
One of the main goals is to help teams at Twitter build great products and features more quickly and easily, and GraphQL helps push this forward by making new data easily available without anyone manually editing schema code.
Twitter uses Strato to power this. Before learning more about Strato, a caveat: Strato is a technology internal to Twitter and there are no plans to open source it.
Strato is a kind of virtual database. A virtual database, also called a federated database, brings together multiple data sources so they can be queried and mutated uniformly. This sounds a lot like GraphQL:
Philosophically, GraphQL and Strato are very similar. Strato has many of the same benefits of GraphQL, but isn't focused specifically on building client applications.
Strato has existed at Twitter much longer than GraphQL, and is used to power the GraphQL API. It integrates tightly with Twitter's Thrift services, and in combination with the GraphQL API, gives them end-to-end types from database to the client. It is similar in spirit to many of the other technologies presented at this conference that stitch together multiple data sources into a single source of truth. This approach is extremely powerful.
For example, The Strato team has made it possible to add data to the Strato Catalog with a simple config file, and made deploys automatic. This allows engineers to have much more power and flexibility by reducing coordination needed between teams. This is an example Strato column that would store the birthdays of Twitter users in a database. It's just a config:
Twitter has built Strato-powered GraphQL, which allows data in the Strato Catalog to automatically appear in the GraphQL Schema with a simple config change:
So, why does Tom find GraphQL exciting? While far from perfect, Tom finds GraphQL exciting because he sees it as an enabling technology, meaning it can "drive radical change in the capabilities of a user or culture." As a result, GraphQL expands the adjacent possible. Because it's so different to what we currently have, it opens up new avenues of exploration that weren't previously accessible.