GraphQL is a data query and manipulation language for APIs, and a runtime for fulfilling queries with existing data.
At GitLab we want to adopt GraphQL to make it easier for the wider community to interact with GitLab in a reliable way, but also to advance our own product by modeling communication between backend and frontend components using GraphQL.
We’ve recently increased the pace of the adoption by defining quarterly OKRs related to GraphQL migration. This resulted in us spending more time on the GraphQL development and helped to surface the need of improving tooling we use to extend the new API.
This document describes the work that is needed to build a stable foundation that will support our development efforts and a large-scale usage of the GraphQL API.
The retrospective on our progress surfaced a few opportunities to streamline our GraphQL development efforts and to reduce the risk of performance degradations and possible outages that may be related to the gaps in the essential mechanisms needed to make the GraphQL API observable and operable at scale.
Amongst small improvements to the GraphQL engine itself we want to build a comprehensive monitoring dashboard, that will enable team members to make sense of what is happening inside our GraphQL API. We want to make it possible to define SLOs, triage breached SLIs and to be able to zoom into relevant details using Grafana and Elastic. We want to see historical data and predict future usage.
It is an opportunity to learn from our experience in evolving the REST API, for the scale, and to apply this knowledge onto the GraphQL development efforts. We can do that by building query-to-feature correlation mechanisms, adding scalable state synchronization support and aligning GraphQL with other architectural initiatives being executed in parallel, like the support for direct uploads.
GraphQL should be secure by default. We can avoid common security mistakes by building mechanisms that will help us to enforce OWASP GraphQL recommendations that are relevant to us.
Understanding what are the needs of the wider community will also allow us to plan deprecation policies better and to design parity between GraphQL and REST API that suits their needs.
Being able to see how GraphQL performs in a production environment is a prerequisite for improving performance and reliability of that service.
We do not yet have tools that would make it possible for us to answer a question of how GraphQL performs and what the bottlenecks we should optimize are. This, combined with a pace of GraphQL adoption and the scale in which we expect it operate, imposes a risk of an increased rate of production incidents what will be difficult to resolve.
We want to build a comprehensive Grafana dashboard that will focus on delivering insights of how GraphQL endpoint performs, while still empowering team members with capability of zooming in into details. We want to improve logging to make it possible to better correlate GraphQL queries with feature using Elastic and to index them in a way that performance problems can be detected early.
- Build a comprehensive Grafana dashboard for GraphQL
- Build a GraphQL query-to-feature correlation mechanisms
- Improve logging GraphQL queries in Elastic
- Redesign error handling on frontend to surface warnings
Our GraphQL API will evolve with time. GraphQL has been designed to make such evolution easier. GraphQL APIs are easier to extend because of how composable GraphQL is. On the other hand this is also a reason why versioning of GraphQL APIs is considered unnecessary. Instead of versioning the API we want to mark some fields as deprecated, but we need to have a way to understand what is the usage of deprecated fields, types and a way to visualize it in a way that is easy to understand. We might want to detect usage of deprecated fields and notify users that we plan to remove them.
- Define a data-informed deprecation policy that will serve our users better
- Build a dashboard showing usage frequency of deprecated GraphQL fields
- Build mechanisms required to send deprecated fields usage in Service Ping
GraphQL is not the only thing we work on, but it cuts across the entire application. It is being used to expose data collected and processed in almost every part of our product. It makes it tightly coupled with our monolithic codebase.
We need to ensure that how we use GraphQL is consistent with other mechanisms we’ve designed to improve performance and reliability of GitLab.
We have extensive experience with evolving our REST API. We want to apply this knowledge onto GraphQL and make it performant and secure by default.
- Design direct uploads for GraphQL
- Build GraphQL query depth and complexity histograms
- Visualize the amount of GraphQL queries reaching limits
- Add support for GraphQL ETags for existing features
We do not plan to deprecate our REST API. It is a simple way to interact with GitLab, and GraphQL might never become a full replacement of a traditional REST API. The two APIs will need to coexist together. We will need to remove duplication between them to make their codebases maintainable. This symbiosis, however, is not only a technical challenge we need to resolve on the backend. Users might want to use the two APIs interchangeably or even at the same time. Making it interoperable by exposing a common scheme for resource identifiers is a prerequisite for interoperability.
- Make GraphQL and REST API interoperable
- Design common resource identifiers for both APIs
One of the most important goals related to GraphQL adoption at GitLab is using it to model interactions between GitLab backend and frontend components. This is an ongoing process that has already surfaced the need of building better state synchronization mechanisms and hooking into existing ones.
- Design a scalable state synchronization mechanism
- Evaluate state synchronization through pub/sub and websockets
- Build a generic support for GraphQL feature correlation and feature ETags
- Redesign frontend code responsible for managing shared global state
- GraphQL API architecture