This page contains information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. As with all projects, the items mentioned on this page are subject to change or delay. The development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
StatusAuthorsCoachDRIsOwning StageCreated
proposed @nhxnguyen @grzesiek @dorrino @nhxnguyen devops data_stores 2023-02-02

ClickHouse Usage at GitLab

Summary

ClickHouse is an open-source column-oriented database management system. It can efficiently filter, aggregate, and sum across large numbers of rows. In FY23, GitLab selected ClickHouse as its standard data store for features with big data and insert-heavy requirements such as Observability and Analytics. This blueprint is a product of the ClickHouse working group. It serves as a high-level blueprint to ClickHouse adoption at GitLab and references other blueprints addressing specific ClickHouse-related technical challenges.

Motivation

In FY23-Q2, the Monitor:Observability team developed and shipped a ClickHouse data platform to store and query data for Error Tracking and other observability features. Other teams have also begun to incorporate ClickHouse into their current or planned architectures. Given the growing interest in ClickHouse across product development teams, it is important to have a cohesive strategy for developing features using ClickHouse. This will allow teams to more efficiently leverage ClickHouse and ensure that we can maintain and support this functionality effectively for SaaS and self-managed customers.

Goals

As ClickHouse has already been selected for use at GitLab, our main goal now is to ensure successful adoption of ClickHouse across GitLab. It is helpful to break down this goal according to the different phases of the product development workflow.

  1. Plan: Make it easy for development teams to understand if ClickHouse is the right fit for their feature.
  2. Develop and Test: Give teams the best practices and frameworks to develop ClickHouse-backed features.
  3. Launch: Support ClickHouse-backed features for SaaS and self-managed.
  4. Improve: Successfully scale our usage of ClickHouse.

Non-Goals

Proposals

The following are links to proposals in the form of blueprints that address technical challenges to using ClickHouse across a wide variety of features.

  1. Scalable data ingestion pipeline.
    • How do we ingest large volumes of data from GitLab into ClickHouse either directly or by replicating existing data?
  2. Supporting ClickHouse for self-managed installations.
    • For which use-cases and scales does it make sense to run ClickHouse for self-managed and what are the associated costs?
    • How can we best support self-managed installation of ClickHouse for different types/sizes of environments?
    • Consider using the Opstrace ClickHouse operator as the basis for a canonical distribution.
    • Consider exposing Clickhouse backend as GitLab Plus to combine benefits of using self-managed instance and GitLab-managed database.
    • Should we develop abstractions for querying and data ingestion to avoid requiring ClickHouse for small-scale installations?
  3. Abstraction layer for features to leverage both ClickHouse or PostreSQL.
    • What are the benefits and tradeoffs? For example, how would this impact our automated migration and query testing?
  4. Security recommendations and secure defaults for ClickHouse usage.

Note that we are still formulating proposals and will update the blueprint accordingly.

Best Practices

Best practices and guidelines for developing performant and scalable features using ClickHouse are located in the ClickHouse developer documentation.