ClickHouse integration guidelines

  • Tier: Free, Premium, Ultimate
  • Offering: GitLab.com, GitLab Self-Managed, GitLab Dedicated
  • Status: Beta on GitLab Self-Managed and GitLab Dedicated

For more information on plans for ClickHouse support for GitLab Self-Managed, see epic 51.

For more information about ClickHouse support for GitLab Dedicated, see ClickHouse for GitLab Dedicated.

ClickHouse is an open-source column-oriented database management system. It can efficiently filter, aggregate, and query across large data sets.

ClickHouse is a secondary data store for GitLab. Only specific data is stored in ClickHouse for advanced analytical features such as GitLab Duo and SDLC trends and CI Analytics.

You should use ClickHouse Cloud to connect ClickHouse to GitLab.

Alternatively, you can bring your own ClickHouse. For more information, see ClickHouse recommendations for GitLab Self-Managed.

Supported ClickHouse versions

First GitLab versionClickHouse versionsComment
17.7.023.x (24.x, 25.x)For using ClickHouse 24.x and 25.x see the workaround section.
18.1.023.x, 24.x, 25.x
18.5.023.x, 24.x, 25.xExperimental support for Replicated database engine.

ClickHouse Cloud is supported. Compatibility is generally ensured with the latest major GitLab release and newer versions.

Set up ClickHouse

To set up ClickHouse with GitLab:

  1. Run ClickHouse Cluster and configure database.
  2. Configure GitLab connection to ClickHouse.
  3. Run ClickHouse migrations.

Run and configure ClickHouse

When you run ClickHouse on a hosted server, various data points might impact the resource consumption, like the number of builds that run on your instance each month, the selected hardware, the data center choice to host ClickHouse, and more. Regardless, the cost should not be significant.

To create the necessary user and database objects:

  1. Generate a secure password and save it.

  2. Sign in to the ClickHouse SQL console.

  3. Execute the following command. Replace PASSWORD_HERE with the generated password.

    CREATE DATABASE gitlab_clickhouse_main_production;
    CREATE USER gitlab IDENTIFIED WITH sha256_password BY 'PASSWORD_HERE';
    CREATE ROLE gitlab_app;
    GRANT SELECT, INSERT, ALTER, CREATE, UPDATE, DROP, TRUNCATE, OPTIMIZE ON gitlab_clickhouse_main_production.* TO gitlab_app;
    GRANT SELECT ON information_schema.* TO gitlab_app;
    GRANT gitlab_app TO gitlab;

Configure the GitLab connection to ClickHouse

To provide GitLab with ClickHouse credentials:

  1. Edit /etc/gitlab/gitlab.rb:

    gitlab_rails['clickhouse_databases']['main']['database'] = 'gitlab_clickhouse_main_production'
    gitlab_rails['clickhouse_databases']['main']['url'] = 'https://example.com/path'
    gitlab_rails['clickhouse_databases']['main']['username'] = 'gitlab'
    gitlab_rails['clickhouse_databases']['main']['password'] = 'PASSWORD_HERE' # replace with the actual password
  2. Save the file and reconfigure GitLab:

    sudo gitlab-ctl reconfigure
  1. Save the ClickHouse password as a Kubernetes Secret:

    kubectl create secret generic gitlab-clickhouse-password --from-literal="main_password=PASSWORD_HERE"
  2. Export the Helm values:

    helm get values gitlab > gitlab_values.yaml
  3. Edit gitlab_values.yaml:

    global:
      clickhouse:
        enabled: true
        main:
          username: default
          password:
            secret: gitlab-clickhouse-password
            key: main_password
          database: gitlab_clickhouse_main_production
          url: 'http://example.com'
  4. Save the file and apply the new values:

    helm upgrade -f gitlab_values.yaml gitlab gitlab/gitlab

To verify that your connection is set up successfully:

  1. Sign in to Rails console

  2. Execute the following command:

    ClickHouse::Client.select('SELECT 1', :main)

    If successful, the command returns [{"1"=>1}]

Run ClickHouse migrations

To create the required database objects execute:

sudo gitlab-rake gitlab:clickhouse:migrate

Migrations are executed automatically using the GitLab-Migrations chart.

Alternatively, you can run migrations by executing the following command in the Toolbox pod:

gitlab-rake gitlab:clickhouse:migrate

Enable ClickHouse for Analytics

Now that your GitLab instance is connected to ClickHouse, you can enable features to use ClickHouse by enabling ClickHouse for Analytics.

Replicated database engine

For a multi-node, high-availability setup, GitLab supports the Replicated table engine in ClickHouse.

Prerequisites:

When configuring the database, you must run the statements with the ON CLUSTER clause. In the following example, replace CLUSTER_NAME_HERE with your cluster’s name:

CREATE DATABASE gitlab_clickhouse_main_production ON CLUSTER CLUSTER_NAME_HERE ENGINE = Replicated('/clickhouse/databases/{cluster}/gitlab_clickhouse_main_production', '{shard}', '{replica}')
CREATE USER gitlab IDENTIFIED WITH sha256_password BY 'PASSWORD_HERE' ON CLUSTER CLUSTER_NAME_HERE;
CREATE ROLE gitlab_app ON CLUSTER CLUSTER_NAME_HERE;
GRANT SELECT, INSERT, ALTER, CREATE, UPDATE, DROP, TRUNCATE, OPTIMIZE ON gitlab_clickhouse_main_production.* TO gitlab_app ON CLUSTER CLUSTER_NAME_HERE;
GRANT SELECT ON information_schema.* TO gitlab_app ON CLUSTER CLUSTER_NAME_HERE;
GRANT gitlab_app TO gitlab ON CLUSTER CLUSTER_NAME_HERE;

Load balancer considerations

The GitLab application communicates with the ClickHouse cluster through the HTTP/HTTPS interface. Consider using an HTTP proxy for load balancing requests to the ClickHouse cluster, such as chproxy.

Backup and Restore

You should perform a full backup before upgrading the GitLab application. ClickHouse data is not included in GitLab backup tooling.

Backup and restore strategy depends on the choice of deployment.

ClickHouse Cloud

ClickHouse Cloud automatically:

  • Manages the backups and restores.
  • Create and retains daily backups.

You do not have to do any additional configuration.

For more information, see ClickHouse Cloud backups.

ClickHouse for GitLab Self-Managed

If you manage your own ClickHouse instance, you should take regular backups to ensure data safety:

This duplicates data for every full backup, but is the easiest approach to restore data.

Alternatively, use clickhouse-backup. This is a third-party tool that provides similar functionality with additional features like scheduling and remote storage management.

Monitoring

To ensure the stability of the GitLab integration, you should monitor the health and performance of your ClickHouse cluster.

ClickHouse Cloud

ClickHouse Cloud provides a native Prometheus integration that exposes metrics through a secure API endpoint.

After generating the API credentials, you can configure collectors to scrape metrics from ClickHouse Cloud. For example, a Prometheus deployment.

ClickHouse for GitLab Self-Managed

ClickHouse can expose metrics in Prometheus format. To enable this:

  1. Configure the prometheus section in your config.xml to expose metrics on a dedicated port (default is 9363).

     <prometheus>
         <endpoint>/metrics</endpoint>
         <port>9363</port>
         <metrics>true</metrics>
         <events>true</events>
         <asynchronous_metrics>true</asynchronous_metrics>
     </prometheus>
  2. Configure Prometheus or a similar compatible server to scrape http://<clickhouse-host>:9363/metrics.

Metrics to monitor

You should set up alerts for the following metrics to detect issues that may impact GitLab features:

Metric NameDescriptionAlert Threshold (Recommendation)
ClickHouse_Metrics_QueryNumber of queries currently executing. A sudden spike might indicate a performance bottleneck.Baseline deviation (for example > 100)
ClickHouseProfileEvents_FailedSelectQueryNumber of failed select queriesBaseline deviation (for example > 50)
ClickHouseProfileEvents_FailedInsertQueryNumber of failed insert queriesBaseline deviation (for example > 10)
ClickHouse_AsyncMetrics_ReadonlyReplicaIndicates if a replica has gone into read-only mode (often due to ZooKeeper connection loss).> 0 (take immediate action)
ClickHouse_ProfileEvents_NetworkErrorsNetwork errors (connection resets/timeouts). Frequent errors might cause GitLab background jobs to fail.Rate > 0

Liveness check

If ClickHouse is available behind a load balancer, you can use the HTTP /ping endpoint to check for liveness. The expected response is Ok with HTTP Code 200.

Security and auditing

To ensure the security of your data and ensure audit ability, use the following security practices.

Network security

  • TLS Encryption: Configure ClickHouse servers to use TLS encryption to validate connections.

    When configuring the connection URL in GitLab, you should use the https:// protocol (for example, https://clickhouse.example.com:8443) to specify this.

  • IP Allow lists: Restrict access to the ClickHouse port (default 8443 or 9440) to only the GitLab application nodes and other authorized networks.

Audit logging

GitLab application does not maintain a separate audit log for individual ClickHouse queries. In order to satisfy specific requirements regarding data access (who queried what and when), you can enable logging on the ClickHouse side.

ClickHouse Cloud

In ClickHouse Cloud, query logging is enabled by default. You can access these logs by querying the system.query_log table.

ClickHouse for GitLab Self-Managed

For self-managed instances, ensure the query_log configuration parameter is enabled in your server configuration:

  1. Verify that the query_log section exists in your config.xml or users.xml:

     <query_log>
         <database>system</database>
         <table>query_log</table>
         <partition_by>toYYYYMM(event_date)</partition_by>
         <flush_interval_milliseconds>7500</flush_interval_milliseconds>
         <ttl>event_date + INTERVAL 30 DAY</ttl>  <!-- Keep only 30 days -->
     </query_log>
  2. Once enabled, all executed queries are recorded in the system.query_log table, allowing for audit trail.

System requirements

The recommended system requirements change depending on the number of users.

Deployment decision matrix quick reference

UsersPrimary RecommendationComparable AWS ARM InstanceComparable GCP ARM InstanceDeployment Type
1KClickHouse Cloud Basic--Managed
2KClickHouse Cloud Basicm8g.xlargec4a-standard-4Managed or Single Node
3KClickHouse Cloud Scalem8g.2xlargec4a-standard-8Managed or Single Node
5KClickHouse Cloud Scalem8g.4xlargec4a-standard-16Managed or Single Node
10KClickHouse Cloud Scalem8g.4xlargec4a-standard-16Managed or Single Node/HA
25KClickHouse for GitLab Self-Managed or ClickHouse Cloud Scalem8g.8xlarge or 3×m8g.4xlargec4a-standard-32 or 3×c4a-standard-16Managed or Single Node/HA
50KClickHouse for GitLab Self-Managed high availability (HA) or ClickHouse Cloud Scalem8g.4xlargec4a-standard-16Managed or HA Cluster

1K Users

Recommendation: ClickHouse Cloud Basic as it provides good cost efficiency with no operational complexity.

2K Users

Recommendation: ClickHouse Cloud Basic as it offers best value with no operational complexity.

Alternative recommendation for ClickHouse for GitLab Self-Managed deployment:

  • AWS: m8g.xlarge (4 vCPU, 16 GB)
  • GCP: c4a-standard-4 or n4-standard-4 (4 vCPU, 16 GB)
  • Storage: 20 GB with low-medium performance tier

3K Users

Recommendation: ClickHouse Cloud Scale

Alternative recommendation for ClickHouse for GitLab Self-Managed deployment:

  • AWS: m8g.2xlarge (8 vCPU, 32 GB)
  • GCP: c4a-standard-8 or n4-standard-8 (8 vCPU, 32 GB)
  • Storage: 100 GB with medium performance tier

Note: HA deployments not cost-effective at this scale.

5K Users

Recommendation: ClickHouse Cloud Scale

Alternative recommendation for ClickHouse for GitLab Self-Managed deployment:

  • AWS: m8g.4xlarge (16 vCPU, 64 GB)
  • GCP: c4a-standard-16 or n4-standard-16 (16 vCPU, 64 GB)
  • Storage: 100 GB with high performance tier
  • Deployment: Single node recommended

10K Users

Recommendation: ClickHouse Cloud Scale

Alternative recommendation for ClickHouse for GitLab Self-Managed deployment:

  • AWS: m8g.4xlarge (16 vCPU, 64 GB)
  • GCP: c4a-standard-16 or n4-standard-16 (16 vCPU, 64 GB)
  • Storage: 200 GB with high performance tier
  • HA Option: 3-node cluster becomes viable for critical workloads

25K Users

Recommendation: ClickHouse Cloud Scale or ClickHouse for GitLab Self-Managed. Both options are economically feasible at this scale.

Recommendations for ClickHouse for GitLab Self-Managed deployment:

  • Single Node:

    • AWS: m8g.8xlarge (32 vCPU, 128 GB)
    • GCP: c4a-standard-32 or n4-standard-32 (32 vCPU, 128 GB)
  • HA Deployment:

    • AWS: 3 × m8g.4xlarge (16 vCPU, 64 GB each)
    • GCP: 3 × c4a-standard-16 or 3 × n4-standard-16 (16 vCPU, 64 GB each)
  • Storage: 400 GB per node with high performance tier.

50K Users

Recommendation: ClickHouse for GitLab Self-Managed HA or ClickHouse Cloud Scale. The self-managed option is slightly more cost-effective at this scale.

Recommendations for ClickHouse for GitLab Self-Managed deployment:

  • Single Node:

    • AWS: m8g.8xlarge (32 vCPU, 128 GB)
    • GCP: c4a-standard-32 or n4-standard-32 (32 vCPU, 128 GB)
  • HA Deployment (Preferred):

    • AWS: 3 × m8g.4xlarge (16 vCPU, 64 GB each)
    • GCP: 3 × c4a-standard-16 or 3 × n4-standard-16 (16 vCPU, 64 GB each)
  • Storage: 1000 GB per node with high performance tier.

HA considerations for ClickHouse for GitLab Self-Managed deployment

HA setup becomes cost effective only at 10k users or above.

  • Minimum: Three ClickHouse nodes for quorum.
  • ClickHouse Keeper: Three nodes for coordination (can be co-located or separate).
  • LoadBalancer: Recommended for distributing queries.
  • Network: Low-latency connectivity between nodes is critical.

Glossary

  • Cluster: A collection of nodes (servers) that work together to store and process data.
  • MergeTree: MergeTree is a table engine in ClickHouse designed for high data ingest rates and large data volumes. It is the core storage engine in ClickHouse, providing features such as columnar storage, custom partitioning, sparse primary indexes, and support for background data merges.
  • Parts: A physical file on a disk that stores a portion of the table’s data. A part is different from a partition, which is a logical division of a table’s data that is created using a partition key.
  • Replica: A copy of the data stored in a ClickHouse database. You can have any number of replicas of the same data for redundancy and reliability. Replicas are used in conjunction with the ReplicatedMergeTree table engine, which enables ClickHouse to keep multiple copies of data in sync across different servers.
  • Shard: A subset of data. ClickHouse always has at least one shard for your data. If you do not split the data across multiple servers, your data is stored in one shard. Sharding data across multiple servers can be used to divide the load if you exceed the capacity of a single server.
  • TTL: Time To Live (TTL) is a ClickHouse feature that automatically moves, deletes, or rolls up columns/rows after a certain time period. This allows you to manage storage more efficiently because you can delete, move, or archive the data that you no longer need to access frequently.

Troubleshooting

Database schema migrations on GitLab 18.0.0 and earlier

On GitLab 18.0.0 and earlier, running database schema migrations for ClickHouse may fail for ClickHouse 24.x and 25.x with the following error message:

Code: 344. DB::Exception: Projection is fully supported in ReplacingMergeTree with deduplicate_merge_projection_mode = throw. Use 'drop' or 'rebuild' option of deduplicate_merge_projection_mode

Without running all migrations, the ClickHouse integration will not work.

To work around this issue and run the migrations:

  1. Sign in to Rails console

  2. Execute the following command:

    ClickHouse::Client.execute("INSERT INTO schema_migrations (version) VALUES ('20231114142100'), ('20240115162101')", :main)
  3. Migrate the database again:

    sudo gitlab-rake gitlab:clickhouse:migrate

This time the database migration should successfully finish.