Tiered Storages in ClickHouse
Quoting from the MergeTree documentation:
MergeTree family table engines can store data on multiple block devices. For example, it can be useful when the data of a certain table are implicitly split into “hot” and “cold”. The most recent data is regularly requested but requires only a small amount of space. On the contrary, the fat-tailed historical data is requested rarely.
When used with remote storage backends such as Amazon S3, this makes a very efficient storage scheme. It allows for storage policies, which allows data to be on local disks for a period of time and then move it to object storage.
An example configuration can look like this:
<storage_configuration>
<disks>
<fast_ssd>
<path>/mnt/fast_ssd/clickhouse/</path>
</fast_ssd>
<gcs>
<support_batch_delete>false</support_batch_delete>
<type>s3</type>
<endpoint>https://storage.googleapis.com/${BUCKET_NAME}/${ROOT_FOLDER}/</endpoint>
<access_key_id>${SERVICE_ACCOUNT_HMAC_KEY}</access_key_id>
<secret_access_key>${SERVICE_ACCOUNT_HMAC_SECRET}</secret_access_key>
<metadata_path>/var/lib/clickhouse/disks/gcs/</metadata_path>
</gcs>
...
</disks>
...
<policies>
<move_from_local_disks_to_gcs> <!-- policy name -->
<volumes>
<hot> <!-- volume name -->
<disk>fast_ssd</disk> <!-- disk name -->
</hot>
<cold>
<disk>gcs</disk>
</cold>
</volumes>
<move_factor>0.2</move_factor>
<!-- The move factor determines when to move data from hot volume to cold.
See ClickHouse docs for more details. -->
</moving_from_ssd_to_hdd>
....
</storage_configuration>
In this storage policy, two volumes are defined hot
and cold
. After the hot
volume is filled with occupancy of disk_size * move_factor
, the data is being moved to Google Cloud Storage (GCS).
If this storage policy is not the default, create tables by attaching the storage policies. For example:
CREATE TABLE key_value_table (
event_date Date,
key String,
value String,
) ENGINE = MergeTree
ORDER BY (key)
PARTITION BY toYYYYMM(event_date)
SETTINGS storage_policy = 'move_from_local_disks_to_gcs'
This approach is possible with Table TTLs, which are also available with MergeTree table engine.
The ClickHouse documentation shows this feature in detail, in the example of implementing a hot - warm - cold architecture.
You can take a similar approach for the example shown above. First, adjust the storage policy:
<storage_configuration>
...
<policies>
<local_disk_and_gcs> <!-- policy name -->
<volumes>
<hot> <!-- volume name -->
<disk>fast_ssd</disk> <!-- disk name -->
</hot>
<cold>
<disk>gcs</disk>
</cold>
</volumes>
</local_disk_and_gcs>
....
</storage_configuration>
Then create the table as:
CREATE TABLE another_key_value_table (
event_date Date,
key String,
value String,
) ENGINE = MergeTree
ORDER BY (key)
PARTITION BY toYYYYMM(event_date)
TTL
event_date TO VOLUME 'hot',
event_date + INTERVAL 1 YEAR TO VOLUME 'cold'
SETTINGS storage_policy = 'local_disk_and_gcs';
This creates the table so that data older than 1 year (evaluated against the
event_date
column) is moved to GCS. Such a storage policy can be helpful for append-only
tables (like audit events) where only the most recent data is accessed frequently.
You can drop the data altogether, which can be a regulatory requirement.
We don’t mention modifying TTLs in this guide, but that is possible as well. See ClickHouse documentation for modifying TTL for details.