ClickHouse Table Engine Overview

Background

Table engines play a key role in ClickHouse to determine:

This section describes MergeTree and Distributed engines, which are the most important and frequently used ClickHouse table engines.

MergeTree Family

Engines of the MergeTree family are the most universal and functional table engines for high-load tasks. They have the following key features:

When data is written, a table with this type of engine divides data into different folders based on the partitioning key. Each column of data in the folder is an independent file. A file that records serialized index sorting is created. This structure reduces the volume of data to be retrieved during data reading, greatly improving query efficiency.

Replicated*MergeTree Engines

All engines of the MergeTree family in ClickHouse prefixed with Replicated become MergeTree engines that support replicas.

Replicated series engines use ZooKeeper to synchronize data. When a replicated table is created, all replicas of the same shard are synchronized based on the information registered with ZooKeeper.

Template for creating a Replicated engine:
ENGINE = Replicated*MergeTree('Storage path in ZooKeeper','Replica name', ...)

Two parameters need to be specified for a Replicated engine:

For details about the example, see Creating a ClickHouse Table.

Distributed Engine

The Distributed engine does not store any data. It serves as a transparent proxy for data shards and can automatically transmit data to each node in the cluster. Distributed tables need to work with other local data tables. Distributed tables distribute received read and write tasks to each local table where data is stored.
Figure 1 Working principle of the Distributed engine

Template for creating a Distributed engine:

ENGINE = Distributed(cluster_name, database_name, table_name, [sharding_key])

Parameters of a distributed table are described as follows:

Example:

-- Create a ReplicatedMergeTree local table named test.
CREATE TABLE default.test ON CLUSTER default_cluster_1
(
    `EventDate` DateTime, 
    `id` UInt64
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/default/test', '{replica}')
PARTITION BY toYYYYMM(EventDate)
ORDER BY id

-- Create a distributed table named test_all based on the local table test.
CREATE TABLE default.test_all ON CLUSTER default_cluster_1
(
    `EventDate` DateTime, 
    `id` UInt64
)
ENGINE = Distributed(default_cluster_1, default, test, rand())

Rules for creating a distributed table: