Multi-Node ClickHouse Cluster with Docker and Zookeeper
In this blog post, we’ll walk you through the process of setting up a highly available, multi-node ClickHouse cluster on AWS EC2 instances using Docker and Zookeeper. Additionally, we’ll configure ClickHouse to use S3 storage, allowing us to store data in Amazon S3 for scalability and durability.
ClickHouse is an open-source, columnar database management system designed for analytical workloads, providing extremely high performance and scalability for big data applications. Let’s dive into setting up a multi-node cluster.
Cluster Architecture
We’ll set up:
- Three EC2 instances (nodes) for our ClickHouse cluster.
- A Zookeeper instance on each node for cluster coordination.
- ClickHouse on each node with an S3-compatible storage backend.
Prerequisites
Before we begin, make sure you have the following prerequisites:
- AWS/Azure/GCP Account: In this blog, You’ll need an AWS account to launch EC2 instances.
- Docker Installed: Docker should be installed on each EC2 instance. (We’ll cover the installation steps here.)
Step 1: Launch EC2 Instances
- Go to the AWS EC2 Console.
- Choose an Ubuntu Amazon Machine Image (AMI)
- Configure security groups:
- Allow inbound SSH (port 22) to connect to instances.
- Allow inbound traffic on ports
2181
(Zookeeper),9000
(ClickHouse)
5. Generate a key pair for SSH access and launch the instances.
6. Take note of each instance’s private IP address as we’ll need them for configuring the cluster.
Step 2: Install Docker on Each Instance
SSH into each EC2 instance and install Docker.
sudo apt update
curl -fsSL https://get.docker.com -o get-docker.sh && sh get-docker.sh && rm -f get-docker.sh
sudo apt install docker.io
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
Log out and log back in to apply Docker group permissions.
Step 3: Prepare Directories for Configuration
On each instance, create directories to store configuration files and data:
mkdir -p ~/clickhouse/{data,logs,config}
vim ~/clickhouse/config/config.xml
We’ll use these directories to store ClickHouse configuration, Zookeeper data, and ClickHouse data.
Step 4: Install Zookeeper Containers
Zookeeper is required to manage the coordination between ClickHouse nodes in a cluster.
- Run Zookeeper node:
docker pull zookeeper:3.8.4
docker run -d --name zookeeper \
-v ~/zookeeper-data:/data \
-p 2181:2181 \
zookeeper:3.8.4
Step 5: Configure ClickHouse Cluster
We’ll set up ClickHouse to use Zookeeper for coordination.
Update ClickHouse Configuration Files:
- Download and update
config.xml
with the following content:
<remote_servers>
<Cluster_Name>
<shard>
<replica>
<host>node1.clickhouse.com or IP </host>
<port>9000</port>
</replica>
<replica>
<host>node2.clickhouse.com or IP</host>
<port>9000</port>
</replica>
<replica>
<host>node3.clickhouse.com or IP</host>
<port>9000</port>
</replica>
<internal_replication>true</internal_replication>
</shard>
</Cluster_Name>
</remote_servers>
<zookeeper>
<node>
<host>zookeeper.node1.com</host>
<port>2181</port>
</node>
</zookeeper>
Macros for Identifying Shards and Replicas for each node
Clickhouse Node-1 config.xml:
<macros>
<shard>01</shard>
<replica>01</replica>
</macros>
Clickhouse Node - 2 config.xml :
<macros>
<shard>01</shard>
<replica>02</replica>
</macros>
Clickhouse Node - 3 config.xml :
<macros>
<shard>01</shard>
<replica>03</replica>
</macros>
Step 6: Deploy ClickHouse Containers on Each Node
Now we’ll deploy ClickHouse on each instance, mounting the configuration file we just created.
docker pull clickhouse/clickhouse-server:24.3.6
docker run -d --name clickhouse-server \
--ulimit nofile=262144:262144 \
--volume=$(pwd)/clickhouse/data:/var/lib/clickhouse \
--volume=$(pwd)/clickhouse/logs:/var/log/clickhouse-server \
--volume=$(pwd)/clickhouse/config/config.xml:/etc/clickhouse-server/config.xml \
--network=host \
--cap-add=SYS_NICE \
--cap-add=NET_ADMIN \
--cap-add=IPC_LOCK \
--cap-add=SYS_PTRACE \
clickhouse/clickhouse-server:24.3.6
Step 7: Verify Cluster Setup
- Connect clickhouse cluster using clickhouse-client.
sudo apt-get install -y clickhouse-client
clickhouse-client --host cluster-IP/Any-Host-IP
2. To view cluster nodes:
SELECT * FROM system.clusters;
3. To check the Zookeeper connection:
SELECT * FROM system.zookeeper WHERE path IN ('/', '/clickhouse');
4. To check macros:
SELECT * FROM system.macros;
Step 8: Sample database and table creation
Create Database
CREATE DATABASE my_db ON CLUSTER Cluster_Name;
SHOW DATABASES;
Create Table
CREATE TABLE my_db.my_first_table ON CLUSTER Cluster_Name (
user_id UInt32,
message String,
timestamp DateTime,
metric Float32
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/my_first_table/{shard}', '{replica}')
PRIMARY KEY (user_id, timestamp);
Create Distributed Table
CREATE TABLE my_db.my_distributed_table ON CLUSTER Cluster_Name
AS my_db.my_first_table
ENGINE = Distributed (Cluster_Name, my_db, my_first_table, rand());
Insert Data into Distributed Table
INSERT INTO my_db.my_distributed_table (user_id, message, timestamp, metric) VALUES
(101, 'Hello, Clickhouse!', now(), -1.0),
(102, 'Insert a lot of rows per batch', yesterday(), 1.41421),
(102, 'Sort your data based on your commonly-used queries', today(), 2.718),
(101, 'Granules are the smallest chunks of data read', now() + 5, 3.14159);
Verify Data Insertion
SELECT * FROM my_db.my_first_table;
SELECT formatReadableQuantity(count()) FROM my_db.my_distributed_table;
SELECT formatReadableQuantity(count()) FROM my_db.my_first_table;
Note: You can verify that each host have the same data
Step 9: AWS S3 Storage Support
Clickhouse supports S3-backed storage with the S3BackedMergeTree engine for storing data in S3 while maintaining performance.
Additional Configuration in config.xml
<storage_configuration>
<disks>
<s3_disk>
<type>s3</type>
<endpoint>https://clickhouse-bucket-name.s3.us-east-2.amazonaws.com/clickhouse</endpoint>
<use_environment_credentials>true</use_environment_credentials>
<access_key_id>your_access_key_id</access_key_id>
<secret_access_key>your_secret_access_key</secret_access_key>
<metadata_path>/var/lib/clickhouse/disks/s3_disk/</metadata_path>
</s3_disk>
<s3_cache>
<type>cache</type>
<disk>s3_disk</disk>
<path>/var/lib/clickhouse/disks/s3_cache/</path>
<max_size>1Gi</max_size>
</s3_cache>
</disks>
<policies>
<s3_main>
<volumes>
<main>
<disk>s3_disk</disk>
</main>
</volumes>
</s3_main>
</policies>
</storage_configuration>
You can also replace access_key_id and secret_access_key with the following, which will attempt to obtain credentials from environment variables and Amazon EC2 metadata:
<use_environment_credentials>true</use_environment_credentials>
Restart Clickhouse with the new configuration.
Connect with clickhouse-client:
Create Table with S3 Storage
CREATE TABLE my_s3_table (
`id` UInt64,
`column1` String
) ENGINE = MergeTree
ORDER BY id
SETTINGS storage_policy = 's3_main';
Insert Data into S3-backed table
INSERT INTO my_s3_table (id, column1) VALUES (1, 'abc'), (2, 'xyz');
In the AWS console, if your data was successfully inserted to S3, you should see that Clickhouse has created new files in your specified bucket.
Conclusion
Congratulations! You have successfully set up a multi-node ClickHouse cluster on AWS EC2 using Docker, Zookeeper, and Amazon S3 for storage. This setup provides high performance, scalability, and durability, making it ideal for big data and analytics workloads. With this configuration, you can easily expand your cluster and take advantage of S3 storage for cost-effective, long-term storage.
If you have any questions or feedback, feel free to comment.
About The Author
Suraj Solanki
Senior DevOps Engineer
LinkedIn: https://www.linkedin.com/in/suraj-solanki
Topmate: https://topmate.io/suraj_solanki