ClickHouse数据压缩[译文]

原文:https://www.altinity.com/blog/2017/11/21/compression-in-clickhouse
Altinity是国外一家从事ClickHouse咨询、服务的公司,该公司高管由ClickHouse开发者,以及来自Percona的专家组成。目前Altinity的ClickHouse云服务测试版已经上线。

综述

It might not be obvious from the start, but ClickHouse supports different kinds of compressions, namely two LZ4 and ZSTD.

There are evaluations for both of these methods: https://www.percona.com/blog/2016/04/13/evaluating-database-compression-methods-update/
But in short, LZ4 is fast but provides smaller compression ratio comparing to ZSTD. While ZSTD is slower than LZ4, it is often faster and compresses better than a traditional Zlib, so it might be considered as a replacement for Zlib compression.

  • 其实,从一开始ClickHouse就支持多种方式的数据压缩:LZ4和ZSTD。
  • 关于压缩算法的测试,见这篇文章。简而言之,LZ4在速度上会更快,但是压缩率较低,ZSTD正好相反。尽管ZSTD比LZ4慢,但是相比传统的压缩方式Zlib,无论是在压缩效率还是速度上,都可以作为Zlib的替代品。

实际压测

To get some real numbers using ClickHouse, let’s review a table compressed with both methods.
For this, we will take the table lineorder, from the benchmark described in https://www.altinity.com/blog/2017/6/16/clickhouse-in-a-general-analytical-workload-based-on-star-schema-benchmark
The uncompressed datasize for lineorder table with 1000 scale is 680G.

  • 为了用事实说话,我们一起对比一下这两种压缩方式。
  • 压测所用的表(lineorder)结构和数据来着这里
  • 未压缩的数据集是680GB。

数据对比

And now let’s load this table into ClickHouse. With the default compression (LZ4), we have184G lineorderlz4
And with ZSTD135G lineorderzstd
There we need to mention how to make ClickHouse using ZSTD. For this, we add the following lines into config:

  • 把上述数据加载到ClickHouse后,默认的LZ4压缩算法下,数据容量是184G(压缩到27%),而ZSTD达到了135GB(压缩到20%)。
  • 关于如何使用ZSTD,需要简单的提一下,使用如下配置即可:
1
2
3
4
5
<compression incl="clickhouse_compression">
<case>
<method>zstd</method>
</case>
</compression>

So the compression ratio for this table

压缩比率对比

Compression Ratio
LZ4 3.7
ZSTD 5.0

What about performance? For this let’s run the following query

  • 压缩后的性能如何,我们来跑如下查询看看。
1
2
SELECT toYear(LO_ORDERDATE) AS yod, sum(LO_REVENUE) FROM lineorder
GROUP BY yod;

And we will execute this query in “cold” run (no data is cached), and following “hot” run when some data is already cached in OS memory after the first run.

  • 为了保持客观,我们会跑两次,第一次是冷数据请求,这次的数据没有被操作系统缓存,第二次跑一次热数据情求,这次的数据已经被操作系统的内存给缓存住了。

So query results, for LZ4 compression:

LZ4的性能如下:

1
2
3
4
5
6
7
# Cold run:
7 rows in set. Elapsed: 19.131 sec. Processed 6.00 billion rows,
36.00 GB (313.63 million rows/s., 1.88 GB/s.)
Hot run:
7 rows in set. Elapsed: 4.531 sec. Processed 6.00 billion rows,
36.00 GB (1.32 billion rows/s., 7.95 GB/s.)

For ZSTD compression:

ZSTD性能如下:

1
2
3
4
5
6
7
Cold run:
7 rows in set. Elapsed: 20.990 sec. Processed 6.00 billion rows,
36.00 GB (285.85 million rows/s., 1.72 GB/s.)
Hot run:
7 rows in set. Elapsed: 7.965 sec. Processed 6.00 billion rows,
36.00 GB (753.26 million rows/s., 4.52 GB/s.)

While there is practically no difference in cold run times (as the IO time prevail decompression time), in hot runs LZ4 is much faster (as there is much less IO operations, and performance of decompression becomes a major factor).

  • 冷数据查询情况下,两者区别不大,原因在于消耗在IO方面的时间,远大于消耗在解压缩上面的时间。
  • 热数据请求下,LZ4会更快,此时IO代价小,数据解压缩成为性能瓶颈。

Conclusion:

结论

ClickHouse proposes two methods of compression: LZ4 and ZSTD, so you can choose what is suitable for your case.
With LZ4 you may get a better execution time with the cost of the worse compression and data taking more space on the storage.

  • ClickHouse提供了两种数据压缩方式供我们选择:LZ4和ZSTD。
  • 默认的LZ4压缩方式,会给我们提供更快的执行效率,但是同时,我们要付出较多的磁盘容量占用的代价了。

译者注

  • ClickHouse在我们公司(Sina)内部已经有一段时间的使用了,抛开高效的SQL执行,数据容量也是一个非常喜人的地方
  • 我们使用的是大容量的服务(没错,就是Hadoop node节点的低配机器),单机容量轻松几十T,再加上ClickHouse优秀的压缩方式,日志数据存1-2年,都没有一点问题
  • 我们没修改过压缩算法,就用的默认的LZ4

热评文章