ClickBench Benchmarks Pros and Cons
The Clickbench benchmarks are based on a dataset with 99,537,185 rows and XX columns. The hits.parquet
file is 14.8 GB in storage.
They contain queries like the following:
- Select an average from a column:
SELECT AVG("UserID") FROM hits;
- Aggregation then sort:
SELECT "UserID", COUNT(*) FROM hits GROUP BY "UserID" ORDER BY COUNT(*) DESC LIMIT 10;
- Pattern matching:
SELECT * FROM hits WHERE "URL" LIKE \'%google%\' ORDER BY "EventTime" LIMIT 10;
The benchmarks are displayed as a cumulative runtime of all the queries.
Here's an example of the comparison they present for DataFusion vs. Daft:
XXX
And here's the comparison for a select few queries:
XXX
Let's turn our attention to the wonderful benefits of the Clickbench benchmarks.
Benefits of the ClickBench benchmarks
The Clickbench benchmarks are easy to understand, cover a variety of different query types, and are useful for end users and query engine maintainers. They make it relatively easy to spot queries where a given engine underperforms.
Covers lots of different engines
The Clickbench benchmarks cover many engines, including Daft, Polars, and TODO (add more).
Some engines are missing, notably TODO.
Good variety of queries
XXX
Easily accessible dataset
The Clickbench dataset can be downloaded in the project README.
This is a significant improvement compared to the h2o dataset, which requires running an R program, which can be challenging if you aren't familiar with the R programming language.
Data in Parquet files
The Clickbench dataset is distributed in different file formats, including Parquet, which is ideal for benchmarks. Parquet is columnar, allowing for column pruning, has a well-defined schema, so engines don't need to infer column names or types, and is binary, making storage files on disk relatively small.
File formats like CSV can cause complications when comparing two engines.
Cover hot and cold query times
The Clickbench queries include query runtimes when data is already loaded in memory and when data is on disk and not loaded yet.
This provides a good representation of runtimes for databases when data is already loaded, as well as for enterprises that run ad hoc queries on data stored on disk.
Different machine sizes are used
XXX
Disadvantages of the Clickbench benchmarks
The Clickbench benchmarks are excellent, but don't test important types of queries and have many flaws.
Let's review the limitations of Clickbench so that you can interpret the results.
No join queries
The Clickbench queries only cover single-table queries and don't consider joins or compound queries with joins.
Joins are often the biggest bottleneck for a query, so this is a significant omission.
Only one dataset size
All the Clickhouse benchmarks are run on the table with XX rows.
The benchmarks aren't run on datasets with different scale factors. It'd be interesting to see the same benchmarks run on the same table with XX rows, XX rows, etc.
This could stress test engines that have issues when the load stresses the limits of the hardware for a machine.
One slow query has a significant impact on results
The query times for all 42 queries are presented in total, which means that a single slow-running query can greatly impact the overall result. Ranking the queries based on their ranking for all queries would be another reliable way to present the results.
Not useful for specialized query engines
Some of the query engines in Clickbench are specialized for a use case and aren't optimized general-purpose engines:
- Time is for time series analyses
- XX is for OLTP
- XX is for XX
It's better to benchmark specialized query engines with specialized query patterns. For example, a spatial query engine should be benchmarked using spatial queries and compared with other spatial engines.
Bad for distributed query engines
Properly testing a distributed engine requires large datasets spread across many files and a computation cluster with several machines.
The Clickbench dataset is small, and the queries can be run on a single laptop with only 16 GB of memory. It's way too small of a dataset to test a distributed engine.
The Clickbench dataset could be 10 times larger and that would give an interesting 147.8 GB dataset to test a distributed engine.
Some metadata-only queries
The first query can be computed from the Parquet metadata alone: "SELECT COUNT(*) FROM hits;"
.
The query engine can fetch the row count from the metadata of every row group in the file and compute the sum.
It's essential to verify that engines utilize metadata to run queries faster when possible, but it's not particularly interesting after that has been confirmed.
Lots of quick queries
A large portion of the queries (around 26) can be run in under a second on a MacBook M3 with 16 GB of RAM.
Ten queries finish between 1 and 3 seconds.
The remaining six queries take more than 3 seconds to run.
The queries are skewed to lightweight workflows that can run very quickly.
Conclusion
Querybench is an excellent benchmarking analysis and a significant contribution to the data community.
It's fantastic that Clickhouse has open-sourced the dataset and the code. It's also amazing that they maintain the benchmarks and made a nice user interface.