How to write good data engine benchmarks

This page shows how to write high-quality data engine benchmarks.

The data industry benefits from better benchmarks.

Properties of good benchmarks

accessible dataset

Users should be able to easily download or generate datasets to replicate the queries on their machine.

list all software verions

Make sure to list software versions for all the engines you use in addition to any other relevant information.

For exmaple, you should specify that a given benchmark uses Polars v1.31.0 and uses the Polars streaming engine.

TODO

list all hardware specs

TODO

open source benchmarking code

TODO

don't use any methdologies that clearly favor one engine without disclosing

TODO

Benchmarking is hard

It's difficult to build accurate benchmarks. Runtimes depends on the hardware, software versions, and data setup.

Accurate benchmarks are even harder when comparing different technologies. Certain frameworks will perform better with different files sizes and file formats. This benchmarking analysis tries to give a fair representation on the range of outcomes that are possible given the most impactful inputs.

The benchmarks presented in this repo should not be interpreted as definitive results. They're runtimes for specific data tasks, on one type of hardware, with a specific set of software versions. The code isn't necessarily optimized (we accept community contributions to restructure code).

The data community should find these benchmarks valuable, caveats aside.