How to write good data engine benchmarks
This page shows how to write high-quality data engine benchmarks.
The data industry benefits from better benchmarks.
Properties of good benchmarks
accessible dataset
Users should be able to easily download or generate datasets to replicate the queries on their machine.
list all software verions
Make sure to list software versions for all the engines you use in addition to any other relevant information.
For exmaple, you should specify that a given benchmark uses Polars v1.31.0 and uses the Polars streaming engine.
TODO
list all hardware specs
TODO
open source benchmarking code
TODO
don't use any methdologies that clearly favor one engine without disclosing
TODO
Benchmarking is hard
It's difficult to build accurate benchmarks. Runtimes depends on the hardware, software versions, and data setup.
Accurate benchmarks are even harder when comparing different technologies. Certain frameworks will perform better with different files sizes and file formats. This benchmarking analysis tries to give a fair representation on the range of outcomes that are possible given the most impactful inputs.
The benchmarks presented in this repo should not be interpreted as definitive results. They're runtimes for specific data tasks, on one type of hardware, with a specific set of software versions. The code isn't necessarily optimized (we accept community contributions to restructure code).
The data community should find these benchmarks valuable, caveats aside.