Welcome to PostHog's ClickHouse manual.
About this manual
PostHog uses ClickHouse to power our data analytics tooling and we've learned a lot about it over the years. The goal of this manual is to share that knowledge externally and raise the average level of ClickHouse understanding for people starting work with ClickHouse.
If you have extensive ClickHouse experience, and want to contribute thoughts or tips of your own, please do by opening an PR or issue on GitHub!
Consider this manual a companion to other great resources out there:
- Designing Data-Intensive Applications
- The chapters "Transaction Processing or Analytics" and "Column-Oriented Storage" are recommended reading for people new to the concepts
- ClickHouse Docs and Knowledge Base
- Altinity's ClickHouse Knowledge Base
- Tinybird's curated ClickHouse Knowledge Base
Why ClickHouse
In 2020, we had launched PostHog for the first time, were getting great early traction, but were struggling with scaling.
To solve this problem we looked at a wide range of OLAP solutions, including Pinot, Presto, Druid, TimescaleDB, CitusDB, and ClickHouse. Some of our team had used these tools before at other companies, such as Uber where Pinot and Presto are both used extensively.
While assessing each tool, we looked at on three main factors:
- Speed: Our users want results in real-time, so our new database needed to scale well and give fast results. Ideally, it wouldn’t be too expensive either.
- Complexity: PostHog users can self-host and install our product themselves, so we didn’t want it to be too complicated for users to manage or deploy. We didn’t want users to have to install an entire Hadoop stack, for example.
- Query interface: We like standardised tools. We eliminated tools such as Druid because, while it does have a SQL wrap around it, it’s not exactly SQL. That can get messy.
ClickHouse was a good fit for all of these factors, so we started doing a more thorough investigation. We read up on benchmarks and researched the experience of companies such as Cloudflare that uses ClickHouse to process 6m requests per second. Eventually, we set up a test cluster to run our own benchmarks.
ClickHouse repeatedly performed an order of magnitude better than other tools we considered. We also discovered other perks, such as the fact that it is column-orientated and written in C++. We found these to be the key benefits of ClickHouse:
- Compression: ClickHouse has excellent compression and the size-on-disk was incredible. ClickHouse even beat out serialization formats such as ORC and Parquet.
- Process from disk: Some OLAP solutions, like Presto, require data to live in memory. That’s fast, but you need to have a lot of memory for big datasets. ClickHouse processes from disk, which is better for smaller instances too.
- Real-time data updates: ClickHouse processes data as it arrives, so there’s no need to pre-aggregate data. It’s faster for us, and our users.
Eventually, we decided we knew enough to proceed and so we spun our test cluster out into an actual production cluster. It’s just part of how we like to bias for speed.
Now, ClickHouse powers all of our analytics features and we're happy with the path taken.
However knowledge on how to build on it and maintain it is more important than ever, bringing us to this manual.