One of the daily benefits of building a co-pilot for workload optimization is talking to data engineers about how they approach optimization in their organizations. One topic that comes up continuously is optimizing table clustering keys, also known as clustering keys. Last week, a large enterprise explained that they spent a whole week of an engineer's time researching and investigating how to optimize the clustering keys of a single dataset, all because the benefits were so magical with boosting query performance, optimizing data access, and reducing costs all at the same time. If you pair clustering key optimization with warehouse optimization and bin-packed queries, you can see something truly magical (but I will leave that for a future blog post). In this blog post, we will take the mystery out of clustering keys by exploring clustering keys, explaining why it is crucial for optimizing Snowflake databases, discussing optimizing options, and looking at how Bluesky approaches the problem for customers. So let’s dig right in.
Snowflake made giant leaps and bounds in effectively querying large datasets by changing how the data is logically and physically stored into micro-partitions. Micro-partitions are tiny pieces of a larger table that are stored, immutable and compressed, making queries more effective as smaller amounts of data are scanned, assuming the cluster key is optimized. The clustering key is a design feature that dictates the amount of data organized into these micro-partitions and specifies the columns based on which the data should be sorted and stored, leading to improved data locality and more efficient query processing as Snowflake can prune or ignore unchanged or irrelevant partitions.
This functionality typically works well when first created, but as data is updated over time, tables can become unoptimized on various dimensions, resulting in unnecessary table scans, which slows down queries.
Optimizing the performance of Table Clustering Key in Snowflake has multiple benefits, which all have implications for query optimization and performance:
Selecting an appropriate clustering key is a critical decision that impacts the performance of your Snowflake database. When done correctly, Snowflake can use the cluster key to prune or ignore partitions that are not relevant to the query, significantly reducing the amount of data that needs to be scanned and increasing query performance. It is a non-trivial task to select an ideal cluster key, as doing so requires knowledge of the table structure, the data within it, how it changes over time, and query patterns against the table.
For example, date fields are common cluster keys, as many users will query tables looking for data from a specific period, such as sales from the past month. However, what if the most common queries were to understand the region sales occurred in? Using the date field as a cluster key may be inappropriate. In the example above, querying for the region with a date cluster key would force all partitions to be scanned, increasing costs and reducing performance.
Let’s assume the most common queries against our sales data table filter by the sale's geographic region instead of the date; clustering by the Region column could be a better choice than the date as it would allow Snowflake to prune more partitions. In the example above, I am only interested in east customers, so there is no need to scan other micro-partions if the cluster key is set correctly. This would result in the lowest cost and highest performance.
The goal is to choose cluster keys for your data sets to create intelligent partitions for your query workloads. Considering factors such as:
The best way to choose a clustering key is by considering the real-world reader and writer queries hitting the table, the cardinality of the data and the usage patterns across time. Then, you should also consider how much it would cost to recluster the table initially and what the cost would be over time. Finally, you want to pick how many clustering keys you should assign to the table. The best and most effective way to do all of that is run a simulation on all of these variables and determine the cost and benefit of each, continuously. That’s where Bluesky comes in.
The big win for customers is that Bluesky has context, and the platform shines big time at scale to reduce the effort in analyzing your own data to optimize. Bluesky scans all workloads via the metadata customers provide, finds reader/writer patterns, and makes comparisons of reader and writer tradeoffs. This data is pumped into our proprietary algorithms to make a recommendation 100% customized to your environment. These optimizations are provided to you without effort, and our recommendations are simple and easy to implement. Think, build more, figure this stuff out less. But enough talking; let's see it in action.
Table clustering findings can be found in Storage on the left-hand side of our navigation. From here, you can see if you have any clustering key optimizations. In the case below, I have 2 critical optimizations, which, if clicked on, will filter the optimizations for quick access.
When clicking on a finding, users get a quick overview of the finding, savings opportunity, recommendation, risks, and effort. This information helps users determine if they would like to execute this finding. Let’s dig into each section:
Lots of fun stuff to digest and understand about Snowflake's Table Clustering Key, but one thing is for sure, it is a vital tool for optimizing query performance and overall data warehouse efficiency. The good news is that by choosing and maintaining a clustering key, you can significantly reduce query execution times, enhance data locality, and make the most of Snowflake's powerful architecture. While you can try to manage this yourself, it is unfortunately very time-consuming and complex, but the good news is platforms such as Bluesky are here to help unlock peak database performance.
Until next time, build more and stay optimized, my friends!