Bluesky’s mission is to help customers on their journey to make better use of these data cloud. That broad statement can be broken down to several aspects which constitutes a “better use”. Consider the following:
Cost optimization is about how your costs change over time, how they adhere to a budget and how different teams use their resources. An increase in cost does not necessarily mean an unoptimized use of resources. And a decrease in costs does not necessarily mean a more efficient workload.
Resource utilization, or efficiency is about measuring the effective use of resources. In other words, it measures the spend and the “waste” in different categories of usage. This is what we’re going to focus on here.
Let's talk about the other three objectives of a “better use” policy: performance, code quality and development velocity. These are core pillars of the Bluesky product and are important to achieving a healthy environment. They each require a deep dive so we will cover them in separate blog posts.
The Efficiency index has a Bluesky feature almost since our GA. In this new release of the Index, we are vastly expanding its scope and usability by making it trackable over time and by providing more insights into what moves it. We’re focusing on the following 3 aspects:
The implementation of efficiency index relies on the following 5 components:
Now, let's dig into each one in more detail.
Warehouse utilization can be thought of in two ways:
For this version of the efficiency index, we have chosen to rely on idle/busy utilization measurement for the following reasons:
We are analyzing data on partial utilization which will help us establish what is an effective FULL utilization of a warehouse. We will ship the updated calculation soon, for free, to all of our customers.
Let’s look at a warehouse utilization chart:
Warehouse utilization
We can see the fluctuating utilization as a function of weekdays vs weekends. We also see a slight deterioration of utilization over the course of the last 2 months. Such insights can help inform what actions to take by drilling down into the specific warehouses’ efficiencies and addressing parameter tuning such as min/max clusters, warehouse size, auto-suspend threshold. Bluesky will automatically give proactive recommendations for each of these if it detects an opportunity for optimization.
Unused tables cause cost efficiency to decrease in two ways. First, the actual bytes stored on the blob store (S3, GCS etc.) carry a monthly cost. Second, often more pronounced but flying under the radar, is the cost of maintaining such tables using ETL pipelines, despite the fact that they are not actually being read by consumers.
We define an unused table as:
The results are broken into two index components, namely Unused tables - Storage and Unused tables - Compute.
The following charts show examples of both index components
Unused tables - Storage
The reader may notice that while the daily cost of overall storage (gray bars) is fluctuating, as tables are created and destroyed as part of pipelines, the actual waste (red bars) moves more slowly as the set of “unused tables” is relatively fixed and has a more predictable daily cost.
Unused tables - Compute
In the case of compute costs for unused table, the trend can be more volatile, as compute-heavy events on the underlying used tables may fluctuate per the execution of ETL jobs
Backup bytes consist of failsafe bytes ensuring the resilience of the data in case of a temporary outage, as well as time travel bytes which record the change history of tables which allow for you to restore to an older point in time when necessary. Active bytes are bytes used to store the active table in its latest version.
While a naive calculation can show you the total ratio of active bytes vs. backup bytes, it will be wrong to consider all backup bytes as “waste”, since they serve the purpose of potentially restoring data and therefore bring value.
For this calculation, we consider backup bytes to be “wasteful” in the following conditions:
If the table has not been created more than twice in the last 10 days, this signals that the use of backup bytes to restore the table or to time travel is highly unlikely, as the table probably participates in an ETL process that can reconstruct the table, rather than revert to an older version.
If the table was created but also deleted in the last 10 days, it signals that it’s likely a temporary solution during development/testing which is not likely to get restored.
The following chart illustrates the tracking of backup bytes efficiency, and waste over time.
Backup bytes efficiency
Failing queries are non productive, as they consume credits and do not perform any useful work. While under normal circumstances, a small percentage of queries is expected to fail as part of development and testing, a high level of failure can be a significant drag on cost efficiency and often hides under the radar. For example, a recurrent query that accumulated a meaningful cost over time may be failing without notice. This can be due to updated configuration such as statement timeout, or change in the data model that causes it to fail (e.g. select * from a table that had its schema changed, therefore breaking downstream pipelines).
We recently uncovered a failing query which a customer had missed which was on track to cost them $327,500 over the year!
The failing queries efficiency index tracks all failing queries and aggregates their cost to calculate the rate of successful queries.
This index can help track and suppress the failure rate over time by drilling down to the specific queries that are failing, for triage. It can also be sensitive, and has been seen to alert on significant spikes in failures within as little as a day. Note that Bluesky also has proactive findings that would alert on some failing queries based on their cost and other parameters.
Failing queries efficiency
To emphasis the principle of actionable insights, let’s look at the following scatter plot of the many runs of this query:
As can be seen from the chart, the query fails consistently when it reaches 400s. Bluesky will fire an alert in email, Slack and/or PagerDuty and it will show you a notification on the top of the UI, along with an explanation:
:
The overall efficiency takes all 5 components and calculates a weighted average efficiency score based on the reported waste in credit cost terms.
Consider the following table as an example:
Note that the overall index efficiency of 82.6% is calculated as the accumulated waste vs. the total spend, which may not be the accumulation of all other index components, as some of them may overlap.
Interested in seeing what your efficiency index looks like?
Sign up for Blueksy at https://www.getbluesky.io/contact
1Such tables may have READ activity related to their maintenance, but not for serving consumers