December 6, 2023

Optimizing Data Cloud Efficiency: Introducing Bluesky's Enhanced Efficiency Index v2

Yoav Derazon

Overview

Bluesky’s mission is to help customers on their journey to make better use of these data cloud. That broad statement can be broken down to several aspects which constitutes a “better use”. Consider the following:

Cost
Resource utilization / efficiency
Performance
Code quality
Development velocity

‍

Cost optimization is about how your costs change over time, how they adhere to a budget and how different teams use their resources. An increase in cost does not necessarily mean an unoptimized use of resources. And a decrease in costs does not necessarily mean a more efficient workload.

‍

Resource utilization, or efficiency is about measuring the effective use of resources. In other words, it measures the spend and the “waste” in different categories of usage. This is what we’re going to focus on here.

‍

Let's talk about the other three objectives of a “better use” policy: performance, code quality and development velocity. These are core pillars of the Bluesky product and are important to achieving a healthy environment. They each require a deep dive so we will cover them in separate blog posts.

Efficiency index v2

The Efficiency index has a Bluesky feature almost since our GA. In this new release of the Index, we are vastly expanding its scope and usability by making it trackable over time and by providing more insights into what moves it. We’re focusing on the following 3 aspects:

Explainable. If there is a drop in efficiency, our customers should be able to easily trace the root cause.
Actionable. Root causes of lower efficiency are interesting, but less valuable if there is nothing you can do about them. We are focusing on providing our customers with solutions to their inefficiencies that are based on their specific queries.
Per scale. Ideally, the index should be normalized for 0-100%. That means that a customer running with an ideal optimization process can asymptotically reach 100%.

‍

The implementation of efficiency index relies on the following 5 components:

Table with Lato Font and Margins

Index component	Description
Warehouse utilization	Warehouse utilization accounts for the waste in credits when the warehouse is idle and not processing any queries.
Unused tables: Storage	This component accounts for the cost of storing tables that are classified as unused. The definition of “unused” includes tables that are not accessed (SELECT) for more than 35 days but have pipelines that keep updating them.
Unused tables: Compute	This component accounts for the cost of maintaining data (e.g., UPDATE/INSERT/DELETE) in tables that are classified as unused.
Backup bytes	This accounts for the cost of keeping backup bytes (time travel & failsafe) in storage compared to the total cost of storing all data. Note that backup bytes, in general, are helpful in achieving resilience and rollback, and not all backup bytes are considered “wasteful.” Our calculation takes into account such backup bytes that are not likely to be used for time travel or recovery (more on that later)
Failing Queries	This tracks the waste in credits due to queries failing to complete execution. The calculation takes into account ALL failing queries, unlike the proactive findings, which will only report on recurrent queries that meet a certain threshold in spend/waste.

Now, let's dig into each one in more detail.

Warehouse utilization

Warehouse utilization can be thought of in two ways:

Measuring the extent to which a warehouse is active and serving actual workloads (a.k.a active & busy), as opposed to when the warehouse is active (and therefore accruing cost), but not serving any workloads (a.k.a active+idle).
Measuring the amount of concurrent workloads being served by a warehouse compared to the “maximum load” the warehouse in its current configuration setting can actually serve. This implies that utilization in a given point in time is not binary (i.e. busy/idle) but rather measured on a continuum.

‍

For this version of the efficiency index, we have chosen to rely on idle/busy utilization measurement for the following reasons:

It is an easier and more practical KPI to address (actionable).
It is easier to explain and rationalize (explainable), as there is no dispute over whether a warehouse is doing actual work vs. no work at any given time.
It is easier to provide as a scaled KPI (per scale), as driving warehouses to continuous utilization will result in a maximum of 100% utilization.

‍

We are analyzing data on partial utilization which will help us establish what is an effective FULL utilization of a warehouse. We will ship the updated calculation soon, for free, to all of our customers.

‍

Let’s look at a warehouse utilization chart:

‍

Warehouse utilization

‍

We can see the fluctuating utilization as a function of weekdays vs weekends. We also see a slight deterioration of utilization over the course of the last 2 months. Such insights can help inform what actions to take by drilling down into the specific warehouses’ efficiencies and addressing parameter tuning such as min/max clusters, warehouse size, auto-suspend threshold. Bluesky will automatically give proactive recommendations for each of these if it detects an opportunity for optimization.

‍

Unused tables

Unused tables cause cost efficiency to decrease in two ways. First, the actual bytes stored on the blob store (S3, GCS etc.) carry a monthly cost. Second, often more pronounced but flying under the radar, is the cost of maintaining such tables using ETL pipelines, despite the fact that they are not actually being read by consumers.

‍

We define an unused table as:

Any base table that has not been accessed for READ in the trailing 35 days or more
AND
Tables are are continuously maintained by adding/removing/updating records

‍

The results are broken into two index components, namely Unused tables - Storage and Unused tables - Compute.

‍

The following charts show examples of both index components

‍

Unused tables - Storage

‍

The reader may notice that while the daily cost of overall storage (gray bars) is fluctuating, as tables are created and destroyed as part of pipelines, the actual waste (red bars) moves more slowly as the set of “unused tables” is relatively fixed and has a more predictable daily cost.

‍

Unused tables - Compute

‍

In the case of compute costs for unused table, the trend can be more volatile, as compute-heavy events on the underlying used tables may fluctuate per the execution of ETL jobs

Backup bytes

Backup bytes consist of failsafe bytes ensuring the resilience of the data in case of a temporary outage, as well as time travel bytes which record the change history of tables which allow for you to restore to an older point in time when necessary. Active bytes are bytes used to store the active table in its latest version.

While a naive calculation can show you the total ratio of active bytes vs. backup bytes, it will be wrong to consider all backup bytes as “waste”, since they serve the purpose of potentially restoring data and therefore bring value.

For this calculation, we consider backup bytes to be “wasteful” in the following conditions:

The table has been created more than twice in the last trailing 10 days

The table was created (at least once) in the last 10 days but has also been deleted in the last 10 days

‍

If the table has not been created more than twice in the last 10 days, this signals that the use of backup bytes to restore the table or to time travel is highly unlikely, as the table probably participates in an ETL process that can reconstruct the table, rather than revert to an older version.

If the table was created but also deleted in the last 10 days, it signals that it’s likely a temporary solution during development/testing which is not likely to get restored.

‍

The following chart illustrates the tracking of backup bytes efficiency, and waste over time.

‍

Backup bytes efficiency

‍

Failing queries

Failing queries are non productive, as they consume credits and do not perform any useful work. While under normal circumstances, a small percentage of queries is expected to fail as part of development and testing, a high level of failure can be a significant drag on cost efficiency and often hides under the radar. For example, a recurrent query that accumulated a meaningful cost over time may be failing without notice. This can be due to updated configuration such as statement timeout, or change in the data model that causes it to fail (e.g. select * from a table that had its schema changed, therefore breaking downstream pipelines).

‍

We recently uncovered a failing query which a customer had missed which was on track to cost them $327,500 over the year!

‍

The failing queries efficiency index tracks all failing queries and aggregates their cost to calculate the rate of successful queries.

‍

This index can help track and suppress the failure rate over time by drilling down to the specific queries that are failing, for triage. It can also be sensitive, and has been seen to alert on significant spikes in failures within as little as a day. Note that Bluesky also has proactive findings that would alert on some failing queries based on their cost and other parameters.

‍

Failing queries efficiency

‍

To emphasis the principle of actionable insights, let’s look at the following scatter plot of the many runs of this query:

‍

‍

As can be seen from the chart, the query fails consistently when it reaches 400s. Bluesky will fire an alert in email, Slack and/or PagerDuty and it will show you a notification on the top of the UI, along with an explanation:

‍

Efficiency index v2 - putting it all together

The overall efficiency takes all 5 components and calculates a weighted average efficiency score based on the reported waste in credit cost terms.

‍

Consider the following table as an example:

Styled Table

Index component	Index value	Wasted credits
Warehouse utilization	61.4%	$35,640
Unused tables: Storage	69.1%	$1,452
Unused tables: Compute	100.0%	$0
Backup bytes	78.3%	$44
Failing Queries	99.8%	$132
Overall index	82.6%	$37,268

‍

Note that the overall index efficiency of 82.6% is calculated as the accumulated waste vs. the total spend, which may not be the accumulation of all other index components, as some of them may overlap.

‍

Interested in seeing what your efficiency index looks like?

¹Such tables may have READ activity related to their maintenance, but not for serving consumers