Introducing Bluesky Control - Proactive Governance Enhancements

Chadd Kenney

Data engineers are the unsung heroes of organizations, helping everyone navigate the data jungle and ensuring data is ready for extraordinary insights for crucial business decisions. But no data pro wants to spend all day watching out for bad behavior such as workload inefficiencies, especially as the organization scales into more and more use cases. Creating custom dashboards and homegrown infrastructure to monitor these types of behaviors is possible, but they're not exactly user-friendly and not the best use of time. So, wouldn't it be cool if you could just get a heads-up in real time when things go wonky, even including suggested remedies?  

At Bluesky, we're all about making data engineers shine like superheroes but without all the extra work. We have enhanced our proactive governance solution, Bluesky Control, with a ton of new alerts for things like run-away queries, daily cost spikes, and even weekly cost rollercoasters (both the grand total and your warehouse expenses). Oh, and don't forget about those sneaky query anomalies that drive up costs if not resolved quickly.

Bluesky is all about staying ahead of the curve and keeping those cost spikes in check. As you would expect, Bluesky runs on Snowflake. When we onboard customers, we ingest a ton of metadata, fire up our analytics pipelines, and churn through the data, looking for new and unique findings to help customers. But even our team can drive up costs due to mistakes.  

For example, back on 8/29, we rolled out a few new product features, and BAM!  There was a massive spike to the tune of 32,049% compared to normal. We kind of saw it coming due to our growth but didn't expect it to blow up like it did. Come to find out, much of the cost was due to an exploding join that got out of control. If we didn't have Bluesky on our side, that would've been a real head-scratcher, and likely, we would have missed it.

Let’s replay the scene and how Bluesky saved the day, in near minutes.

Setting the Scene:

Imagine you are just hanging out, minding your business, when you get an alert via Slack. You can’t believe it - there's a massive cost spike, a whopping 32,049% increase! Now, not all cost spikes are created equal. Sometimes, cost spikes could be positive, such as onboarding more customers and adding more value to them. Or maybe a welcoming change to the business, to where an additional one dollar spent results in ten dollars earned. But regardless of the cause, it's definitely a "stop and check this out" moment. I clicked on the notification and saw all the details of this particular cost spike from our virtual warehouse, APP_WH.

Examining the Cost Spike Notification:

The notification showed a significant spike with the details of “The warehouse APP_WH” spent a total of $1,036 on this day, which was a 32,049% increase from the typical value. Typically, daily spend is around $3.” It's a crazy spike, but the question is, why? To help diagnose the issue, let’s look at the top 10 offender query signatures for that particular day.

  

Query signatures are queries that are 99.999% the same but might have a different predicate in the where clause. These query signatures allow you to see all the runs during a particular time frame. During that day, there was a query signature that cost us $1,054 in a single day. It is a new query, so it's time to dig in.  

Digging into the Query:

After clicking the query signature, I can dig into it, review all the metrics, and even compare it to other runs of the same query signature. For example, below is a scatter plot to showcase the massive difference in execution time. You can click on each of the dots to compare each and determine why that particular run took so long.  

Understanding the Finding and Resolution:

The good news is that Bluesky provides findings for queries to ensure you are always in the know, and what do you know, it's an exploding join!  Bluesky helpfully notes that "This query pattern has a Join operator which produces significantly (often by orders of magnitude) more rows (1772348674400 rows) than it consumes (5605635 rows). It's joining tables without providing a join condition (resulting in a Cartesian product), or providing a condition where records from one table match multiple records from another table.” This query spits out way more rows (like, seriously, 1,772,348,674,400 rows) than it should. That is a crazy amount of rows.

Comment and Resolve:

Next, I commented on the findings inside Bluesky to notify the developers about the issue.  They receive a notification that directs them to the query analytics and find recommendations. The best part? After Bluesky surfaced this exploding join and its negative impact, our developer quickly spotted the error in their code and fixed it within 5 minutes. Without Bluesky, it might have taken the eng team multiple weeks, if not months, to fix the issue and resolve it, thereby hurting cost efficiency and product quality (since a cross-join is usually a bug). And the icing on the cake? Making this simple change could save $7,601 a year. 

Now, while this may not be a huge deal, just think about it – what if I never got that alert about this hiccup? What if this, along with a bunch of other weird stuff, just slipped under the radar? It's not just a waste of resources; it could also be messing with other super-important queries that folks are running on the same virtual warehouse.

Configuring Bluesky Control:

Creating these useful alerts with Bluesky Control is easy and fast. You start by creating a new monitor, pick the type of monitor you need (like daily warehouse cost spike), choose the warehouses you want to keep an eye on (or just go all-in and select them all), set your alert threshold (you can do it by percentage or a specific price point), and decide how you want to get those heads-up notifications (email, Slack, or PagerDuty – take your pick). Once you hit that save button, you're all set and covered. Bluesky Control is an excellent complement to Snowflake’s new budget monitor feature. You get to enjoy peace of mind knowing your workloads won’t break the bank, as well as fine-grain alerting configurations in Bluesky.

Next Steps:

The good news is that all Bluesky customers get this new proactive governance, as well as a bunch of other cool features, every week. We will detail those later. We are all about helping customers get the most out of their Snowflake investment. Stay tuned for more info, and sign up for our newsletter to keep updated on all the newly created innovation.  We have a ton of excellent webinars and events coming up, so also check out our upcoming events page for more information. Lastly, don't forget to treat yourself to a quick demo on how fast you can spot and fix issues, just like the one you saw above.

Demo:

Until next time, build more and optimize less, my friends……