The Best Optimization Resources Available

Prodan Statev

Post - Get to Know the Bluesky Sales Team:

Bluesky is excited to introduce two new members to the team. Prodan Statev and Luis Leon recently joined the Bluesky team bringing their combined decades of experience in the data space, most recently with Snowflake and dbt Labs respectively. They are both excited to help build the most ambitious copilot for Snowflake and help people use it the best way possible.   

What got you interested in working with data?

PS: I was in middle school and loved hanging out in the small computer repair shop down the street. The two people working there were mostly fixing desktop PCs, which at the time meant defragmenting hard drives and performing clean installs of Windows to remove all of the malware that people had downloaded. I figured I could be doing the same! When I finally convinced one of my mom’s colleagues to let me fix his slow PC, I quickly figured out I knew absolutely nothing about computers. So I brought the computer to the repair shop, had them do their magic (aka reinstall the OS) and then strapped the PC on my back and walked it to the happy customer, who quickly gave me another PC to fix. I sensed a business opportunity and struck a deal with the small repair shop, bringing them computers to fix, which they promised to do so quickly.I would charge the clients I acquired slightly more for the privilege of the fast service and hand delivery. Eventually, I was hired to work for the computer repair shop and was tasked with tracking the inventory. There, I worked with my first piece of enterprise software: SharePoint.   

LL: One of my first jobs out of college was working for a small start-up that did data collection and analysis for the quality of acute surgical care - which is a fancy way of saying we tried to use data to help physicians improve their performance and help patients get better treatment. It was really powerful to see how my analytic skills could be used to help professionals with far more education and training than I have and how they in turn used this information to save lives. One of the jobs I often got drafted into was helping our head of IT manage and upgrade our data center. I remember very vividly the days long process we would undergo everytime we got a large new data set - data would arrive on CD-roms, the director of IT and I would have to drive out to the data center and physically slot more RAM into the racks before calling back to the office to make sure everything worked and we had enough storage and memory and actually process all this new data all the while worrying that I would accidentally discharge static electricity and fry everything we had built. While an interesting learning experience, none of this work directly helped the mission of the business or its customers. So when I found there was a better way of doing this work I was hooked.

Tell us about your experience working with the Modern Data Stack?

PS: I built a startup in college which at one point was going to do massive amounts of analytics on every user interaction with a piece of media. We quickly pivoted but it gave me a taste of working on what at the time I thought of as big data. After college, I had to find “real work” (per US immigration regulations) so I joined the then small Snowflake. I worked on the sales side, eventually co-creating Snowflake’s Startup Program, which helped application builders create the next generation of startups built on the platform. 

I’m very proud of what we created there because we had to offer something more compelling than Startup Programs at AWS, Azure and GCP, which were much larger and offered a ton of free credits. Along the way, I worked with some of Snowflake’s biggest customers to use the system better, grow their business faster and support the most demanding use cases. 

LL: Snowflake & dbt are tools that fundamentally change what is possible for an organization to do with data, by empowering any user that can write SQL select statements to create test and deploy data pipelines individuals and organizations can react and transform data faster and easier than ever before. Pairing this technology with Snowflake has truly revolutionized how data is used by organizations. Small teams of business users can now accomplish what used to take full data engineers departments weeks in matter of days or hours.

When I worked with organizations adopting dbt for the first time it was often because existing systems and processes broke down. In many cases because data velocity and reliability had gotten so bad business users couldn't trust the data or just found alternative ways of getting it (Excel downloads anyone?) Sometimes a key member of the data engineering team moved on to a new role or a new company, taking vast amounts of institutional knowledge with them. And in some cases mighty data teams of one person just needed better tools to keep up with them. Regardless of the specifics there was always a compelling catastrophe, generally the type a business would want to throw money at to make it go away. 

The good news is these are amazing tools and I've seen hundreds of organizations of every size overcome these problems and transform the way their organizations deal with data and ultimately do business.

What’s the most underrated problem when it comes to adapting the modern data stack?

PS: At Snowflake, I would frequently get asked ‘how can we maximize our investment in Snowflake?’ The short answer involved lots of curated content from consultants, engineers and practitioners that I had collected over the years working with different organizations. 

How to troubleshoot individual queries: https://medium.com/snowflake/unleashing-the-power-of-query-optimization-in-snowflake-e0af413dd441 

When and how to use Search Optimization: https://medium.com/snowflake/snowflake-performance-search-optimization-service-part-1-8d42b4a96b14 

Using resource monitors to monitor costs:

https://medium.com/snowflake/best-practices-to-optimize-resource-optimization-in-snowflake-837218c6db59 

CI/CD in Snowflake:

https://medium.com/snowflake/building-ci-cd-pipelines-for-data-applications-at-snowflake-702f398ec7c1 

While people liked reading these articles, they weren’t relieved. I got a sense of overwhelm from them, which only increased as the pipelines became more complex. Even the most studious of my customers found the experience of optimizing Snowflake required studying multiple competing trade-offs and constant tinkering and monitoring.

What customers were asking for something more useful than a bunch of links and hand waving generalities. They wanted something specific to each workload. Something that could weigh the cost and benefit of each change, down to the query level. Something that could give them personalized recommendations and explain the tradeoffs for each change. As you might have guessed already, that tool is Bluesky. But back then I didn’t know that, so I would resort to giving them the list of links above.  

LL: A pattern I would continually see across organizations is that once these initial fires had been put out I would start to hear a common set of questions:

  • How can I monitor these dbt jobs for performance?
  • How can I speed up our pipelines?
  • How can I manage the Snowflake costs associated with all these new workloads?

I would explain to folks dbt and Snowflake best practice (https://www.snowflake.com/wp-content/uploads/2021/10/Best-Practices-for-Optimizing-Your-dbt-and-Snowflake-Deployment.pdf) which advises to size up your SF warehouse (Small to Medium, Large to XL) or convert the materialization type of your dbt models (view to table, table to incrementally loaded table) “when performance becomes an issue.”

Many resources are available from Snowflake (https://docs.snowflake.com/en/user-guide/resource-monitors), dbt (https://docs.getdbt.com/docs/deploy/run-visibility#model-timing) and Select (https://hub.getdbt.com/get-select/dbt_snowflake_monitoring/latest/) to help you report and track granular usage, as well as tutorials (https://docs.getdbt.com/blog/how-we-shaved-90-minutes-off-model) on how folks have used these tools to find and fix issues. These tools are great and I’ve helped many organizations successfully use them to improve performance more than 10X, but while these tools all give plenty of information they leave the exercise of the specifics of what to change and when to you, the user which is often little more than a game of guess-and-check. What’s worse, this process is manual and has to be repeated as new workloads are brought into the warehouse, data volume grows and business needs change.

Why work on Bluesky and not another startup?

PS: Number 1, the amazing team that we assembled. Seriously, check out our about page, it’s like the nerdy version of the PayPal mafia. And number 2, Bluesky answers a clear need and it is a joy to use. You just feel good knowing that this powerful beast of a system is behaving optimally and is being looked after by an advanced tool designed by Snowflake experts. And at the end of the day, it just works. It saves you time, it saves you money, it makes you more productive and makes you your CFO’s best friend. I want to work on products that solve those things for customers.  

LL: That is very well said. As someone who has spent a career working in data and helping people find data driven solutions to their problems it always frustrated me that the best tuning advice I could give to people I worked with was to change settings, configurations and materializations “when performance becomes an issue.” It never felt like a satisfying answer as a data practitioner. Bluesky allows people to get real, actionable suggestions based on data and usage in your warehouse. I’m a big believer in practicing what you preach, so bring a data driven approach to managing your data warehouse is for me truly a beautiful thing.

Do you relate to these stories? Maybe recognize your organization's struggles? Interested in learning more about how Bluesky can help? Contact us at https://www.getbluesky.io/contact to speak with Prodan and Luis today!