Junaid is a rising senior at UC Berkeley. He is one of our first interns at Bluesky. During the first 6 weeks of his internship, he took on an independent project to investigate and enhance the use of Regular Expressions in query text processing. With a daily 15-min mentorship from Zheng Shao, Bluesky’s co-founder and CTO, Junaid completed the project end-to-end from investigation, data analysis, to productionization.
Here is a story from Junaid on his internship experience at Bluesky. Bluesky is actively hiring strong software engineers as interns and full time employees. Please apply via Bluesky Jobs if you are interested.
During my internship with Bluesky this summer, I had the opportunity to work with the engineering team on projects revolving around SQL, Databases, and query analysis. Through this experience I became more experienced with Python, SQL, and regex. But most importantly, I gained a far better understanding of what it means to develop production-level code. The startup culture at Bluesky made it so that even as an intern I felt like I was contributing to projects at the core of Bluesky. And even though the internship was remote, the frequent check-ins with the engineering team, particularly my mentor Zheng Shao, allowed me to build a strong connection with and learn a lot from the Bluesky team.
The first task I was responsible for implementing was extracting important time data from SQL queries using regex. This project allowed me to quickly gain familiarity with the work that Bluesky was doing, which revolved around optimizing Snowflake query costs. The use case of this task was to determine what percentage of queries with similar structures, but vastly different costs, could be attributed to a difference in time range values.
Through this first task, I gained experience with regex, common SQL query structures, and SQL time functions.
In order to use the extracted time range data for analysis, it was important to have a method to group queries with similar structures together. Bluesky already had 1 core method for generating a query signature, where each query signature corresponded to a cluster of queries with similar structures.
But to allow for more flexible analysis, having multiple definitions for query signatures would be essential. I was responsible for implementing a series of 6 additional query signature (QS) definitions, which would progressively map more queries to the same signature by normalizing more elements of the query than the signature before. For example, the final query signature I implemented, QS5, normalized everything that was not a SQL selective keyword (e.g. SELECT, INSERT, FROM, WHERE), while the first query signature, QS0, normalized nothing. From the 70,000,000 queries that the signature series was tested on, QS5 generated 99.8% fewer distinct signatures than QS0.
The query signature series was also developed using regex to match with the desired elements. The table below shows how each subsequent query signature generated far fewer distinct query signatures, through broader, more generalized groupings.
One of the major challenges I encountered while implementing the series of query signatures was the lack of regex lookaround operator support in Snowflake. In order to accurately match specific query elements, like a SELECT list, it was essential to have lookahead assertions. In order to enable the use of lookaround operators, I looked into SQL UDFs (User Defined Functions). Since JavaScript allowed for the use of lookaround operators, I created a UDF utilizing JavaScript to match specific query elements with regex and then return the modified query signatures.
It became apparent that the UDF implemented in JavaScript performed far slower than the native Snowflake function. In order to understand the tradeoff between performance and lookaround operator support, I created a Jupyter Notebook to perform regex operations on 100,000 queries and analyzed the difference in performance between the UDF and the native Snowflake function using the Python datetime library. I determined that the UDF was performing 70% slower than the native function.
After developing the query signature series and time data extraction model, I was responsible for implementing the model in dbt and creating unit tests for the model. dbt was our tool of choice because dbt makes it easy to transform data using simple select statements, build modular code, and perform automated testing.
To take advantage of dbt, I implemented the model as an incremental model to save computing costs. The most challenging aspect of the migration to dbt was the implementation of unit testing. My final implementation utilized dbt macros, seeds, tags, and command line variables. To test the model, I created a test input table and a test output table. Macros helped ensure that when I was running a unit test, the model would change its input from the source table to the input test table. I then compared the output of the model when run on the test input with the test output table using the dbt-utils package.
One major refinement that could be made to my implementation of unit testing in dbt is allowing for the comparison of columns of type ARRAY. The built-in equality test in the dbt-utils package did not allow for comparison of ARRAY columns, so to properly test a model with ARRAY columns, it would be necessary to build a custom equality test into dbt.
Overall, my internship at Bluesky proved to be an incredibly valuable experience, and I’m excited to watch the company build out the next generation of AI-powered big data infrastructure. As Mingsheng says, “Above the clouds, the sky is always blue.”
We’re hiring. If you are looking to make an impact and inspired to build the future AI-driven data infrastructure then take a look at our job openings.