Databricks Serverless SQL + Python = An analyst’s dream come true! ❤️

Sathish Gangichetty
7 min readJun 17, 2022

--

We’ve been told that 2022 is the year when true economic contraction will start and we’re already seeing some of that. If this is true, it gives us an opportunity to ask — “what can we do that can help reduce the cost burden on our orgs?” Maybe we can look at what we do everyday and unlock opportunities for cost optimization. After all, necessity is the mother of …yeah, right.. first principles thinking. Using this guiding principle, we’ll examine how Databricks SQL + Serverless Endpoints + Python can deliver significant benefits over traditional cloud data warehousing vendors.

Everyday, data analysts use a variety of tools to ingest data into their cloud data warehouse and then generate reports for end users. Sometimes, these reports might be consumed via slide decks. In other cases, they may end up being a part of a BI dashboard. This cycle of ingest-analyze-consume is at the heart of what a data analyst typically does. Granted in most cases, there may not be a need to ingest since well-qualified ingest jobs are handled by the data engineering team. However, the need to ingest arbitrary files and data still stands.

These could be custom files downloaded from a CRM system, data from some arbitrary form that has to be processed, or even data that was processed by connecting to two disparate operational systems using a reporting tool or excel. In any case, this grunt work still happens and is “hands-on-the-keyboard” work that someone does. Now, it’s the data analyst’s responsibility to load these files into a data warehouse, analyze this data, and make it available for consumption. In the past, cloud data warehouses (CDW) offered a decent leg up and cost savings via their serverless offerings. But the trade-offs have been lopsided and to their credit, there wasn’t a viable cost effective alternative until the lakehouse paradigm emerged. This offers room for us to analyze the downsides of a traditional CDW. The traditional CDW model creates significant organizational risks and overhead while boxing out options that don’t fit their way of functioning/thinking. In other words, choosing them has significant downstream consequences.

Let’s take a look at some of these risks played out through the lens of a traditional cloud data warehouse provider and the promises they offer:

  1. They ingest your data into their warehouse offering and serve it up in a format that’s obtuse to you and re-cluster it for you periodically based on your consumption patterns. (proprietary risk)
  2. They give you ways to access the data, but you will have to always access the data through their tools. If they go lights out, so do you. You have to trust them that they never go lights out. And usually, they can make this pain go away if you give them more money. (proprietary risk)
  3. They may/may not enable other workloads you might truly care about down the road because that doesn’t align with their interests. Therefore, they may need you to get tools from other vendors who they get along with if you care about it deeply. (not open)
  4. If 1, 2, & 3 don’t excite you — you can park your data on object storage in the cloud of your choice and they give you tools to operate on that data but recognize that they may not be able to perform as efficiently or effectively. At the end of the day, this will increase your compute costs, but hey, this was your own doing! You should’ve listened to them because they know what they’re doing. (flexibility risk)

These are very expensive forward looking decisions that are usually thrust upon an unsuspecting analyst who has little to no understanding of how things play out at scale. Usually, one ignores these problems until these risks manifest themselves either via spiking costs or when you want to do something that isn’t in the vendor’s “wheelhouse”.

Also, the risk of tech debt is seldom a moot point with analysts. They usually don’t realize the far reaching implications behind these choices. But perhaps they should. If all the analyst wants to do is some report refresh or some basic analytics workload in a cost effective manner, this is definitely an area to double-click on in the current economic climate. Like we mentioned earlier, this is an area to operate on using your first principles hat. What worked yesterday, may not work today. In fact, the only thing that we can’t do away with in this set up, is the need to land your data on the cloud. There are simple tools that allow you to do that (s3fs for example). Couple this with the Delta project and Databricks Serverless SQL and its Python connector — we might just have the best bet in town for a serverless offering. To understand why, let’s play this out a little bit more using an example.

We have some data on our local machine (this could’ve come from an on-prem database or some file). For our example, we’ll use the titanic dataset. Our task is to get this analyzed for some simple reporting using the Databricks serverless SQL offering and make this data available for use across teams. So, we’ll have to first upload the data into s3 (the object store we’ll use here in our example) and then query this on Databricks SQL.

Before we upload the data into s3,

  1. We can quickly read from our csv file and perform any transformations or EDA needed using pandas on spark

2. Write to Delta locally by passing simple config options as shown below. This is well documented here.

3. Now that we have our data in Delta format, we can quickly upload the data using s3fs as shown below. It’s worth mentioning that Delta is an open source project that offers a robust table format. In fact, as you’ve seen in the step above, there hasn’t been any vendor managed tool in the mix for working with the Delta format. Additionally, Delta is simply parquet with a transaction log attached to it. There’s no opaque vendor-managed black magic. At this stage, we want to make sure we have a storage credential on Databricks Unity Catalog and an external table pointing to that storage location.

4. Next, we can quickly fire up a serverless endpoint on Databricks SQL and start querying away.

5. We can toggle the function above back to a stop status once our task is complete thereby effectively shutting off the serverless SQL endpoint! The snapshot below shows the following things:

a. Fire up a Databricks SQL (dbsql) serverless endpoint

b. Connect to the endpoint

c. Run a query on the active endpoint

d. Collect the results in a pandas dataframe

e. Stop the serverless endpoint

Viola! Just like that we were able to instantly demonstrate our ingest-analyze-consume cycle!

With this, we’re able to accomplish some significant items.

  1. We no longer need to choose a proprietary format. With Delta, you get an open source table format built on parquet. No sleight of hand — fully transparent, open source format! (Open Source First)
  2. You can always access this data. You own your data. Your access to your data isn’t predicated on databricks service uptimes. (Transparent access to data)
  3. Building on top of open source formats brings significant benefits. You are no longer at the beck and call of the vendor’s roadmap! Their problem is not your problem. (Vendor neutral execution of your priorities)
  4. You can do away with wasteful ingest causing duplication of data. External tables are well supported and easily tuned to perform optimally. (Control without loss of performance)

To wrap up, here’s the entire demonstration running from the command line (~30 odd secs start to finish, where 80% of the time is spent on the local machine)

If you were to map back to where we started from, it’s clear that in addition to providing significant cost savings, this approach delivers a much better deal that works to your benefit. A user-centric solution, where you truly control your usage without getting boxed in — both right now and in the future. This delivers on the promise of the lake house where you can exploit the data in your lake with the data warehousing capabilities that DBSQL provides AND bring in unstructured data as well to deliver AI first apps. No duplication, no lock-in, no my way or the highway. Should you be interested in trying this out, definitely try out Databricks SQL here!

— — —

Connect with me on LinkedIn (P.S: I work at Databricks)

--

--

Sathish Gangichetty

I’m someone with a deep passion for human centered AI. A life long student. Currently work @ databricks