Breaking Up With the Data Warehouse

May 08, 2023

Let me take you back to 2012. The Mayans predicted the end of the world, Queen Elizabeth II was celebrating her Diamond Jubilee, President Barack Obama started his second term, and The Avengers were having their first team-up on the big screen. Among all this noise though you might have missed perhaps the biggest moment of them all: the release of Amazon Redshift.

With the initial release of Amazon Redshift, we entered a new world of the data warehouse. One where data can be both stored and computed against all inside of the cloud, with each being able to be scaled up and down separately. Gone were the days of the 1990s where an engineer had to estimate their future needs, running into problems when they simply ran out of storage capacity or the compute power necessary to run internal processes. And while Redshift may have been the first, we soon saw others rise including popular names Google BigQuery, Azure Synapse, Databricks Delta Lake, and of course Snowflake.

Today a data engineering team has a wide selection of choices when it comes to cloud-native warehousing, but the separation of storage and compute has stopped there. If a data team were to decide to move their legacy on-premise Hadoop towards Snowflake for example, they would be forced to utilize both Snowflake’s storage system in addition to their querying engine. This would be fine if companies were to move all of their data over to a singular vendor, but the truth is we live in a hybrid/multi-cloud world where analytical data is spread across multiple systems residing across clouds and on-prem. Since the storage and compute engines of these warehouses are so tightly connected, it’s been impossible to utilize Snowflake’s query engine on data residing inside Azure, or to utilize BigQuery BI Engine to run computation over data inside Databricks. That is, until recently.

Just as storage and compute have been separated inside each individual vendor, we are beginning to see an emerging pattern of the two being disconnected entirely. New technologies are arising that allow analytical data to reside across systems while still being queried with best-of-breed solutions from different vendors. Open table formats are being rapidly adopted by both newer vendors and existing warehouse solutions which allow data to be queried utilizing various engines. New open-source object storage systems residing outside of the cloud hyperscalers can be placed on top of the table format of choice. And there has been an eruption of query engines appearing, both vendor-controlled and open-source, that have been adopted by large enterprises looking to utilize best in breed tooling without the requirement of lifting-and-shifting their current storage architectures.

I believe that we are still in the very early innings of the further disaggregation of storage and compute capabilities. Modern data teams want the ability to utilize the best querying capabilities dependent on the scale and complexity of the job at hand, going from utilizing just the power of their own laptop all the way to scale-out massively parallelized jobs. By pushing out where data resides (storage) from how it is acted on (compute), data engineers will be able to balance price to performance as they see fit. Through the disassociation of storage and compute providers, data professionals will have more choice than ever when it comes to designing and implementing their data warehouse/lake, with an emphasis on best-of-breed rather than single provider.

Unstructured with Ryan Wexler

Discussion about this post