WTF is a Data Lakehouse?

David Regalado
2 min readNov 10, 2021

I see a lot of people getting hooked on the Data Lakehouse paradigm. Well, well! But do you really understand what that is? For Databricks, the coiners of this term, these are the key features:

👉Transaction support. The keyword here is concurrency. Does your data lake support it? Using SQL?

👉Schema enforcement and governance. Here you must understand what transaction log is and how it works. You will thank me later. 😉

👉BI support. Why should I have data in my data lake and at the same time in the data warehouse? Isn’t that redundant? Doesn’t it increase costs? Keeping them in sync is also overhead.

👉Storage is decoupled from compute. Nothing new under the sun. It is what GCP, AWS, Azure, or any cloud provider offers. And that for what reason? If you have an on-premise cluster for processing power, then you should get the idea that today it is possible to rent those machines to process the data. When finished, you can tear them down so that you stop paying for them. The storage goes separately. In other words, the process ends — goodbye cluster — but the data persists.

👉Openness. If your team uses R/Python but has difficulty accessing your warehouse data directly, using more standard storage formats alleviates that pain. Say hello to csv, parquet, avro, json, and many more.

👉Support for diverse data types ranging from unstructured to structured data. Big Data is not just the volume of data, right? Now that the possibilities of analyzing files such as images, video, audio, semi-structured data, and text are opened we can finally talk about the variety of data.

👉Support for diverse workloads. It’s okay to do BI in your Data Warehouse. Would doing machine learning there be too greedy?

Too many concepts to grasp?

Let’s start again but from the basics.

Data Lakehouse = Data Lake + Data Warehouse

Data Lakehouse explained with memes. Is it easier to understand? Credits: This guy

For more information, I recommend the following:

Thanks for reading! Do you want more?

Hit 50 times the like button and something wonderful will happen.

  • 👉Follow me for more nerdy talks!
  • 👉Follow Data Engineering LATAM for more content related to Data Engineering, Data Science and Data Management.

--

--

David Regalado

Founder @Data Engineering Latam community, the largest and coolest data community in Latin America ;) Passionate about all things data! beacons.ai/davidregalado