Data Lake vs. Data Warehouse

by Jonathan Ringvald in Articles

A data lake keeps raw data handy in a repository until it's ready for use.  Data warehouses are static and store data in files and folders, while a data lake is f r more "fluid" use cases. Data Lakes are essential for any organization to effectively explore and scale answers to machine learning problems.

Data lakes store data in their purest form, catering to multiple stakeholders and can also be used to package data in a way that can be consumed by non-technical people unlike data warehouses.


What's the difference between a Data Lake and a Data Warehouse?

Data warehouses are a straightforward solution where you generally collect data from a source with ETL and load it in. The you access those data sources with BI and analytics tools.

Data lakes ingest data by batch or stream where you can they choose whether or not to process or transform that data. A huge advantage of the data lake architecture is that you have access to the raw data which allows all kinds of users to process and transform data for their specific use cases.


Convinced yet? Here are some other key advantages to developing a data lake:

  • Flexibility: data can be structured or unstructured. The ability to throw all kinds of data in there, from PDF's to CSVs, PNGs and Excel Sheets.
  • Speed: Consolidated data lakes have much higher speed queries.
  • Security: Cyber security layers for ensuring the authenticity and quality of the data.
  • Vast: Store all the data, an endless “memory” limit.

 

More differences between a Data Lake and a Data Warehouse

Here's a great video from IBM explaining Data Lakes in more detail: