The main differences between a Data Lake and a Data Warehouse lie in their structures and purposes. A Data Lake is like a vast, unstructured reservoir for storing raw data, while a Data Warehouse is more structured and tailored for processing and analyzing refined data. For a deeper dive into the value of Data Warehouses in business intelligence, check out this insightful article: https://www.linkedin.com/pulse/what-enterprise-data-warehouse-its-value-business-intelligence-nhogf/. It breaks down the crucial role Data Warehouses play in leveraging business insights.
Those are just terms used by marketing. 30 years ago, BI required a pre-aggregated database designed as cubes. Dimensions and hierarchies were pre-defined and we used to call those cubes data warehouses. An example of this old-school BI is SAP BW. This was required to provide speed. Along the time, the term data warehouse started to be used by vendors of transactional systems. So, you could hear terms like Tax data warehouse (for packages that deal with tax info) and Client data warehouse (for CRMs and packages that deal with customer info).
We don´t need cubes to provide speed anymore. Data Lake is a new term related to big data that no longer requires rigid structures as cubes to provide speed. We now divide data lakes into HOT, WARM and COLD depending on the information availability needs. In the old days, data warehouse meant structured data but nowadays, you may have non-structured data as well and use the Hadoop ecosystem to store them with low cost and security and that is all part of the data lake.
Most companies use only structured data and they call the single repository of data as data lakes after data is treated, cleaned and shaped into standards (like field and table naming standards). Some others have layers of data lakes (RAW data, Cleansed Data, etc.....). Having different layers is very useful for data governance since the volume and the speed of change affects the company´s policies.
Many of the comparisons of data lake and data warehouse that you see (such as the one below from Talend) are based on an out-of-date or dumbed-down idea of the data warehouse.
The more advanced data warehouse engines:
- support a wide range of data types and formats
- can access external data (e.g., in object storage) that has never been ingested
- support data scientists as well as business users (e.g., with an ability to run Python, R, SAS routines and data science libraries on data in place in parallel in the data warehouse)
- support operational query on live, rapidly changing data
While also providing capabilities and services never provided on data lakes or their cloud-based equivalents. Data warehouses, properly operated and housing data that is properly curated, are much more efficient, cost-effective and performant for data that is intensively shared and widely used.
Data lakes are good repositories for data that is more lightly or locally used and does not merit the level of curation usually desired in a data warehouse.
I assume the "Data Warehouse" term here is in the traditional/conventional meaning.
In such a case, the main differences are:
1) the traditional data warehouse only hosts structured or semi-structured data, while Data Lake hosts all kinds of data including unstructured data like video and audio.
2) Data Warehouse is a "House" meaning it is pre-built then to housing the fit-the-purpose data, while Data Lake is after data gets in, the build "house or camp or tent" to fit the "on-fly" demand, in such a way the Data Lake's "house" could be scaled unlimited (theoretically).
3) the Data Warehouse is ETL populated, while Data Lake is ELT fulfilled.
A DWH is, like already said below a collection of in presence used data, well structured, quality proven, in the best case "The single point of truth." Overall the biggest part will be structured data, nevertheless, it is also possible to store semi-structured data. Normally most parts in the data warehouse provide access to data in high performance. Approaches like DataVault allow high flexibility concerning changes in data structure but normally require an additional layer for performant analysis. When it comes to data where single data is only valid as part of a stream and you want to gain patterns out of it, you should think about a data lake. Same if you have a lot of pictures, videos, or information coming from social media. Data Lakes allow you to store "unstructured" data in a simple and performant way. Normally access is easy and Data Scientists will love it. Nevertheless, it is important to ensure the quality of the data, as you often will use external sources, where you do not have a supervisor. We from mip suggest a cooperation of both systems, which may be integrated or connected via interfaces. Interesting intellectual approaches in this space are Data Fabric or Data Mesh.
The relationship between them is depending on how one implements the Data Lake.
The Data Lake can include the Data Warehouse in it, or the Data Warehouse can be a supplement to the data lake to meet the traditional reporting and ad hoc analysis needs. An example is the Databricks' Lakehouse implementation.
In the answer below, I meant to add: there are some types of data where it does not add a lot of value to put the data in the data warehouse. An example would be a large audio or video file.
In this case, it makes sense to store the raw data objects in a data lake or similar platform and, if it would be valuable to integrate with other data, store the extracted features in the data warehouse.
What is a Cloud Data Warehouse? A cloud data warehouse (often abbreviated as DW or DWH) is a repository that consolidates data from various sources (including internet of things devices, relationship databases, and other data systems) and stores it in a public cloud. It is built for high-performance analytics, scale, and ease of use. Because a data warehouse is dedicated to storing and analyzing data in order to reveal patterns, trends, and correlations it enables businesses to launch...
The main differences between a Data Lake and a Data Warehouse lie in their structures and purposes. A Data Lake is like a vast, unstructured reservoir for storing raw data, while a Data Warehouse is more structured and tailored for processing and analyzing refined data. For a deeper dive into the value of Data Warehouses in business intelligence, check out this insightful article: https://www.linkedin.com/pulse/what-enterprise-data-warehouse-its-value-business-intelligence-nhogf/. It breaks down the crucial role Data Warehouses play in leveraging business insights.
Those are just terms used by marketing. 30 years ago, BI required a pre-aggregated database designed as cubes. Dimensions and hierarchies were pre-defined and we used to call those cubes data warehouses. An example of this old-school BI is SAP BW. This was required to provide speed. Along the time, the term data warehouse started to be used by vendors of transactional systems. So, you could hear terms like Tax data warehouse (for packages that deal with tax info) and Client data warehouse (for CRMs and packages that deal with customer info).
We don´t need cubes to provide speed anymore. Data Lake is a new term related to big data that no longer requires rigid structures as cubes to provide speed. We now divide data lakes into HOT, WARM and COLD depending on the information availability needs. In the old days, data warehouse meant structured data but nowadays, you may have non-structured data as well and use the Hadoop ecosystem to store them with low cost and security and that is all part of the data lake.
Most companies use only structured data and they call the single repository of data as data lakes after data is treated, cleaned and shaped into standards (like field and table naming standards). Some others have layers of data lakes (RAW data, Cleansed Data, etc.....). Having different layers is very useful for data governance since the volume and the speed of change affects the company´s policies.
All the best.
Many of the comparisons of data lake and data warehouse that you see (such as the one below from Talend) are based on an out-of-date or dumbed-down idea of the data warehouse.
The more advanced data warehouse engines:
- support a wide range of data types and formats
- can access external data (e.g., in object storage) that has never been ingested
- support data scientists as well as business users (e.g., with an ability to run Python, R, SAS routines and data science libraries on data in place in parallel in the data warehouse)
- support operational query on live, rapidly changing data
While also providing capabilities and services never provided on data lakes or their cloud-based equivalents. Data warehouses, properly operated and housing data that is properly curated, are much more efficient, cost-effective and performant for data that is intensively shared and widely used.
Data lakes are good repositories for data that is more lightly or locally used and does not merit the level of curation usually desired in a data warehouse.
I assume the "Data Warehouse" term here is in the traditional/conventional meaning.
In such a case, the main differences are:
1) the traditional data warehouse only hosts structured or semi-structured data, while Data Lake hosts all kinds of data including unstructured data like video and audio.
2) Data Warehouse is a "House" meaning it is pre-built then to housing the fit-the-purpose data, while Data Lake is after data gets in, the build "house or camp or tent" to fit the "on-fly" demand, in such a way the Data Lake's "house" could be scaled unlimited (theoretically).
3) the Data Warehouse is ETL populated, while Data Lake is ELT fulfilled.
A DWH is, like already said below a collection of in presence used data, well structured, quality proven, in the best case "The single point of truth." Overall the biggest part will be structured data, nevertheless, it is also possible to store semi-structured data. Normally most parts in the data warehouse provide access to data in high performance. Approaches like DataVault allow high flexibility concerning changes in data structure but normally require an additional layer for performant analysis. When it comes to data where single data is only valid as part of a stream and you want to gain patterns out of it, you should think about a data lake. Same if you have a lot of pictures, videos, or information coming from social media. Data Lakes allow you to store "unstructured" data in a simple and performant way. Normally access is easy and Data Scientists will love it. Nevertheless, it is important to ensure the quality of the data, as you often will use external sources, where you do not have a supervisor. We from mip suggest a cooperation of both systems, which may be integrated or connected via interfaces. Interesting intellectual approaches in this space are Data Fabric or Data Mesh.
The relationship between them is depending on how one implements the Data Lake.
The Data Lake can include the Data Warehouse in it, or the Data Warehouse can be a supplement to the data lake to meet the traditional reporting and ad hoc analysis needs. An example is the Databricks' Lakehouse implementation.
In the answer below, I meant to add: there are some types of data where it does not add a lot of value to put the data in the data warehouse. An example would be a large audio or video file.
In this case, it makes sense to store the raw data objects in a data lake or similar platform and, if it would be valuable to integrate with other data, store the extracted features in the data warehouse.
Hi @Evgeny Belenky - great question.
Here is the best answer crafted by Talend
Data Lake
Data Warehouse
Data Structure
Raw
Processed
Purpose of Data
Not Yet Determined
Currently In Use
Users
Data Scientists
Business Professionals
Accessibility
Highly accessible and quick to update
More complicated and costly to make change
Please read more here https://www.talend.com/resourc...