Hi community,
I am a solution architect for a global tech company with over ten years of experience.
What are the benefits of having separate layers or a dedicated schema for each layer in ETL?
One dedicated layer for staging, one for type 1 persistent tables, and a dedicated schema/layer for tables at a dimensional model?
Why it is not recommended to have all tables in one schema?
Thanks!
I appreciate your help.
I have over 15 years of ETL experience on "real world" projects & am now teaching grad courses in the Business Intelligence life cycle.
It does not make any sense to separate the ETL process into "layers". I do recommend that "all" tables do show in a star schema diagram so that it can be understood which must be handled early in TL (transformation & Load) steps so that rows are in place when needed to be joined to foreign keys in records that are handled afterward. Dimension table rows do persist longer than those identified as fact tables.
Here are some of the advantages of managing data in different layers:
1. Provides logical separation of data between different layers
2. For any maintenance of each layer such as backup or recovery or data model change apply, etc. are layer dependent
3. From data security perspective, only authorized resource can work in their respective layer
4. Space allocation for each layer can be done independently
On a practical point, it gives freedom to work each layer independently and putting them together will be a project nightmare.
Traditional ETL would usually use a dedicated database (or even database server) where you'll load & transform your raw data before ingesting it into the final destination. This would allow checking data before its final destination.
Data transformation pipeline in DW with the arrival of Cloud Data Warehouse like snowflake has changed the landscape. The DW has also become a data lake where all raw data is stored. Using a transformation tool like DBT, you could build your fact and dimension tables, therefore are able to grab data from RAW and send them to its final destination.
For your raw data, it does make sense to separate the sources into different schemas.
You can also separate your final destination into different schemas too.
One for Finance, one for Product, one for Marketing. This is all you can grant at the schema level for each role. Much easier to manage for permissions.
Check out getdbt.com
The main reason is security and governance.
Most of the time, you are required to perform different actions on the data. Cleansing and adapting it to naming standards is pretty common and this could happen in different steps.
Having different schemas helps to prevent unwanted mistakes.
From a business perspective, it is recommended to extract data from the source system only once. In large organisations, there may be several fields in various tables which will not be required for reporting immediately. So while all data is extracted from the source system to the Persistent Storage Area in the target system, one could store this data in a staging layer in the target system for current / future use. The staging layer would typically contain all data extracted or could be filtered and transformed as required. The Persistent storage area is typically cleared within 15 days.The Staging layer in the target system also serves as a backup of all data in case your source system is down.
The advantage in building another layer above the staging layer will be to transform data further and load it in a form which will make sense for business. An example would be where one creates a value field as quantity × rate or any other formula such as computation of a discount from other fields. Only fields required for reporting could be stored in this permanent layer.
Further layers for data targets as required could be created depending on Performance and reporting considerations and the Reporting tool one uses.
The above multi layer architecture described was for older installations.
Newer setups have features which allow one to construct a view based on a single/multiple tables and report directly.
Thank you Gouri !!!