RajneeshShukla - PeerSpot reviewer
Solution Architect at a tech vendor with 10,001+ employees
  • 7
  • 425

What are the benefits of having separate layers or a dedicated schema for each layer in ETL?

Hi community,

I am a solution architect for a global tech company with over ten years of experience.

What are the benefits of having separate layers or a dedicated schema for each layer in ETL? 

One dedicated layer for staging, one for type 1 persistent tables, and a dedicated schema/layer for tables at a dimensional model? 

Why it is not recommended to have all tables in one schema?


I appreciate your help. 

PeerSpot user
6 Answers
Certified Adjunct Faculty, School of Engineering and Computing at a university with 1,001-5,000 employees
Real User
Top 5
Sep 12, 2021

I have over 15 years of ETL experience on "real world" projects & am now teaching grad courses in the Business Intelligence life cycle. 

It does not make any sense to separate the ETL process into "layers". I do recommend that "all" tables do show in a star schema diagram so that it can be understood which must be handled early in TL (transformation & Load) steps so that rows are in place when needed to be joined to foreign keys in records that are handled afterward. Dimension table rows do persist longer than those identified as fact tables. 

Search for a product comparison in Cloud Data Warehouse
Data Sleek - PeerSpot reviewer
Owner at a consultancy with 1-10 employees
Real User
Top 5
Oct 1, 2021

Traditional ETL would usually use a dedicated database (or even database server) where you'll load & transform your raw data before ingesting it into the final destination. This would allow checking data before its final destination. 

Data transformation pipeline in DW with the arrival of Cloud Data Warehouse like snowflake has changed the landscape. The DW has also become a data lake where all raw data is stored. Using a transformation tool like DBT, you could build your fact and dimension tables, therefore are able to grab data from RAW and send them to its final destination. 

For your raw data, it does make sense to separate the sources into different schemas.
You can also separate your final destination into different schemas too. 
One for Finance, one for Product, one for Marketing. This is all you can grant at the schema level for each role. Much easier to manage for permissions.

Check out getdbt.com

Djalma Gomes, Pmp, Mba - PeerSpot reviewer
Managing Partner at Data Pine
Top 5
Sep 14, 2021

The main reason is security and governance. 

Most of the time, you are required to perform different actions on the data. Cleansing and adapting it to naming standards is pretty common and this could happen in different steps.

Having different schemas helps to prevent unwanted mistakes. 

Asst. Senior IT Manager at a retailer with 5,001-10,000 employees
Real User
Sep 13, 2021

From a business perspective, it is recommended to extract data from the source system only once. In large organisations, there may be several fields in various tables which will not be required for reporting immediately. So while all data is extracted from the source system to the Persistent Storage Area in the target system, one could store this data in a staging layer in the target system for current / future use. The staging layer would typically contain all data extracted or could be filtered and transformed as required. The Persistent storage area is typically cleared within 15 days.The Staging layer in the target system also serves as a backup of all data in case your source system is down.

The advantage in building another layer above the staging layer will be to transform data further and load it in a form which will make sense for business. An example would be where one creates a value field as quantity × rate or any other formula such as computation of a discount from other fields. Only fields required for reporting could be stored in this permanent layer. 

Further layers for data targets as required could be created depending on Performance and reporting considerations and the Reporting tool one uses.

The above multi layer architecture described was for older installations.

Newer setups have features which allow one to construct a view based on a single/multiple tables and report directly.

Gouri Mishra - PeerSpot reviewer
Teradata FSLDM Consultant at TIK IT Solutions
Real User
May 13, 2020

Here are some of the advantages of managing data in different layers:
1. Provides logical separation of data between different layers
2. For any maintenance of each layer such as backup or recovery or data model change apply, etc. are layer dependent
3. From data security perspective, only authorized resource can work in their respective layer
4. Space allocation for each layer can be done independently

On a practical point, it gives freedom to work each layer independently and putting them together will be a project nightmare.

RajneeshShukla - PeerSpot reviewer
Solution Architect at a tech vendor with 10,001+ employees
Real User
May 20, 2020

Thank you Gouri !!!

Learn what your peers think about Microsoft Azure Synapse Analytics. Get advice and tips from experienced pros sharing their opinions. Updated: November 2022.
653,584 professionals have used our research since 2012.
Related Questions
Netanya Carmi - PeerSpot reviewer
Content Manager at PeerSpot (formerly IT Central Station)
Nov 1, 2021
How does Amazon Redshift compare with Microsoft Azure Synapse Analytics? Which is better and why?
See 1 answer
Nov 1, 2021
Amazon Redshift is very fast, has a very good response time, and is very user-friendly. The initial setup is very straightforward. This solution can merge and integrate well with many different databases. It scales very easily and has a good cost-benefit ratio. It is very powerful and performs well. The data processing is very fast and can do analytics on the fly. Amazon Redshift should have more cloud-native tools that integrate with it as Microsoft does with the Azure ecosystem. There is some missing functionality that can make Redshift difficult to work in. If you are analyzing a lot of data, the load performance-wise can be very low. Microsoft Azure Synapse Analytics can handle large amounts of data - in the petabyte range. You can scale up and down as much as you want. It has a pay-as-you-go protocol, so you only pay for what you use. The speed is very good and the architecture is excellent. You are in control of what you need, which is a huge advantage. Microsoft Azure Synapse Analytics can be a very complicated product to use. The setup can be very complex. We had some issues with dashboard reporting - we found there were bugs in the platform when dealing with specific types of queries. This solution did not meet the necessary security protocols for some of our clients. Conclusion Both of these solutions are very stable, scalable, and flexible. Each of these products has their own unique ecosystems that are very competitive with each other, making them both great solutions to use. Although they are very similar, there are some features and functionalities that make them unique and attractive depending on the type of enterprise you operate. We are a smaller enterprise and overall our team felt based on ease of use, support, and where we are headed, we have started to migrate everything over to an AWS ecosystem and chose Amazon Redshift. It is the solution our clients liked best as well.
Netanya Carmi - PeerSpot reviewer
Content Manager at PeerSpot (formerly IT Central Station)
Sep 3, 2021
And why?
See 1 answer
NitinKumar - PeerSpot reviewer
Director of Enginnering at Sigmoid
Sep 3, 2021
If you are dealing with semi-structured data like json Snowflake has great support in handling and querying json data. it is also good to use as data lake and can act as one stop solution for a data lake and cloud data warehouse. query performance and low maintainability is also what makes snowflake a great choice.
Related Categories
Download Free Report
Download our free Microsoft Azure Synapse Analytics Report and get advice and tips from experienced pros sharing their opinions. Updated: November 2022.
653,584 professionals have used our research since 2012.