ETL Design Process & Best Practices

Introduction

ETL stands for Extract Transform and Load. Typical an ETL tool is used to extract huge volumes of data from various sources and transform the data depending on business needs and load into a different destination. In the modern business world the data has been stored in multiple locations and in many incompatible formats. The business data might be stored in different formats such as Excel, plain text, comma separated, XML and in individual databases of various business systems used etc. Handling all this business information efficiently is a great challenge and the ETL tool plays an important role in solving this problem.

Extract, Transform and Load

There are three steps involved in an ETL process

Extract– The first step in the ETL process is extracting the data from various sources. The source is usually flat file, XML, any RDBMS etc…

Transform – Once the data has been extracted the next step is to transform the data into a desired structure. The data transformation step may include filtering unwanted data, sorting, aggregating, joining data, data cleaning, data validation based on the business need.

Load– The last step involves the transformed data being loaded into a destination target, which might be a database or a data warehouse.

There are many challenges involved in designing an ETL solution. Following some best practices would ensure a successful design and implementation of the ETL solution.

Analyzing Source Data

This is the first step of the ETL development. It is always wiser to spend more time on understanding the different sources and types during the requirement gathering and analyzing phase. Understand what kind of data and volume of data we are going to process.

Mapping of each column source and destination must be decided.
Data types of source and destination needs to be considered.
Identify complex task in your project and find the solution
Use Staging table for analysis then you can move in the actual table

Fixing Data Issues

Users are frequently facing data issues in the source files. It will be a pain to identify the exact issue. Hence it is important that there should be a strategy to identify the error and fix them for the next run.

Add data validation task and if there’s any issue you can move them in a separate table/file.
Communicate to source Partner experts to fix such issues if it is repeated.
Add autocorrect task (lookup) if any known issues such as spell mistake, invalid date, email id etc.

Validation

As part of the ETL solution, validation and testing are very important to ensure the ETL solution is working as per the requirement. You can create multiple test cases and apply them to validate. Execute the same test cases periodically with new sources and update them if anything is missed.

Validate all business logic before loading it into actual table/file.
Create negative scenario test cases to validate the ETL process
Test with huge volume data in order to rule out any performance issues.
Keep your test cases update to date.
Ensure the configured emails are received by the respective end users.

Optimizing the ETL Solution

After you have completed the basic functionality of your ETL solution you should optimize it for memory consumption and performance of the ETL solution as a whole. Basic database performance techniques can be applied. Make the runtime of each ETL step as short as possible. Perform the Performance testing in different environments and for different sizes of data.

Ensure that the Hardware is capable to handle the ETL.
Drop indexes while loading and re-create them after load
Disable all triggers in the destination table and handle them in another step.
Use parallel process wherever possible.
Capture each task running time and compare them periodically.
Disable check and foreign key constraint to load faster.

Error Handling, Logging and Alerting

Identify a best error handling mechanism for your ETL solution and a Logging system. The error handling mechanism should capture the ETL project name, task name, error number, error description. Logging should be saved in a table or file about each step of execution time, success/failure and error description. This information will be helpful to analyze the issue and fix them quickly.

Log all errors in a file/table for your reference
Ignore errors that do not have an impact on the business logic but do store/log those errors. If the error has business logic impacts, stop the ETL process and fix the issue.
Have an alerting mechanism in place. Send Error message as an Email to the end user and support team.

Point of Failure Recovery

There is always a possibility of unexpected failure that could eventually happen. A typical ETL solution will have many data sources that sometime might run into few dozens or hundreds and there should always be a way to identify the state of the ETL process at the time when a failure occurs. Enable point of failure recovery during the large amount of data load. It helps to start the process again from where it got failed.