Extract Transform Load
Extract, transform, and load (ETL) is a process in data warehousing that involves
- extracting data from outside sources,
- transforming it to fit business needs, and ultimately
- Loading it into the data warehouse.
ETL is important, as it is the way data actually gets loaded into the warehouse. This article assumes that data is always loaded into a data warehouse, whereas the term ETL can in fact refer to a process that loads any database.
The first part of an ETL process is to extract the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization / format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as IMS or other data structures such as VSAM or ISAM. Extraction converts the data into a format for transformation processing.
The transform phase applies a series of rules or functions to the extracted data to derive the data to be loaded. Some data sources will require very little manipulation of data. However, in other cases any combination of the following transformations types may be required:
- Selecting only certain columns to load (or if you prefer, null columns not to load)
- Translating coded values (e.g. If the source system stores M for male and F for female but the warehouse stores 1 for male and 2 for female)
- Encoding free-form values (e.g. Mapping “Male” and “M” and “Mr” onto 1)
- Deriving a new calculated value (e.g. sale amount = qty * unit price)
- Joining together data from multiple sources (e.g. look up, merge, etc)
- Summarizing multiple rows of data (e.g. total sales for each region)
- Generating surrogate key values
- Transposing or Pivoting (turning multiple columns into multiple rows or vice versa)
The load phase loads the data into the data warehouse. Depending on the requirements of the organization, this process ranges widely. Some data warehouses merely overwrite old information with new data. More complex systems can maintain a history and audit trail of all changes to the data.