Extract Transform Load

Extract Transform Load

Extract, transform, and load (ETL) is a process in data warehousing that involves

  • extracting data from outside sources,
  • transforming it to fit business needs, and ultimately
  • Loading it into the data warehouse.

ETL is important, as it is the way data actually gets loaded into the warehouse. This article assumes that data is always loaded into a data warehouse, whereas the term ETL can in fact refer to a process that loads any database.

Extract

The first part of an ETL process is to extract the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization / format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as IMS or other data structures such as VSAM or ISAM. Extraction converts the data into a format for transformation processing.

Transform

The transform phase applies a series of rules or functions to the extracted data to derive the data to be loaded. Some data sources will require very little manipulation of data. However, in other cases any combination of the following transformations types may be required:

  • Selecting only certain columns to load (or if you prefer, null columns not to load)
  • Translating coded values (e.g. If the source system stores M for male and F for female but the warehouse stores 1 for male and 2 for female)
  • Encoding free-form values (e.g. Mapping “Male” and “M” and “Mr” onto 1)
  • Deriving a new calculated value (e.g. sale amount = qty * unit price)
  • Joining together data from multiple sources (e.g. look up, merge, etc)
  • Summarizing multiple rows of data (e.g. total sales for each region)
  • Generating surrogate key values
  • Transposing or Pivoting (turning multiple columns into multiple rows or vice versa)

Load

The load phase loads the data into the data warehouse. Depending on the requirements of the organization, this process ranges widely. Some data warehouses merely overwrite old information with new data. More complex systems can maintain a history and audit trail of all changes to the data.

Extract Transform Load

Extract Transform Load

Extract, transform, and load (ETL) is a process in data warehousing that involves

  • extracting data from outside sources,
  • transforming it to fit business needs, and ultimately
  • Loading it into the data warehouse.

ETL is important, as it is the way data actually gets loaded into the warehouse. This article assumes that data is always loaded into a data warehouse, whereas the term ETL can in fact refer to a process that loads any database.

Extract

The first part of an ETL process is to extract the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization / format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as IMS or other data structures such as VSAM or ISAM. Extraction converts the data into a format for transformation processing.

Transform

The transform phase applies a series of rules or functions to the extracted data to derive the data to be loaded. Some data sources will require very little manipulation of data. However, in other cases any combination of the following transformations types may be required:

  • Selecting only certain columns to load (or if you prefer, null columns not to load)
  • Translating coded values (e.g. If the source system stores M for male and F for female but the warehouse stores 1 for male and 2 for female)
  • Encoding free-form values (e.g. Mapping “Male” and “M” and “Mr” onto 1)
  • Deriving a new calculated value (e.g. sale amount = qty * unit price)
  • Joining together data from multiple sources (e.g. look up, merge, etc)
  • Summarizing multiple rows of data (e.g. total sales for each region)
  • Generating surrogate key values
  • Transposing or Pivoting (turning multiple columns into multiple rows or vice versa)

Load

The load phase loads the data into the data warehouse. Depending on the requirements of the organization, this process ranges widely. Some data warehouses merely overwrite old information with new data. More complex systems can maintain a history and audit trail of all changes to the data.

If You need more visit our site – www.iguddy.com

 

Data Warehouse Concepts

If You need more visit our site – www.iguddy.com

Data Warehouse Concepts

OLTP:

  • Online Transaction Processing.
  • Its fully transactional based
  • Transaction should be made as soon as possible
  • It’s a client server model
  • It handles with current data.
  • The SQL statements retrieves only small amount of data.
  • Data’s are in the form on “Normalized data”
  • Mostly used SQL statements are INSERT,UPDATE,DELETE
  • Heterogeneous data
  • Examples: Online Reservation System, Withdraw money from ATM and etc.,

OLAP :

  • Online Analytical Processing
  • Otherwise called as “Decision Support System”
  • Deformalized data
  • Mainly focuses on database
  • Its fully analysis and report oriented
  • Main users are Management and business people
  • It handles Historical Data
  • Only Bulk Load
  • Homogeneous data

Difference between OLTP and OLAP

                          OLTP                        OLAP 
Online Transaction Processing Online Analytical Processing
OLTPs are the original source of the data. OLAP data comes from the various OLTP Databases
It is transaction oriented. It is Analytical and report oriented.
Normalized data. De normalized data.
Handles current data Handles  historical data
Operations like insert, update, delete can be performed. Since it is mainly used for bulk loading of data we mainly perform the insert operation.Update and delete operations are performed in some cases where it is necessary.
Select query is used to retrieve only minimal amount of data. Select query is used to retrieve bulk volume of data.
Used by large number of end users. Used by minimal number of users.
Consist of heterogeneous data. Consist of homogenous data.

Tasks Performed When data coming from Live to data warehouse

  • DE normalizing the data.
  • Data cleansing.
  • Data trimming
  • Removing duplicate records.
  • Decoding the data.
  • Calculating the data.
  • Data validation if required
  • Null Value replacement.
  • Data type conversion.

Life Cycle of Data Warehousing Project:

  • Project Planning
    • Budget Allocation.
    • Internal resource allocation
    • BRS (Business Requirement Study) which involves
    • Dimensional modeling
      • This involves conversion of OLTP tables to OLAP tables.
      • This is performed by Dimension modeler.
      • Physical Design
      • Application Development
      • Deployment and training
      • Maintenance

Teams involved in Data warehousing:

  • Data Dimension modeler.
  • DBA
  • Report Developer
  • Report Admin
  • ETL developer
  • ETL admin
  • System Admin.

Dataware House

  • Managing Historical data and helps the management to take right decisions at right time
  • A DW is Subject oriented, nonvolatile and integrated collection of data to support management

Data mart

  • Data mart is a subset of data warehouse

DWH Design Approach

  • Top Down Approach

  • Bottom Up  Approach

Dimension Table

  • De normalized data
  • Must have one primary key
  • Contains less number of records than fact table
  • It contains large number of columns

Fact Table

  • Contains foreign key to all the dimension table
  • Contains large number of data than the dimension table

Dimension model type

  • Star Schema
  • Snow flake schema

Star Schema

  • Fact table is surrounded by many dimension table

 

If You need more visit our site – www.iguddy.com