Friday, July 30, 2010

An introduction to ETL: Extract, transform and load for data warehouses

Building a data warehouse in a large and complex organisation is a complex task. The data being sourced usually resides in a number of transactional systems. These systems are built to manage transactions, to capture sales or purchases, add and maintain client information or to perform complex calculations on the data stored.

A data warehouse serves quite a different purpose. A data warehouse is built to supply management information. It provides a helicopter view of what is going on in the business. It provides a picture of a huge volume of information that is derived from the business. Often the data warehouse is used to populate multidimensional cubes known as OLAP - OnLine Analytical Processing - to allow for slicing, dicing and data mining.

The tables that comprise a data warehouse are built very differently from those used in the transactional systems. Transactional databases are built to be able to manage the access and manipulation of a large variety data for a single transaction. The design is usually relational using third of fourth 'normal form'. Typically there are many tables that can be linked using a series of keys. The problem of using this data to provide meaningful information is compounded by the fact that many organisations use many, sometimes hundreds of systems to manage their data.

The design of a data warehouse is quite different. One of the most popular methods is the 'star schema' as proposed by Ralph Kimball. The star schema consists of a fact table that contains all of the measurable information and a number of dimension tables that determine the levels at which the measures may be accessed.

A fact table may provide details of sales. The facts may include number of sales, value of sales and cost of sales. The dimensions could include product, organisational structure and region. Typically, you would use the dimensions to access the facts. You may use one or more dimension. You could use the regional dimension to obtain city, region and national sales results. You could add the product dimension to ascertain in which regions each product performs best.

A data warehouse allows for the mass extraction of information quickly and easily using aggregate or summary information.

One of the biggest tasks in building a data warehouse is the ETL - Extract, Transform and Load step. The design of a data warehouse (or a data mart) will determine what information is required.

Ralph Kimball identifies 38 processes that are required for the ETL





SOURCE:http://www.helium.com/items/1675374-an-introduction-to-etl-extract-transform-and-load-for-data-warehouses

No comments:

Post a Comment