Wednesday, September 2, 2015

Why A Star Schema?

Description: 
In interviewing some candidates for some projects, I got really frustrated when nobody could tell me “why a star schema is good for reporting”. It got even more frustrating when I couldn’t find a good / concise article on the internet. So, I wrote something up.

To understand the star schema, first let's talk about its opposite, 3rd Normal Form.

What is 3rd Normal Form? 
A table is in 3rd Normal Form if, all data represented relates to the key of that table. In the following example where the key is [Tournament|Year], the information [Winner] is unique. However, the [Winner Data of Birth] is independent of the [Tournament|Year], so this table is not considered to be in 3rd Normal Form.

Tournament
Year
Winner
Winner Date of Birth
Indiana Invitational
1998
Al Fredrickson
21 July 1975
Cleveland Open
1999
Bob Albertson
28 September 1968
Des Moines Masters
1999
Al Fredrickson
21 July 1975
Indiana Invitational
1999
Chip Masterson
14 March 1977

A 3rd Normal form representation would look like this.  These structures are ideal for OLTP (On-Line Transaction Processing), i.e. maintain the data through an application.  This is because each table is atomic and relatively small both in terms of rows and columns.   In the case of any large tables, they are typically inserted into and do not require scan operations.

Tournament
Year
Winner
Indiana Invitational
1998
Al Fredrickson
Cleveland Open
1999
Bob Albertson
Des Moines Masters
1999
Al Fredrickson
Indiana Invitational
1999
Chip Masterson


Winner
Date of Birth
Chip Masterson
14 March 1977
Al Fredrickson
21 July 1975
Bob Albertson
28 September 1968

A typical data warehouse fact table, may look like the following.   This is similar to the starting table but is even further de-normalized.  The winner birth date has been parsed out to separate the Month and Year.  The purpose of de-normalization is to make available on the fact table all questions that an analytical user may ask.  For example, if I wanted to query and get the count of all tournament winners who were born in the month of September, the DB operation would be to scan and sum this one table on the Winner Birth Month looking for all records with a value of “September”.   That’s less costly than parsing the Winner Date of Birth field and less costly than joining to a winner table which has that information. Decisions about which fields to de-normalize on the fact table should be driven by the business questions which are going to be asked.

Tournament
Tournament Year
Winner
Winner Birth Month
Winner Birth Year
Winner Date of Birth
Indiana Invitational
1998
Al Fredrickson
July
1975
21 July 1975
Cleveland Open
1999
Bob Albertson
September
1968
28 September 1968
Des Moines Masters
1999
Al Fredrickson
July
1975
21 July 1975
Indiana Invitational
1999
Chip Masterson
March
1977
14 March 1977

A few common optimization for a data warehousing world include:

1) Replacing the value in the fact table with a numeric surrogate key to a lookup table. The necessary join does not lead to a performance hit because the database will quickly find the surrogate keys it needs from the small dimension table and keep those in memory while scanning the fact table.  Furthermore, because the fact table contains numerical keys, each row takes less space making it faster to scan. 

2) Ensuring there’s a proper index on the filter keys.   If the surrogate key above is indexed, it is even easier to find resulting in less scanning of the fact table. As a result, you may gave the following data model which looks like a star:

No comments:

Post a Comment