Rajiv's Code: Why A Star Schema?

Description:
In interviewing some candidates for some projects, I got really frustrated when nobody could tell me “why a star schema is good for reporting”. It got even more frustrating when I couldn’t find a good / concise article on the internet. So, I wrote something up.

To understand the star schema, first let's talk about its opposite, 3rd Normal Form.

What is 3rd Normal Form?
A table is in 3rd Normal Form if, all data represented relates to the key of that table. In the following example where the key is [Tournament|Year], the information [Winner] is unique. However, the [Winner Data of Birth] is independent of the [Tournament|Year], so this table is not considered to be in 3rd Normal Form.

Tournament	Year	Winner	Winner Date of Birth
Indiana Invitational	1998	Al Fredrickson	21 July 1975
Cleveland Open	1999	Bob Albertson	28 September 1968
Des Moines Masters	1999	Al Fredrickson	21 July 1975
Indiana Invitational	1999	Chip Masterson	14 March 1977

A 3^rd Normal form representation would look like this. These structures are ideal for OLTP (On-Line Transaction Processing), i.e. maintain the data through an application. This is because each table is atomic and relatively small both in terms of rows and columns. In the case of any large tables, they are typically inserted into and do not require scan operations.

Tournament	Year	Winner
Indiana Invitational	1998	Al Fredrickson
Cleveland Open	1999	Bob Albertson
Des Moines Masters	1999	Al Fredrickson
Indiana Invitational	1999	Chip Masterson

Winner	Date of Birth
Chip Masterson	14 March 1977
Al Fredrickson	21 July 1975
Bob Albertson	28 September 1968

A typical data warehouse fact table, may look like the following. This is similar to the starting table but is even further de-normalized. The winner birth date has been parsed out to separate the Month and Year. The purpose of de-normalization is to make available on the fact table all questions that an analytical user may ask. For example, if I wanted to query and get the count of all tournament winners who were born in the month of September, the DB operation would be to scan and sum this one table on the Winner Birth Month looking for all records with a value of “September”. That’s less costly than parsing the Winner Date of Birth field and less costly than joining to a winner table which has that information. Decisions about which fields to de-normalize on the fact table should be driven by the business questions which are going to be asked.

Tournament	Tournament Year	Winner	Winner Birth Month	Winner Birth Year	Winner Date of Birth
Indiana Invitational	1998	Al Fredrickson	July	1975	21 July 1975
Cleveland Open	1999	Bob Albertson	September	1968	28 September 1968
Des Moines Masters	1999	Al Fredrickson	July	1975	21 July 1975
Indiana Invitational	1999	Chip Masterson	March	1977	14 March 1977

A few common optimization for a data warehousing world include:

1) Replacing the value in the fact table with a numeric surrogate key to a lookup table. The necessary join does not lead to a performance hit because the database will quickly find the surrogate keys it needs from the small dimension table and keep those in memory while scanning the fact table. Furthermore, because the fact table contains numerical keys, each row takes less space making it faster to scan.

2) Ensuring there’s a proper index on the filter keys. If the surrogate key above is indexed, it is even easier to find resulting in less scanning of the fact table. As a result, you may gave the following data model which looks like a star:

Rajiv's Code

Wednesday, September 2, 2015

Why A Star Schema?

No comments:

Post a Comment