Description:
In interviewing some candidates for some projects, I got really frustrated when nobody could tell me “why a star schema is good for reporting”. It got even more frustrating when I couldn’t find a good / concise article on the internet. So, I wrote something up.
To understand the star schema, first let's talk about its opposite, 3rd Normal Form.
What is 3rd Normal Form?
A table is in 3rd Normal Form if, all data represented relates to the key of that table. In the following example where the key is [Tournament|Year], the information [Winner] is unique. However, the [Winner Data of Birth] is independent of the [Tournament|Year], so this table is not considered to be in 3rd Normal Form.
In interviewing some candidates for some projects, I got really frustrated when nobody could tell me “why a star schema is good for reporting”. It got even more frustrating when I couldn’t find a good / concise article on the internet. So, I wrote something up.
To understand the star schema, first let's talk about its opposite, 3rd Normal Form.
What is 3rd Normal Form?
A table is in 3rd Normal Form if, all data represented relates to the key of that table. In the following example where the key is [Tournament|Year], the information [Winner] is unique. However, the [Winner Data of Birth] is independent of the [Tournament|Year], so this table is not considered to be in 3rd Normal Form.
Tournament
|
Year
|
Winner
|
Winner Date of Birth
|
Indiana Invitational
|
1998
|
Al Fredrickson
|
21 July 1975
|
Cleveland Open
|
1999
|
Bob Albertson
|
28 September 1968
|
Des Moines Masters
|
1999
|
Al Fredrickson
|
21 July 1975
|
Indiana Invitational
|
1999
|
Chip Masterson
|
14 March 1977
|
A 3rd
Normal form representation would look like this. These structures are ideal for OLTP (On-Line
Transaction Processing), i.e. maintain the data through an application. This is because each table is atomic and
relatively small both in terms of rows and columns. In the case of any large tables, they are
typically inserted into and do not require scan operations.
Tournament
|
Year
|
Winner
|
Indiana Invitational
|
1998
|
Al Fredrickson
|
Cleveland Open
|
1999
|
Bob Albertson
|
Des Moines Masters
|
1999
|
Al Fredrickson
|
Indiana Invitational
|
1999
|
Chip Masterson
|
Winner
|
Date of Birth
|
Chip Masterson
|
14 March 1977
|
Al Fredrickson
|
21 July 1975
|
Bob Albertson
|
28 September 1968
|
A typical data warehouse fact table, may look like the
following. This is similar to the starting
table but is even further de-normalized.
The winner birth date has been parsed out to separate the Month and
Year. The purpose of de-normalization is
to make available on the fact table all questions that an analytical user may
ask. For example, if I wanted to query
and get the count of all tournament winners who were born in the month of
September, the DB operation would be to scan and sum this one table on the
Winner Birth Month looking for all records with a value of “September”. That’s less costly than parsing the Winner
Date of Birth field and less costly than joining to a winner table which has
that information. Decisions about which fields to de-normalize on the fact
table should be driven by the business questions which are going to be asked.
Tournament
|
Tournament Year
|
Winner
|
Winner Birth Month
|
Winner Birth Year
|
Winner Date of Birth
|
Indiana Invitational
|
1998
|
Al Fredrickson
|
July
|
1975
|
21 July 1975
|
Cleveland Open
|
1999
|
Bob Albertson
|
September
|
1968
|
28 September 1968
|
Des Moines Masters
|
1999
|
Al Fredrickson
|
July
|
1975
|
21 July 1975
|
Indiana Invitational
|
1999
|
Chip Masterson
|
March
|
1977
|
14 March 1977
|
A few common optimization for a data warehousing world
include:
1) Replacing the value in the fact table with a
numeric surrogate key to a lookup table. The necessary join does not lead to a
performance hit because the database will quickly find the surrogate keys it
needs from the small dimension table and keep those in memory while scanning
the fact table. Furthermore, because the
fact table contains numerical keys, each row takes less space making it faster
to scan.
2) Ensuring there’s a proper index on the filter
keys. If the surrogate key above is indexed, it is
even easier to find resulting in less scanning of the fact table.
As a result, you may gave the following data model which
looks like a star:
No comments:
Post a Comment