Thursday, September 3, 2015

Simple Data Profiling (in Teradata)

My work often require that I analyze flat files to understand the data, relationships, cardinality, the unique keys etc.  To do this effectively, I always:

  1. Load the data into a relational DB so that I can run queries and test theories.
  2. Profile the data to get a sense of the the likely values, the frequency of null, etc.
Below are two queries which you can run in Teradata.  The results of the queries are additional SQL statements which you can run on the data to learn more about it.   These scripts can easily be modified for other RDBMS (Oracle, Netezza, DB2, etc.). With this simple automation, i can profile a a data file in less than 5 mins without the need for any additional software / licenses.

If running through Teradata Studio Express, it will prompt you for the following:

dbname - the database the table is in
tname - the name of the table

The output will be two data sets:

  1. One row per column containing the following fields:
    1. table_name - name of the table
    2. seq - sequence of the column in the table, this does not start with 1, but it is sequential
    3. column_name - name of the column
    4. #_of_records - the number of rows in the table
    5. distinct_values - the number of distinct values in the table for this column
    6. min_value - the smallest value in the table for this column
    7. max_value - the largest value in the table for this column
    8. min_length - the size of the value with the smallest size
    9. max_length - the size of the value with the largest size
    10. num_nulls - the number of nulls in the table for this column
    11. num_emptys - the number of empty strings in the table for this column
  2. Up to 100 rows per column containing the following fields for the top occurring values:
    1. table_name - name of the table
    2. seq - sequence of the column in the table, this does not start with 1, but it is sequential
    3. column_name - name of the column
    4. occurrences - the number of times this value occurs in the data
    5. val - the value of the column

SELECT 'select '''|| trim(TableName) || ''' table_name,
  cast('''||ColumnId||''' as integer) as seq,
  cast('''||TRIM(ColumnName)||''' as varchar(100)) column_name,
  count(*) as #_OF_RECORDS,
  count(distinct "'||TRIM(ColumnName)||'") as DISTINCT_VALUES,
  cast(min("'||TRIM(ColumnName)||'") as varchar(255)) as MIN_VALUE,
  cast(max("'||TRIM(ColumnName)||'") as varchar(255)) as MAX_VALUE,
  min(character_length(cast("'||TRIM(ColumnName)||'" as varchar(2000)))) as MIN_LENGTH,
  max(character_length(cast("'||TRIM(ColumnName)||'" as varchar(2000)))) as MAX_LENGTH,
  sum(case when "'||TRIM(ColumnName)||'" IS NULL then 1 else 0 end) AS NUM_NULLS,
  sum(case when TRIM(cast("' || Trim(ColumnName) || '" as varchar(255))) = '''' then 1 else 0 end) AS NUM_EMPTYS
from ' || trim(DatabaseName) || '.' || trim(TableName) ||'
union all' q
FROM DBC.COLUMNS
WHERE DatabaseName = ?\dbname
AND TableName = ?\tname;


SELECT 'select * from (select top 100 '''|| trim(TableName) || ''' table_name,
  cast('''||ColumnId||''' as integer) as seq,
  cast('''||TRIM(ColumnName)||''' as varchar(100)) column_name,
  count("' ||TRIM(ColumnName)|| '") occurences,' || 
  case when columnType = 'CV' and columnLength > 100 then
    '"' || TRIM(ColumnName) || '"' || ' val' else
    'cast("' ||TRIM(ColumnName)|| '" as varchar(100)) val' end || '
from ' || trim(DatabaseName) || '.' || trim(TableName) || '
group by 5 order by 4 desc) "' || trim(ColumnName) || '"
union all' q
FROM DBC.COLUMNS
WHERE DatabaseName = ?\dbname
AND TableName = ?\tname;

Wednesday, September 2, 2015

Sankey Visualization in MicroStrategy 10

I spent a little bit of time last week integrating the sankey visualization into MicroStrategy 10.  To deploy locally, just unzip the D3Flow.zip into: MicroStrategy\Web ASPx\plugins.  You’ll also need to put the sankey.js file into: MicroStrategy\Web ASPx\javascript\D3.  


This was based on cobbling together a couple of things:

Why A Star Schema?

Description: 
In interviewing some candidates for some projects, I got really frustrated when nobody could tell me “why a star schema is good for reporting”. It got even more frustrating when I couldn’t find a good / concise article on the internet. So, I wrote something up.

To understand the star schema, first let's talk about its opposite, 3rd Normal Form.

What is 3rd Normal Form? 
A table is in 3rd Normal Form if, all data represented relates to the key of that table. In the following example where the key is [Tournament|Year], the information [Winner] is unique. However, the [Winner Data of Birth] is independent of the [Tournament|Year], so this table is not considered to be in 3rd Normal Form.

Tournament
Year
Winner
Winner Date of Birth
Indiana Invitational
1998
Al Fredrickson
21 July 1975
Cleveland Open
1999
Bob Albertson
28 September 1968
Des Moines Masters
1999
Al Fredrickson
21 July 1975
Indiana Invitational
1999
Chip Masterson
14 March 1977

A 3rd Normal form representation would look like this.  These structures are ideal for OLTP (On-Line Transaction Processing), i.e. maintain the data through an application.  This is because each table is atomic and relatively small both in terms of rows and columns.   In the case of any large tables, they are typically inserted into and do not require scan operations.

Tournament
Year
Winner
Indiana Invitational
1998
Al Fredrickson
Cleveland Open
1999
Bob Albertson
Des Moines Masters
1999
Al Fredrickson
Indiana Invitational
1999
Chip Masterson


Winner
Date of Birth
Chip Masterson
14 March 1977
Al Fredrickson
21 July 1975
Bob Albertson
28 September 1968

A typical data warehouse fact table, may look like the following.   This is similar to the starting table but is even further de-normalized.  The winner birth date has been parsed out to separate the Month and Year.  The purpose of de-normalization is to make available on the fact table all questions that an analytical user may ask.  For example, if I wanted to query and get the count of all tournament winners who were born in the month of September, the DB operation would be to scan and sum this one table on the Winner Birth Month looking for all records with a value of “September”.   That’s less costly than parsing the Winner Date of Birth field and less costly than joining to a winner table which has that information. Decisions about which fields to de-normalize on the fact table should be driven by the business questions which are going to be asked.

Tournament
Tournament Year
Winner
Winner Birth Month
Winner Birth Year
Winner Date of Birth
Indiana Invitational
1998
Al Fredrickson
July
1975
21 July 1975
Cleveland Open
1999
Bob Albertson
September
1968
28 September 1968
Des Moines Masters
1999
Al Fredrickson
July
1975
21 July 1975
Indiana Invitational
1999
Chip Masterson
March
1977
14 March 1977

A few common optimization for a data warehousing world include:

1) Replacing the value in the fact table with a numeric surrogate key to a lookup table. The necessary join does not lead to a performance hit because the database will quickly find the surrogate keys it needs from the small dimension table and keep those in memory while scanning the fact table.  Furthermore, because the fact table contains numerical keys, each row takes less space making it faster to scan. 

2) Ensuring there’s a proper index on the filter keys.   If the surrogate key above is indexed, it is even easier to find resulting in less scanning of the fact table. As a result, you may gave the following data model which looks like a star:

JSON Flattener

Description:

I couldn't find something that would convert a JSON file to a flat CSV. While there are some examples out there which will traverse the first level of the tree, this will traverse the tree and repeat the parent values for the child elements.

Technology:

This program I wrote is 100% html + Javascript.

Code:

 
Demo:



Data: