Rajiv's Code: 2015

Thursday, September 3, 2015

Simple Data Profiling (in Teradata)

My work often require that I analyze flat files to understand the data, relationships, cardinality, the unique keys etc. To do this effectively, I always:

Load the data into a relational DB so that I can run queries and test theories.
Profile the data to get a sense of the the likely values, the frequency of null, etc.

Below are two queries which you can run in Teradata. The results of the queries are additional SQL statements which you can run on the data to learn more about it. These scripts can easily be modified for other RDBMS (Oracle, Netezza, DB2, etc.). With this simple automation, i can profile a a data file in less than 5 mins without the need for any additional software / licenses.

If running through Teradata Studio Express, it will prompt you for the following:

dbname - the database the table is in
tname - the name of the table

The output will be two data sets:

One row per column containing the following fields:

table_name - name of the table
seq - sequence of the column in the table, this does not start with 1, but it is sequential
column_name - name of the column
#_of_records - the number of rows in the table
distinct_values - the number of distinct values in the table for this column
min_value - the smallest value in the table for this column
max_value - the largest value in the table for this column
min_length - the size of the value with the smallest size
max_length - the size of the value with the largest size
num_nulls - the number of nulls in the table for this column
num_emptys - the number of empty strings in the table for this column

Up to 100 rows per column containing the following fields for the top occurring values:

table_name - name of the table
seq - sequence of the column in the table, this does not start with 1, but it is sequential
column_name - name of the column
occurrences - the number of times this value occurs in the data
val - the value of the column

SELECT 'select '''|| trim(TableName) || ''' table_name,

cast('''||ColumnId||''' as integer) as seq,

cast('''||TRIM(ColumnName)||''' as varchar(100)) column_name,

count(*) as #_OF_RECORDS,

count(distinct "'||TRIM(ColumnName)||'") as DISTINCT_VALUES,

cast(min("'||TRIM(ColumnName)||'") as varchar(255)) as MIN_VALUE,

cast(max("'||TRIM(ColumnName)||'") as varchar(255)) as MAX_VALUE,

min(character_length(cast("'||TRIM(ColumnName)||'" as varchar(2000)))) as MIN_LENGTH,

max(character_length(cast("'||TRIM(ColumnName)||'" as varchar(2000)))) as MAX_LENGTH,

sum(case when "'||TRIM(ColumnName)||'" IS NULL then 1 else 0 end) AS NUM_NULLS,

sum(case when TRIM(cast("' || Trim(ColumnName) || '" as varchar(255))) = '''' then 1 else 0 end) AS NUM_EMPTYS

from ' || trim(DatabaseName) || '.' || trim(TableName) ||'

union all' q

FROM DBC.COLUMNS

WHERE DatabaseName = ?\dbname

AND TableName = ?\tname;

SELECT 'select * from (select top 100 '''|| trim(TableName) || ''' table_name,

cast('''||ColumnId||''' as integer) as seq,

cast('''||TRIM(ColumnName)||''' as varchar(100)) column_name,

count("' ||TRIM(ColumnName)|| '") occurences,' ||

case when columnType = 'CV' and columnLength > 100 then

'"' || TRIM(ColumnName) || '"' || ' val' else

'cast("' ||TRIM(ColumnName)|| '" as varchar(100)) val' end || '

from ' || trim(DatabaseName) || '.' || trim(TableName) || '

group by 5 order by 4 desc) "' || trim(ColumnName) || '"

union all' q

FROM DBC.COLUMNS

WHERE DatabaseName = ?\dbname

AND TableName = ?\tname;

Wednesday, September 2, 2015

Sankey Visualization in MicroStrategy 10

I spent a little bit of time last week integrating the sankey visualization into MicroStrategy 10. To deploy locally, just unzip the D3Flow.zip into: MicroStrategy\Web ASPx\plugins. You’ll also need to put the sankey.js file into: MicroStrategy\Web ASPx\javascript\D3.

sankey.js

D3Flow.zip

This was based on cobbling together a couple of things:

I started with this tutorial to understand how it works:

https://lw.microstrategy.com/msdz/MSDL/10/docs/projects/VisSDK_All/Content/topics/HTML5/SampleCode_D3SimpleBarChart.htm

I tried to use this example, which worked, but did something funny when a node was used as both a source and target:

https://github.com/mstr-dev/Visualization-Plugins/tree/master/D3Flow

I modified the code to use this example from a guy who improved on the diagraming capabilities:

http://bl.ocks.org/soxofaan/7c96560677ead0425fe7

Why A Star Schema?

Description:
In interviewing some candidates for some projects, I got really frustrated when nobody could tell me “why a star schema is good for reporting”. It got even more frustrating when I couldn’t find a good / concise article on the internet. So, I wrote something up.

To understand the star schema, first let's talk about its opposite, 3rd Normal Form.

What is 3rd Normal Form?
A table is in 3rd Normal Form if, all data represented relates to the key of that table. In the following example where the key is [Tournament|Year], the information [Winner] is unique. However, the [Winner Data of Birth] is independent of the [Tournament|Year], so this table is not considered to be in 3rd Normal Form.

Tournament	Year	Winner	Winner Date of Birth
Indiana Invitational	1998	Al Fredrickson	21 July 1975
Cleveland Open	1999	Bob Albertson	28 September 1968
Des Moines Masters	1999	Al Fredrickson	21 July 1975
Indiana Invitational	1999	Chip Masterson	14 March 1977

A 3^rd Normal form representation would look like this. These structures are ideal for OLTP (On-Line Transaction Processing), i.e. maintain the data through an application. This is because each table is atomic and relatively small both in terms of rows and columns. In the case of any large tables, they are typically inserted into and do not require scan operations.

Tournament	Year	Winner
Indiana Invitational	1998	Al Fredrickson
Cleveland Open	1999	Bob Albertson
Des Moines Masters	1999	Al Fredrickson
Indiana Invitational	1999	Chip Masterson

Winner	Date of Birth
Chip Masterson	14 March 1977
Al Fredrickson	21 July 1975
Bob Albertson	28 September 1968

A typical data warehouse fact table, may look like the following. This is similar to the starting table but is even further de-normalized. The winner birth date has been parsed out to separate the Month and Year. The purpose of de-normalization is to make available on the fact table all questions that an analytical user may ask. For example, if I wanted to query and get the count of all tournament winners who were born in the month of September, the DB operation would be to scan and sum this one table on the Winner Birth Month looking for all records with a value of “September”. That’s less costly than parsing the Winner Date of Birth field and less costly than joining to a winner table which has that information. Decisions about which fields to de-normalize on the fact table should be driven by the business questions which are going to be asked.

Tournament	Tournament Year	Winner	Winner Birth Month	Winner Birth Year	Winner Date of Birth
Indiana Invitational	1998	Al Fredrickson	July	1975	21 July 1975
Cleveland Open	1999	Bob Albertson	September	1968	28 September 1968
Des Moines Masters	1999	Al Fredrickson	July	1975	21 July 1975
Indiana Invitational	1999	Chip Masterson	March	1977	14 March 1977

A few common optimization for a data warehousing world include:

1) Replacing the value in the fact table with a numeric surrogate key to a lookup table. The necessary join does not lead to a performance hit because the database will quickly find the surrogate keys it needs from the small dimension table and keep those in memory while scanning the fact table. Furthermore, because the fact table contains numerical keys, each row takes less space making it faster to scan.

2) Ensuring there’s a proper index on the filter keys. If the surrogate key above is indexed, it is even easier to find resulting in less scanning of the fact table. As a result, you may gave the following data model which looks like a star:

JSON Flattener

Description:

I couldn't find something that would convert a JSON file to a flat CSV. While there are some examples out there which will traverse the first level of the tree, this will traverse the tree and repeat the parent values for the child elements.

Technology:

This program I wrote is 100% html + Javascript.

Code:

Demo:

Data: