Rajiv's Code: Simple Data Profiling (in Teradata)

My work often require that I analyze flat files to understand the data, relationships, cardinality, the unique keys etc. To do this effectively, I always:

Load the data into a relational DB so that I can run queries and test theories.
Profile the data to get a sense of the the likely values, the frequency of null, etc.

Below are two queries which you can run in Teradata. The results of the queries are additional SQL statements which you can run on the data to learn more about it. These scripts can easily be modified for other RDBMS (Oracle, Netezza, DB2, etc.). With this simple automation, i can profile a a data file in less than 5 mins without the need for any additional software / licenses.

If running through Teradata Studio Express, it will prompt you for the following:

dbname - the database the table is in
tname - the name of the table

The output will be two data sets:

One row per column containing the following fields:

table_name - name of the table
seq - sequence of the column in the table, this does not start with 1, but it is sequential
column_name - name of the column
#_of_records - the number of rows in the table
distinct_values - the number of distinct values in the table for this column
min_value - the smallest value in the table for this column
max_value - the largest value in the table for this column
min_length - the size of the value with the smallest size
max_length - the size of the value with the largest size
num_nulls - the number of nulls in the table for this column
num_emptys - the number of empty strings in the table for this column

Up to 100 rows per column containing the following fields for the top occurring values:

table_name - name of the table
seq - sequence of the column in the table, this does not start with 1, but it is sequential
column_name - name of the column
occurrences - the number of times this value occurs in the data
val - the value of the column

SELECT 'select '''|| trim(TableName) || ''' table_name,

cast('''||ColumnId||''' as integer) as seq,

cast('''||TRIM(ColumnName)||''' as varchar(100)) column_name,

count(*) as #_OF_RECORDS,

count(distinct "'||TRIM(ColumnName)||'") as DISTINCT_VALUES,

cast(min("'||TRIM(ColumnName)||'") as varchar(255)) as MIN_VALUE,

cast(max("'||TRIM(ColumnName)||'") as varchar(255)) as MAX_VALUE,

min(character_length(cast("'||TRIM(ColumnName)||'" as varchar(2000)))) as MIN_LENGTH,

max(character_length(cast("'||TRIM(ColumnName)||'" as varchar(2000)))) as MAX_LENGTH,

sum(case when "'||TRIM(ColumnName)||'" IS NULL then 1 else 0 end) AS NUM_NULLS,

sum(case when TRIM(cast("' || Trim(ColumnName) || '" as varchar(255))) = '''' then 1 else 0 end) AS NUM_EMPTYS

from ' || trim(DatabaseName) || '.' || trim(TableName) ||'

union all' q

FROM DBC.COLUMNS

WHERE DatabaseName = ?\dbname

AND TableName = ?\tname;

SELECT 'select * from (select top 100 '''|| trim(TableName) || ''' table_name,

cast('''||ColumnId||''' as integer) as seq,

cast('''||TRIM(ColumnName)||''' as varchar(100)) column_name,

count("' ||TRIM(ColumnName)|| '") occurences,' ||

case when columnType = 'CV' and columnLength > 100 then

'"' || TRIM(ColumnName) || '"' || ' val' else

'cast("' ||TRIM(ColumnName)|| '" as varchar(100)) val' end || '

from ' || trim(DatabaseName) || '.' || trim(TableName) || '

group by 5 order by 4 desc) "' || trim(ColumnName) || '"

union all' q

FROM DBC.COLUMNS

WHERE DatabaseName = ?\dbname

AND TableName = ?\tname;

4 comments:

AnonymousMarch 16, 2017 at 11:13 AM
ColumnName values get truncated if you run these queries in Advanced Query Tool.
AnonymousMarch 16, 2017 at 11:28 AM
I'm guessing its getting truncated because DBC.Columns.ColumnName is length is 30
AnonymousNovember 7, 2019 at 6:33 AM
Great script! Thank you!
LisaJuly 23, 2021 at 6:27 AM
Thank you! this was extremely helpful and hard to come by.

Thursday, September 3, 2015

Simple Data Profiling (in Teradata)

4 comments: