Thursday, September 3, 2015

Simple Data Profiling (in Teradata)

My work often require that I analyze flat files to understand the data, relationships, cardinality, the unique keys etc.  To do this effectively, I always:

  1. Load the data into a relational DB so that I can run queries and test theories.
  2. Profile the data to get a sense of the the likely values, the frequency of null, etc.
Below are two queries which you can run in Teradata.  The results of the queries are additional SQL statements which you can run on the data to learn more about it.   These scripts can easily be modified for other RDBMS (Oracle, Netezza, DB2, etc.). With this simple automation, i can profile a a data file in less than 5 mins without the need for any additional software / licenses.

If running through Teradata Studio Express, it will prompt you for the following:

dbname - the database the table is in
tname - the name of the table

The output will be two data sets:

  1. One row per column containing the following fields:
    1. table_name - name of the table
    2. seq - sequence of the column in the table, this does not start with 1, but it is sequential
    3. column_name - name of the column
    4. #_of_records - the number of rows in the table
    5. distinct_values - the number of distinct values in the table for this column
    6. min_value - the smallest value in the table for this column
    7. max_value - the largest value in the table for this column
    8. min_length - the size of the value with the smallest size
    9. max_length - the size of the value with the largest size
    10. num_nulls - the number of nulls in the table for this column
    11. num_emptys - the number of empty strings in the table for this column
  2. Up to 100 rows per column containing the following fields for the top occurring values:
    1. table_name - name of the table
    2. seq - sequence of the column in the table, this does not start with 1, but it is sequential
    3. column_name - name of the column
    4. occurrences - the number of times this value occurs in the data
    5. val - the value of the column

SELECT 'select '''|| trim(TableName) || ''' table_name,
  cast('''||ColumnId||''' as integer) as seq,
  cast('''||TRIM(ColumnName)||''' as varchar(100)) column_name,
  count(*) as #_OF_RECORDS,
  count(distinct "'||TRIM(ColumnName)||'") as DISTINCT_VALUES,
  cast(min("'||TRIM(ColumnName)||'") as varchar(255)) as MIN_VALUE,
  cast(max("'||TRIM(ColumnName)||'") as varchar(255)) as MAX_VALUE,
  min(character_length(cast("'||TRIM(ColumnName)||'" as varchar(2000)))) as MIN_LENGTH,
  max(character_length(cast("'||TRIM(ColumnName)||'" as varchar(2000)))) as MAX_LENGTH,
  sum(case when "'||TRIM(ColumnName)||'" IS NULL then 1 else 0 end) AS NUM_NULLS,
  sum(case when TRIM(cast("' || Trim(ColumnName) || '" as varchar(255))) = '''' then 1 else 0 end) AS NUM_EMPTYS
from ' || trim(DatabaseName) || '.' || trim(TableName) ||'
union all' q
FROM DBC.COLUMNS
WHERE DatabaseName = ?\dbname
AND TableName = ?\tname;


SELECT 'select * from (select top 100 '''|| trim(TableName) || ''' table_name,
  cast('''||ColumnId||''' as integer) as seq,
  cast('''||TRIM(ColumnName)||''' as varchar(100)) column_name,
  count("' ||TRIM(ColumnName)|| '") occurences,' || 
  case when columnType = 'CV' and columnLength > 100 then
    '"' || TRIM(ColumnName) || '"' || ' val' else
    'cast("' ||TRIM(ColumnName)|| '" as varchar(100)) val' end || '
from ' || trim(DatabaseName) || '.' || trim(TableName) || '
group by 5 order by 4 desc) "' || trim(ColumnName) || '"
union all' q
FROM DBC.COLUMNS
WHERE DatabaseName = ?\dbname
AND TableName = ?\tname;

4 comments:

  1. ColumnName values get truncated if you run these queries in Advanced Query Tool.

    ReplyDelete
  2. I'm guessing its getting truncated because DBC.Columns.ColumnName is length is 30

    ReplyDelete
  3. Great script! Thank you!

    ReplyDelete
  4. Thank you! this was extremely helpful and hard to come by.

    ReplyDelete