My work often require that I analyze flat files to understand the data, relationships, cardinality, the unique keys etc. To do this effectively, I always:
If running through Teradata Studio Express, it will prompt you for the following:
dbname - the database the table is in
tname - the name of the table
The output will be two data sets:
- Load the data into a relational DB so that I can run queries and test theories.
- Profile the data to get a sense of the the likely values, the frequency of null, etc.
If running through Teradata Studio Express, it will prompt you for the following:
dbname - the database the table is in
tname - the name of the table
The output will be two data sets:
- One row per column containing the following fields:
- table_name - name of the table
- seq - sequence of the column in the table, this does not start with 1, but it is sequential
- column_name - name of the column
- #_of_records - the number of rows in the table
- distinct_values - the number of distinct values in the table for this column
- min_value - the smallest value in the table for this column
- max_value - the largest value in the table for this column
- min_length - the size of the value with the smallest size
- max_length - the size of the value with the largest size
- num_nulls - the number of nulls in the table for this column
- num_emptys - the number of empty strings in the table for this column
- Up to 100 rows per column containing the following fields for the top occurring values:
- table_name - name of the table
- seq - sequence of the column in the table, this does not start with 1, but it is sequential
- column_name - name of the column
- occurrences - the number of times this value occurs in the data
- val - the value of the column
SELECT 'select '''|| trim(TableName) || ''' table_name,
cast('''||ColumnId||''' as integer) as seq,
cast('''||TRIM(ColumnName)||''' as varchar(100)) column_name,
count(*) as #_OF_RECORDS,
count(distinct "'||TRIM(ColumnName)||'") as DISTINCT_VALUES,
cast(min("'||TRIM(ColumnName)||'") as varchar(255)) as MIN_VALUE,
cast(max("'||TRIM(ColumnName)||'") as varchar(255)) as MAX_VALUE,
min(character_length(cast("'||TRIM(ColumnName)||'" as varchar(2000)))) as MIN_LENGTH,
max(character_length(cast("'||TRIM(ColumnName)||'" as varchar(2000)))) as MAX_LENGTH,
sum(case when "'||TRIM(ColumnName)||'" IS NULL then 1 else 0 end) AS NUM_NULLS,
sum(case when TRIM(cast("' || Trim(ColumnName) || '" as
varchar(255))) = '''' then 1 else 0 end) AS NUM_EMPTYS
from ' || trim(DatabaseName) || '.' || trim(TableName) ||'
union all' q
FROM DBC.COLUMNS
WHERE DatabaseName = ?\dbname
AND TableName = ?\tname;
SELECT 'select * from (select top 100 '''|| trim(TableName) || ''' table_name,
cast('''||ColumnId||''' as integer) as seq,
cast('''||TRIM(ColumnName)||''' as varchar(100)) column_name,
count("' ||TRIM(ColumnName)|| '") occurences,' ||
case when columnType = 'CV' and columnLength > 100 then
'"' || TRIM(ColumnName) || '"' || ' val' else
'cast("' ||TRIM(ColumnName)|| '" as varchar(100)) val' end || '
from ' || trim(DatabaseName) || '.' || trim(TableName) || '
group by 5 order by 4 desc) "' || trim(ColumnName) || '"
union all' q
FROM DBC.COLUMNS
WHERE DatabaseName = ?\dbname
AND TableName = ?\tname;
ColumnName values get truncated if you run these queries in Advanced Query Tool.
ReplyDeleteI'm guessing its getting truncated because DBC.Columns.ColumnName is length is 30
ReplyDeleteGreat script! Thank you!
ReplyDeleteThank you! this was extremely helpful and hard to come by.
ReplyDelete