Testing primary keys in Impala/Hive
Unlike standard SQL databases, Hadoop based Imapla and Hive do not enforce primary keys. This function helps to test for them regardless.
Primary keys are perhaps the most basic building block of relational data modelling. If you have a table containing data, (one of) the first question(s) you should argueably ask is “what are the dimensions of the data?”. In Economics this is sometimes referred to as the “level of observation”. More formally it is a set of keys (“columns”) that uniquely identify a row in the table. In classical database archtictures such as OracleSQL you need to delare the primary keys of a table upon creating the table. In distributed systems such as Hadoop and its SQL engine and table metastore Hive however, primary keys are not declared explicitely. If they are not documented anywhere this can lead to confusion. In any case, it is useful to know how to test whether certain keys do constitute a set of primary keys. The following code helps you achieve this:
SELECT CASE WHEN MAX(tab.rows_count_each) > 1
THEN CAST(0 AS BOOLEAN)
ELSE CAST(1 AS BOOLEAN)
END AS 'Primary_Keys'
FROM (SELECT COUNT(*) AS rows_count_each
GROUP BY var1, var2) tab
For those of you with less of a background in SQL, the above is equivalent to
isid var1 var2