Division query — Find entities in table A that appear for all items in table B

A not so standard SQL problem

Division query

Recently I needed to merge data from several tables to figure out which companies engaged in certain actitivities. More specifically, I wanted to know which companies were active in all sub-categories of a field. This led me to learn about so called query division, an often miss-understood concept (even on SO).

Suppose you have the following problem. You have three tables (to ease interpretation let’s take a well-known example of customer purchases, but really it extends to any n:m relationship problem). The tables might looks like this:

Customers

C_ID ...
1
2
3
4

Products

P_ID ...
1
2

and Purchases

C_ID  P_ID
1     1
1     2
2     1
3     1
3     2
4     3
...

Displayed are only the primary keys, but you can imagine any range of additional, arbitrary colunms.

The goal is to write a query that identifies Customers 1 and 3 as habing bought all the products. My initial idea was similar to the first SO answer to find all customer that have bought as many distinct items as there are in the products table. This will actually work, provided that the products table does contain all the products that there are (this may be a very good or very poor assumption based on the data that you work with). If you are able to make that assumption the problems boils down to:

select 
    C_ID
from Customers
group by 
    C_ID
having count(distinct P_ID) = (select count(*) from Products)

But what if Purchases looked more like this:

C_ID  P_ID
1     1
1     4
2     1
3     1
3     2
4     3
...

Here, Customer 1 bought a product with P_ID=4, which does not appear in the product table. Yet the above query would identify him as a customer who bought all the products.

My next intution was to return the data to Python and loop. However, one should be immediately suspicious when looping comes to mind. Usually there is a better way. In particular since in my application downloading all the data would have been a massive pain.

After some more research I learned about the “division query” strategy: “Find all the customers for which there is no product, that has never been purchased”.

-- All the customers who have not
SELECT *
FROM   customers C
WHERE NOT EXISTS (  -- all the products which have not
                   SELECT *
                   FROM   products P 
                   WHERE  NOT EXISTS ( --ever been purchased
                                      SELECT *
                                      FROM   purchases PU
                                      WHERE  PU.c_id = c_id
                                      AND PU.p_id = P.p_id)) 

This will return the desired:

C_ID ...
1
3

It’s far from intuitive, but once you understand it, higly extendable. For example, you could easily add additional requirements in any of the subqueries.

Jannic Alexander Cutura
Jannic Alexander Cutura
Software ∪ Data Engineer

My interests include distributed computing and cloud as well as financial stability and regulation.

Related