SQL to Find Master Records with Identical Detail Records

March 17, 2015

in SQL, Subqueries

SQL to Find Master Records with Identical Detail Records

The query to identify duplicate records in Oracle in a single table is fairly straightforward. Things get tricky in a two-table (master-detail) scenario, in which the master table holds the header information (id, name, etc) and the detail table has sets of values associated with the master records.

What if we need to find those master records that have identical sets of values in the detail table? Can a single SQL list master records with identical detail records?

This post explains how this can be done.

The Master:Detail Case

Given two tables:

  • list_group: The table that stores header information (list_id, list_name,etc) for list_val.
  • list_val: The table that stores values associated with each element of list_group

The contents of the tables are:

SQL> select list_id
  2       , list_name
  3  from list_group;

LIST_ID  LIST_NAME
-------- ------------
CALLSTAT Call Status
MAILSTAT Mail Status
FAXSTAT  Fax Status
POSTSTAT Post Status

SQL> select list_id
  2       , value_code
  3       , value_name
  4  from list_val;

LIST_ID  VALUE_CO VALUE_NAME
-------- -------- ------------------
CALLSTAT NOANS    Unanswered
CALLSTAT NUMNA    Number Unavailable
CALLSTAT SUCCESS  Call Successful
MAILSTAT DELIV    Delivered
MAILSTAT TRANSIT  In-Transit
MAILSTAT RETURNED Returned
FAXSTAT  DELIV    Delivered
FAXSTAT  TRANSIT  In-Transit
POSTSTAT DELIV    Delivered
POSTSTAT TRANSIT  In-Transit
POSTSTAT RETURNED Returned

11 rows selected.

The sets of values are considered duplicate if the value "codes" in list_val are the same for the same master. In this example, MAILSTAT and POSTSTAT have duplicate values in list_val. FAXSTAT has a subset of those values, so it is not considered a "duplicate" though there is partial match.

The aim is to write a SQL to identify records in master table list_group that have duplicate sets of values in list_val.

Cut to the Chase: The Final SQL

Here’s a SQL that will identify master records with duplicate sets of detail records. An explanation and step-wise breakdown follows in the next section.

with cte as
(select 
   g1.list_id glist1
 , g2.list_id glist2
 from list_group g1
    , list_group g2
 where g1.list_id <> g2.list_id
 and not exists
 (
   select 1
   from list_val v1
   where v1.list_id = g1.list_id
   and not exists 
   (
     select 1 
     from list_val v2
     where v2.list_id = g2.list_id
     and v2.value_code = v1.value_code
   )
 )
)
select c1.*
from cte c1
   , cte c2
where c1.glist1 = c2.glist2
and c1.glist2 = c2.glist1
and c1.glist1 < c1.glist2;

The Approach

The SQL above follows this algorithm:

  1. Pair every element in the master table list_group with every other element, except itself. Say the master pair is (GLIST1, GLIST2).
  2. Exclude from the master pairs, those in which GLIST1 is not a subset of GLIST2 (i.e. all list_val detail values of GLIST1 are not present in GLIST2)
  3. Exclude from the master pairs, those in which GLIST2 is not a subset of GLIST1 (i.e. all list_val detail values of GLIST2 are not present in GLIST1)

What remains are master record pairs with perfect equivalence i.e. GLIST1 = GLIST2, i.e. all detail records of GLIST1 are present in GLIST2, and vice versa.

A step-wise run of parts of the SQL is shown below.

1. Pair every element in the master table list_group with every other element, except itself.

Say the master pair is (GLIST1, GLIST2).

-- Pair every element in master table 
-- list_group with every other element, 
-- except itself
select 
   g1.list_id glist1
 , g2.list_id glist2
 from list_group g1
    , list_group g2
 where g1.list_id <> g2.list_id;

When run:

SQL> -- Pair every element in master table
SQL> -- list_group with every other element,
SQL> -- except itself
SQL> select
  2     g1.list_id glist1
  3   , g2.list_id glist2
  4   from list_group g1
  5      , list_group g2
  6   where g1.list_id <> g2.list_id;

GLIST1   GLIST2
-------- --------
CALLSTAT MAILSTAT
CALLSTAT FAXSTAT
CALLSTAT POSTSTAT
MAILSTAT CALLSTAT
MAILSTAT FAXSTAT
MAILSTAT POSTSTAT
FAXSTAT  CALLSTAT
FAXSTAT  MAILSTAT
FAXSTAT  POSTSTAT
POSTSTAT CALLSTAT
POSTSTAT MAILSTAT
POSTSTAT FAXSTAT

12 rows selected.

2. Exclude from the master pairs, those in which GLIST1 is not a subset of GLIST2

(i.e. all list_val detail values of GLIST1 are not present in GLIST2)

Correlated subquery comes in handy when comparing the ids from the outer query in the detail subquery.

-- Exclude from the master pairs, those in
-- which GLIST1 is not a subset of GLIST2 
-- (i.e. all list_val detail values of GLIST1 
-- are not present in GLIST2)  
select
   g1.list_id glist1
 , g2.list_id glist2
 from list_group g1
    , list_group g2
 where g1.list_id <> g2.list_id
 and not exists
 (
   select 1
   from list_val v1
   where v1.list_id = g1.list_id
   and not exists
   (
     select 1
     from list_val v2
     where v2.list_id = g2.list_id
     and v2.value_code = v1.value_code
   )
 );

When run:

SQL> -- Exclude from the master pairs, those in
SQL> -- which GLIST1 is not a subset of GLIST2
SQL> -- (i.e. all list_val detail values of GLIST1
SQL> -- are not present in GLIST2)
SQL> select
  2     g1.list_id glist1
  3   , g2.list_id glist2
  4   from list_group g1
  5      , list_group g2
  6   where g1.list_id <> g2.list_id
  7   and not exists
  8   (
  9     select 1
 10     from list_val v1
 11     where v1.list_id = g1.list_id
 12     and not exists
 13     (
 14       select 1
 15       from list_val v2
 16       where v2.list_id = g2.list_id
 17       and v2.value_code = v1.value_code
 18     )
 19   );

GLIST1   GLIST2
-------- --------
MAILSTAT POSTSTAT
FAXSTAT  MAILSTAT
FAXSTAT  POSTSTAT
POSTSTAT MAILSTAT

3. Exclude from the master pairs, those in which GLIST2 is not a subset of GLIST1

(i.e. all list_val detail values of GLIST2 are not present in GLIST1)

We make use of a Common Table Expression (CTE) to factor out the repeating portion of the query.

-- Make a cte of the filtered master 
-- pairs. Exclude from the master 
-- pairs, those in which GLIST2 is  
-- not a subset of GLIST1. 
with cte as
(select 
   g1.list_id glist1
 , g2.list_id glist2
 from list_group g1
    , list_group g2
 where g1.list_id <> g2.list_id
 and not exists
 (
   select 1
   from list_val v1
   where v1.list_id = g1.list_id
   and not exists 
   (
     select 1 
     from list_val v2
     where v2.list_id = g2.list_id
     and v2.value_code = v1.value_code
   )
 )
)
select c1.*
from cte c1
   , cte c2
where c1.glist1 = c2.glist2
and c1.glist2 = c2.glist1
and c1.glist1 < c1.glist2;

When run:

SQL> -- Make a cte of the filtered master
SQL> -- pairs. Exclude from the master
SQL> -- pairs, those in which GLIST2 is
SQL> -- not a subset of GLIST1.
SQL> with cte as
  2  (select
  3     g1.list_id glist1
  4   , g2.list_id glist2
  5   from list_group g1
  6      , list_group g2
  7   where g1.list_id <> g2.list_id
  8   and not exists
  9   (
 10     select 1
 11     from list_val v1
 12     where v1.list_id = g1.list_id
 13     and not exists
 14     (
 15       select 1
 16       from list_val v2
 17       where v2.list_id = g2.list_id
 18       and v2.value_code = v1.value_code
 19     )
 20   )
 21  )
 22  select c1.*
 23  from cte c1
 24     , cte c2
 25  where c1.glist1 = c2.glist2
 26  and c1.glist2 = c2.glist1
 27  and c1.glist1 < c1.glist2;

GLIST1   GLIST2
-------- --------
MAILSTAT POSTSTAT

And there you have it — a single SQL to find master records with identical detail records.

Leave a Comment

Previous post:

Next post: