SQL: Compare Two Tables for Differences in Data

May 15, 2017

in Joins, Scripts, SQL

SQL: Compare Two Tables for Differences

An approach for comparing two tables for differences in data, using a pure SQL solution:

In a recent project, a shadow table had to be compared periodically with its main source table to identify the differences between the two tables.

The nature of differences fell into one of these buckets:

  1. Insert Required [INS]: If the source had rows not present in the shadow table
  2. Update Required [UPD]: If one or more columns in a row (identified by a primary key) had changed values
  3. Delete Required [DEL]: If the shadow table had rows not present in the source table

Comparing two tables for differences (INS/UPD/DEL) needed a different solution from a simple MERGE SQL statement. Here an upsert was not to be executed, only the nature of differences to be identified.

I came upon this post on Data Detection in Oracle SQL, which outlines a number of good solutions for similar use cases in a data warehouse environment.

We followed the DECODE method – the key reasons for the choice were its simplicity and elegance. New Oracle developers were able to quickly understand and support this approach.

Here’s a detailed breakdown of the solution, with sample SOURCE and TARGET table scripts that you could copy to test out in your own environment.

Case Study Setup Scripts

1. Create tables SOURCE and TARGET

-- SOURCE: Main table against which 
-- the shadow table is to be compared
create table source
(
  c1 varchar2(2) -- Primary Key
, c2 varchar2(2)
, c3 varchar2(2)
);

-- TARGET: Shadow table which is to 
-- be compared against SOURCE
create table target
(
  c1 varchar2(2) -- Primary Key
, c2 varchar2(2)
, c3 varchar2(2)
);

2. Populate data

Using INSERT ALL

-- Data for main table SOURCE
insert all
into source(c1,c2,c3) values('1','x','y')
into source(c1,c2,c3) values('2','xx','y')
into source(c1,c2,c3) values('3','x','y')
select * from dual;

-- Data for shadow table TARGET
insert all
into target(c1,c2,c3) values('1','x','y')
into target(c1,c2,c3) values('2','x','y')
into target(c1,c2,c3) values('4','x','yy')
select * from dual;

What we have now is representative data in which

  • One row has a changed value in SOURCE and needs to be updated in TARGET [UPD]
  • One row is new in SOURCE and needs to be inserted into TARGET [INS]
  • One row is missing in SOURCE and needs to be deleted from TARGET [DEL]

SQL Compare Two Tables for Differences

Compare two tables for differences: SQL Solution

A single SQL can give us the required result:

select case
   when t.c1 is null then 'INS'
   when s.c1 is null then 'DEL'
                     else 'UPD'
       end dml_flag
     , nvl (s.c1, t.c1) c1
     , s.c2, s.c3
from source s
full join target t 
      on (s.c1 = t.c1)
where decode (s.c2, t.c2, 0, 1)
    + decode (s.c3, t.c3, 0, 1) > 0;

When run:

SQL> select case
  2     when t.c1 is null then 'INS'
  3     when s.c1 is null then 'DEL'
  4                       else 'UPD'
  5         end dml_flag
  6       , nvl (s.c1, t.c1) c1
  7       , s.c2, s.c3
  8  from source s
  9  full join target t
 10        on (s.c1 = t.c1)
 11  where decode (s.c2, t.c2, 0, 1)
 12      + decode (s.c3, t.c3, 0, 1) > 0;

DML_FLAG  C1 C2 C3
--------- -- -- --
UPD       2  xx y
DEL       4
INS       3  x  y 

Understanding the SQL solution step-by-step

Read on for help with understanding how the SQL above works and how to apply it to your situation…

Step 1: FULL JOIN

The SQL first does a full join between SOURCE and TARGET based on the Primary Key (C1). A full join returns rows from both tables, with nulls if a match is not found in the other table.

This is the result of a basic full join between SOURCE and TARGET with no other filters/conditions:

SQL Identify Differences in Two Tables

Step 2: Filter out unchanged rows

After the full join, the SQL filters out rows in which no change is detected. This is done by

   I. applying a DECODE on every corresponding attribute pair in SOURCE and TARGET – to give 0 if no change in value, 1 if there is a change — using this format:

DECODE (.[columnX]
      , [target].[columnX]
      , 0 —- Value if .[columnX] = [target.columnX]
     , 1)

For N columns to be compared, there would be N such DECODES.

II. adding up the result of DECODEs
III. checking the final sum of the result of DECODEs against zero. A non-zero sum means that at least one attribute showed a variation in values between SOURCE and TARGET, and so the row qualifies for update.

This SQL gives us a listing of rows that have some sort of difference (INS/UPD/DEL) in data between SOURCE and TARGET:

select *
from source s
full join target t 
      on (s.c1 = t.c1)
where decode (s.c2, t.c2, 0, 1)
    + decode (s.c3, t.c3, 0, 1) > 0;

When run:

SQL> select *
  2  from source s
  3  full join target t
  4        on (s.c1 = t.c1)
  5  where decode (s.c2, t.c2, 0, 1)
  6      + decode (s.c3, t.c3, 0, 1) > 0;

C1 C2 C3 C1 C2 C3
-- -- -- -- -- --
2  xx y  2  x  y
         4  x  yy
3  x  y

Step 3: Derive DML_FLAG

The SQL finally applies a searched CASE expression on the SQL formed in Step 2, to derive the value of DML_FLAG:

  • If the full join found no rows in TARGET, then DML_FLAG = INS
  • If the full join found no rows in SOURCE, then DML_FLAG = DEL
  • If the full join found rows in both TARGET and SOURCE, then the row can only be showing up in the Step 2 SQL if at least one attribute has a variation in values. That is, DML_FLAG = UPD.

Putting it all together gives us the final SQL:

select case
   when t.c1 is null then 'INS'
   when s.c1 is null then 'DEL'
                     else 'UPD'
       end dml_flag
     , nvl (s.c1, t.c1) c1
     , s.c2, s.c3
from source s
full join target t 
      on (s.c1 = t.c1)
where decode (s.c2, t.c2, 0, 1)
    + decode (s.c3, t.c3, 0, 1) > 0;

When run:

SQL> select case
  2     when t.c1 is null then 'INS'
  3     when s.c1 is null then 'DEL'
  4                       else 'UPD'
  5         end dml_flag
  6       , nvl (s.c1, t.c1) c1
  7       , s.c2, s.c3
  8  from source s
  9  full join target t
 10        on (s.c1 = t.c1)
 11  where decode (s.c2, t.c2, 0, 1)
 12      + decode (s.c3, t.c3, 0, 1) > 0;

DML_FLAG  C1 C2 C3
--------- -- -- --
UPD       2  xx y
DEL       4
INS       3  x  y 

Leave a Comment

Previous post:

Next post: