Replace Duplicates in ID Column with Unique Values

December 9, 2014

in Analytics, DBA, DDL, Sequences, SQL

Replace Duplicate IDs with Unique Values

A common scenario: in the development environment, a table’s ID column is missing its unique constraint, and developers have been creating their own data for testing. Suddenly someone realizes that the ID column has duplicates. The DBA is summoned. "Enable the unique key", the DBA is told, "but do not delete our data."

Can the DBA escape this catch-22? For sure, says this post — with an approach of replacing *only* the duplicate IDs with unique values, and letting everything else stay untouched.

We’ll see this in action with a simple case study: a table BOOKS with two columns {ref_no, book_name}, in which ref_no is expected to be unique — but isn’t.

SQL> desc books
 Name                    Null?    Type
 ----------------------- -------- ----------------
 REF_NO                           NUMBER(3)
 BOOK_NAME                        VARCHAR2(50)


SQL> select * from books;

    REF_NO BOOK_NAME
---------- --------------------
         1 Alice in Wonderland
         2 Oliver Twist
         2 Pygmalion
         3 LOTR
         6 Animal Farm
         6 The Hobbit
         6 Life Of Pi
         7 Atonement

In this example, ID column (ref_no) with values 2 and 6 are duplicates.

SQL> -- List duplicates
SQL> select ref_no, count(*)
  2  from books
  3  group by ref_no
  4  having count(*) > 1;

    REF_NO   COUNT(*)
---------- ----------
         6          3
         2          2

As you can see, if we fix one row with ref_no 2, and two rows with ref_no 6, the duplicate issue will get resolved.

Here’s how.

Steps to Replace Duplicate IDs with Unique Values

1. Find the maximum value [X] of ID column

SQL> select max(ref_no) from books;

MAX(REF_NO)
-----------
          7

2. Create sequence to generate ID, starting with [X+1]

Set the start point of the new sequence ahead of the value identified in step 1, so that numbers generated by the sequence do not clash with any existing value in the ID column.

SQL> create sequence ref_no_seq
  2  start with 8;

Sequence created.

3. Update ID for duplicates in the table, using the new sequence

The SQL:

update books
set ref_no = ref_no_seq.nextval
where rowid in
-- Update only the duplicates
(
  select rowid
  from 
    (
     select rowid
       , ref_no
       , row_number() over 
         (partition by ref_no 
          order by book_name) sno
     from books
    )
  where sno > 1
);

When run:

SQL> update books
  2  set ref_no = ref_no_seq.nextval
  3  where rowid in
  4  -- Update only the duplicates
  5  (
  6    select rowid
  7    from
  8      (
  9       select rowid
 10         , ref_no
 11         , row_number() over
 12           (partition by ref_no
 13            order by book_name) sno
 14       from books
 15      )
 16    where sno > 1
 17  );

3 rows updated.

Verification

Check the data after duplicates have been resolved:

SQL> select * from books
  2  order by ref_no;

    REF_NO BOOK_NAME
---------- -------------------
         1 Alice in Wonderland
         2 Oliver Twist
         3 LOTR
         6 Animal Farm
         7 Atonement
         8 Pygmalion
         9 The Hobbit
        10 Life Of Pi

8 rows selected.

All good now, and the unique key constraint can now be enabled.

Caution: Note that this method will give you unique IDs, not gapless IDs — if you’re insisting on that, you’re in for trouble anyway!

Leave a Comment

Previous post:

Next post: