How to Delete Old Data from a Database Schema

August 19, 2013

in Data Dictionary, Scripts, SQL

Delete Old Data from a Database Schema

How would you delete old data from a database schema so that it retains only the current year’s rows in all the tables in the schema, and the rest of the data is erased?

First things first: Why would you want to delete old data from the entire database schema?

You may want to slim down a database schema for a number of reasons. Let’s say to set up the test environment for a data migration project, a replica of the legacy database schema has been created. The legacy database schema has millions of records spanning multiple years, but for testing migration scripts in the replica schema, just the most recent records — those of the current year – will do.

Delete Old Data from a Database Schema: Solution

What is the simplest way to set up the schema so that it contains only the current year’s rows and nothing else? Consider that each table in the schema has a standard audit column CREATED_DATETIME to store when the row was first inserted into the table.

We tried out two approaches to delete old data from the schema: one a direct method, the other a slightly roundabout but better performing method. The two approaches are:

  1. The DELETE script approach, which is easier to write but less efficient to execute.
  2. The CTAS-Drop-Rename script approach, which involves a few more steps to write but performs better for large data volumes.

The steps, scripts used and their execution is given below.

Approach 1: Direct DELETE Script

Steps:

1. Using an SQL script generator, generate script delete_statements.sql to delete rows where CREATED_DATETIME does not fall in the current year.
2. Run the auto-generated script delete_statements.sql of step 1, to delete rows older than the current year.

Script Generators:

Here is the SQL that generates the script delete_statements.sql as output:

-- This SQL creates an SQL script to 
-- delete old rows from all tables
-- in the schema.
select  'delete from '
     ||  object_name
     ||' where created_datetime <='
     ||' TRUNC(SYSDATE,'||''''||'YEAR'||''''||');' 
     delete_old_rows
from user_objects
where object_type = 'TABLE';

Execution:

The script in action:

SQL> -- This SQL creates an SQL script to
SQL> -- delete old rows from all tables
SQL> -- in the schema.
SQL> select  'delete from '
  2       ||  object_name
  3       ||' where created_datetime <='
  4       ||'TRUNC(SYSDATE,'||''''||'YEAR'||''''||');'
  5       delete_old_rows
  6  from user_objects
  7  where object_type = 'TABLE';

DELETE_OLD_ROWS
--------------------------------------------------------------------------

delete from REGIONS where created_datetime <= TRUNC(SYSDATE,'YEAR');
delete from COUNTRIES where created_datetime <= TRUNC(SYSDATE,'YEAR');
delete from LOCATIONS where created_datetime <= TRUNC(SYSDATE,'YEAR');
delete from DEPARTMENTS where created_datetime <= TRUNC(SYSDATE,'YEAR');
delete from JOBS where created_datetime <= TRUNC(SYSDATE,'YEAR');
delete from EMPLOYEES where created_datetime <= TRUNC(SYSDATE,'YEAR');
delete from JOB_HISTORY where created_datetime <= TRUNC(SYSDATE,'YEAR');

Run delete_old_rows.sql i.e. the output received from the SQL above. Verify that the result is fine after deletion, then commit.

This approach is perhaps the most intuitive way to delete old data from the entire schema, but massive DELETEs would generate large redo and undo data. If that is a concern, approach 2 would work better.

Approach 2: CTAS-Drop-Rename Script

Steps:

1. Using an SQL script generator, generate script create_slim_tables.sql to create tables via CTAS including only the rows in which CREATED_DATETIME falls in the current year. In the names of all the newly created tables, append a common suffix that does not clash with any of the old table names, say _TMP.
2. Using an SQL script generator, generate script drop_legacy_tables.sql to drop tables that do not have the suffix _TMP.
3. Using an SQL script generator, generate script rename_slim_tables.sql to rename _TMP tables as the original table. Recreate the indexes, constraints and grants on the tables.

Script generators:

The script generator for create_slim_tables.sql:

-- This SQL generates an SQL script to 
-- create _TMP tables with only the
-- latest rows from corresponding tables
-- in the schema.
select  'create table '
     ||  object_name || '_TMP'
     ||' as select * from '
     ||  object_name
     ||' where created_datetime >='
     ||' TRUNC(SYSDATE,'||''''||'YEAR'||''''||');' 
     create_slim_tables
from user_objects
where object_type = 'TABLE';

The script generator for drop_legacy_tables.sql:

-- This SQL generates an SQL script to 
-- drop the legacy tables from the schema.
-- It excludes the newly created 
-- tables ie those that end in _TMP.
select  'drop table '
     ||  object_name 
     || ';'
     drop_legacy_tables
from user_objects
where object_type = 'TABLE'
and object_name not like '%_TMP';

The script generator for rename_slim_tables.sql :

-- This SQL generates an SQL script 
-- to remove the _TMP suffix from the 
-- new  tables in the schema.
select  'rename ' 
     ||  object_name 
     ||' to '
     || substr(object_name,0,length(object_name)-4)
     || ';'
     rename_slim_tables
from user_objects
where object_type = 'TABLE'
and object_name like '%_TMP';

Execution:

The scripts in action – the one to create _TMP tables first. Spool or export the output of the SQL below into file create_slim_tables.sql.

SQL> -- This SQL generates an SQL script to
SQL> -- create _TMP tables with only the
SQL> -- latest rows from corresponding tables
SQL> -- in the schema.
SQL> select  'create table '
  2       ||  object_name || '_TMP'
  3       ||' as select * from '
  4       ||  object_name
  5       ||' where created_datetime >='
  6       ||' TRUNC(SYSDATE,'||''''||'YEAR'||''''||');'
  7       create_slim_tables
  8  from user_objects
  9  where object_type = 'TABLE';

CREATE_SLIM_TABLES
--------------------------------------------------------------------------------

create table REGIONS_TMP as select * from REGIONS where created_datetime >= TRUNC(SYSDATE,'YEAR');
create table COUNTRIES_TMP as select * from COUNTRIES where created_datetime >= TRUNC(SYSDATE,'YEAR');
create table LOCATIONS_TMP as select * from LOCATIONS where created_datetime >= TRUNC(SYSDATE,'YEAR');
create table DEPARTMENTS_TMP as select * from DEPARTMENTS where created_datetime >= TRUNC(SYSDATE,'YEAR');
create table JOBS_TMP as select * from JOBS where created_datetime >= TRUNC(SYSDATE,'YEAR');
create table EMPLOYEES_TMP as select * from EMPLOYEES where created_datetime >= TRUNC(SYSDATE,'YEAR');
create table JOB_HISTORY_TMP as select * from JOB_HISTORY where created_datetime >= TRUNC(SYSDATE,'YEAR');

Run create_slim_tables.sql i.e. the output received from the SQL above. After this step, for each table [X] in the schema, a new table [X]_TMP will get created with the latest records from table [X]. Next, we get rid of the old tables from the legacy database schema i.e. those tables that do not have the suffix _TMP. Spool or export the output of the SQL below into drop_legacy_tables.sql.

SQL> -- This SQL generates an SQL script to
SQL> -- drop the legacy tables from the schema.
SQL> -- It excludes the newly created
SQL> -- tables ie those that end in _TMP.
SQL> select  'drop table '
  2       ||  object_name
  3       || ';'
  4       drop_legacy_tables
  5  from user_objects
  6  where object_type = 'TABLE'
  7  and object_name not like '%_TMP';

DROP_LEGACY_TABLES
-----------------------------------------------

drop table REGIONS;
drop table COUNTRIES;
drop table LOCATIONS;
drop table DEPARTMENTS;
drop table JOBS;
drop table EMPLOYEES;
drop table JOB_HISTORY;

Run drop_legacy_tables.sql i.e. the output received from the SQL above. After this step, all the old tables will be dropped from the schema. Make sure that you have got a copy of the index/constraint/grant creation scripts on the original tables before dropping them. Next, we reset the names of all the _TMP tables to replace their originals. Spool or export the output of the SQL below into rename_slim_tables.sql.

SQL> -- This SQL generates an SQL script
SQL> -- to remove the _TMP suffix from the
SQL> -- new  tables in the schema.
SQL> select  'rename '
  2       ||  object_name
  3       ||' to '
  4       || substr(object_name,0,length(object_name) - 4)
  5       || ';'
  6       rename_slim_tables
  7  from user_objects
  8  where object_type = 'TABLE'
  9  and object_name like '%_TMP';

RENAME_SLIM_TABLES
---------------------------------------

rename REGIONS_TMP to REGIONS;
rename COUNTRIES_TMP to COUNTRIES;
rename LOCATIONS_TMP to LOCATIONS;
rename DEPARTMENTS_TMP to DEPARTMENTS;
rename JOBS_TMP to JOBS;
rename EMPLOYEES_TMP to EMPLOYEES;
rename JOB_HISTORY_TMP to JOB_HISTORY;

Run rename_slim_tables.sql i.e. the output of the SQL above. This will rename every table of the form [X]_TMP to [X]. Finally, run the script to recreate the indexes, constraints and grants.

At this stage we have got what we wanted – we have now replaced the legacy tables with their slimmer versions.

How would you delete all rows except the latest, if there is no timestamp column?

In the absence of a timestamp or sequence-generated column, there is no way Oracle can determine the "latest N rows". You will have to settle for "any N rows".

It is a misconception that filtering by max of ROWNUM or ROWID will give the latest records. ROWNUM is applied to query results and not to data storage — it can be proportional to date of entry only if there is a table column recording the time of entry on which sorting is applied. See an example of why ROWNUM > N never returns a result.

ROWID is the physical address of the row on disk and it can get reused, its value may not have any correlation with the date of entry.

Bottomline: if you want the latest N records, you must have a column on the table that indicates the date of entry. There is no inbuilt sorting by date in the Oracle database.

Leave a Comment

Previous post:

Next post: