Operating System - HP-UX
1748003 Members
4412 Online
108757 Solutions
New Discussion юеВ

Re: duplicate record extraction before sqlldr

 
Edgar_8
Regular Advisor

duplicate record extraction before sqlldr

Hi,

We have comma separated files with thousands of records that need to be loaded in oracle 9i table
with a primary key col.Is there a way we could search through the file on a unix level against the
primary key value deleting duplicate records before the file is sql loaded into db? The CSV file has
200 records with the 49th record being the primary key, so the unix search should remove duplicate records
but also allow us to extract col1-50, any ideas?

Thanks in advance!
8 REPLIES 8
Graham Cameron_1
Honored Contributor

Re: duplicate record extraction before sqlldr

Seems to me it would be far easier to let Oracle do the work.
Not only will it reject duplciates, but will write them to a failure file.

If you must pre-process it at the unix level, you could try identifying the duplicates with awk.
You may run into line too long though.

Something like (not tested):

awk -F"," '
{ if ($49 in alreadyfound) {
printf ("Rejecting line %d: %s\n", $NR, $0) >> "DUPFILE"
next
}
print
alreadyfound[$49]=1
} ' YOURINPUTFILE > YOURUNIQUEFILE

-- Graham

Computers make it easier to do a lot of things, but most of the things they make it easier to do don't need to be done.
Indira Aramandla
Honored Contributor

Re: duplicate record extraction before sqlldr

Hi,

By Definition a primary-key is "NOT NULL" and "UNIQUE". So When you try to load the file using sqlldr, it will reject all the duplicates and the logs will show you clearly how many records got loaded and how many records got rejected. There will be a bad file with all the duplicate records.

You need not manually edit the csv file and eliminate/delete the duplicate records when the utility sqlldr does it automatically for you.

IA
Never give up, Keep Trying
Edgar_8
Regular Advisor

Re: duplicate record extraction before sqlldr

Hi Graham,

We having performance issues with the current sqlldr load, its simply too slow! So we looking at attempting
to do as much work upfront.The CSV data is being loaded via conventional mode into a table which has
a primary key & 3 indexes.
Any ideas how we could improve performance & identify what is affecting performance?

Thanks in advance!
Sanjay Kumar Suri
Honored Contributor

Re: duplicate record extraction before sqlldr

Hello Edgar

Some guidelines are:

Preallocate the space based on expected data volume to prevent dynamic allocation of extents during the load.

In case direct loads are used, temporary segments are used to generate indexes for the new data. These indexes are merged with the existing indexes at the end of the load.

By sorting the input data on the key fields of the largest index, can help in improving the performance.

sks

A rigid mind is very sure, but often wrong. A flexible mind is generally unsure, but often right.
Jean-Luc Oudart
Honored Contributor

Re: duplicate record extraction before sqlldr

Edgar,

The fastest way to load the data with sqlldr is the Direct mode.
As you may have duplicate, you must get read of them before the load itseld.
If you have sort tools such as syncsort you can easily filter the data and remove the duplicate record.
Furhtermore, you can pre-sort the data on the primary key and speed up the load.

Also the UNRECOVERABLE option will not generate redo information.

Regards,
Jean-Luc
fiat lux
Steve Lewis
Honored Contributor

Re: duplicate record extraction before sqlldr

In awk your problem is likely to be with record length being too long, so I also agree that it could be better to let Oracle do the work for you. You could alternatively chop the first 50 fields out before searching and sorting.

If the total record length is short, you could try it in HP-UX using this tested script:

cut -d, -f1-50 unlfile.csv | \
sort -t, -k 49,49 | \
awk -F, '{if($49!=prev)print;prev=$49}' > dupfree.csv

Brian_274
Frequent Advisor

Re: duplicate record extraction before sqlldr

You say that you have 1 primary key and 3 indexes. You're using the primary key to filter out duplicate records. So why do you need the other 3 indexes. Get rid of them during the load, then recreate them after the load. Having only one primary key index will make the load go a lot faster.
Arturo Galbiati
Esteemed Contributor

Re: duplicate record extraction before sqlldr

This command will remove all the record with duplicated key 49, but one:

sort -u -t, -k49,49 input_file -o output_file

HTH
Rgds,
Art