# Duplicate Records in unix file



## cgjoker

Hi there.

I have a file with a unique id in the first 30 characters.

I want to identify duplicate records in a file that may have two of the same unique ids in the file.

Can someone help with this.???

so im thinking something like:

sort -u | ?? > duplicate_ids.log

thanks in advance.


p.s what would be even more handy would be if I can enter a range of field positions in the file that I want checked, so for an example say I want field 10 to 35 checked for dups, I would enter:

script.ksh 10 35


----------



## bpmurray

Assuming you're using a unix sort, either on unix or using cygwin, you can do the following:

First, to sort according to a key, you can use:
* sort --key=start[,end]

To sort keys, removing duplicates, use:
* sort -u

To sort keys, flagging duplicates, use:
* sort -c -u

If you actually want to list duplicates, I'd use a shell script that first sorts the data, and then uses AWK to identify duplicates.


----------



## cgjoker

Hey there...

I have got this going I think but it doesn't do exactly what I was hoping for...

sort -d -f -u ${file} > ${LOG}

so basically it is creating another file that is sorted and has the dups removed.

Id rather create a file that identifies only the dups rather than what its doing now which is the opposite.


----------



## bpmurray

Well, try a couple of scripts like:


Code:


sort -d -f -k 1.${start},1.${end} ${file} | awk -v pos1=${start} pos2=${end} -f foo.awk -

and the awk script is something like:


Code:


BEGIN { 
	firstRec = 1;
	lastKey = "";
	posn1 = pos1;
	if (pos2 > pos1)
	   slen = pos2 - pos1 + 1;
	else
	   slen = 0;
      }
      {
        if (firstRec != 1) {
           if ((slen == 0 && $0 == lastKey) ||
	       (slen > 0 && substr($0, posn1, slen) == lastKey)) {
              print "Record #" $NR ": " $0;
           }
        }
        if (slen > 0)
           lastKey = substr($0, posn1, slen);
        else
           lastKey = $0;
        firstRec = 0;
     }


----------



## cgjoker

Thanks...

I tried the first one:

#!/bin/sh

file=$1
start=$2
end=$3
sort -d -f -k 1.${start},1.${end} ${file} | awk -v pos1=${start} pos2=${end} -f foo.awk -

and got this error:

[user:] dups.ksh dupfile.dat 1 29
awk: 0602-533 Cannot find or open file -f.
The source line number is 1.

I put the awk script into foo.awk and placed the awk script into the same directory as my dups.ksh script.

any ideas?


----------



## cgjoker

Sorry, I think I need a little more direction with this one.


----------



## bpmurray

Oops - typo: It should read: 

awk -v pos1=${start} -v pos2=${end} -f foo.awk -

i.e. there should be an extra "-v" before pos2=

Apologies!


----------



## alphastode

Hi:

Your script only display records which are having duplicate for the string specified with start and end but what needs to be done to include records which are not duplicate so that out put records will have all uniq records.

like if input file has data:

11
22
22
33

then out put should be:
11
22
33

and currently your script is giving 

22 only.

Thanks in Advance

Alpha


----------



## bpmurray

The point of the awk script is to catch the dups. To only output unique recs, use:


Code:


sort -d -f -u -k 1.${start},1.${end} ${file}


----------



## Squashman

bpmurray said:


> The point of the awk script is to catch the dups. To only output unique recs, use:
> 
> 
> Code:
> 
> 
> sort -d -f -u -k 1.${start},1.${end} ${file}


Hey bpmurray.
Could he just sort the file first then pipe it to uniq or is that the same as what you are doing with the sort options.

sort -d -f | uniq -w30


----------



## bpmurray

The *-u* param to the sort means "unique", so it's the same thing really.


----------

