44a9deddb14d75c29377773934d8315e

I got a text file (about 2000 lines) where i want to remove duplicate lines, but without loosing the orign sort order.

I found out that i can use awk for this. To get this a bit handier i wrote this little script. Unfortunatelly its noticeable slower than using awk directly from the command line. Any idea for improvements?

The whole point is that i want to use "uniqq" instead of "awk '!x[$0]++'" which is harder to remember for me.

#!/bin/bash
# filename: uniqq

        # uniq replacement. no post sort is needed. the original order persists

        read i
        while read line; do
                i=$i$'\n'$line
        done
        echo "$i" | awk '!x[$0]++'

Refactorings

No refactoring yet !

0706636fd5e30fa66019d7ffacdb5b11

Marco Valtas

February 8, 2009, February 08, 2009 01:01, permalink

No rating. Login to rate!

Hi, this problem remembered me of my time in Bioinformatics. Usually we had problems like yours to solve, but instead of 2000 lines, the files had much more lines and performance was always a concern. I couldn't not get if you want to sort the file or not. When using a tool like uniq you probably will have to sort the file before remove any duplicates.

My solution is not in 'bash' since in bioinformatics we use a lot of Perl, and you mentioned tools like bash and awk I have a good guess that we can access a little of Perl to solve your problem. I will assume too, that you don't want to sort the file just remove duplicates, as you will see, you could sort just adding a couple of lines in the solution.

Hope this helps.

# here's some steps.
# headers.dmp is a file with proteins headers, here's a sample:

[mavcunha@strongcoffee tmp]$ head headers.dmp 
>Q4U9M9|104K_THEAN 104 kDa microneme/rhoptry antigen precursor - Theileria annulata
>P15711|104K_THEPA 104 kDa microneme/rhoptry antigen precursor - Theileria parva
>Q43495|108_SOLLC Protein 108 precursor - Solanum lycopersicum (Tomato) (Lycopersicon esculentum)
>P18646|10KD_VIGUN 10 kDa protein precursor - Vigna unguiculata (Cowpea)
>P13813|110KD_PLAKN 110 kDa antigen - Plasmodium knowlesi
>Q9XHP0|11S2_SESIN 11S globulin seed storage protein 2 precursor - Sesamum indicum (Oriental sesame) (Gingelly)
>P19084|11S3_HELAN 11S globulin seed storage protein G3 precursor - Helianthus annuus (Common sunflower)
>P13744|11SB_CUCMA 11S globulin subunit beta precursor [Contains: 11S globulin gamma chain - Cucurbita maxima (Pumpkin) (Winter squash)
>P32234|128UP_DROME GTP-binding protein 128up - Drosophila melanogaster (Fruit fly)
>P21215|12AH_CLOS4 12-alpha-hydroxysteroid dehydrogenase - Clostridium sp. (strain C 48-50)

# The original file has 389046 lines.
[mavcunha@strongcoffee tmp]$ wc -l headers.dmp 
  389046 headers.dmp

# and these lines are uniq.
[mavcunha@strongcoffee tmp]$ cat headers.dmp | sort | uniq  | wc -l 
  389046

# I created a little script (code omited) to mess with the file and create some duplicated lines.
[mavcunha@strongcoffee tmp]$ perl create_dups.pl headers.dmp > headers_dup.dmp 

# and the new file has more lines..
[mavcunha@strongcoffee tmp]$ wc -l headers_dup.dmp 
  502917 headers_dup.dmp

# here is the code to remove duplicated lines in perl, it's a simple algorithm:
# 
# read the line,
# check if the line exists in a associative array.
# if exists go to the next line.
# if not, put on the associative array and print the line.
#!/usr/bin/perl

open(F, shift) or die($!);
while(<F>) {
    next if exists($lines{$_});
    $lines{$_}++;
    print $_;
}
close F;
# to run just pass the filename and redirect to output.
[mavcunha@strongcoffee tmp]$ perl remove_dup.pl headers_dup.dmp > headers_clean.dmp

# now let's check if this file is really clean
# first let's count the lines

[mavcunha@strongcoffee tmp]$ wc -l headers_clean.dmp 
  389046 headers_clean.dmp

# seems the same, what about a diff with the original?
[mavcunha@strongcoffee tmp]$ diff headers.dmp headers_clean.dmp 
[mavcunha@strongcoffee tmp]$ 

# I think we done it.
5a00a3a98dcf6f9cd717440fd2b606e5

Eineki

February 8, 2009, February 08, 2009 02:06, permalink

1 rating. Login to rate!

Your problem is read. It is a standard utility (not a bash built in command) and calling it consume a lot of time (so your script get slow).

You should gain a lot of speed redirecting the standard input to awk with a single command: cat - (- stands for standard input).

Try it and let me know if it helped

#!/bin/bash
cat - | awk '!x[$0]++'
44a9deddb14d75c29377773934d8315e

ck01.myopenid.com

February 8, 2009, February 08, 2009 11:02, permalink

No rating. Login to rate!

Thanks alot Eineki. Thats it!! Your solution is almost fast as calling awk directly. Perfect :o)

44a9deddb14d75c29377773934d8315e

ck01.myopenid.com

February 8, 2009, February 08, 2009 11:25, permalink

No rating. Login to rate!

I was wondering if this could also be done via "alias"? Unfortunatelly I couldn't get it to work.

One try was this:

h0mer:~# alias uniqq='awk \'!x[$0]++\''
-bash: !x[$0]++\: event not found
217f87ea53780a9fd3fa2b1c7b1a98ca

mightybs

February 13, 2009, February 13, 2009 13:35, permalink

No rating. Login to rate!

In bash the necessary quoting can be tricky:

alias uniqq='awk '\''!x[$0]++'\'''
44a9deddb14d75c29377773934d8315e

ck01.myopenid.com

February 14, 2009, February 14, 2009 16:45, permalink

No rating. Login to rate!

Thanks.. I will research the quoting thing in bash ;-)

Your refactoring





Format Copy from initial code

or Cancel