Merge multiple lines in a file

Peggy White · ‎10-23-2009

I've hunted and hunted but nothing seems to apply to what I need. Any help will be much appreciated!

My input file looks like (Unix):
marker,allele1,allele2
RS1002244,1,1
RS1002244,1,3
RS1002244,3,3
RS1003719,2,2
RS1003719,2,4
RS1003719,4,4

Most markers are listed 3 times but a few have 3 alleles and are listed more.

An example of a marker with 3 alleles is:
marker,allele1,allele2
RS757210,2,2
RS757210,2,3
RS757210,2,4
RS757210,3,3
RS757210,3,4
RS757210,4,4

I would like to get output like:
marker,allele1,allele2,allele3
RS1002244,1,3,.
RS1003719,2,4,.
RS757210,2,3,4

Everything I've found gives me
RS1002244,1,1,1,3,3,3,
RS1003719,2,2,2,4,4,4,
etc.

Thanks very much in advance, Peggy 10/23

Steven Schweda · ‎10-23-2009

> Everything I've found [...]

Not a very complete description of what
you've tried.

I'd probably write a real program for a job
like this, but I assume that you're trying to
write a shell script.

Incomplete, but possibly useful:

dy # echo ',1,1,1,3,3,3,4,4' | sed -e 's/$,.$\1*/\1/g'
,1,3,4

Hein van den Heuvel · ‎10-23-2009

Peggy, please carefully review the example input data.
Are you sure that sample output matches that? Is the 'allele2' column used at all?
Can you re-state the problem with non-identical values in allele1 and allele2?
Like:
RS1003719,2,7
RS1003719,2,8
RS1003719,4,8
Or is it critical that a allele2 value comes back as allele1?
Is the input garantueed to be sorted?

Anyway... here is some perl which generates the specified output from the specified input, but admittedly I doubt it matches the actual need.

--- x.pl -----------
while (<>) { # Go over all input
$x{"$1 $2"}=1 if /^(\w+),(\d+),/; # remember marker and allele1 if found
}
$x{x} = 1; # this is the end # any ASCII value higher than highest input marker

for (sort keys %x) { # go over accumulated markers
( $marker, $a) = split;
if ($marker eq $old) { # just add column if already seen
$count++;
$text .= ','.$a;
} else {
$text .= ",." if $count == 2; # add empty third if need be
print $text."\n" if $count; # print except for first 1
$count = 1; # First 1 for this marker
$old = $marker;
$text = $marker.','.$a;
}
}
-------------
Run as : perl x.pl x

fwiw,
Hein.

Dennis Handly · ‎10-23-2009

You appear to be using business logic (domain-specific) terminology and not computer terminology, which deals with foos, bars, fields, keys, records and strings.

>An example of a marker with 3 alleles is:
>marker,allele1,allele2

I'm not sure I see the "3"? Also, is this a title line with a description of the fields?

>RS1002244,1,3,.

It seems you want to collect all of the numbers that occur after the first field (key) and sort unique them?

>Steven: echo ',1,1,1,3,3,3,4,4' | sed -e 's/$,.$\1*/\1/g'

Thanks, didn't know you could use \# on the LHS.

Steven Schweda · ‎10-23-2009

> Thanks, didn't know you could use \# on the LHS.

I'd never thought of trying it before, and I
wasn't sure until I had tried it, but there
it is. "man 5 regex" doesn't limit it, and
there's even an example using it that way.

> I'm not sure I see the "3"?

I was guessing that

RS757210,2,2
RS757210,2,3
RS757210,2,4
RS757210,3,3
RS757210,3,4
RS757210,4,4

had the three alleles, 2, 3, and 4 (in
various places), attached to the name
("marker") RS757210.

> You appear to be using business logic
> (domain-specific) terminology [...]

Yup. It pays to watch CSI to keep up on the
latest genetics terminology.

I long ago stopped expecting clear problem
statements in this forum. Hoping for, yes;
expecting, no. (I keep asking, but my
success rate is pretty low.)

Peggy White · ‎10-24-2009

Thanks very much to all! I'm sorry I wasn't clearer.

The column headings can be anything; I'm happy with name, column 2, and column 3.

There are indeed 3 separate values for the 2nd example I give - 2,2 - 2,3 - 2,4; values are 2, 3, and 4.

I would like output that lists each number once for each of the names it goes with. It doesn't matter if it's sorted or not.

I didn't include any code because I haven't been able to do much. I found one thing on this web page which is what I used for my last example, where all numbers were included. It was from March of this year, and the subject was "Merging lines into one from one file using awk or gawk".

Sorry, Peggy

Peggy White · ‎10-24-2009

Hein's works close to perfection! Thanks so much!

VK2COT · ‎10-24-2009

Hello Peggy,

I know you already got good suggestions.

Pere is another one (just to show you
that we are all different :)

#!/usr/bin/perl

use strict;
use warnings;

my %seen = ();
my @MyArr = ();
my @arr = ();
my %myhash;
my %final;

while () {
chomp $_;
my @arr = split(/,/, $_);
push(@MyArr, join ",", $arr[0], $arr[1]);
}

foreach my $elem (@MyArr)
{
$seen{$elem}++;
$myhash{$elem} = "($seen{$elem})";
my @arr = split(/,/, $elem);
$myhash{$elem} =~ s/$|$//g;
if ( defined($final{$arr[0]}) ) {
if ( $myhash{$elem} < 2 ) {
$final{$arr[0]} = "$final{$arr[0]},$arr[1]";
}
}
else {
$final{$arr[0]} = "$arr[0],$arr[1]";
}
}

foreach my $hkey (sort keys %final) {
my $ff = $final{$hkey} =~ tr/,/,/;
my $add = q{};
if ( $ff < 3 ) {

$add=",.";
}
print "$final{$hkey}$add\n";
}

exit(0);

__DATA__
RS1002244,1,1
RS1002244,1,3
RS1002244,2,4
RS1002244,3,3
RS1003719,2,2
RS1003719,2,4
RS1003719,4,4

When you run it, the following comes:

RS1002244,1,3,.
RS1003719,2,4,.
RS757210,2,3,4

Cheers,

VK2COT

VK2COT - Dusan Baljevic

Hein van den Heuvel · ‎10-24-2009

I asked about the input being sorted to some extend.
IF all the row for a given marker are garantueed to come together, then the output can be generated as the rows are processed.

For example:

---------------------------------------
while (<>) { # Go over all input

next unless /^(\w+),(\d+),/; # marker and allele1 number on this line?
if ($1 eq $old) { # just add column if already seen
next if $allele{$2}++;
print ",$2";
$count++;
} else {
print ",." if $count == 2;
print "\n" if $count;
print "$1,$2"; # print except for first 1
$old = $1;
%allele = ($2 => 1);
$count = 1;
}
}
print ",." if $count == 2;
print "\n";

---------------

or using an array to build the output line....

--------------

while (<>) { # Go over all input

if ( /^(\w+),(\d+),/ ) { # remember marker and allele1 number on this line?
$marker = $1;
} else {
next;
}
if ($marker eq $old) { # just add column if already seen
next if $allele{$2}++; # Seen this one already?
$allele[$count++] = $2; # Put in list if new.
} else {
print join (q(,),$old,@allele),"\n" if $count; # print except for first 1
$count = 1; # First 1 for this marker
$old = $marker;
@allele = ($2, q(.), q(.)); # seed output columns
%allele = ($2 => 1); # only one value allele seen so for
}
}
print join (q(,),$old,@allele),"\n";

--------------
TimTowTdi

enough already!
:-)

Hein.

Peggy White · ‎10-25-2009

3 fantastic answers! Thanks so much. Hopefully I can help someone someday. Peggy

James R. Ferguson · ‎10-25-2009

Hi Peggy:

Though late, I can't resist this Perl variation:

# cat ./myalleles
#!/usr/bin/perl
use strict;
use warnings;
my ( @allele, %locus );
while (<>) {
chomp;
@allele = split /,/;
for my $n ( 1 .. @allele - 1 ) {
if ( !grep( /$allele[$n]/, @{ $locus{ $allele[0] } } ) ) {
push( @{ $locus{ $allele[0] } }, $allele[$n] );

}
}
}
for my $marker ( sort keys %locus ) {
@allele = ( @{ $locus{$marker} } );
print join ',', $marker, ( sort @allele );
print "\n";
}
1;

...using your input data:

# cat ./myalleles.data
RS1002244,1,1
RS1002244,1,3
RS1002244,3,3
RS1003719,2,2
RS1003719,2,4
RS1003719,4,4
RS757210,2,2
RS757210,2,3
RS757210,2,4
RS757210,3,3
RS757210,3,4
RS757210,4,4

...when run produces:

# ./myalleles ./myalleles.data
RS1002244,1,3
RS1003719,2,4
RS757210,2,3,4

The script reads lines of input. Commas delineate the fields of each line. The first field (zero in Perl) is used as a unique hash key to represent the locus. Alleles are then added into an array of data for each locus as long as the array for the locus under examination does _not_ already contain the allele. This eliminates duplicates immediately.

When all data has been read, the loci are sorted and listed with the alleles of each in sorted order.

Regards!

...JRF...

Peggy White · ‎10-25-2009

Great to have so much to play with!

James R. Ferguson · ‎10-25-2009

Hi (again) Peggy:

Please make a one-line correction to my script to insure correct matching.

Change the line:

if ( !grep( /$allele[$n]/, @{ $locus{ $allele[0] } } ) ) {

TO:

if ( !grep( /\b$allele[$n]\b/, @{ $locus{ $allele[0] } } ) ) {

This makes matching exact rather than fuzzy. For example, the original line would have treated a value of "2" as a match to "22" (etc) which is erroneous, and thus "2" would not have been added to a list already containing "22" --- not what we wanted.

Regards!

...JRF...

Peggy White · ‎10-25-2009

Thanks again, and have a great week!

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Merge multiple lines in a file

Merge multiple lines in a file

Re: Merge multiple lines in a file

Re: Merge multiple lines in a file

Re: Merge multiple lines in a file

Re: Merge multiple lines in a file

Re: Merge multiple lines in a file

Re: Merge multiple lines in a file

Re: Merge multiple lines in a file

Re: Merge multiple lines in a file

Re: Merge multiple lines in a file

Re: Merge multiple lines in a file

Re: Merge multiple lines in a file

Re: Merge multiple lines in a file

Re: Merge multiple lines in a file