Operating System - HP-UX
1757902 Members
2872 Online
108866 Solutions
New Discussion юеВ

Merge multiple lines in a file

 
SOLVED
Go to solution
Peggy White
Occasional Advisor

Merge multiple lines in a file

I've hunted and hunted but nothing seems to apply to what I need. Any help will be much appreciated!

My input file looks like (Unix):
marker,allele1,allele2
RS1002244,1,1
RS1002244,1,3
RS1002244,3,3
RS1003719,2,2
RS1003719,2,4
RS1003719,4,4

Most markers are listed 3 times but a few have 3 alleles and are listed more.

An example of a marker with 3 alleles is:
marker,allele1,allele2
RS757210,2,2
RS757210,2,3
RS757210,2,4
RS757210,3,3
RS757210,3,4
RS757210,4,4

I would like to get output like:
marker,allele1,allele2,allele3
RS1002244,1,3,.
RS1003719,2,4,.
RS757210,2,3,4

Everything I've found gives me
RS1002244,1,1,1,3,3,3,
RS1003719,2,2,2,4,4,4,
etc.

Thanks very much in advance, Peggy 10/23
13 REPLIES 13
Steven Schweda
Honored Contributor

Re: Merge multiple lines in a file

> Everything I've found [...]

Not a very complete description of what
you've tried.

I'd probably write a real program for a job
like this, but I assume that you're trying to
write a shell script.

Incomplete, but possibly useful:

dy # echo ',1,1,1,3,3,3,4,4' | sed -e 's/\(,.\)\1*/\1/g'
,1,3,4
Hein van den Heuvel
Honored Contributor
Solution

Re: Merge multiple lines in a file

Peggy, please carefully review the example input data.
Are you sure that sample output matches that? Is the 'allele2' column used at all?
Can you re-state the problem with non-identical values in allele1 and allele2?
Like:
RS1003719,2,7
RS1003719,2,8
RS1003719,4,8
Or is it critical that a allele2 value comes back as allele1?
Is the input garantueed to be sorted?

Anyway... here is some perl which generates the specified output from the specified input, but admittedly I doubt it matches the actual need.

--- x.pl -----------
while (<>) { # Go over all input
$x{"$1 $2"}=1 if /^(\w+),(\d+),/; # remember marker and allele1 if found
}
$x{x} = 1; # this is the end # any ASCII value higher than highest input marker

for (sort keys %x) { # go over accumulated markers
( $marker, $a) = split;
if ($marker eq $old) { # just add column if already seen
$count++;
$text .= ','.$a;
} else {
$text .= ",." if $count == 2; # add empty third if need be
print $text."\n" if $count; # print except for first 1
$count = 1; # First 1 for this marker
$old = $marker;
$text = $marker.','.$a;
}
}
-------------
Run as : perl x.pl x

fwiw,
Hein.
Dennis Handly
Acclaimed Contributor

Re: Merge multiple lines in a file

You appear to be using business logic (domain-specific) terminology and not computer terminology, which deals with foos, bars, fields, keys, records and strings.

>An example of a marker with 3 alleles is:
>marker,allele1,allele2

I'm not sure I see the "3"? Also, is this a title line with a description of the fields?

>RS1002244,1,3,.

It seems you want to collect all of the numbers that occur after the first field (key) and sort unique them?

>Steven: echo ',1,1,1,3,3,3,4,4' | sed -e 's/\(,.\)\1*/\1/g'

Thanks, didn't know you could use \# on the LHS.
Steven Schweda
Honored Contributor

Re: Merge multiple lines in a file

> Thanks, didn't know you could use \# on the LHS.

I'd never thought of trying it before, and I
wasn't sure until I had tried it, but there
it is. "man 5 regex" doesn't limit it, and
there's even an example using it that way.

> I'm not sure I see the "3"?

I was guessing that

RS757210,2,2
RS757210,2,3
RS757210,2,4
RS757210,3,3
RS757210,3,4
RS757210,4,4

had the three alleles, 2, 3, and 4 (in
various places), attached to the name
("marker") RS757210.

> You appear to be using business logic
> (domain-specific) terminology [...]

Yup. It pays to watch CSI to keep up on the
latest genetics terminology.


I long ago stopped expecting clear problem
statements in this forum. Hoping for, yes;
expecting, no. (I keep asking, but my
success rate is pretty low.)
Peggy White
Occasional Advisor

Re: Merge multiple lines in a file

Thanks very much to all! I'm sorry I wasn't clearer.

The column headings can be anything; I'm happy with name, column 2, and column 3.

There are indeed 3 separate values for the 2nd example I give - 2,2 - 2,3 - 2,4; values are 2, 3, and 4.

I would like output that lists each number once for each of the names it goes with. It doesn't matter if it's sorted or not.

I didn't include any code because I haven't been able to do much. I found one thing on this web page which is what I used for my last example, where all numbers were included. It was from March of this year, and the subject was "Merging lines into one from one file using awk or gawk".

Sorry, Peggy
Peggy White
Occasional Advisor

Re: Merge multiple lines in a file

Hein's works close to perfection! Thanks so much!
VK2COT
Honored Contributor

Re: Merge multiple lines in a file

Hello Peggy,

I know you already got good suggestions.

Pere is another one (just to show you
that we are all different :)

#!/usr/bin/perl

use strict;
use warnings;

my %seen = ();
my @MyArr = ();
my @arr = ();
my %myhash;
my %final;

while () {
chomp $_;
my @arr = split(/,/, $_);
push(@MyArr, join ",", $arr[0], $arr[1]);
}

foreach my $elem (@MyArr)
{
$seen{$elem}++;
$myhash{$elem} = "($seen{$elem})";
my @arr = split(/,/, $elem);
$myhash{$elem} =~ s/\(|\)//g;
if ( defined($final{$arr[0]}) ) {
if ( $myhash{$elem} < 2 ) {
$final{$arr[0]} = "$final{$arr[0]},$arr[1]";
}
}
else {
$final{$arr[0]} = "$arr[0],$arr[1]";
}
}

foreach my $hkey (sort keys %final) {
my $ff = $final{$hkey} =~ tr/,/,/;
my $add = q{};
if ( $ff < 3 ) {

$add=",.";
}
print "$final{$hkey}$add\n";
}

exit(0);

__DATA__
RS1002244,1,1
RS1002244,1,3
RS1002244,2,4
RS1002244,3,3
RS1003719,2,2
RS1003719,2,4
RS1003719,4,4

When you run it, the following comes:

RS1002244,1,3,.
RS1003719,2,4,.
RS757210,2,3,4

Cheers,

VK2COT
VK2COT - Dusan Baljevic
Hein van den Heuvel
Honored Contributor

Re: Merge multiple lines in a file

I asked about the input being sorted to some extend.
IF all the row for a given marker are garantueed to come together, then the output can be generated as the rows are processed.

For example:

---------------------------------------
while (<>) { # Go over all input

next unless /^(\w+),(\d+),/; # marker and allele1 number on this line?
if ($1 eq $old) { # just add column if already seen
next if $allele{$2}++;
print ",$2";
$count++;
} else {
print ",." if $count == 2;
print "\n" if $count;
print "$1,$2"; # print except for first 1
$old = $1;
%allele = ($2 => 1);
$count = 1;
}
}
print ",." if $count == 2;
print "\n";

---------------

or using an array to build the output line....

--------------

while (<>) { # Go over all input

if ( /^(\w+),(\d+),/ ) { # remember marker and allele1 number on this line?
$marker = $1;
} else {
next;
}
if ($marker eq $old) { # just add column if already seen
next if $allele{$2}++; # Seen this one already?
$allele[$count++] = $2; # Put in list if new.
} else {
print join (q(,),$old,@allele),"\n" if $count; # print except for first 1
$count = 1; # First 1 for this marker
$old = $marker;
@allele = ($2, q(.), q(.)); # seed output columns
%allele = ($2 => 1); # only one value allele seen so for
}
}
print join (q(,),$old,@allele),"\n";

--------------
TimTowTdi

enough already!
:-)

Hein.
Peggy White
Occasional Advisor

Re: Merge multiple lines in a file

3 fantastic answers! Thanks so much. Hopefully I can help someone someday. Peggy