General
cancel
Showing results for 
Search instead for 
Did you mean: 

Merging files in different formats

Peggy White
Occasional Advisor

Merging files in different formats

Hello all, and I'm sure that's not the best subject line to use, but here goes.

I have 3 files - 1 for each of 3 different chromosomes. Each of the 3 files looks like (with different SNPS):
SNP Al1 Al2
rs7342690 C A
rs11862844 A C
rs4021617 A G

I have 3 other files (for the same 3 chromosomes) that are all the same line length - 2,457 individuals. Each line looks like:
C005->C005-000 ML_DOSE 1.178 1.177 1.333 1.782 0.225 0.437 0.586 1.999 2.000 0.523 ....

Each of the 2,457 lines start with a four character/digit family id pointing (->) to a study id. Each value in this file after ML_DOSE corresponds (in order) to the SNPs in the first file.

I need to match up the values in the file containing study ids to the file containing the SNPs. There are 577,282 SNPs in 1 file (I need to match up 5), 598,112 SNPs in the second file (I need to match up 2), and 263,830 in the third file (I need to match up 4).

I've done this before but only for 1 studyid, so it was just a matter of grepping out the ID, replacing blank spaces with with \n in sed, and pasting the 2 files together. I can't figure out how to do this with 2,457 people though.

Any help will be very much appreciated, and I will certainly send more details about the files if it helps. Thanks to all, Peggy
6 REPLIES
Goran Koruga
Honored Contributor

Re: Merging files in different formats

Hello.

Simply use hash arrays in Perl or awk.

First split the line into the fields you are interested in and then put the data in hash arrays as you wish.

Of course it may become hash array of more complex structures, but that too is not hard to do. For perl see these man pages:

perldata
perldsc

Regards,
Goran
Viktor Balogh
Honored Contributor

Re: Merging files in different formats

>I need to match up the values in the file containing study ids to the file containing the SNPs.

if you need to match them based on the first field, then:

# man 1 join

if it's not the first field you want to match by then awk both files to rearrange the columns.

if you want to join them without any matching (I mean 1st line to 1st line and so on...)

# man 1 paste

****
Unix operates with beer.
Dennis Handly
Acclaimed Contributor

Re: Merging files in different formats

>different chromosomes.

Again refer to my comment about domain specific terminology in your previous thread:
http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1381141

>C005->C005-000 ML_DOSE 1.178 1.177 1.333 1.782 0.225 0.437 0.586 1.999 2.000 0.523 ....

How many fields on each line? Or if more than X, you have a continuation line?

>Each value in this file after ML_DOSE corresponds (in order) to the SNPs in the first file.

Is there one field for each record in your first file?

>There are 577,282 SNPs in first file (I need to match up 5)

Match 5 to what?

>598,112 SNPs in the second file (I need to match up 2), and 263,830 in the third file (I need to match up 4).

I'm not sure what 2 and 4 you want to match?
You seem to mention 3 SNPs files and 3 other files for study ids. What files match to what?
Viktor Balogh
Honored Contributor

Re: Merging files in different formats

ok, make this a little clearer. two types of files - one for the SNPs and one for study_IDs.

SNP:

SNP Al1 Al2
rs7342690 C A
rs11862844 A C
rs4021617 A G

study_ID:

C005->C005-000 ML_DOSE 1.178 1.177 1.333 1.782 0.225 0.437 0.586 1.999 2.000 0.523 ....


>Each value in this file after ML_DOSE corresponds (in order) to the SNPs in the first file.

How? I don't see any correspondence here.
And after matching these, what ouput would you like to have?
****
Unix operates with beer.
Peggy White
Occasional Advisor

Re: Merging files in different formats

Thanks all - sorry for my terminology; it's the terminology I use and it's what I need to use in my problem solving. Joining won't work as not all Study IDs are in all files. I'm semi-decent with arrays, I have a hard time with hash examples of the Flintstones and the Jetsons, but I will certainly look up the Perl thing suggested. This task completely changed today, so I'll be starting it over.
Dennis Handly
Acclaimed Contributor

Re: Merging files in different formats

>sorry for my terminology;

The computer science terminology is pretty simple: files, records, fields and characters

If you wish to name these items with your terminology, do so, but it is probably easier for us dummies to deal with "fileb", "field2", etc. :-)

It would also help if you can show some small sample input files and where fields need to go for your output.