Saturday, October 19, 2013

Concatenating Lines With AWK

So this is my first blog post and it's a nerdy one at that, but now that I think about it all my blog posts are going to be nerdy so let's get started.

My brother asked the following question the other day: 

"I have a text file similar to this

>Sequence1
ACTG
ACTG
>Sequence2
AAAA
CCCC

And I need it to be:

>Sequence1
ACTGACTG
>Sequence2
AAAACCCC

So, I need the lines not to be word wrapped below the lines demarcated by >.  Any idea how to do this?  Do you guys know anyone who is a text editor pro or is their a simple script that can be written?"

And he included a sample file "ns4.txt". Here's my short answer:

awk '/>/{if (x)print x;print;x="";next}{x=(!x)?$0:x""$0;}END{print x;}' nsP4.txt > nsP4New.txt

And the long answer:


If you break it apart it's easier to follow. So awk has the following input structure: 

    awk 'program' FILE

Where 'program' is the program to execute and then FILE is the input file. The file is nsP4.txt and the program in this case is:

    />/{if (x)print x;print;x="";next}{x=(!x)?$0:x""$0;}END{print x;}

Where />/{....}{....} is a regular expression that means if a line contains ">" execute the first set of curly braces and if it doesn't execute the second set of curly braces. So in the first set of curly braces you execute the following if the line contains ">":

    if (x)                                    
        print x;
    print;
    x="";
    next

So if x is not empty print x then print a line then assign the empty string to x finally go to the next line.

We execute the second set of curly braces if the line doesn't contain ">"

    x = (!x)?$0:x""$0;

This is a ternary operator and is the same as saying:

    if (!x)
        x = $0
    else
        x = x""$0

This means if x is not not empty (so empty) then assign the current line to x otherwise assign the concatenation of x, "" and the value of the input line. You could actually simplify this whole section in your case to:

    x = x$0;

The extra code in there is if you want to add some sort of delimiter where you currently have "". For example if you wanted to separate the base sequences with a comma. In that case you'd have:

    x = (!x)?$0:x","$0;

Then the last thing that awk does is:

    END{print x;}

So when it gets to the end of the file print out the value of x.

The very last thing is that awk would just spit all of this out to the terminal window unless you tell it to put it somewhere else so the line:

...... > nsP4New.txt 

Is a bash command to take the output of awk and send it to a new file nsP4New.txt that will be created.

No comments:

Post a Comment