Skip to content

Fasta file format tips

With multi-fasta files, it is often required to extract few fasta sequences which contain the keyword/s of interest. One fast way to do this, is by awk.

data.fa

>Chr1
ATCTGCTGCTCGGGCTGCTCTAT...
>Chr2
GTACGTCGTAGGACATGCATCG...
>MT1
TACGATCGATCAGCTCAGCATC...
>MT2
CGCCATGGATCAGCTACATGTA...
$ awk 'BEGIN {RS=">"} /Chr2/ {print ">"$0}' data.fa

Note that in the BEGIN section of the script, we have redefined the internal variable for the record separator RS=">" which by default is "new line". This way, awk will treat the whole fasta (multi-line) record as one record.

output:

>Chr2
GTACGTCGTAGGACATGCATCG...

example taken from: link