Manipulating the output from a genome analysis - vcf and gff

Problem formulated and presented at the workshop by Jonas Söderberg,
Department of Cell and Molecular Biology, Molecular Evolution

Task at hand

We have a bunch of files that describes a standard fly, the genes of the fly and also where our new non-standard flies differ from the standard fly (the reference). These differences are called variants and are divided ito three categories; SNPs (Singular Nucleotide Polymorphisms), Insertions (Some DNA is inserted in our variant in comparison to the reference) and Deletions (Some DNA is deleted from our variant in comparison to the reference). We want to find out where and in what way the new flies have variants, if those variants are inside genes or not and finally what the genes (in which those variants are located) actually do.

Getting the RAW files

We have a comparison between a number of different new fly cell lines. These are found in a huge vcf file (dgrp2.vcf). We also have the annotations in a gff3 file (Drosophila_melanogaster.BDGP6.28.101.chr.gff3) and the genome itself in a fasta file.

The annotation of the fly genome:

Find the GFF3 here.

The genome sequence used in this project:

The genome is located here.

The comparison of all the variants in the cell lines in the freeze project

Here is the VCF file with the variants.

Common names and functions

Finally, full gene names and functions found in this file.

Tips before starting

For readability of the vcf file:

awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9}' dgrp2.vcf > dgrp2_trimmed.vcf &

If your awk takes too long to run (10h or so) - Make sure you have the latest awk, and also you might want to think about parallelisation.
My solutions also work for files with more than one chromosome, so in some cases they are longer than needed for your exercise. I left it so on purpose.
SNPs and INDELs are collectively named variants
Genes and CoDing Sequences (CDSs) are sometimes just collectively called genes
For shorter run times, extract chromosome 4 and look only at that.

Making awking the data easier

Start by splitting the task into sub-tasks. This makes it easier to see what happens and you might get interesting intermediary results.

bonus result

A table with counted and sorted different genomic features in chromosome 4.

bonus result

SNPs sorted by number. Just like the coins on day one.

The exercise

Identify the steps you need and use awk to do those. Open the hints if you get stuck.

Hints, ordered by subject. Don't use them unless necessary. They are not good code examples.

Overall, an example of things to look for

Hint Example

Let's say we want to find out all genes that contains a variant and all variants that are located within a gene. What do we want to do first? Take a look at the vcf file. That is the one that contains all the variants. Then look at the gff file, which contains the genes and other annotations. Finally, take a look at the DNA sequence. You will need to combine all three to answer the question.

Hint Example, what do we need to get?

Positions for SNPs and INDELs
Positions for genes and CDSs
Separation of variants (SNPs and INDELs) into two groups, inside and outside genes (and CDSs)
Separation of genes/CDSs into those with and without variants (and maybe how many there are per gene)

Chromosome 4

Hint What do we need?

Extract chr4 from the vcf and the gff and make new files

Hint

All lines from chromosome 4 start with a 4

Solution

awk '/^4/{print $0}' Drosophila_melanogaster.BDGP6.28.101.gff3 > Drosophila_melanogaster.chr4.gff3

awk '/^4/{print $0}' dgrp2_trimmed.vcf > dgrp2_chr4.vcf

bonus result example A table with counted and sorted different genomic features in chromosome 4.

   1 chromosome
   1 snoRNA
   2 pre_miRNA
   7 pseudogene
  11 pseudogenic_transcript
  26 ncRNA_gene
  31 ncRNA
  79 gene
 295 mRNA
 338 three_prime_UTR
 571 five_prime_UTR
2740 CDS
3155 exon

Variants

Hint Which output do we want?

Get distribution of variants and list them in two separate files. For a bonus plot of the lengths of the INDELS, get the length of all INDELS into a third file

Hint

Remove lines beginning with # (grep)

Hint

If columns 4 and 5 have different length, it's an INDEL. Otherwise it's a SNP.

Hint

You want the output to be a file with columns 1, 2, 4 and 5, a classifier (SNP or INDEL) and finally the length of the INDEL (put "-" for the SNPs)

Solution

cat dgrp2_chr4.vcf | grep -v "#" | awk '{if (length($4)>1||length($5)>1){a="INDEL";b=length($4)-length($5);cnt[b]+=1;} else {a="SNP";b="-";} printf("%s\t%s\t%s\t%s\t%s\t%s\n", $1, $2, a, b, $4, $5) > "indels_Drosophila_chr4";}END{for (x in cnt){print x,cnt[x] > "distr_Drosophila_chr4"}}'
or
cat dgrp2_chr4.vcf | grep -v "#" | ./indels.awk

indels.awk

#!/usr/bin/awk -f
{
    if (length($4)>1||length($5)>1){
        a="INDEL";
        b=length($4)-length($5);
        cnt[b]+=1;
    }
    else{
        a="SNP";
        b="-";
    }
printf("%s\t%s\t%s\t%s\t%s\t%s\n", $1, $2, a, b, $4, $5) > "indels_Drosophila_chr4";
}

END{
    for (x in cnt){
        print x,cnt[x] > "distr_Drosophila_chr4"
    }
}

Solution proposed by Loïs Rancilhac - 2022.08.30

awk '/_SNP/ {SNP++; print $0 > "chr4_SNPs.vcf"} /_DEL/ {DEL++; print $0 > "chr4_DEL.vcf"; LENGTH=length($4)-length($5); print LENGTH > "Deletions_lengths.txt"} /_INS/ {INS++; print $0 > "chr4_INS.vcf"; LENGTH=length($5)-length($4); print LENGTH > "Insertions_lengths.txt"} END{print "SNPs: "SNP"\nInsertions: "INS"\nDeletions: "DEL}' chr4.vcf

bonus result example SNPs sorted by number

1182 C->T  
1133 G->A  
 932 A->G  
 929 A->T  
 892 T->A  
 880 T->C  
 639 G->T  
 621 C->A  
 436 A->C  
 396 T->G  
 372 G->C  
 357 C->G

Genes with variants

Hint How do we get those?

Compare back and separate the annotation into features that do and don’t have variants. For a bonus, also record the number of variants in each feature

Hint

Make an index using the previous output to identify positions of variants

Hint

For each feature in the gff, check all position it covers to see if they are in your index, if so print to one file. If not, print to another.

Solution

awk 'FNR==NR{a[$1,$2]="T"; next}{ hits=0; for(N=$4; N<=$5; N++) { if (a[$1,N] == "T") {hits+=1}} if (hits>0) {print hits "\t" $0 > "haveSNPINDEL_Drosophila_chr4.gff"} else {print $0 > "noSNPINDEL_Drosophila_chr4.gff"}}' indels_Drosophila_chr4 Drosophila_melanogaster.chr4.gff3
or
./genes_var.awk indels_Drosophila_chr4 Drosophila_melanogaster.chr4.gff3

genes_var.awk

#!/usr/bin/awk -f

FNR==NR{
    a[$1,$2]="T";
    next
}
{
    hits=0;
    for(N=$4; N<=$5; N++) {
        if (a[$1,N] == "T") {
         hits+=1
        }
    }
    if (hits>0) {
        print hits "\t" $0 > "haveSNPINDEL_Drosophila_chr4.gff"
    }
    else {
        print $0 > "noSNPINDEL_Drosophila_chr4.gff"
    }
}

Genes/CDSs only

Hint What features do we look for?

Filter for genes and CDSs before doing the analysis.

Hint

Only genes and CDSs are interesting to us. Make a gff without the rest of the features.

Solution

awk '$3 ~ /gene|CDS/' Drosophila_melanogaster.chr4.gff3 > Drosophila_melanogaster.chr4_genesCDSs.gff3

List of variants

Hint How do we classify the variants?

Repeat step 3 for the SNPs/INDELs themselves, to see which are actually located inside genes

Hint

Make an index of all genes/CDSs (from your gff), where start and stop are paired

Hint

For each feature from your step 2 file, check the position against the index and print whether or not the variant is inside a gene.

Solution

awk 'FNR==NR{ingene[$1,$4]=$5; next}{state="Not in gene";for (pair in ingene) {split(pair, t, SUBSEP); if ($1==t[1] && $2>=t[2] && $2<=ingene[t[1],t[2]]) {state=(t[1] " " t[2] " " ingene[t[1],t[2]])}} print $0, " ", state }' Drosophila_melanogaster.chr4_genesCDSs.gff3 indels_Drosophila_chr4 > SNPsInGenes_Drosophila_ch4
or
./varlist.awk Drosophila_melanogaster.chr4_genesCDSs.gff3 indels_Drosophila_chr4 > SNPsInGenes_Drosophila_ch4

varlist.awk

#!/usr/bin/awk -f

FNR==NR{
    ingene[$1,$4]=$5; 
    next
}
{
    state="Not in gene";
    for (pair in ingene) {
        split(pair, t, SUBSEP); 
        if ($1==t[1] && $2>=t[2] && $2<=ingene[t[1],t[2]]) {
            state=(t[1] " " t[2] " " ingene[t[1],t[2]])
        }
    } 
    print $0, " ", state 
}

Hint Finding the names

Find a common field

Hint Which field?

gene ID

Hint Where is that?

column nine, not awk!

Solution

awk 'FNR==1{++fileidx} fileidx==1{split($9,a,";|:");ingene[$1,$4,$5]=a[2]} fileidx==2{FS="\t";name[$3]=$5} fileidx==3{state="Not in gene";for (trip in ingene) {split(trip, t, SUBSEP); if ($1==t[1] && $2>=t[2] && $2<=t[3]) {state=(t[1] "\t" t[2] "\t" t[3] "\t" name[ingene[t[1],t[2],t[3]]])}} print $0, "\t", state }' Drosophila_melanogaster.chr4_genesCDSs.gff3 fbgn_fbtr_fbpp_expanded_fb_2020_06.tsv indels_Drosophila_chr4 > SNPsInNamedGenes_Drosophila_ch4
or
./in_named.awk Drosophila_melanogaster.chr4_genesCDSs.gff3 fbgn_fbtr_fbpp_expanded_fb_2020_06.tsv indels_Drosophila_chr4 > SNPsInNamedGenes_Drosophila_ch4

in_named.awk

#!/usr/bin/awk -f
FNR==1{
    ++fileidx
}
{
    if (fileidx==1){
        split($9,a,";|:");
        ingene[$1,$4,$5]=a[2]
    } 
    if (fileidx==2){
        FS="\t";
        name[$3]=$5
    } 
    if (fileidx==3){
        state="Not in gene";
        for (trip in ingene) {
            split(trip, t, SUBSEP); 
            if ($1==t[1] && $2>=t[2] && $2<=t[3]) {
                state=(t[1] "\t" t[2] "\t" t[3] "\t" name[ingene[t[1],t[2],t[3]]])
            }
        } 
        print $0, "\t", state 
    }
}

Look at the distribution of genes:

awk -F"\t" '{print $10}' SNPsInNamedGenes_Drosophila_ch4 | sort | uniq -c | sort -n