Demo notebook for GMI Proficiency Test 2015


About the test


"What is the GMI PT?

This inter-laboratory performance test is provided to facilitate harmonization and standardization in whole genome sequencing and data analysis, with the aim to produce comparable data for the GMI initiative. "

This notebook is about the task 2:

" 2) Variant detection and phylogenetic/clustering analysis of three datasets each including fastq data from circa 20 genomes of S. enterica, E. coli and S. aureus. Note: If performing a reference based approach for variant detection, the reference applied for the analysis must be the species specific reference indicated in the proficiency test protocol."

  • Proficiency test protocol link

Supplied test material

"Three datasets, one for each of S. enterica, E. coli and S. aureus, will be available for download from the ftp-site ‘cgebase.cbs.dtu.dk’. Login to the ftp-site will be provided directly to each participant. Each dataset will consist of the original fastq files (i.e., whole genome sequence data) from circa 20 samples for phylogenetic cluster analysis based on a tool of the laboratory’s own choice; SNP-calling, gene-by-gene, etc."

Procedure and analysis of test material

"The three fastq datasets should be downloaded from the ftp-site. They are organized into three different .zip archives appropriately labeled with the taxon they represent. Within each archive the participant will find the paired-end reads. The objective associated with this dataset is to assess the variability of laboratories in the clusters identified through the analysis of next- generation sequencing data. As such, the participant should employ their preferred method for constructing a matrix (e.g., gene, SNP, presence/absence, etc.) and for clustering samples (e.g., distance-, maximum-likelihood-, Bayesian-based).

If performing a reference based approach for variant detection, the reference applied for the analysis must be: STA00025 (S. enterica), EC002143 (E. coli) and SAH596 (S. aureus)."

REPORTING OF RESULTS AND EVALUATION

"Specifically, up to two types of files should be submitted: For each dataset:

  1. The DNA sequence matrix used for clustering should be in fasta format
  2. The clusters themselves in newick format "


Preparation


Import python libraries used

In [1]:
from glob import glob

from Bio import AlignIO
from Bio import SeqIO

from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor

import ete2

import time
import subprocess
import os
import copy

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline


#some settings
matplotlib.rc('font', size=16)
matplotlib.rcParams['lines.linewidth'] = 3
matplotlib.rcParams['figure.figsize'] =(16,9)

Go to working directory

  • NOTE Machine specific
In [2]:
os.chdir('/nagyvinyok/adat83/sotejedlik/ribli/gmi_3')

Write bash params to file

  • NOTE Machine specific
In [3]:
%%writefile bash_params
Bowtie_Path=/nagyvinyok/adat88/kozos/sotejedlik/usr/bowtie2
SAMTOOLS=/nagyvinyok/adat88/kozos/sotejedlik/usr/samtools-1.2/samtools
BCFTOOLS=/nagyvinyok/adat88/kozos/sotejedlik/usr/samtools-1.2/bcftools
MAFFT=mafft
Overwriting bash_params


Downloading, unzipping data


  • We downloaded zipped sample data and reference genomes into the 'input' subdirectory
  • No we unzip them

Extract sample files

TIME It takes ~ 30 mins

In [ ]:
%%bash
. bash_params

if [ -d samples ]
then
  echo "Samples already extracted"
else
  mkdir samples
  unzip -d samples input/ecoliMay2015_PT.zip
  unzip -d samples input/saureusMay2015_PT.zip
  unzip -d samples input/typhimuriumMay2015_PT.zip
  #The S. aureus samples are individually gzipped
  cd samples/saureusPT
  gunzip *.gz
fi


Some basic statistics on the data


We did some very basic statstics on the provided dataset. This should be done in a more thorough manner in a real situation. We wanted to create as simple worklflow as possible. This goal of this notebook is to demonstrate the possibility of developping worklows in the IPython Notebook enviroment. For further quality control one can use FatsQC

Check how many samples we have

In [5]:
%%bash
. bash_params

echo 'Number of FASTQ files:'
echo "E. coli"
ls samples/ecoliPT/*_1.fastq | wc -l
echo "S. aureus"
ls samples/saureusPT/*_1.fastq | wc -l
echo "Typhimurium"
ls samples/typhimuriumPT/*_1.fastq | wc -l
Number of FASTQ files:
E. coli
21
S. aureus
24
Typhimurium
20

Number of reads in each sample

  • NOTE Very high fluctuation in coverage

TIME It takes ~ 2 mins

In [7]:
%%bash
. bash_params
mkdir work
for i in samples/ecoliPT/*_1.fastq; do \
    echo -n ${i}": "; cat $i | wc -l | awk '{print $1/4}' ;
done > work/readCount.dat
for i in samples/saureusPT/*_1.fastq; do \
    echo -n ${i}": "; cat $i | wc -l | awk '{print $1/4}' ;
done >> work/readCount.dat
for i in samples/typhimuriumPT/*_1.fastq; do \
    echo -n ${i}": "; cat $i | wc -l | awk '{print $1/4}' ;
done >> work/readCount.dat
In [8]:
nReads = pd.read_csv("work/readCount.dat", header=None, sep=" ")
nReads['name']=[str(x.split('/')[-1].split('_')[0]) for x in nReads[0]]

fig,ax=plt.subplots()
ax.bar(range(len(nReads)),nReads[1],facecolor='#5555ff', edgecolor='none')
ax.set_xticks(0.5+np.arange(len(nReads)))
ax.set_xticklabels(nReads['name'],rotation='vertical',fontsize=14)
dump=ax.set_xlim(0,len(nReads))

Read lengths in all samples

TIME It takes ~ 15 mins

In [9]:
%%bash
. bash_params

for i in samples/ecoliPT/*_1.fastq; do \
    echo -n ${i}": "; grep -v '[+@]' $i | awk 'BEGIN {;FS=""}{lengc[NF]++}\
    END{for(i in lengc) printf("%d %d "),i,lengc[i]/2; printf("\n")}';
done > work/readLen.dat
for i in samples/saureusPT/*_1.fastq; do \
    echo -n ${i}": "; grep -v '[+@]' $i | awk 'BEGIN {;FS=""}{lengc[NF]++}\
    END{for(i in lengc) printf("%d %d "),i,lengc[i]/2; printf("\n")}';
done >> work/readLen.dat
for i in samples/typhimuriumPT/*_1.fastq; do \
    echo -n ${i}": "; grep -v '[+@]' $i | awk 'BEGIN {;FS=""}{lengc[NF]++}\
    END{for(i in lengc) printf("%d %d "),i,lengc[i]/2; printf("\n")}';
done >> work/readLen.dat
In [10]:
names,lens,counts=[],[],[]
with open('work/readLen.dat') as f:
    for line in f:
        names.append(line.split(' ')[0].split('/')[-1].split('_')[0])
        lens.append(map(int,line.split(' ')[1:-1:2]))
        counts.append(map(int,line.split(' ')[2:-1:2]))

N=np.max([max(leng) for leng in lens])
m=np.zeros((N+1,len(names)))
for i in xrange(len(names)):
    for j in xrange(len(lens[i])):
        m[lens[i][j],i]+=counts[i][j]
    m[:,i]/=np.sum(counts[i])        

fig,ax=plt.subplots()
ax.imshow(m,interpolation='none',aspect=0.15,cmap=plt.cm.jet,
          norm=matplotlib.colors.LogNorm(),origin='lower')
ax.set_xticks(np.arange(len(names)))
ax.set_xticklabels(names,rotation='vertical',fontsize=14)
ax.set_ylim(30,N+10)
ax.set_ylabel('read length')
dump=ax.set_xlim(-1,len(names))

Check the fasta files, since we know nothing about them yet


NOTE They are called exactly like some samples

  • Probably they were created from those samples, with denovo assembly
  • Interesting that all of them was made from relatively low coverage samples
In [11]:
%%bash
cd input
ls *fasta
EC002143.fasta
SAH596.fasta
STA00025.fasta

Print the number of contigs, and some names of contigs

  • NOTE they are probably contigs from a denovo alignment.
    • NODE means which node it was computed on
    • coverage should be the average coverage on the contig
In [12]:
%%bash
grep '>' input/EC002143.fasta | wc -l
grep '>' input/SAH596.fasta | wc -l
grep '>' input/STA00025.fasta | wc -l
echo
head -n1 input/EC002143.fasta 
head -n1 input/SAH596.fasta 
head -n1 input/STA00025.fasta
335
73
192

>NODE_3_length_6711_cov_193.634186
>C00000163_n1_c1 130204H596-I1-E1 NODE_1_length_107505_cov_22.298033
>NODE_1_length_933_cov_23.973206

Cumulative distributions of contig lengths

In [17]:
fig,ax=plt.subplots()
fig.set_size_inches(16,9)
for fname in ['EC002143.fasta','SAH596.fasta','STA00025.fasta']:
    ref_dict=SeqIO.to_dict(SeqIO.parse('input/'+fname,"fasta"))
    x=np.sort([len(x) for x in ref_dict.values()])
    ax.plot(x,np.linspace(0,1,len(x)),'H',ms=10,mec='none',label=fname)
ax.set_xscale('log')
ax.set_xlabel('contig length')
ax.set_ylabel('cumulative distribution')
ax.legend(loc='best',fancybox=True)
dump=ax.set_ylim(0,1.1)

Full reference lengths

In [19]:
for fname in ['EC002143.fasta','SAH596.fasta','STA00025.fasta']:
    ref_dict=SeqIO.to_dict(SeqIO.parse('input/'+fname,"fasta"))
    print fname,'genome length:', np.sum([len(x) for x in ref_dict.values()])
EC002143.fasta genome length: 5183821
SAH596.fasta genome length: 2836332
STA00025.fasta genome length: 5088344


Alignment


Because the species of the samples were given, and we assumed they might be very similar to each other, We have decided to use a reference based method.

We have used the bowtie2 alignment software. We have chosen bowtie2 mainly because the author of this notebook has the most experience with this aligner. Other good aligner software should be also good (bwa, novoalign etc.).

Bowtie2 and reference genome index generation

TIME It only takes seconds

In [ ]:
%%bash
. bash_params

if [ -d indexes ]
then
  echo "Indices already generated"
else
  mkdir indexes
  $Bowtie_Path/bowtie2-build input/EC002143.fasta indexes/EC002143
  $SAMTOOLS faidx input/EC002143.fasta
  $Bowtie_Path/bowtie2-build input/SAH596.fasta indexes/SAH596
  $SAMTOOLS faidx input/SAH596.fasta
  $Bowtie_Path/bowtie2-build input/STA00025.fasta indexes/STA00025
  $SAMTOOLS faidx input/STA00025.fasta
fi

Alignment

  • We used bowtie2 with default settings

TIME It takes ~ 2.5 hours

In [ ]:
%%bash
. bash_params

mkdir -p align

#ecoliPT
DEST=align/ecoliPT
GENOME=EC002143
if [ -d $DEST ]
then
  echo "$DEST already aligned"
else
  mkdir $DEST
  for i in samples/ecoliPT/*_1.fastq
  do
    j=${i%_1.fastq}
    k=${j#samples/ecoliPT/}
    $Bowtie_Path/bowtie2 -p 22 -x indexes/$GENOME -1 ${j}_1.fastq -2 ${j}_2.fastq \
    | $SAMTOOLS view -b -S - > $DEST/$k.bam
    $SAMTOOLS sort -@ 10 $DEST/$k.bam $DEST/$k.s
    mv $DEST/$k.s.bam $DEST/$k.bam
    $SAMTOOLS index $DEST/$k.bam
  done
  wait
fi

#saureusPT
DEST=align/saureusPT
GENOME=SAH596
if [ -d $DEST ]
then
  echo "$DEST already aligned"
else
  mkdir $DEST
  for i in samples/saureusPT/*_1.fastq
  do
    j=${i%_1.fastq}
    k=${j#samples/saureusPT/}
    $Bowtie_Path/bowtie2 -p 22 -x indexes/$GENOME -1 ${j}_1.fastq -2 ${j}_2.fastq \
    | $SAMTOOLS view -b -S - > $DEST/$k.bam
    $SAMTOOLS sort -@ 10 $DEST/$k.bam $DEST/$k.s
    mv $DEST/$k.s.bam $DEST/$k.bam
    $SAMTOOLS index $DEST/$k.bam
  done
  wait
fi

#typhimuriumPT
DEST=align/typhimuriumPT
GENOME=STA00025
if [ -d $DEST ]
then
  echo "$DEST already aligned"
else
  mkdir $DEST
  for i in samples/typhimuriumPT/*_1.fastq
  do
    j=${i%_1.fastq}
    k=${j#samples/typhimuriumPT/}
    $Bowtie_Path/bowtie2 -p 22 -x indexes/$GENOME -1 ${j}_1.fastq -2 ${j}_2.fastq \
    | $SAMTOOLS view -b -S - > $DEST/$k.bam
    $SAMTOOLS sort -@ 10 $DEST/$k.bam $DEST/$k.s
    mv $DEST/$k.s.bam $DEST/$k.bam
    $SAMTOOLS index $DEST/$k.bam
  done
  wait
fi


Creating the initial DNA sequence matrix


Steps:

VCF creation:

  • The VCF format is a frequently used intermediary step in variant calling pipelines. The format specification.
  • For the creation of the VCF we used the standard samtools software, (version 1.2).

Variant calling, consensus sequence:

  • For selecting the most probable allel in each position we used "bcftools call" from the standard samtools software, (version 1.2). Link for bcftools.
  • For writing the consensus fasta sequence "bcftools consensus" from the standard samtools software, (version 1.2).

Creating the DNA equence matrix

  • We performed multiple alignment of the created fasta sequence with MAFFT software.

TIME It takes 20 hours!!! (from nov. 1. 19:53 to nov 2. 15.34.)

In [ ]:
%%bash
. bash_params
mkdir -p consensus

DEST=ecoliPT
GENOME=EC002143

if [ -d consensus/$DEST ]
then
  echo "consensus for $DEST already generated"
else
  mkdir consensus/$DEST

  for i in samples/$DEST/*_1.fastq
  do
    j=${i%_1.fastq}
    k=${j#samples/$DEST/}
    $SAMTOOLS mpileup -v -f input/$GENOME.fasta align/$DEST/$k.bam > consensus/$DEST/$k.vcf
    $BCFTOOLS index consensus/$DEST/$k.vcf
    $BCFTOOLS call -o consensus/$DEST/$k.call.vcf -O z -c consensus/$DEST/$k.vcf
    $BCFTOOLS index consensus/$DEST/$k.call.vcf
    $BCFTOOLS consensus -f input/$GENOME.fasta consensus/$DEST/$k.call.vcf > consensus/$DEST/$k.cons.fa
    awk -v SMPL=$k -v DDIR=consensus/$DEST/ \
      '/^>/ {fajl= DDIR "all_" substr($1,2) ".fa"; print ">" SMPL >> fajl} !/^>/{print >> fajl}' \
      consensus/$DEST/$k.cons.fa  
  done

  for i in consensus/$DEST/all_*.fa
  do
    $MAFFT --clustalout $i > ${i%.fa}.clw
  done
fi

DEST=saureusPT
GENOME=SAH596

if [ -d consensus/$DEST ]
then
  echo "consensus for $DEST already generated"
else
  mkdir consensus/$DEST

  for i in samples/$DEST/*_1.fastq
  do
    j=${i%_1.fastq}
    k=${j#samples/$DEST/}
    $SAMTOOLS mpileup -v -f input/$GENOME.fasta align/$DEST/$k.bam > consensus/$DEST/$k.vcf
    $BCFTOOLS index consensus/$DEST/$k.vcf
    $BCFTOOLS call -o consensus/$DEST/$k.call.vcf -O z -c consensus/$DEST/$k.vcf
    $BCFTOOLS index consensus/$DEST/$k.call.vcf
    $BCFTOOLS consensus -f input/$GENOME.fasta consensus/$DEST/$k.call.vcf > consensus/$DEST/$k.cons.fa
    awk -v SMPL=$k -v DDIR=consensus/$DEST/ \
      '/^>/ {fajl= DDIR "all_" substr($1,2) ".fa"; print ">" SMPL >> fajl} !/^>/{print >> fajl}' \
      consensus/$DEST/$k.cons.fa  
  done

  for i in consensus/$DEST/all_*.fa
  do
    $MAFFT --clustalout $i > ${i%.fa}.clw
  done
fi

DEST=typhimuriumPT
GENOME=STA00025
if [ -d consensus/$DEST ]
then
  echo "consensus for $DEST already generated"
else
  mkdir consensus/$DEST

  for i in samples/$DEST/*_1.fastq
  do
    j=${i%_1.fastq}
    k=${j#samples/$DEST/}
    $SAMTOOLS mpileup -v -f input/$GENOME.fasta align/$DEST/$k.bam > consensus/$DEST/$k.vcf
    $BCFTOOLS index consensus/$DEST/$k.vcf
    $BCFTOOLS call -o consensus/$DEST/$k.call.vcf -O z -c consensus/$DEST/$k.vcf
    $BCFTOOLS index consensus/$DEST/$k.call.vcf
    $BCFTOOLS consensus -f input/$GENOME.fasta consensus/$DEST/$k.call.vcf > consensus/$DEST/$k.cons.fa
    awk -v SMPL=$k -v DDIR=consensus/$DEST/ \
      '/^>/ {fajl= DDIR "all_" substr($1,2) ".fa"; print ">" SMPL >> fajl} !/^>/{print >> fajl}' \
      consensus/$DEST/$k.cons.fa  
  done

  for i in consensus/$DEST/all_*.fa
  do
    $MAFFT --clustalout $i > ${i%.fa}.clw
  done
fi


Creating the final DNA sequence matrix, and clustering the samples


We selected the following position fot he final DNA matrix:

  • The position must be contained in the consensus sequence of all samples. This criteria tries to eliminate the problematic positions in the reference genome.
  • There must not be mor than 2 different bases at a given position in all samples. This criteria also tries to eliminate the problematic positions in the reference genome.
  • There must be at least some difference between the samples. This criteria is used only to speed up calculations later.

We decided to use distance based clustering

  • First we wanted to create as simple worklflow as possible. This goal of this notebook is to demonstrate the possibility of developping worklows in the IPython Notebook enviroment.
  • Also we (WIGNER) are not experts in bacteriology, so we didn't wish to invent our complicated pipeline for this demonstration.

Distance construction, and Neighour-joining tree construction

Define functions to read the clustal files, and concatenate them into a multiple sequence alignment object, and write the result to file

  • this is the result for 1, :the sequence matrix used for clustering in fasta format
In [3]:
def read_clustal(input_str):
    start=time.time()
    init = None
    
    for fff in glob(input_str):
        if init == None :
            aln = filter_aln(AlignIO.read(fff, 'clustal'))
            aln.sort()
            init = 1
        else:
            aln_tmp = filter_aln(AlignIO.read(fff, 'clustal'))
            aln_tmp.sort()
            aln = aln + aln_tmp
        
    print 'Length of the final sequence: ',len(aln[0]),'bp'
    print 'Selecting positions took: ', time.time()-start,'s' 
    return aln

def filter_aln(in_aln):
    out_aln=in_aln[:,0:1] #init (will be deleted)
    leng=len(in_aln)
    
    #loop over positions
    for i in xrange(len(in_aln[0])):
        counts=[ in_aln[:,i].count(x) for x in ['a','c','g','t','-']]
        
        if ( counts[-1]==0 and #no zero coverage
            max(counts)!=leng and  #some difference needed
            sorted(counts)[2]==0): #only 2 different letters!
            out_aln+=in_aln[:,i:i+1]
    return out_aln[:,1:]
        

def write_seq_matrix(aln,output_fname):
    with open(output_fname, "w") as handle:
        SeqIO.write(aln, handle, "fasta")

Define functions to build the tree, plot it and write it to a file in newick format

  • We need to convert from Biopython Phylo newick format, to the format described in the protocol, We use ete2 for this purpose.
In [4]:
def build_tree(aln):
    calculator = DistanceCalculator('identity')
    constructor = DistanceTreeConstructor(calculator, 'nj')
    start=time.time()
    tree = constructor.build_tree(aln)
    print 'Building the tree took: ', time.time()-start,'s' 
    return tree

def get_dist_mat(aln):
    calculator = DistanceCalculator('identity')
    start=time.time()
    dm = calculator.get_distance(aln)
    print 'Calculating the distance matrix took: ', time.time()-start,'s'
    return dm

def plot_dist_mat(dm,cmap=plt.cm.Greens):
    fig,ax=plt.subplots()
    m=np.array(dm)
    ax.imshow(m,interpolation='none',cmap=cmap)
    ax.set_xticks(range(m.shape[0]))
    ax.set_xticklabels(dm.names,rotation='vertical')
    ax.set_yticks(range(m.shape[1]))
    dump=ax.set_yticklabels(dm.names)

def plot_tree(tree,xmin=None,xmax=None):
    def my_label(clade): #label the leaves
        if str(clade)[:5] == 'Inner':
            return None
        else:
            return clade
    Phylo.draw(tree,my_label)
    
def write_tree(tree,output_fname):
    #have to format the tree for the accepted output
    Phylo.write(tree,'temp.nw','newick')
    temp_tree=ete2.Tree('temp.nw',format=1)
    subprocess.call(['rm','temp.nw'])
    temp_tree.write(outfile=output_fname,format=9)

Create the matrix and tree, write results and plot the trees

Create directory for results

In [5]:
%%bash
mkdir results

E. coli

In [10]:
ecoli_aln=read_clustal(input_str='consensus/ecoliPT/all_*.clw')
write_seq_matrix(ecoli_aln,'results/WIGNER_EC.fasta')
ecoli_tree=build_tree(ecoli_aln)
ecoli_dm=get_dist_mat(ecoli_aln)
write_tree(ecoli_tree,output_fname='results/WIGNER_EC.tre')
Length of the final sequence:  92831 bp
Selecting positions took:  1068.09990597 s
Building the tree took:  50.0059378147 s
Calculating the distance matrix took:  48.4180049896 s

Plot the distance matrix

In [11]:
plot_dist_mat(ecoli_dm)

Plot the tree

  • Branch length is the rate of different positions source

    • 1: complelety different
    • 0: the same
  • We did't use all the positions, so these numbers should be multiplied by the length of the DNA sequence matrix, to get the number of different positions

In [12]:
plot_tree(ecoli_tree)

S. aureus

In [13]:
saureus_aln=read_clustal(input_str='consensus/saureusPT/all_*.clw')
write_seq_matrix(saureus_aln,'results/WIGNER_SA.fasta')
saureus_tree=build_tree(saureus_aln)
saureus_dm=get_dist_mat(saureus_aln)
write_tree(saureus_tree,output_fname='results/WIGNER_SA.tre')
Length of the final sequence:  6164 bp
Selecting positions took:  578.385573149 s
Building the tree took:  4.34834003448 s
Calculating the distance matrix took:  4.27276611328 s
In [14]:
plot_dist_mat(saureus_dm,cmap=plt.cm.Blues)
In [15]:
plot_tree(saureus_tree)

S. enterica

In [16]:
typh_aln=read_clustal(input_str='consensus/typhimuriumPT/all_*.clw')
write_seq_matrix(typh_aln,'results/WIGNER_ST.fasta')
typh_tree=build_tree(typh_aln)
typh_dm=get_dist_mat(typh_aln)
write_tree(typh_tree,output_fname='results/WIGNER_ST.tre')
Length of the final sequence:  31044 bp
Selecting positions took:  958.346100092 s
Building the tree took:  16.255177021 s
Calculating the distance matrix took:  14.5855751038 s
In [17]:
plot_dist_mat(typh_dm,cmap=plt.cm.Purples)
In [18]:
plot_tree(typh_tree)