This inter-laboratory performance test is provided to facilitate harmonization and standardization in whole genome sequencing and data analysis, with the aim to produce comparable data for the GMI initiative. "
" 2) Variant detection and phylogenetic/clustering analysis of three datasets each including fastq data from circa 20 genomes of S. enterica, E. coli and S. aureus. Note: If performing a reference based approach for variant detection, the reference applied for the analysis must be the species specific reference indicated in the proficiency test protocol."
"Three datasets, one for each of S. enterica, E. coli and S. aureus, will be available for download from the ftp-site ‘cgebase.cbs.dtu.dk’. Login to the ftp-site will be provided directly to each participant. Each dataset will consist of the original fastq files (i.e., whole genome sequence data) from circa 20 samples for phylogenetic cluster analysis based on a tool of the laboratory’s own choice; SNP-calling, gene-by-gene, etc."
"The three fastq datasets should be downloaded from the ftp-site. They are organized into three different .zip archives appropriately labeled with the taxon they represent. Within each archive the participant will find the paired-end reads. The objective associated with this dataset is to assess the variability of laboratories in the clusters identified through the analysis of next- generation sequencing data. As such, the participant should employ their preferred method for constructing a matrix (e.g., gene, SNP, presence/absence, etc.) and for clustering samples (e.g., distance-, maximum-likelihood-, Bayesian-based).
If performing a reference based approach for variant detection, the reference applied for the analysis must be: STA00025 (S. enterica), EC002143 (E. coli) and SAH596 (S. aureus)."
"Specifically, up to two types of files should be submitted: For each dataset:
Import python libraries used
from glob import glob
from Bio import AlignIO
from Bio import SeqIO
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
import ete2
import time
import subprocess
import os
import copy
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
#some settings
matplotlib.rc('font', size=16)
matplotlib.rcParams['lines.linewidth'] = 3
matplotlib.rcParams['figure.figsize'] =(16,9)
Go to working directory
os.chdir('/nagyvinyok/adat83/sotejedlik/ribli/gmi_3')
Write bash params to file
%%writefile bash_params
Bowtie_Path=/nagyvinyok/adat88/kozos/sotejedlik/usr/bowtie2
SAMTOOLS=/nagyvinyok/adat88/kozos/sotejedlik/usr/samtools-1.2/samtools
BCFTOOLS=/nagyvinyok/adat88/kozos/sotejedlik/usr/samtools-1.2/bcftools
MAFFT=mafft
Extract sample files
TIME It takes ~ 30 mins
%%bash
. bash_params
if [ -d samples ]
then
echo "Samples already extracted"
else
mkdir samples
unzip -d samples input/ecoliMay2015_PT.zip
unzip -d samples input/saureusMay2015_PT.zip
unzip -d samples input/typhimuriumMay2015_PT.zip
#The S. aureus samples are individually gzipped
cd samples/saureusPT
gunzip *.gz
fi
We did some very basic statstics on the provided dataset. This should be done in a more thorough manner in a real situation. We wanted to create as simple worklflow as possible. This goal of this notebook is to demonstrate the possibility of developping worklows in the IPython Notebook enviroment. For further quality control one can use FatsQC
Check how many samples we have
%%bash
. bash_params
echo 'Number of FASTQ files:'
echo "E. coli"
ls samples/ecoliPT/*_1.fastq | wc -l
echo "S. aureus"
ls samples/saureusPT/*_1.fastq | wc -l
echo "Typhimurium"
ls samples/typhimuriumPT/*_1.fastq | wc -l
%%bash
. bash_params
mkdir work
for i in samples/ecoliPT/*_1.fastq; do \
echo -n ${i}": "; cat $i | wc -l | awk '{print $1/4}' ;
done > work/readCount.dat
for i in samples/saureusPT/*_1.fastq; do \
echo -n ${i}": "; cat $i | wc -l | awk '{print $1/4}' ;
done >> work/readCount.dat
for i in samples/typhimuriumPT/*_1.fastq; do \
echo -n ${i}": "; cat $i | wc -l | awk '{print $1/4}' ;
done >> work/readCount.dat
nReads = pd.read_csv("work/readCount.dat", header=None, sep=" ")
nReads['name']=[str(x.split('/')[-1].split('_')[0]) for x in nReads[0]]
fig,ax=plt.subplots()
ax.bar(range(len(nReads)),nReads[1],facecolor='#5555ff', edgecolor='none')
ax.set_xticks(0.5+np.arange(len(nReads)))
ax.set_xticklabels(nReads['name'],rotation='vertical',fontsize=14)
dump=ax.set_xlim(0,len(nReads))
TIME It takes ~ 15 mins
%%bash
. bash_params
for i in samples/ecoliPT/*_1.fastq; do \
echo -n ${i}": "; grep -v '[+@]' $i | awk 'BEGIN {;FS=""}{lengc[NF]++}\
END{for(i in lengc) printf("%d %d "),i,lengc[i]/2; printf("\n")}';
done > work/readLen.dat
for i in samples/saureusPT/*_1.fastq; do \
echo -n ${i}": "; grep -v '[+@]' $i | awk 'BEGIN {;FS=""}{lengc[NF]++}\
END{for(i in lengc) printf("%d %d "),i,lengc[i]/2; printf("\n")}';
done >> work/readLen.dat
for i in samples/typhimuriumPT/*_1.fastq; do \
echo -n ${i}": "; grep -v '[+@]' $i | awk 'BEGIN {;FS=""}{lengc[NF]++}\
END{for(i in lengc) printf("%d %d "),i,lengc[i]/2; printf("\n")}';
done >> work/readLen.dat
names,lens,counts=[],[],[]
with open('work/readLen.dat') as f:
for line in f:
names.append(line.split(' ')[0].split('/')[-1].split('_')[0])
lens.append(map(int,line.split(' ')[1:-1:2]))
counts.append(map(int,line.split(' ')[2:-1:2]))
N=np.max([max(leng) for leng in lens])
m=np.zeros((N+1,len(names)))
for i in xrange(len(names)):
for j in xrange(len(lens[i])):
m[lens[i][j],i]+=counts[i][j]
m[:,i]/=np.sum(counts[i])
fig,ax=plt.subplots()
ax.imshow(m,interpolation='none',aspect=0.15,cmap=plt.cm.jet,
norm=matplotlib.colors.LogNorm(),origin='lower')
ax.set_xticks(np.arange(len(names)))
ax.set_xticklabels(names,rotation='vertical',fontsize=14)
ax.set_ylim(30,N+10)
ax.set_ylabel('read length')
dump=ax.set_xlim(-1,len(names))
NOTE They are called exactly like some samples
%%bash
cd input
ls *fasta
Print the number of contigs, and some names of contigs
%%bash
grep '>' input/EC002143.fasta | wc -l
grep '>' input/SAH596.fasta | wc -l
grep '>' input/STA00025.fasta | wc -l
echo
head -n1 input/EC002143.fasta
head -n1 input/SAH596.fasta
head -n1 input/STA00025.fasta
Cumulative distributions of contig lengths
fig,ax=plt.subplots()
fig.set_size_inches(16,9)
for fname in ['EC002143.fasta','SAH596.fasta','STA00025.fasta']:
ref_dict=SeqIO.to_dict(SeqIO.parse('input/'+fname,"fasta"))
x=np.sort([len(x) for x in ref_dict.values()])
ax.plot(x,np.linspace(0,1,len(x)),'H',ms=10,mec='none',label=fname)
ax.set_xscale('log')
ax.set_xlabel('contig length')
ax.set_ylabel('cumulative distribution')
ax.legend(loc='best',fancybox=True)
dump=ax.set_ylim(0,1.1)
Full reference lengths
for fname in ['EC002143.fasta','SAH596.fasta','STA00025.fasta']:
ref_dict=SeqIO.to_dict(SeqIO.parse('input/'+fname,"fasta"))
print fname,'genome length:', np.sum([len(x) for x in ref_dict.values()])
Because the species of the samples were given, and we assumed they might be very similar to each other, We have decided to use a reference based method.
We have used the bowtie2 alignment software. We have chosen bowtie2 mainly because the author of this notebook has the most experience with this aligner. Other good aligner software should be also good (bwa, novoalign etc.).
TIME It only takes seconds
%%bash
. bash_params
if [ -d indexes ]
then
echo "Indices already generated"
else
mkdir indexes
$Bowtie_Path/bowtie2-build input/EC002143.fasta indexes/EC002143
$SAMTOOLS faidx input/EC002143.fasta
$Bowtie_Path/bowtie2-build input/SAH596.fasta indexes/SAH596
$SAMTOOLS faidx input/SAH596.fasta
$Bowtie_Path/bowtie2-build input/STA00025.fasta indexes/STA00025
$SAMTOOLS faidx input/STA00025.fasta
fi
%%bash
. bash_params
mkdir -p align
#ecoliPT
DEST=align/ecoliPT
GENOME=EC002143
if [ -d $DEST ]
then
echo "$DEST already aligned"
else
mkdir $DEST
for i in samples/ecoliPT/*_1.fastq
do
j=${i%_1.fastq}
k=${j#samples/ecoliPT/}
$Bowtie_Path/bowtie2 -p 22 -x indexes/$GENOME -1 ${j}_1.fastq -2 ${j}_2.fastq \
| $SAMTOOLS view -b -S - > $DEST/$k.bam
$SAMTOOLS sort -@ 10 $DEST/$k.bam $DEST/$k.s
mv $DEST/$k.s.bam $DEST/$k.bam
$SAMTOOLS index $DEST/$k.bam
done
wait
fi
#saureusPT
DEST=align/saureusPT
GENOME=SAH596
if [ -d $DEST ]
then
echo "$DEST already aligned"
else
mkdir $DEST
for i in samples/saureusPT/*_1.fastq
do
j=${i%_1.fastq}
k=${j#samples/saureusPT/}
$Bowtie_Path/bowtie2 -p 22 -x indexes/$GENOME -1 ${j}_1.fastq -2 ${j}_2.fastq \
| $SAMTOOLS view -b -S - > $DEST/$k.bam
$SAMTOOLS sort -@ 10 $DEST/$k.bam $DEST/$k.s
mv $DEST/$k.s.bam $DEST/$k.bam
$SAMTOOLS index $DEST/$k.bam
done
wait
fi
#typhimuriumPT
DEST=align/typhimuriumPT
GENOME=STA00025
if [ -d $DEST ]
then
echo "$DEST already aligned"
else
mkdir $DEST
for i in samples/typhimuriumPT/*_1.fastq
do
j=${i%_1.fastq}
k=${j#samples/typhimuriumPT/}
$Bowtie_Path/bowtie2 -p 22 -x indexes/$GENOME -1 ${j}_1.fastq -2 ${j}_2.fastq \
| $SAMTOOLS view -b -S - > $DEST/$k.bam
$SAMTOOLS sort -@ 10 $DEST/$k.bam $DEST/$k.s
mv $DEST/$k.s.bam $DEST/$k.bam
$SAMTOOLS index $DEST/$k.bam
done
wait
fi
TIME It takes 20 hours!!! (from nov. 1. 19:53 to nov 2. 15.34.)
%%bash
. bash_params
mkdir -p consensus
DEST=ecoliPT
GENOME=EC002143
if [ -d consensus/$DEST ]
then
echo "consensus for $DEST already generated"
else
mkdir consensus/$DEST
for i in samples/$DEST/*_1.fastq
do
j=${i%_1.fastq}
k=${j#samples/$DEST/}
$SAMTOOLS mpileup -v -f input/$GENOME.fasta align/$DEST/$k.bam > consensus/$DEST/$k.vcf
$BCFTOOLS index consensus/$DEST/$k.vcf
$BCFTOOLS call -o consensus/$DEST/$k.call.vcf -O z -c consensus/$DEST/$k.vcf
$BCFTOOLS index consensus/$DEST/$k.call.vcf
$BCFTOOLS consensus -f input/$GENOME.fasta consensus/$DEST/$k.call.vcf > consensus/$DEST/$k.cons.fa
awk -v SMPL=$k -v DDIR=consensus/$DEST/ \
'/^>/ {fajl= DDIR "all_" substr($1,2) ".fa"; print ">" SMPL >> fajl} !/^>/{print >> fajl}' \
consensus/$DEST/$k.cons.fa
done
for i in consensus/$DEST/all_*.fa
do
$MAFFT --clustalout $i > ${i%.fa}.clw
done
fi
DEST=saureusPT
GENOME=SAH596
if [ -d consensus/$DEST ]
then
echo "consensus for $DEST already generated"
else
mkdir consensus/$DEST
for i in samples/$DEST/*_1.fastq
do
j=${i%_1.fastq}
k=${j#samples/$DEST/}
$SAMTOOLS mpileup -v -f input/$GENOME.fasta align/$DEST/$k.bam > consensus/$DEST/$k.vcf
$BCFTOOLS index consensus/$DEST/$k.vcf
$BCFTOOLS call -o consensus/$DEST/$k.call.vcf -O z -c consensus/$DEST/$k.vcf
$BCFTOOLS index consensus/$DEST/$k.call.vcf
$BCFTOOLS consensus -f input/$GENOME.fasta consensus/$DEST/$k.call.vcf > consensus/$DEST/$k.cons.fa
awk -v SMPL=$k -v DDIR=consensus/$DEST/ \
'/^>/ {fajl= DDIR "all_" substr($1,2) ".fa"; print ">" SMPL >> fajl} !/^>/{print >> fajl}' \
consensus/$DEST/$k.cons.fa
done
for i in consensus/$DEST/all_*.fa
do
$MAFFT --clustalout $i > ${i%.fa}.clw
done
fi
DEST=typhimuriumPT
GENOME=STA00025
if [ -d consensus/$DEST ]
then
echo "consensus for $DEST already generated"
else
mkdir consensus/$DEST
for i in samples/$DEST/*_1.fastq
do
j=${i%_1.fastq}
k=${j#samples/$DEST/}
$SAMTOOLS mpileup -v -f input/$GENOME.fasta align/$DEST/$k.bam > consensus/$DEST/$k.vcf
$BCFTOOLS index consensus/$DEST/$k.vcf
$BCFTOOLS call -o consensus/$DEST/$k.call.vcf -O z -c consensus/$DEST/$k.vcf
$BCFTOOLS index consensus/$DEST/$k.call.vcf
$BCFTOOLS consensus -f input/$GENOME.fasta consensus/$DEST/$k.call.vcf > consensus/$DEST/$k.cons.fa
awk -v SMPL=$k -v DDIR=consensus/$DEST/ \
'/^>/ {fajl= DDIR "all_" substr($1,2) ".fa"; print ">" SMPL >> fajl} !/^>/{print >> fajl}' \
consensus/$DEST/$k.cons.fa
done
for i in consensus/$DEST/all_*.fa
do
$MAFFT --clustalout $i > ${i%.fa}.clw
done
fi
def read_clustal(input_str):
start=time.time()
init = None
for fff in glob(input_str):
if init == None :
aln = filter_aln(AlignIO.read(fff, 'clustal'))
aln.sort()
init = 1
else:
aln_tmp = filter_aln(AlignIO.read(fff, 'clustal'))
aln_tmp.sort()
aln = aln + aln_tmp
print 'Length of the final sequence: ',len(aln[0]),'bp'
print 'Selecting positions took: ', time.time()-start,'s'
return aln
def filter_aln(in_aln):
out_aln=in_aln[:,0:1] #init (will be deleted)
leng=len(in_aln)
#loop over positions
for i in xrange(len(in_aln[0])):
counts=[ in_aln[:,i].count(x) for x in ['a','c','g','t','-']]
if ( counts[-1]==0 and #no zero coverage
max(counts)!=leng and #some difference needed
sorted(counts)[2]==0): #only 2 different letters!
out_aln+=in_aln[:,i:i+1]
return out_aln[:,1:]
def write_seq_matrix(aln,output_fname):
with open(output_fname, "w") as handle:
SeqIO.write(aln, handle, "fasta")
def build_tree(aln):
calculator = DistanceCalculator('identity')
constructor = DistanceTreeConstructor(calculator, 'nj')
start=time.time()
tree = constructor.build_tree(aln)
print 'Building the tree took: ', time.time()-start,'s'
return tree
def get_dist_mat(aln):
calculator = DistanceCalculator('identity')
start=time.time()
dm = calculator.get_distance(aln)
print 'Calculating the distance matrix took: ', time.time()-start,'s'
return dm
def plot_dist_mat(dm,cmap=plt.cm.Greens):
fig,ax=plt.subplots()
m=np.array(dm)
ax.imshow(m,interpolation='none',cmap=cmap)
ax.set_xticks(range(m.shape[0]))
ax.set_xticklabels(dm.names,rotation='vertical')
ax.set_yticks(range(m.shape[1]))
dump=ax.set_yticklabels(dm.names)
def plot_tree(tree,xmin=None,xmax=None):
def my_label(clade): #label the leaves
if str(clade)[:5] == 'Inner':
return None
else:
return clade
Phylo.draw(tree,my_label)
def write_tree(tree,output_fname):
#have to format the tree for the accepted output
Phylo.write(tree,'temp.nw','newick')
temp_tree=ete2.Tree('temp.nw',format=1)
subprocess.call(['rm','temp.nw'])
temp_tree.write(outfile=output_fname,format=9)
Create directory for results
%%bash
mkdir results
ecoli_aln=read_clustal(input_str='consensus/ecoliPT/all_*.clw')
write_seq_matrix(ecoli_aln,'results/WIGNER_EC.fasta')
ecoli_tree=build_tree(ecoli_aln)
ecoli_dm=get_dist_mat(ecoli_aln)
write_tree(ecoli_tree,output_fname='results/WIGNER_EC.tre')
plot_dist_mat(ecoli_dm)
plot_tree(ecoli_tree)
saureus_aln=read_clustal(input_str='consensus/saureusPT/all_*.clw')
write_seq_matrix(saureus_aln,'results/WIGNER_SA.fasta')
saureus_tree=build_tree(saureus_aln)
saureus_dm=get_dist_mat(saureus_aln)
write_tree(saureus_tree,output_fname='results/WIGNER_SA.tre')
plot_dist_mat(saureus_dm,cmap=plt.cm.Blues)
plot_tree(saureus_tree)
typh_aln=read_clustal(input_str='consensus/typhimuriumPT/all_*.clw')
write_seq_matrix(typh_aln,'results/WIGNER_ST.fasta')
typh_tree=build_tree(typh_aln)
typh_dm=get_dist_mat(typh_aln)
write_tree(typh_tree,output_fname='results/WIGNER_ST.tre')
plot_dist_mat(typh_dm,cmap=plt.cm.Purples)
plot_tree(typh_tree)