Test varscan run

  • 10M basepair long section

In [1]:
import subprocess
import pandas as pd
from pandas import DataFrame as df

path and tools

In [2]:
#gallus reference
galref="/home/ribli/input/index/gallus/Gallus_gallus.Galgal4.74.dna.toplevel.fa"

#input files
norm_bam="/nagyvinyok/adat83/sotejedlik/ngs_data/enzimologia/R3_bam/DS41.bam"
tum_bam="/nagyvinyok/adat83/sotejedlik/ngs_data/enzimologia/R3_bam/DS50.bam"

Create pileup on a smaller part for test run

  • Length is 10M basepair
In [3]:
#temp files
norm_pup="DS41_tmp.pup"
tum_pup="DS50_tmp.pup"

#position
pos=' 1:40,000,000-50,000,000 '

#pileup flags
pup_flags=' -A -Q 0 -B '

#create pileup
cmd='samtools mpileup '+ pup_flags+ ' -f ' + galref + ' -r '+pos + norm_bam  + ' > '+norm_pup
print cmd
print subprocess.call(cmd,shell=True)
cmd='samtools mpileup '+ pup_flags+ ' -f ' + galref + ' -r '+pos + tum_bam  + ' > '+tum_pup
print cmd
print subprocess.call(cmd,shell=True)
samtools mpileup  -A -Q 0 -B  -f /home/ribli/input/index/gallus/Gallus_gallus.Galgal4.74.dna.toplevel.fa -r  1:40,000,000-50,000,000 /nagyvinyok/adat83/sotejedlik/ngs_data/enzimologia/R3_bam/DS41.bam > DS41_tmp.pup
0
samtools mpileup  -A -Q 0 -B  -f /home/ribli/input/index/gallus/Gallus_gallus.Galgal4.74.dna.toplevel.fa -r  1:40,000,000-50,000,000 /nagyvinyok/adat83/sotejedlik/ngs_data/enzimologia/R3_bam/DS50.bam > DS50_tmp.pup
0

Run varscan

In [4]:
#varscan params
mincov=' 5 '
pval=' 0.05 '

#output file
output="testout_41_51.vsc"

#run varscan
cmd='java -jar VarScan.v2.3.7.jar somatic '+ norm_pup +' ' +tum_pup + ' ' + output
cmd+=' --min-coverage '+mincov+' --somatic-p-value '+ pval
print cmd
print subprocess.call(cmd,shell=True)
java -jar VarScan.v2.3.7.jar somatic DS41_tmp.pup DS50_tmp.pup testout_41_51.vsc --min-coverage  5  --somatic-p-value  0.05 
0

Check out output

In [5]:
snp_result=pd.read_csv(output+'.snp',sep='\t')
snp_result=snp_result.convert_objects(convert_numeric=True)
snp_result[['ref','var','normal_reads1','normal_reads2',
            'tumor_reads1','tumor_reads2',
           'somatic_status','variant_p_value','somatic_p_value']].iloc[:10]
Out[5]:
ref var normal_reads1 normal_reads2 tumor_reads1 tumor_reads2 somatic_status variant_p_value somatic_p_value
0 C T 0 22 0 26 Germline 1.553985e-28 1
1 T C 0 28 0 17 Germline 9.631367e-27 1
2 G A 0 23 0 20 Germline 1.506579e-25 1
3 T C 0 15 0 25 Germline 9.301702e-24 1
4 A T 0 13 0 26 Germline 3.674172e-23 1
5 G C 0 18 0 38 Germline 2.560229e-33 1
6 C T 0 21 0 40 Germline 2.608977e-36 1
7 C T 0 22 0 38 Germline 1.035037e-35 1
8 C T 0 22 0 29 Germline 2.502447e-30 1
9 G A 0 17 0 28 Germline 9.631367e-27 1

Somatic results

In [6]:
print 'Small preview'
snp_result[snp_result['somatic_status']=='Somatic'][['ref','var',
            'normal_reads1','normal_reads2',
            'tumor_reads1','tumor_reads2',
           'somatic_status','variant_p_value','somatic_p_value']].iloc[:10]
Small preview
Out[6]:
ref var normal_reads1 normal_reads2 tumor_reads1 tumor_reads2 somatic_status variant_p_value somatic_p_value
364 T C 10 2 6 10 Somatic 1 0.019156
486 C T 18 3 12 18 Somatic 1 0.001111
768 C A 5 1 1 6 Somatic 1 0.025058
774 G C 7 1 5 10 Somatic 1 0.018778
937 T C 18 0 14 5 Somatic 1 0.026676
1192 C A 20 0 6 2 Somatic 1 0.074074
1807 G T 11 0 16 4 Somatic 1 0.153981
1842 C A 21 0 8 2 Somatic 1 0.096774
1844 G A 14 0 15 5 Somatic 1 0.055718
1980 G T 14 0 15 4 Somatic 1 0.094721

Somatic filter bug?

  • There are many mutation with somatic status == Somatic, and they have somatic p-value higher than 0.05
In [7]:
somatic_bug=snp_result[(snp_result['somatic_status']=='Somatic') &
           (snp_result['somatic_p_value'] > 0.05 )][['ref','var',
            'normal_reads1','normal_reads2',
            'tumor_reads1','tumor_reads2',
           'somatic_status','variant_p_value','somatic_p_value']]
print 'There are numerous position involved: N=',len(somatic_bug.index)
print 'Small preview:'
somatic_bug.iloc[:10]
There are numerous position involved: N= 219
Small preview:
Out[7]:
ref var normal_reads1 normal_reads2 tumor_reads1 tumor_reads2 somatic_status variant_p_value somatic_p_value
1192 C A 20 0 6 2 Somatic 1 0.074074
1807 G T 11 0 16 4 Somatic 1 0.153981
1842 C A 21 0 8 2 Somatic 1 0.096774
1844 G A 14 0 15 5 Somatic 1 0.055718
1980 G T 14 0 15 4 Somatic 1 0.094721
2653 C A 11 0 11 3 Somatic 1 0.158261
3159 C T 9 0 7 2 Somatic 1 0.235294
3574 C A 5 0 7 3 Somatic 1 0.263736
3735 C G 7 0 7 2 Somatic 1 0.300000
3835 A T 6 0 8 2 Somatic 1 0.375000

Clean up

In [8]:
print subprocess.call(['rm',norm_pup,tum_pup,output+'.snp',output+'.indel'])
0

Because of the somatic filter bug,I run it without somatic filter, and then apply the filter by hand

  • Is it still safe?? If there is something wrong with somatic filtering, maybe now we are still affected?
In [9]:
#create pileup
cmd='samtools mpileup '+ pup_flags+ ' -f ' + galref + ' -r '+pos + norm_bam  + ' > '+norm_pup
print cmd
print subprocess.call(cmd,shell=True)
cmd='samtools mpileup '+ pup_flags+ ' -f ' + galref + ' -r '+pos + tum_bam  + ' > '+tum_pup
print cmd
print subprocess.call(cmd,shell=True)

#change somatic p-value filter to a very high value
pval=' 0.9 '

#run varscan
cmd='java -jar VarScan.v2.3.7.jar somatic '+ norm_pup +' ' +tum_pup + ' ' + output
cmd+=' --min-coverage '+mincov+' --somatic-p-value '+ pval
print cmd
print subprocess.call(cmd,shell=True)
samtools mpileup  -A -Q 0 -B  -f /home/ribli/input/index/gallus/Gallus_gallus.Galgal4.74.dna.toplevel.fa -r  1:40,000,000-50,000,000 /nagyvinyok/adat83/sotejedlik/ngs_data/enzimologia/R3_bam/DS41.bam > DS41_tmp.pup
0
samtools mpileup  -A -Q 0 -B  -f /home/ribli/input/index/gallus/Gallus_gallus.Galgal4.74.dna.toplevel.fa -r  1:40,000,000-50,000,000 /nagyvinyok/adat83/sotejedlik/ngs_data/enzimologia/R3_bam/DS50.bam > DS50_tmp.pup
0
java -jar VarScan.v2.3.7.jar somatic DS41_tmp.pup DS50_tmp.pup testout_41_51.vsc --min-coverage  5  --somatic-p-value  0.9 
0

Check out output

In [10]:
snp_result=pd.read_csv(output+'.snp',sep='\t')
snp_result=snp_result.convert_objects(convert_numeric=True)

See the called somatic snps with p < 0.05

In [11]:
snp_s_called=snp_result[(snp_result['somatic_status']=='Somatic') &
           (snp_result['somatic_p_value'] < 0.05 )][['ref','var',
            'normal_reads1','normal_reads2',
            'tumor_reads1','tumor_reads2',
           'somatic_status','variant_p_value','somatic_p_value']]
print 'Number of somatic snps found N=',len(snp_s_called.index)
print 'Small preview:'
snp_s_called.iloc[:10]
Number of somatic snps found N= 226
Small preview:
Out[11]:
ref var normal_reads1 normal_reads2 tumor_reads1 tumor_reads2 somatic_status variant_p_value somatic_p_value
368 T C 10 2 6 10 Somatic 1 0.019156
490 C T 18 3 12 18 Somatic 1 0.001111
778 C A 5 1 1 6 Somatic 1 0.025058
784 G C 7 1 5 10 Somatic 1 0.018778
950 T C 18 0 14 5 Somatic 1 0.026676
3655 G T 24 0 23 6 Somatic 1 0.020691
4003 T C 16 3 10 11 Somatic 1 0.017065
4085 G A 13 2 7 10 Somatic 1 0.009893
4146 T C 16 3 4 18 Somatic 1 0.000027
4400 A G 24 6 8 13 Somatic 1 0.002868

See the called somatic snps with much stricter p < 0.005

In [12]:
snp_s_called=snp_result[(snp_result['somatic_status']=='Somatic') &
           (snp_result['somatic_p_value'] < 0.005 )][['ref','var',
            'normal_reads1','normal_reads2',
            'tumor_reads1','tumor_reads2',
           'somatic_status','variant_p_value','somatic_p_value']]
print 'Number of somatic snps found N=',len(snp_s_called.index)
print 'Small preview:'
snp_s_called.iloc[:10]
Number of somatic snps found N= 55
Small preview:
Out[12]:
ref var normal_reads1 normal_reads2 tumor_reads1 tumor_reads2 somatic_status variant_p_value somatic_p_value
490 C T 18 3 12 18 Somatic 1 0.001111
4146 T C 16 3 4 18 Somatic 1 0.000027
4400 A G 24 6 8 13 Somatic 1 0.002868
6292 G T 17 3 9 14 Somatic 1 0.002445
7110 C T 17 4 6 19 Somatic 1 0.000138
10080 G C 18 4 11 16 Somatic 1 0.003872
12149 G A 12 2 10 16 Somatic 1 0.004663
12880 G A 19 3 14 19 Somatic 1 0.001077
15389 C T 38 9 27 24 Somatic 1 0.003109
15567 G A 10 1 9 17 Somatic 1 0.002033

Conclusions

There are many 'somatic' mutations found altough we know that there should be as few as possible

  • 10M long section is 1% of the genome, so number of mutations that varscan finds:
    • p < 0.05 : ~ 22600 'mutations'
    • p < 0.005 : ~ 5500 'mutations'
  • These are huge numbers

Should we try even stricter p value??

Note, that these samples (41,50) are particularly noisy!!! Even in our method


Clean up

In [13]:
print subprocess.call(['rm',norm_pup,tum_pup,output+'.snp',output+'.indel'])
0

My notes from before

1, the sum of the bases doesnt add up the coverage

- filtering based on base quality
- and other stuff like * : deleted base
counts in the coverage why??


2, somatic filter and somatic p value discrepancy???

- output file p values way higher than threshold
- but filter does something because the numbber 
of hits decreases with more strict filter 

- filter wokrs nice at pileup2snp but that has
no utility to be used with somatic filtering...


3, coverage filtering is based on pileup coverage value

- very low qualitz bases also contribute to coverage
- even * the deleted base counts in coverage...
    - this causes false positives...
    - cannot be solved with var-freq-for-hom filter