Varscan bug in somatic filtering

Varscan has a bug in somatic filtering (what is quite unbelievable for me) : If you specify a somatic filter, there are numerous entries in the output which dont pass the somatic filter threshold.

The Fischer-exact test values seem to be correct (not shown here, only checked by hand). I tested numerous Varscan version, all of them failed.

I'm running Varscan on a 10Mbp long sample section to show the bug


In [2]:
import subprocess
import pandas as pd
from pandas import DataFrame as df

path and tools

In [3]:
#gallus reference
galref="/home/ribli/input/index/gallus/Gallus_gallus.Galgal4.74.dna.toplevel.fa"

#input files
norm_bam="/nagyvinyok/adat83/sotejedlik/dt40/rmdup/DS50_RMdup_picard_realign.bam"
tum_bam="/nagyvinyok/adat83/sotejedlik/dt40/rmdup/DS41_RMdup_picard_realign.bam"

Create pileup on a smaller part for test run

  • Length is 10M basepair
In [4]:
#temp files
norm_pup="DS41_tmp.pup"
tum_pup="DS50_tmp.pup"

#position
pos=' 1:40,000,000-50,000,000 '

#pileup flags
pup_flags=' -Q 30 -B '

#create pileup
cmd='samtools mpileup '+ pup_flags+ ' -f ' + galref + ' -r '+pos + norm_bam  + ' > '+norm_pup
subprocess.call(cmd,shell=True)
print cmd
cmd='samtools mpileup '+ pup_flags+ ' -f ' + galref + ' -r '+pos + tum_bam  + ' > '+tum_pup
subprocess.call(cmd,shell=True)
print cmd
samtools mpileup  -Q 30 -B  -f /home/ribli/input/index/gallus/Gallus_gallus.Galgal4.74.dna.toplevel.fa -r  1:40,000,000-50,000,000 /nagyvinyok/adat83/sotejedlik/dt40/rmdup/DS50_RMdup_picard_realign.bam > DS41_tmp.pup
samtools mpileup  -Q 30 -B  -f /home/ribli/input/index/gallus/Gallus_gallus.Galgal4.74.dna.toplevel.fa -r  1:40,000,000-50,000,000 /nagyvinyok/adat83/sotejedlik/dt40/rmdup/DS41_RMdup_picard_realign.bam > DS50_tmp.pup

Run varscan

In [5]:
#varscan params
mincov=' 5 '
pval=' 0.05 '

#output file
output="testout_41_51_som_filt.vsc"

#run varscan
cmd='java -jar VarScan.v2.3.7.jar somatic '+ norm_pup +' ' +tum_pup + ' ' + output
cmd+=' --min-coverage '+mincov+' --somatic-p-value '+ pval
print cmd
print subprocess.call(cmd,shell=True)
java -jar VarScan.v2.3.7.jar somatic DS41_tmp.pup DS50_tmp.pup testout_41_51_som_filt.vsc --min-coverage  5  --somatic-p-value  0.05 
0

Check out output

In [6]:
snp_result=pd.read_csv(output+'.snp',sep='\t')
snp_result=snp_result.convert_objects(convert_numeric=True)
snp_result[['ref','var','normal_reads1','normal_reads2',
            'tumor_reads1','tumor_reads2',
           'somatic_status','variant_p_value','somatic_p_value']].iloc[:10]
Out[6]:
ref var normal_reads1 normal_reads2 tumor_reads1 tumor_reads2 somatic_status variant_p_value somatic_p_value
0 C T 0 22 0 17 Germline 3.674172e-23 1
1 T C 0 15 0 23 Germline 1.450827e-22 1
2 G A 0 17 0 18 Germline 8.913746e-21 1
3 T C 0 23 0 14 Germline 5.726949e-22 1
4 A T 0 23 0 11 Germline 3.514563e-20 1
5 G C 0 32 0 17 Germline 3.925015e-29 1
6 C T 0 36 0 18 Germline 4.022885e-32 1
7 C T 0 35 0 20 Germline 1.014948e-32 1
8 C T 0 27 0 20 Germline 6.151192e-28 1
9 G A 0 25 0 15 Germline 9.301702e-24 1

Somatic results

In [7]:
print 'Small preview'
snp_result[snp_result['somatic_status']=='Somatic'][['ref','var',
            'normal_reads1','normal_reads2',
            'tumor_reads1','tumor_reads2',
           'somatic_status','variant_p_value','somatic_p_value']].iloc[:10]
Small preview
Out[7]:
ref var normal_reads1 normal_reads2 tumor_reads1 tumor_reads2 somatic_status variant_p_value somatic_p_value
142 A T 20 0 8 4 Somatic 1 0.013765
225 C A 17 0 2 2 Somatic 1 0.028571
855 C T 13 3 7 12 Somatic 1 0.009749
1391 C T 16 0 9 3 Somatic 1 0.067155
1562 T A 20 0 11 3 Somatic 1 0.060829
1793 G A 6 0 8 2 Somatic 1 0.375000
2193 T C 10 0 7 2 Somatic 1 0.210526
2219 T A 15 0 8 2 Somatic 1 0.150000
2445 C T 13 0 8 2 Somatic 1 0.177866
2515 G A 8 0 5 6 Somatic 1 0.017028

Somatic filter bug

  • There are many mutation with somatic status == Somatic, and they have somatic p-value higher than 0.05
In [8]:
somatic_bug=snp_result[(snp_result['somatic_status']=='Somatic') &
           (snp_result['somatic_p_value'] > 0.05 )][['ref','var',
            'normal_reads1','normal_reads2',
            'tumor_reads1','tumor_reads2',
           'somatic_status','variant_p_value','somatic_p_value']]
print 'There are numerous position involved: N=',len(somatic_bug.index)
print 'Small preview:'
somatic_bug.iloc[:10]
There are numerous position involved: N= 185
Small preview:
Out[8]:
ref var normal_reads1 normal_reads2 tumor_reads1 tumor_reads2 somatic_status variant_p_value somatic_p_value
1391 C T 16 0 9 3 Somatic 1 0.067155
1562 T A 20 0 11 3 Somatic 1 0.060829
1793 G A 6 0 8 2 Somatic 1 0.375000
2193 T C 10 0 7 2 Somatic 1 0.210526
2219 T A 15 0 8 2 Somatic 1 0.150000
2445 C T 13 0 8 2 Somatic 1 0.177866
3083 T A 13 0 8 2 Somatic 1 0.177866
3500 T A 4 0 3 2 Somatic 1 0.277778
5037 G T 19 0 7 2 Somatic 1 0.095238
5934 G T 7 0 5 2 Somatic 1 0.230769

Clean up

In [8]:
print subprocess.call(['rm',norm_pup,tum_pup,output+'.snp',output+'.indel'])
0