Demo notebook

Integration of the CGE Outbreak Map in the IPython Notebook enviromentp


About the Jupyter-notebook

The programming languages

  • This notebook is written in python, but you can use the exact same jupyter framework in many different languages (R,Ruby,Julia,Haskell etc..). Please explore the jupyter project webpage for more information about support for programming languages
Link to Jupyter project

This is a markdown cell

  • You can write easy markdown headers and notes like this
  • Or you can write html ike the jumbotron above. In the notebook, you can use the Bootstrap framework to have nice buttons etc. like the one below.
Link to Bootstrap
  • You can also write equations, which will be rendered by MathJax
$$E = mc^2$$

The purpose of this small demo is to explore integration of pre-existing browser-based visualization tools in the IPython-notebook enviroment

We will use the CGE Outbreak Map for visualization

Purpose of the demonstration:

  • Show the integration ready-made parts of other pipeline in the Ipython notebook enviroment
  • Show an easy to use, tweakable, easy to enhance pipeline

We will use flu samples now

  • There are very few samples!

First we should figure, how many flu samples are there?

I will build the url from the logical blocks:

  • the thing below is called a code cell, If you push Shift+Enter, or click the triangle at the menubar, the code will be executed
In [1]:
url_base='http://www.ebi.ac.uk/ena/data/warehouse/search?' #base for advanced search
url_query='query=\"tax_tree(11320)\"' #influenza A taxon and all subordinates (tree)
url_result='&result=sample' # looking for samples, they have location
url_count='&resultcount' # count the results

url=url_base+url_query+url_result+url_count #concatenate

print 'The url is:',url #print
The url is: http://www.ebi.ac.uk/ena/data/warehouse/search?query="tax_tree(11320)"&result=sample&resultcount

Query the url, read the result back as a string

  • Actually you can also click on it, and you will be presented with the results int the browser
In [2]:
import urllib #python modules for url-s
url_res = urllib.urlopen(url).read()
print url_res
n_samples=int(''.join(url_res.split('\n')[0].split()[-1].split(',')))
Number of results: 2,320
Time taken: 0 seconds

Now i will download the information associated with the samples

Build url again

In [3]:
url_base='http://www.ebi.ac.uk/ena/data/warehouse/search?'
url_query='query=\"tax_tree(11320)\"'
url_result='&result=sample'
url_display='&display=report' #report is the tab separated output
url_fields='&fields=accession,country,collection_date,host,location' #get accesion and location
url_limits='&offset=1&length='+str(n_samples) #get all the results

url=url_base+url_query+url_result+url_display+url_fields+url_limits
print 'The url is:',url
The url is: http://www.ebi.ac.uk/ena/data/warehouse/search?query="tax_tree(11320)"&result=sample&display=report&fields=accession,country,collection_date,host,location&offset=1&length=2320

The result is a tab separated table

  • I will Download the table to a string
In [27]:
ena_flu_loco_page = urllib.urlopen(url).read()

Load the table into a pandas DataFrame

  • Pandas is a very useful library for data analysis in python
  • The DataFrame object is similar to R dataframes
In [48]:
import pandas as pd #pandas
from StringIO import StringIO #for reading string into pandas
ena_flu_loco_table = pd.read_csv(StringIO(ena_flu_loco_page),sep='\t')

Peek into the table

  • Unfortunately most of the values are missing (NaNs)
In [49]:
ena_flu_loco_table.head()
Out[49]:
accession country collection_date host location
0 SAMD00018947 NaN NaN Homo sapiens NaN
1 SAMD00018948 NaN NaN Homo sapiens NaN
2 SAMEA1573029 NaN NaN NaN NaN
3 SAMEA1573030 NaN NaN NaN NaN
4 SAMEA1573031 NaN NaN NaN NaN

See how many geolocation, and time data is there?

In [51]:
print "The number of sample with geolocations,and date is: ",
print len(ena_flu_loco_table[
    (pd.isnull(ena_flu_loco_table['location']) == False) &
    (pd.isnull(ena_flu_loco_table['collection_date']) == False) ])
The number of sample with geolocations,and date is:  39

Get rid of samples with no geolocation

In [52]:
ena_flu_loco_table=ena_flu_loco_table[
    (pd.isnull(ena_flu_loco_table['location']) == False) &
    (pd.isnull(ena_flu_loco_table['collection_date']) == False) ]

Parse the longitudes, longitudes, and date

  • The data is in a different format than the map will need read, so I need to convert is. (N,E,S,W) instead of negative values
  • double dates with '/' separation
In [54]:
def parse_lat(string_loc):
    loc_list=string_loc.split(' ')
    if (loc_list[1] =='N'):
        return float(loc_list[0])
    elif (loc_list[1] =='S'):
        return -float(loc_list[0])
    
def parse_lon(string_loc):
    loc_list=string_loc.split(' ')
    if (loc_list[3] =='E'):
        return float(loc_list[2])
    elif (loc_list[3] =='W'):
        return -float(loc_list[2])
    
ena_flu_loco_table['lat']=map(parse_lat,ena_flu_loco_table['location'])
ena_flu_loco_table['lon']=map(parse_lon,ena_flu_loco_table['location'])
ena_flu_loco_table['date']=[x.split('/')[0] for x in ena_flu_loco_table['collection_date']]

ena_flu_loco_table=ena_flu_loco_table[['lat','lon','accession','country',
                                      'date','host']]

Peak into table

In [55]:
ena_flu_loco_table.head()
Out[55]:
lat lon accession country date host
925 -39.4871 176.8210 SAMN01094185 New Zealand 2005-01-01 Mallard
926 -37.7498 176.4095 SAMN01094186 New Zealand 2004-01-01 Mallard
927 -39.4871 176.8210 SAMN01094187 New Zealand 2005-01-01 Mallard
928 -37.7498 176.4095 SAMN01094188 New Zealand 2005-01-01 Mallard
929 -37.7498 176.4095 SAMN01094189 New Zealand 2005-01-01 Mallard

Change table format to format accepted by CGE Outbreak Map

In [60]:
import pandas as pd
cge_table=pd.DataFrame(columns=['city','google_location','source_note','strain','collection_date',
                               'country','region','collected_by','longitude','isolation_source',
                                'pathogenic','latitude','location_note','pathogenicity_note',
                               'organism','notes','zip_code'])

cge_table['latitude']=ena_flu_loco_table['lat']
cge_table['longitude']=ena_flu_loco_table['lon']
cge_table['country']=ena_flu_loco_table['country']
cge_table['collection_date']=ena_flu_loco_table['date']
cge_table['isolation_source']=ena_flu_loco_table['host']

Write it out in json to a location where the map will read

  • have to overwrite existing file now!
  • Note If I you write over influenza data or ebola data, you need to write different format for date, or it will fail: Ebola: only year influenza: full date
In [61]:
cge_table.to_json('json/influenza_data.js',orient='records')

Load the CGE Outbreak Map

  • The data we created is in the Demo Data /Influenza tab
In [62]:
from IPython.display import HTML
HTML('''
<div class="wrap">
    <iframe class="frame" src="index.html"></iframe>
</div>

<style>
wrap {
    width: 1px;
    height: 1px;
    padding: 0;
    overflow: hidden;
}
.frame {
    width: 1050px;
    height: 780px;
    border: 0;
    -ms-transform: scale(0.9);
    -moz-transform: scale(0.9);
    -o-transform: scale(0.9);
    -webkit-transform: scale(0.9);
    transform: scale(0.9);
    
    -ms-transform-origin: 0 0;
    -moz-transform-origin: 0 0;
    -o-transform-origin: 0 0;
    -webkit-transform-origin: 0 0;
    transform-origin: 0 0;
}
</style>
''')
Out[62]:

TO DO

  • include taxon name search to be used with real taxon names
  • nicer iframe frame (no scrollbars)
  • dedicated user json input file
  • package the map for a show python command
  • had to change uppercase letters from the bitbucket version
  • ...