Demo notebook

Query all the samples around europe, and draw them on an interactive map

About the Jupyter-notebook¶

The programming languages¶

This notebook is written in python, but you can use the exact same jupyter framework in many different languages (R,Ruby,Julia,Haskell etc..). Please explore the jupyter project webpage for more information about support for programming languages

Link to Jupyter project

This is a markdown cell¶

You can write easy markdown headers and notes like this
Or you can write html ike the jumbotron above. In the notebook, you can use the Bootstrap framework to have nice buttons etc. like the one below.

Link to Bootstrap

You can also write equations, which will be rendered by MathJax

$$E = mc^2$$

The purpose of this small demo is to explore geolocations in the ENA ¶

First I should figure, how many flu samples are there to download?¶

ENA has an advanced search option where we can discover data with some filtering. It has a graphical interface but, it also support programatic acces through url based queries.
Advanced search graphical interface
Advanced search tutorial

I will build the url from the logical blocks:

the thing below is called a code cell, If you push Shift+Enter, or click the triangle at the menubar, the code will be executed

url_base='http://www.ebi.ac.uk/ena/data/warehouse/search?' #base for advanced search
url_query='query=\"geo_box1(30,-30,72,58)\"' # all samples around europe  
url_result='&result=sample' # looking for samples, they have location
url_count='&resultcount' # count the results

url=url_base+url_query+url_result+url_count #concatenate
print 'The url is:',url #print

The url is: http://www.ebi.ac.uk/ena/data/warehouse/search?query="geo_box1(30,-30,72,58)"&result=sample&resultcount

Query the url, read the result back as a string

Actually you can also click on it, and you will be presented with the results int the browser

import urllib #python modules for url-s
res = urllib.urlopen(url).read()
print res
n_sample=int(''.join(res.split('\n')[0].split(' ')[-1].split(',')))
print "Number of samples: ",n_sample

Number of results: 30,290
Time taken: 0 seconds
Number of samples:  30290

Now i will download all the geolocation information associated with the samples¶

Build url again

url_base='http://www.ebi.ac.uk/ena/data/warehouse/search?'
url_query='query=\"geo_box1(30,-30,72,58)\"'  
url_result='&result=sample'
url_display='&display=report' #report is the tab separated output
url_fields='&fields=accession,location' #get accesion and location
url_limits='&offset=1&length='+str(n_sample) #get all the results

url=url_base+url_query+url_result+url_display+url_fields+url_limits
print 'The url is:',url #print

The url is: http://www.ebi.ac.uk/ena/data/warehouse/search?query="geo_box1(30,-30,72,58)"&result=sample&display=report&fields=accession,location&offset=1&length=30290

The result is a tab separated table, I will download the table to a string

ena_flu_loco_page = urllib.urlopen(url).read()

Load the table into a pandas DataFrame

Pandas is a very useful library for data analysis in python
The DataFrame object is similar to R dataframes

import pandas as pd #pandas
from StringIO import StringIO #for reading string into pandas
ena_flu_loco_table = pd.read_csv(StringIO(ena_flu_loco_page),sep='\t')

Peek into the table

ena_flu_loco_table.head()

Parse the longitudes, longitudes¶

The data is in a different format than the map will need read, so I need to convert is. (N,E,S,W instead of negative values )

def parse_lat(string_loc):
    loc_list=string_loc.split(' ')
    if (loc_list[1] =='N'):
        return float(loc_list[0])
    elif (loc_list[1] =='S'):
        return -float(loc_list[0])
    
def parse_lon(string_loc):
    loc_list=string_loc.split(' ')
    if (loc_list[3] =='E'):
        return float(loc_list[2])
    elif (loc_list[3] =='W'):
        return -float(loc_list[2])
    
ena_flu_loco_table['lat']=map(parse_lat,ena_flu_loco_table['location'])
ena_flu_loco_table['lon']=map(parse_lon,ena_flu_loco_table['location'])

ena_flu_loco_table=ena_flu_loco_table[['lat','lon','accession']]

ena_flu_loco_table.head()

See how many unique locations we have¶

print 'Number of unique locations:',
print len(ena_flu_loco_table.groupby(['lat','lon']).size().reset_index())

Number of unique locations: 3303

This will be shown on the map, when you click on the point with the mouse

Contents:

Number of cases
list of accession numbers, truncated if too long

I am using the sql-like groupby statement for group the samples

#the function used for grouping
def form_acc(x):
    if (x['accession'].size < 5):
        return pd.Series(
            dict({'count' : x['accession'].size, 'acc_list' : ' '.join(x['accession']),
                }))
    else:
        return pd.Series(
            dict({'count' : x['accession'].size, 'acc_list' : ' '.join(list(
                        x['accession'])[:2]) + ' ... ' + ' '.join(list(
                        x['accession'])[-2:])}))

#group-by
uniq_locs_w_acc=ena_flu_loco_table.groupby(['lat','lon']).apply(form_acc).reset_index()

Plot the points on map¶

I will use the Folium library which is python wrapper for the Leaflet javasript library for map based visualizations

(The magic will be in the html output in the cell, in you are interested you can read the html source code of the notebook output cell)

First define the map drawing function

from IPython.core.display import HTML
import folium

def inline_map(m, width=650, height=500):
    """Takes a folium instance and embed HTML."""
    m._build_map()
    srcdoc = m.HTML.replace('"', '&quot;')
    embed = HTML('<iframe srcdoc="{}" '
                 'style="width: {}px; height: {}px; '
                 'border: none"></iframe>'.format(srcdoc, width, height))
    return embed

Initialize the map object

width, height = 650, 500
flu_map = folium.Map(location=[47, -17], zoom_start=3,
                    tiles='OpenStreetMap', width=width, height=height)

Add point to the map object

Let's make point area proportional to number of cases
- This is miseleading, beacuse somewhere all the cases around have the sample position (Europe), and somewhere the positions are more scattered (Shanghai)

for i in xrange(len(uniq_locs_w_acc)):
    loc=(uniq_locs_w_acc.iloc[i]['lat'],uniq_locs_w_acc.iloc[i]['lon'] )
    name='Number of cases: '+str(uniq_locs_w_acc.iloc[i]['count'])
    name+='   Accesions: '+uniq_locs_w_acc.iloc[i]['acc_list']
    size=uniq_locs_w_acc.iloc[i]['count'] ** 0.5 
    
    flu_map.circle_marker(location=loc, radius=1e3*size,
                          line_color='none',fill_color='#3186cc',
                          fill_opacity=0.7, popup=name)

And finally draw the map

inline_map(flu_map)

Some notes about this notebook:¶

Memory footprint:

python: ~500 MiB
chrome: ~200 MiB

Map:

Rendering slow on a Bay-Trail intel proc
Folium has limited customization

Some small details:

Bad geolocations can be seen at 0 longitude, but it is not surprising
Grid in france (due to truncation of decimal values?)
Line of the Danube can clearly be seen

	accession	location
0	SAMD00018983	52.0167 N 4.3667 E
1	SAMD00018984	52.0167 N 4.3667 E
2	SAMD00018985	52.0167 N 4.3667 E
3	SAMD00018986	52.0167 N 4.3667 E
4	SAMD00018987	52.0167 N 4.3667 E

	lat	lon	accession
0	52.0167	4.3667	SAMD00018983
1	52.0167	4.3667	SAMD00018984
2	52.0167	4.3667	SAMD00018985
3	52.0167	4.3667	SAMD00018986
4	52.0167	4.3667	SAMD00018987