Demo notebook

Query all the samples around europe, and draw them on an interactive map


About the Jupyter-notebook

The programming languages

  • This notebook is written in python, but you can use the exact same jupyter framework in many different languages (R,Ruby,Julia,Haskell etc..). Please explore the jupyter project webpage for more information about support for programming languages
Link to Jupyter project

This is a markdown cell

  • You can write easy markdown headers and notes like this
  • Or you can write html ike the jumbotron above. In the notebook, you can use the Bootstrap framework to have nice buttons etc. like the one below.
Link to Bootstrap
  • You can also write equations, which will be rendered by MathJax
$$E = mc^2$$

The purpose of this small demo is to explore geolocations in the ENA

First I should figure, how many flu samples are there to download?

I will build the url from the logical blocks:

  • the thing below is called a code cell, If you push Shift+Enter, or click the triangle at the menubar, the code will be executed
In [1]:
url_base='http://www.ebi.ac.uk/ena/data/warehouse/search?' #base for advanced search
url_query='query=\"geo_box1(30,-30,72,58)\"' # all samples around europe  
url_result='&result=sample' # looking for samples, they have location
url_count='&resultcount' # count the results

url=url_base+url_query+url_result+url_count #concatenate
print 'The url is:',url #print
The url is: http://www.ebi.ac.uk/ena/data/warehouse/search?query="geo_box1(30,-30,72,58)"&result=sample&resultcount

Query the url, read the result back as a string

  • Actually you can also click on it, and you will be presented with the results int the browser
In [2]:
import urllib #python modules for url-s
res = urllib.urlopen(url).read()
print res
n_sample=int(''.join(res.split('\n')[0].split(' ')[-1].split(',')))
print "Number of samples: ",n_sample
Number of results: 30,290
Time taken: 0 seconds
Number of samples:  30290

Now i will download all the geolocation information associated with the samples

Build url again

In [3]:
url_base='http://www.ebi.ac.uk/ena/data/warehouse/search?'
url_query='query=\"geo_box1(30,-30,72,58)\"'  
url_result='&result=sample'
url_display='&display=report' #report is the tab separated output
url_fields='&fields=accession,location' #get accesion and location
url_limits='&offset=1&length='+str(n_sample) #get all the results

url=url_base+url_query+url_result+url_display+url_fields+url_limits
print 'The url is:',url #print
The url is: http://www.ebi.ac.uk/ena/data/warehouse/search?query="geo_box1(30,-30,72,58)"&result=sample&display=report&fields=accession,location&offset=1&length=30290

The result is a tab separated table, I will download the table to a string

In [4]:
ena_flu_loco_page = urllib.urlopen(url).read()

Load the table into a pandas DataFrame

  • Pandas is a very useful library for data analysis in python
  • The DataFrame object is similar to R dataframes
In [5]:
import pandas as pd #pandas
from StringIO import StringIO #for reading string into pandas
ena_flu_loco_table = pd.read_csv(StringIO(ena_flu_loco_page),sep='\t')

Peek into the table

In [6]:
ena_flu_loco_table.head()
Out[6]:
accession location
0 SAMD00018983 52.0167 N 4.3667 E
1 SAMD00018984 52.0167 N 4.3667 E
2 SAMD00018985 52.0167 N 4.3667 E
3 SAMD00018986 52.0167 N 4.3667 E
4 SAMD00018987 52.0167 N 4.3667 E

Parse the longitudes, longitudes

  • The data is in a different format than the map will need read, so I need to convert is. (N,E,S,W instead of negative values )
In [7]:
def parse_lat(string_loc):
    loc_list=string_loc.split(' ')
    if (loc_list[1] =='N'):
        return float(loc_list[0])
    elif (loc_list[1] =='S'):
        return -float(loc_list[0])
    
def parse_lon(string_loc):
    loc_list=string_loc.split(' ')
    if (loc_list[3] =='E'):
        return float(loc_list[2])
    elif (loc_list[3] =='W'):
        return -float(loc_list[2])
    
ena_flu_loco_table['lat']=map(parse_lat,ena_flu_loco_table['location'])
ena_flu_loco_table['lon']=map(parse_lon,ena_flu_loco_table['location'])

ena_flu_loco_table=ena_flu_loco_table[['lat','lon','accession']]
In [8]:
ena_flu_loco_table.head()
Out[8]:
lat lon accession
0 52.0167 4.3667 SAMD00018983
1 52.0167 4.3667 SAMD00018984
2 52.0167 4.3667 SAMD00018985
3 52.0167 4.3667 SAMD00018986
4 52.0167 4.3667 SAMD00018987

See how many unique locations we have

In [9]:
print 'Number of unique locations:',
print len(ena_flu_loco_table.groupby(['lat','lon']).size().reset_index())
Number of unique locations: 3303

Generate a popup string for each unique location

  • This will be shown on the map, when you click on the point with the mouse

Contents:

  • Number of cases
  • list of accession numbers, truncated if too long

I am using the sql-like groupby statement for group the samples

In [10]:
#the function used for grouping
def form_acc(x):
    if (x['accession'].size < 5):
        return pd.Series(
            dict({'count' : x['accession'].size, 'acc_list' : ' '.join(x['accession']),
                }))
    else:
        return pd.Series(
            dict({'count' : x['accession'].size, 'acc_list' : ' '.join(list(
                        x['accession'])[:2]) + ' ... ' + ' '.join(list(
                        x['accession'])[-2:])}))

#group-by
uniq_locs_w_acc=ena_flu_loco_table.groupby(['lat','lon']).apply(form_acc).reset_index()

Plot the points on map

I will use the Folium library which is python wrapper for the Leaflet javasript library for map based visualizations

  • (The magic will be in the html output in the cell, in you are interested you can read the html source code of the notebook output cell)

First define the map drawing function

In [11]:
from IPython.core.display import HTML
import folium

def inline_map(m, width=650, height=500):
    """Takes a folium instance and embed HTML."""
    m._build_map()
    srcdoc = m.HTML.replace('"', '&quot;')
    embed = HTML('<iframe srcdoc="{}" '
                 'style="width: {}px; height: {}px; '
                 'border: none"></iframe>'.format(srcdoc, width, height))
    return embed

Initialize the map object

In [12]:
width, height = 650, 500
flu_map = folium.Map(location=[47, -17], zoom_start=3,
                    tiles='OpenStreetMap', width=width, height=height)

Add point to the map object

  • Let's make point area proportional to number of cases
    • This is miseleading, beacuse somewhere all the cases around have the sample position (Europe), and somewhere the positions are more scattered (Shanghai)
In [13]:
for i in xrange(len(uniq_locs_w_acc)):
    loc=(uniq_locs_w_acc.iloc[i]['lat'],uniq_locs_w_acc.iloc[i]['lon'] )
    name='Number of cases: '+str(uniq_locs_w_acc.iloc[i]['count'])
    name+='   Accesions: '+uniq_locs_w_acc.iloc[i]['acc_list']
    size=uniq_locs_w_acc.iloc[i]['count'] ** 0.5 
    
    flu_map.circle_marker(location=loc, radius=1e3*size,
                          line_color='none',fill_color='#3186cc',
                          fill_opacity=0.7, popup=name)

And finally draw the map

In [14]:
inline_map(flu_map)
Out[14]:

Some notes about this notebook:

Memory footprint:

  • python: ~500 MiB
  • chrome: ~200 MiB

Map:

  • Rendering slow on a Bay-Trail intel proc
  • Folium has limited customization

Some small details:

  • Bad geolocations can be seen at 0 longitude, but it is not surprising
  • Grid in france (due to truncation of decimal values?)
  • Line of the Danube can clearly be seen