{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "

Demo notebook

\n", "\n", "

Query all the samples around europe, and draw them on an interactive map

\n", "\n", "
\n", "\n", "---\n", "#About the Jupyter-notebook\n", "\n", "\n", "####The programming languages\n", "- This notebook is written in python, but you can use the exact same jupyter framework in many different languages (R,Ruby,Julia,Haskell etc..). Please explore the jupyter project webpage for more information about support for programming languages\n", "\n", "
\n", "Link to Jupyter project\n", "
\n", "\n", "\n", "\n", "####This is a markdown cell\n", "- You can write easy markdown headers and notes like this\n", "- Or you can write html ike the jumbotron above. In the notebook, you can use the Bootstrap framework to have nice buttons etc. like the one below.\n", "\n", "
\n", "Link to Bootstrap\n", "
\n", "\n", "- You can also write equations, which will be rendered by [MathJax](http://www.mathjax.org/)\n", "\n", "$$E = mc^2$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "#The purpose of this small demo is to explore geolocations in the [ENA](http://www.ebi.ac.uk/ena)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###First I should figure, how many flu samples are there to download?\n", "\n", "- ENA has an advanced search option where we can discover data with some filtering. It has a graphical interface but, it also support programatic acces through url based queries.\n", "- [Advanced search graphical interface](http://www.ebi.ac.uk/ena/data/warehouse/search)\n", "- [Advanced search tutorial ](http://www.ebi.ac.uk/ena/support/advanced-search-tutorial)\n", "\n", "I will build the url from the logical blocks:\n", "- the thing below is called a code cell, If you push Shift+Enter, or click the triangle at the menubar, the code will be executed" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The url is: http://www.ebi.ac.uk/ena/data/warehouse/search?query=\"geo_box1(30,-30,72,58)\"&result=sample&resultcount\n" ] } ], "source": [ "url_base='http://www.ebi.ac.uk/ena/data/warehouse/search?' #base for advanced search\n", "url_query='query=\\\"geo_box1(30,-30,72,58)\\\"' # all samples around europe \n", "url_result='&result=sample' # looking for samples, they have location\n", "url_count='&resultcount' # count the results\n", "\n", "url=url_base+url_query+url_result+url_count #concatenate\n", "print 'The url is:',url #print" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Query the url, read the result back as a string\n", "- Actually you can also click on it, and you will be presented with the results int the browser" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of results: 30,290\n", "Time taken: 0 seconds\n", "Number of samples: 30290\n" ] } ], "source": [ "import urllib #python modules for url-s\n", "res = urllib.urlopen(url).read()\n", "print res\n", "n_sample=int(''.join(res.split('\\n')[0].split(' ')[-1].split(',')))\n", "print \"Number of samples: \",n_sample" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Now i will download all the geolocation information associated with the samples\n", "\n", "Build url again" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The url is: http://www.ebi.ac.uk/ena/data/warehouse/search?query=\"geo_box1(30,-30,72,58)\"&result=sample&display=report&fields=accession,location&offset=1&length=30290\n" ] } ], "source": [ "url_base='http://www.ebi.ac.uk/ena/data/warehouse/search?'\n", "url_query='query=\\\"geo_box1(30,-30,72,58)\\\"' \n", "url_result='&result=sample'\n", "url_display='&display=report' #report is the tab separated output\n", "url_fields='&fields=accession,location' #get accesion and location\n", "url_limits='&offset=1&length='+str(n_sample) #get all the results\n", "\n", "url=url_base+url_query+url_result+url_display+url_fields+url_limits\n", "print 'The url is:',url #print" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a tab separated table, I will download the table to a string" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ena_flu_loco_page = urllib.urlopen(url).read()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load the table into a pandas DataFrame\n", "- [Pandas](http://pandas.pydata.org/) is a very useful library for data analysis in python\n", "- The DataFrame object is similar to R dataframes" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd #pandas\n", "from StringIO import StringIO #for reading string into pandas\n", "ena_flu_loco_table = pd.read_csv(StringIO(ena_flu_loco_page),sep='\\t')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Peek into the table" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
accessionlocation
0SAMD0001898352.0167 N 4.3667 E
1SAMD0001898452.0167 N 4.3667 E
2SAMD0001898552.0167 N 4.3667 E
3SAMD0001898652.0167 N 4.3667 E
4SAMD0001898752.0167 N 4.3667 E
\n", "
" ], "text/plain": [ " accession location\n", "0 SAMD00018983 52.0167 N 4.3667 E\n", "1 SAMD00018984 52.0167 N 4.3667 E\n", "2 SAMD00018985 52.0167 N 4.3667 E\n", "3 SAMD00018986 52.0167 N 4.3667 E\n", "4 SAMD00018987 52.0167 N 4.3667 E" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ena_flu_loco_table.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Parse the longitudes, longitudes\n", "- The data is in a different format than the map will need read, so I need to convert is. (N,E,S,W instead of negative values )" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def parse_lat(string_loc):\n", " loc_list=string_loc.split(' ')\n", " if (loc_list[1] =='N'):\n", " return float(loc_list[0])\n", " elif (loc_list[1] =='S'):\n", " return -float(loc_list[0])\n", " \n", "def parse_lon(string_loc):\n", " loc_list=string_loc.split(' ')\n", " if (loc_list[3] =='E'):\n", " return float(loc_list[2])\n", " elif (loc_list[3] =='W'):\n", " return -float(loc_list[2])\n", " \n", "ena_flu_loco_table['lat']=map(parse_lat,ena_flu_loco_table['location'])\n", "ena_flu_loco_table['lon']=map(parse_lon,ena_flu_loco_table['location'])\n", "\n", "ena_flu_loco_table=ena_flu_loco_table[['lat','lon','accession']]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
latlonaccession
052.01674.3667SAMD00018983
152.01674.3667SAMD00018984
252.01674.3667SAMD00018985
352.01674.3667SAMD00018986
452.01674.3667SAMD00018987
\n", "
" ], "text/plain": [ " lat lon accession\n", "0 52.0167 4.3667 SAMD00018983\n", "1 52.0167 4.3667 SAMD00018984\n", "2 52.0167 4.3667 SAMD00018985\n", "3 52.0167 4.3667 SAMD00018986\n", "4 52.0167 4.3667 SAMD00018987" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ena_flu_loco_table.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###See how many unique locations we have" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of unique locations: 3303\n" ] } ], "source": [ "print 'Number of unique locations:',\n", "print len(ena_flu_loco_table.groupby(['lat','lon']).size().reset_index())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Generate a popup string for each unique location\n", "- This will be shown on the map, when you click on the point with the mouse\n", "\n", "Contents:\n", "- Number of cases\n", "- list of accession numbers, truncated if too long\n", "\n", "I am using the sql-like groupby statement for group the samples" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#the function used for grouping\n", "def form_acc(x):\n", " if (x['accession'].size < 5):\n", " return pd.Series(\n", " dict({'count' : x['accession'].size, 'acc_list' : ' '.join(x['accession']),\n", " }))\n", " else:\n", " return pd.Series(\n", " dict({'count' : x['accession'].size, 'acc_list' : ' '.join(list(\n", " x['accession'])[:2]) + ' ... ' + ' '.join(list(\n", " x['accession'])[-2:])}))\n", "\n", "#group-by\n", "uniq_locs_w_acc=ena_flu_loco_table.groupby(['lat','lon']).apply(form_acc).reset_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#Plot the points on map\n", "\n", "I will use the [Folium](http://folium.readthedocs.org/en/latest/) library which is python wrapper for the [Leaflet](http://leafletjs.com/) javasript library for map based visualizations\n", "- (The magic will be in the html output in the cell, in you are interested you can read the html source code of the notebook output cell)\n", "\n", "\n", "\n", "First define the map drawing function" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from IPython.core.display import HTML\n", "import folium\n", "\n", "def inline_map(m, width=650, height=500):\n", " \"\"\"Takes a folium instance and embed HTML.\"\"\"\n", " m._build_map()\n", " srcdoc = m.HTML.replace('\"', '"')\n", " embed = HTML(''.format(srcdoc, width, height))\n", " return embed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Initialize the map object" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "width, height = 650, 500\n", "flu_map = folium.Map(location=[47, -17], zoom_start=3,\n", " tiles='OpenStreetMap', width=width, height=height)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add point to the map object\n", "\n", "- Let's make point area proportional to number of cases\n", " - This is miseleading, beacuse somewhere all the cases around have the sample position (Europe), and somewhere the positions are more scattered (Shanghai)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for i in xrange(len(uniq_locs_w_acc)):\n", " loc=(uniq_locs_w_acc.iloc[i]['lat'],uniq_locs_w_acc.iloc[i]['lon'] )\n", " name='Number of cases: '+str(uniq_locs_w_acc.iloc[i]['count'])\n", " name+=' Accesions: '+uniq_locs_w_acc.iloc[i]['acc_list']\n", " size=uniq_locs_w_acc.iloc[i]['count'] ** 0.5 \n", " \n", " flu_map.circle_marker(location=loc, radius=1e3*size,\n", " line_color='none',fill_color='#3186cc',\n", " fill_opacity=0.7, popup=name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And finally draw the map" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "