XClose

COMP0233: Research Software Engineering With Python

Home
Menu

An example Python data analysis notebook

This page illustrates how to use Python to perform a simple but complete analysis: retrieve data, do some computations based on it, and visualise the results.

Don't worry if you don't understand everything on this page! Its purpose is to give you an example of things you can do and how to go about doing them - you are not expected to be able to reproduce an analysis like this in Python at this stage! We will be looking at the concepts and practices introduced on this page as we go along the course.

As we show the code for different parts of the work, we will be touching on various aspects you may want to keep in mind, either related to Python specifically, or to research programming more generally.

Why write software to manage your data and plots?

We can use programs for our entire research pipeline. Not just big scientific simulation codes, but also the small scripts which we use to tidy up data and produce plots. This should be code, so that the whole research pipeline is recorded for reproducibility. Data manipulation in spreadsheets is much harder to share or check.

You can see another similar demonstration on the software carpentry site. We'll try to give links to other sources of Python training along the way. Part of our approach is that we assume you know how to use the internet! If you find something confusing out there, please bring it along to the next session. In this course, we'll always try to draw your attention to other sources of information about what we're learning. Paying attention to as many of these as you need to, is just as important as these core notes.

Importing Libraries

Research programming is all about using libraries: tools other people have provided programs that do many cool things. By combining them we can feel really powerful but doing minimum work ourselves. The python syntax to import someone else's library is "import".

In [1]:
%pip install -q geopy  # install geopy if not already installed
import geopy # A python library for investigating geographic information.
# https://pypi.org/project/geopy/
Note: you may need to restart the kernel to use updated packages.

Now, if you try to follow along on this example in an Jupyter notebook, you'll probably find that you just got an error message.

You'll need to wait until we've covered installation of additional python libraries later in the course, then come back to this and try again. For now, just follow along and try get the feel for how programming for data-focused research works.

In [2]:
# Select geocoding service provided by OpenStreetMap's Nominatim - https://wiki.openstreetmap.org/wiki/Nominatim
geocoder = geopy.geocoders.Nominatim(user_agent="comp0023") 
geocoder.geocode('Cambridge', exactly_one=False)
Out[2]:
[Location(Cambridge, Cambridgeshire, Cambridgeshire and Peterborough, England, United Kingdom, (52.2055314, 0.1186637, 0.0)),
 Location(Cambridge, Middlesex County, Massachusetts, United States, (42.3656347, -71.1040018, 0.0)),
 Location(Cambridge, Region of Waterloo, Ontario, Canada, (43.3600536, -80.3123023, 0.0)),
 Location(Cambridge, Isanti County, Minnesota, 55008, United States, (45.5727408, -93.2243921, 0.0)),
 Location(Cambridge, Waipa District, Waikato, 3434, New Zealand / Aotearoa, (-37.8917889, 175.4691069, 0.0)),
 Location(Cambridge, Henry County, Illinois, United States, (41.3025257, -90.1962861, 0.0)),
 Location(Cambridge, Dorchester County, Maryland, 21613, United States, (38.5714624, -76.0763177, 0.0)),
 Location(Cambridge, Guernsey County, Ohio, 43725, United States, (40.031183, -81.5884561, 0.0)),
 Location(Cambridge, Union Township, Story County, Iowa, United States, (41.8990768, -93.5294029, 0.0)),
 Location(Cambridge, Jefferson County, Kentucky, United States, (38.2217369, -85.616627, 0.0))]

The results come out as a list inside a list: [Name, [Latitude, Longitude]]. Programs represent data in a variety of different containers like this.

Comments

Code after a # symbol doesn't get run.

In [3]:
print("This runs") # print("This doesn't")
# print("This doesn't either")
This runs

Functions

We can wrap code up in a function, so that we can repeatedly get just the information we want.

In [4]:
def geolocate(city):
    """Get the latitude and longitude of a specific location."""
    
    full_name, coordinates = geocoder.geocode(city)
    return coordinates

Defining functions which put together code to make a more complex task seem simple from the outside is the most important thing in programming. The output of the function is specified using the return keyword. The input to the function is put inside brackets after the function name:

In [5]:
geolocate(city='Cambridge')
Out[5]:
(52.2055314, 0.1186637)

Variables

We can store a result in a variable:

In [6]:
london_location = geolocate("London")
print(london_location)
(51.4893335, -0.14405508452768728)

More complex functions

We'll fetch a map of a place from the Google Maps server, given a longitude and latitude. The URLs look like: https://mt0.google.com/vt?x=658&y=340&z=10&lyrs=s. Since we'll frequently be generating these URLs, we will create two helper functions to make our life easier.

The first is a function to convert our latitude and longitude into the coordinate tiles system used by Google Maps. We will then create a second function to build up a web request from the URL given our parameters.

In [7]:
import os
import math
import requests

def deg2num(lat_deg, lon_deg, zoom):
    """Convert latitude and longitude to XY tiles coordinates."""

    lat_rad = math.radians(lat_deg)
    n = 2.0 ** zoom
    x_tiles_coord = int((lon_deg + 180.0) / 360.0 * n)
    y_tiles_coord = int((1.0 - math.asinh(math.tan(lat_rad)) / math.pi) / 2.0 * n)

    return (x_tiles_coord, y_tiles_coord)

def request_map_at(latitude, longitude, zoom=10, satellite=True):
    """Retrieve a map from Google at a given location."""

    base_url = "https://mt0.google.com/vt?"
    x_coord, y_coord = deg2num(latitude, longitude, zoom)

    params = dict(
        x=x_coord,
        y=y_coord,
        z=zoom,
    )
    if satellite:
        params['lyrs'] = 's'
    
    return requests.get(base_url, params=params)
In [8]:
london_latitude, london_longitude = london_location
map_response = request_map_at(london_latitude, london_longitude)

Checking our work

Let's see what URL we ended up with.

Firsty we will define two constants so that we can split the returned URL into the base URL and the part of the URL that corresponds to the location we requested:

In [9]:
url = map_response.url

first_25s = slice(0, 25)
from_25th = slice(25, None)

print(url)
print(url[first_25s])
print(url[from_25th])
https://mt0.google.com/vt?x=511&y=340&z=10&lyrs=s
https://mt0.google.com/vt
?x=511&y=340&z=10&lyrs=s

url is a string and we can select parts of this string using the slices we defined above. first_25s will select characters 0 to 24 of the string and from_25th will select all characters from the 25th onwards.

We can write tests so that if we change our code later we can check the results are still valid. We will do this here using assert statements. If any of those assert statements are False we will get an error. If we receive an error from our tests we know we need to fix something in our code.

In [10]:
assert "https://mt0.google.com/vt?" in url
assert "z=10" in url
assert "lyrs=s" in url

Our previous function comes back with an Object representing the web request. In Python, we can use the . operator to get access to a particular attribute of the object. In this case, the image at the requested URL is stored in the content attribute. It's a big file, so let's just get look at first few bytes:

In [11]:
map_response.content[0:20]
Out[11]:
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00'

Displaying results

We'll need to do this a lot, so we can wrap up our previous function in another function to save on typing.

In [12]:
def map_content_at(latitude, longitude, zoom=10, satellite=True):
    """Retrieve a map image from Google at a given location."""

    return request_map_at(latitude, longitude, zoom=10, satellite=True).content

We can use a library that comes with Jupyter notebook to display the image. This is one of the most powerful things about modern programming languages like Python - being able to work with images, documents, or any other kind of data just as easily as we can with numbers or strings.

In [13]:
import IPython

map_png = map_content_at(london_latitude, london_longitude)
In [14]:
print("The type of our map result is actually a: ", type(map_png))
The type of our map result is actually a:  <class 'bytes'>
In [15]:
IPython.display.Image(map_png)
Out[15]:
No description has been provided for this image
In [16]:
IPython.display.Image(map_content_at(*geolocate("New Delhi")))
Out[16]:
No description has been provided for this image

Manipulating Numbers

Now we get to our research project: we want to use satellite imagery to find out how urbanised the world is along a line between two cites. We expect the satellite image to be greener in the countryside.

We'll need to import a few more libraries to count how much green there is in an image.

In [17]:
from io import BytesIO  # A library to convert between files and strings
import numpy as np  # A library to deal with matrices
import imageio.v3 as iio  # A library to deal with images

Let's define what we count as green:

In [18]:
def is_green(pixels):
    """Determine if each pixel in an image array is green."""
    
    # RGB indices
    red, green, blue = range(3)

    threshold = 1.1
    greener_than_red = pixels[:, :, green] > threshold * pixels[:, :, red]
    greener_than_blue = pixels[:, :, green] > threshold * pixels[:, :, blue]
    green = np.logical_and(greener_than_red, greener_than_blue) 

    return green

This code has assumed we have our pixel data for the image as a $256 \times 256 \times 3$ 3-d matrix, with each of the three layers being red, green, and blue pixels.

We find out which pixels are green by comparing, element-by-element, the middle (green, number 1) layer to the top (red, zero) and bottom (blue, 2)

Now we just need to parse in our data, which is a PNG image, and turn it into our matrix format:

In [19]:
def count_green_in_png(data):
    """Determine the total number of green pixels in an image."""

    f = BytesIO(data)
    pixels = iio.imread(f) # Get our PNG image as a numpy array

    return np.sum(is_green(pixels))
In [20]:
london_map = map_content_at(london_latitude, london_longitude)
green_count_london = count_green_in_png(london_map)
print(green_count_london)
21417
In [21]:
iio.imread(BytesIO(london_map)).shape
Out[21]:
(256, 256, 3)

We'll also need a function to get an evenly spaced set of places between two endpoints:

In [22]:
def location_sequence(start, end, steps):
    """Generate a sequence of evenly spaced locations between two sets of coordinates."""

    start_latitude, start_longitude = start
    end_latitude, end_longitude = end
    
    latitudes = np.linspace(start_latitude, end_latitude, steps)
    longitudes = np.linspace(start_longitude, end_longitude, steps)

    path = np.vstack([latitudes, longitudes]).transpose()
    
    return path
In [23]:
london_to_cambridge = location_sequence(
    start=geolocate("London"),
    end=geolocate("Cambridge"),
    steps=5,
)
print(london_to_cambridge)
[[ 5.14893335e+01 -1.44055085e-01]
 [ 5.16683830e+01 -7.83753884e-02]
 [ 5.18474324e+01 -1.26956923e-02]
 [ 5.20264819e+01  5.29840039e-02]
 [ 5.22055314e+01  1.18663700e-01]]

Creating Images

We should display the green content to check our work:

In [24]:
def show_green_in_png(data):
    """Convert all non-green pixels in an RGB image to black.

    Red and blue channel are set to 0 for all pixels.
    Pixels that are green will have the green channel set to its max value.
    Pixels that are non-green will have the green channel set to 0.
    """

    f = BytesIO(data)
    pixels = iio.imread(f) # Get our PNG image as a numpy array
    green_pixels = is_green(pixels)

    green_channel = 1
    binary_pixels = np.zeros_like(pixels, dtype=np.uint8)
    max_possible_value =  np.iinfo(binary_pixels.dtype).max
    binary_pixels[green_pixels, green_channel] = max_possible_value

    buffer = BytesIO()
    binary_image = iio.imwrite(buffer, binary_pixels, extension='.png')

    return buffer.getvalue()
In [25]:
london_location
Out[25]:
(51.4893335, -0.14405508452768728)
In [26]:
IPython.display.Image(
    map_content_at(london_latitude, london_longitude, satellite=True)
)
Out[26]:
No description has been provided for this image
In [27]:
IPython.display.Image(
    show_green_in_png(
        map_content_at(
            london_latitude,
            london_longitude,
            satellite=True,
        )
    )
)
Out[27]:
No description has been provided for this image

Looping

We can loop over each element in out list of coordinates and get a map for that place:

In [28]:
london_to_birmingham = location_sequence(
    start=geolocate("London"),
    end=geolocate("Birmingham"),
    steps=10,
)

london_to_birmingham_maps = []

for latitude, longitude in london_to_birmingham:

    current_map = map_content_at(latitude, longitude)
    london_to_birmingham_maps.append(current_map)
    
    IPython.display.display(
        IPython.display.Image(
            current_map,
        )
    )
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

So now we can count the green from London to Birmingham!

In [29]:
green_at_each_location = [count_green_in_png(current_map) for current_map in london_to_birmingham_maps]
print(green_at_each_location)
[np.int64(21417), np.int64(21417), np.int64(50321), np.int64(52703), np.int64(55081), np.int64(49484), np.int64(50265), np.int64(50096), np.int64(47362), np.int64(49284)]

Plotting graphs

Let's plot a graph.

In [30]:
import matplotlib.pyplot as plt
%matplotlib inline
In [31]:
plt.plot(green_at_each_location)

plt.xticks(range(10))
plt.xlabel("Sequence step")
plt.ylabel(r"$N_{green}$")
Out[31]:
Text(0, 0.5, '$N_{green}$')
No description has been provided for this image

From a research perspective, of course, this code needs a lot of work. But I hope the power of using programming is clear.

Composing Program Elements

We built little pieces of useful code, to:

  • Find latitude and longitude of a place
  • Get a map at a given latitude and longitude
  • Decide whether a (red,green,blue) triple is mainly green
  • Decide whether each pixel is mainly green
  • Plot a new image showing the green places
  • Find evenly spaced points between two places

By putting these together, we can make a function which can plot this graph automatically for any two places:

In [32]:
def green_between(start, end, steps):
    """Count the amount of green space along a linear path between two locations."""

    sequence = location_sequence(
        start=geolocate(start),
        end=geolocate(end),
        steps=steps,
    )
    maps = [map_content_at(latitude, longitude) for latitude, longitude in sequence]
    green_at_each_location = [count_green_in_png(current_map) for current_map in maps]
    
    return green_at_each_location
In [33]:
plt.plot(green_between('New York', 'Chicago', 20))
---------------------------------------------------------------------------
TimeoutError                              Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/urllib3/connectionpool.py:534, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
    533 try:
--> 534     response = conn.getresponse()
    535 except (BaseSSLError, OSError) as e:

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/urllib3/connection.py:516, in HTTPConnection.getresponse(self)
    515 # Get the response from http.client.HTTPConnection
--> 516 httplib_response = super().getresponse()
    518 try:

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/http/client.py:1428, in HTTPConnection.getresponse(self)
   1427 try:
-> 1428     response.begin()
   1429 except ConnectionError:

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/http/client.py:331, in HTTPResponse.begin(self)
    330 while True:
--> 331     version, status, reason = self._read_status()
    332     if status != CONTINUE:

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/http/client.py:292, in HTTPResponse._read_status(self)
    291 def _read_status(self):
--> 292     line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
    293     if len(line) > _MAXLINE:

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/socket.py:720, in SocketIO.readinto(self, b)
    719 try:
--> 720     return self._sock.recv_into(b)
    721 except timeout:

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/ssl.py:1251, in SSLSocket.recv_into(self, buffer, nbytes, flags)
   1248         raise ValueError(
   1249           "non-zero flags not allowed in calls to recv_into() on %s" %
   1250           self.__class__)
-> 1251     return self.read(nbytes, buffer)
   1252 else:

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/ssl.py:1103, in SSLSocket.read(self, len, buffer)
   1102 if buffer is not None:
-> 1103     return self._sslobj.read(len, buffer)
   1104 else:

TimeoutError: The read operation timed out

The above exception was the direct cause of the following exception:

ReadTimeoutError                          Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/urllib3/connectionpool.py:787, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    786 # Make the request on the HTTPConnection object
--> 787 response = self._make_request(
    788     conn,
    789     method,
    790     url,
    791     timeout=timeout_obj,
    792     body=body,
    793     headers=headers,
    794     chunked=chunked,
    795     retries=retries,
    796     response_conn=response_conn,
    797     preload_content=preload_content,
    798     decode_content=decode_content,
    799     **response_kw,
    800 )
    802 # Everything went great!

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/urllib3/connectionpool.py:536, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
    535 except (BaseSSLError, OSError) as e:
--> 536     self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
    537     raise

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/urllib3/connectionpool.py:367, in HTTPConnectionPool._raise_timeout(self, err, url, timeout_value)
    366 if isinstance(err, SocketTimeout):
--> 367     raise ReadTimeoutError(
    368         self, url, f"Read timed out. (read timeout={timeout_value})"
    369     ) from err
    371 # See the above comment about EAGAIN in Python 3.

ReadTimeoutError: HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Read timed out. (read timeout=1)

The above exception was the direct cause of the following exception:

MaxRetryError                             Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/requests/adapters.py:667, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    666 try:
--> 667     resp = conn.urlopen(
    668         method=request.method,
    669         url=url,
    670         body=request.body,
    671         headers=request.headers,
    672         redirect=False,
    673         assert_same_host=False,
    674         preload_content=False,
    675         decode_content=False,
    676         retries=self.max_retries,
    677         timeout=timeout,
    678         chunked=chunked,
    679     )
    681 except (ProtocolError, OSError) as err:

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/urllib3/connectionpool.py:871, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    868     log.warning(
    869         "Retrying (%r) after connection broken by '%r': %s", retries, err, url
    870     )
--> 871     return self.urlopen(
    872         method,
    873         url,
    874         body,
    875         headers,
    876         retries,
    877         redirect,
    878         assert_same_host,
    879         timeout=timeout,
    880         pool_timeout=pool_timeout,
    881         release_conn=release_conn,
    882         chunked=chunked,
    883         body_pos=body_pos,
    884         preload_content=preload_content,
    885         decode_content=decode_content,
    886         **response_kw,
    887     )
    889 # Handle redirect?

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/urllib3/connectionpool.py:871, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    868     log.warning(
    869         "Retrying (%r) after connection broken by '%r': %s", retries, err, url
    870     )
--> 871     return self.urlopen(
    872         method,
    873         url,
    874         body,
    875         headers,
    876         retries,
    877         redirect,
    878         assert_same_host,
    879         timeout=timeout,
    880         pool_timeout=pool_timeout,
    881         release_conn=release_conn,
    882         chunked=chunked,
    883         body_pos=body_pos,
    884         preload_content=preload_content,
    885         decode_content=decode_content,
    886         **response_kw,
    887     )
    889 # Handle redirect?

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/urllib3/connectionpool.py:841, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    839     new_e = ProtocolError("Connection aborted.", new_e)
--> 841 retries = retries.increment(
    842     method, url, error=new_e, _pool=self, _stacktrace=sys.exc_info()[2]
    843 )
    844 retries.sleep()

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/urllib3/util/retry.py:519, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    518     reason = error or ResponseError(cause)
--> 519     raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    521 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)

MaxRetryError: HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Max retries exceeded with url: /search?q=Chicago&format=json&limit=1 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Read timed out. (read timeout=1)"))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/geopy/adapters.py:482, in RequestsAdapter._request(self, url, timeout, headers)
    481 try:
--> 482     resp = self.session.get(url, timeout=timeout, headers=headers)
    483 except Exception as error:

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/requests/sessions.py:602, in Session.get(self, url, **kwargs)
    601 kwargs.setdefault("allow_redirects", True)
--> 602 return self.request("GET", url, **kwargs)

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/requests/sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
    591 return resp

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/requests/sessions.py:703, in Session.send(self, request, **kwargs)
    702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
    705 # Total elapsed time of the request (approximately)

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/requests/adapters.py:700, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    698         raise SSLError(e, request=request)
--> 700     raise ConnectionError(e, request=request)
    702 except ClosedPoolError as e:

ConnectionError: HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Max retries exceeded with url: /search?q=Chicago&format=json&limit=1 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Read timed out. (read timeout=1)"))

During handling of the above exception, another exception occurred:

GeocoderUnavailable                       Traceback (most recent call last)
Cell In[33], line 1
----> 1 plt.plot(green_between('New York', 'Chicago', 20))

Cell In[32], line 6, in green_between(start, end, steps)
      1 def green_between(start, end, steps):
      2     """Count the amount of green space along a linear path between two locations."""
      4     sequence = location_sequence(
      5         start=geolocate(start),
----> 6         end=geolocate(end),
      7         steps=steps,
      8     )
      9     maps = [map_content_at(latitude, longitude) for latitude, longitude in sequence]
     10     green_at_each_location = [count_green_in_png(current_map) for current_map in maps]

Cell In[4], line 4, in geolocate(city)
      1 def geolocate(city):
      2     """Get the latitude and longitude of a specific location."""
----> 4     full_name, coordinates = geocoder.geocode(city)
      5     return coordinates

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/geopy/geocoders/nominatim.py:297, in Nominatim.geocode(self, query, exactly_one, timeout, limit, addressdetails, language, geometry, extratags, country_codes, viewbox, bounded, featuretype, namedetails)
    295 logger.debug("%s.geocode: %s", self.__class__.__name__, url)
    296 callback = partial(self._parse_json, exactly_one=exactly_one)
--> 297 return self._call_geocoder(url, callback, timeout=timeout)

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/geopy/geocoders/base.py:368, in Geocoder._call_geocoder(self, url, callback, timeout, is_json, headers)
    366 try:
    367     if is_json:
--> 368         result = self.adapter.get_json(url, timeout=timeout, headers=req_headers)
    369     else:
    370         result = self.adapter.get_text(url, timeout=timeout, headers=req_headers)

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/geopy/adapters.py:472, in RequestsAdapter.get_json(self, url, timeout, headers)
    471 def get_json(self, url, *, timeout, headers):
--> 472     resp = self._request(url, timeout=timeout, headers=headers)
    473     try:
    474         return resp.json()

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/geopy/adapters.py:494, in RequestsAdapter._request(self, url, timeout, headers)
    492         raise GeocoderServiceError(message)
    493     else:
--> 494         raise GeocoderUnavailable(message)
    495 elif isinstance(error, requests.Timeout):
    496     raise GeocoderTimedOut("Service timed out")

GeocoderUnavailable: HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Max retries exceeded with url: /search?q=Chicago&format=json&limit=1 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Read timed out. (read timeout=1)"))

We can also put the plotting command into a function, to make it more general:

In [34]:
def plot_green_between(start, end, steps):
    """ount the amount of green space along a linear path between two locations"""
    green_between_locations = green_between(start, end, steps)
    plt.plot(green_between_locations)
    xticks_steps = 5 if steps > 10 else 1
    plt.xticks(range(0, steps, xticks_steps))
    plt.xlabel("Sequence step")
    plt.ylabel(r"$N_{green}$")
    plt.title(f"{start} -- {end}")
In [35]:
plot_green_between('New York', 'Chicago', 20)
---------------------------------------------------------------------------
TimeoutError                              Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/urllib3/connectionpool.py:534, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
    533 try:
--> 534     response = conn.getresponse()
    535 except (BaseSSLError, OSError) as e:

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/urllib3/connection.py:516, in HTTPConnection.getresponse(self)
    515 # Get the response from http.client.HTTPConnection
--> 516 httplib_response = super().getresponse()
    518 try:

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/http/client.py:1428, in HTTPConnection.getresponse(self)
   1427 try:
-> 1428     response.begin()
   1429 except ConnectionError:

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/http/client.py:331, in HTTPResponse.begin(self)
    330 while True:
--> 331     version, status, reason = self._read_status()
    332     if status != CONTINUE:

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/http/client.py:292, in HTTPResponse._read_status(self)
    291 def _read_status(self):
--> 292     line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
    293     if len(line) > _MAXLINE:

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/socket.py:720, in SocketIO.readinto(self, b)
    719 try:
--> 720     return self._sock.recv_into(b)
    721 except timeout:

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/ssl.py:1251, in SSLSocket.recv_into(self, buffer, nbytes, flags)
   1248         raise ValueError(
   1249           "non-zero flags not allowed in calls to recv_into() on %s" %
   1250           self.__class__)
-> 1251     return self.read(nbytes, buffer)
   1252 else:

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/ssl.py:1103, in SSLSocket.read(self, len, buffer)
   1102 if buffer is not None:
-> 1103     return self._sslobj.read(len, buffer)
   1104 else:

TimeoutError: The read operation timed out

The above exception was the direct cause of the following exception:

ReadTimeoutError                          Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/urllib3/connectionpool.py:787, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    786 # Make the request on the HTTPConnection object
--> 787 response = self._make_request(
    788     conn,
    789     method,
    790     url,
    791     timeout=timeout_obj,
    792     body=body,
    793     headers=headers,
    794     chunked=chunked,
    795     retries=retries,
    796     response_conn=response_conn,
    797     preload_content=preload_content,
    798     decode_content=decode_content,
    799     **response_kw,
    800 )
    802 # Everything went great!

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/urllib3/connectionpool.py:536, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
    535 except (BaseSSLError, OSError) as e:
--> 536     self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
    537     raise

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/urllib3/connectionpool.py:367, in HTTPConnectionPool._raise_timeout(self, err, url, timeout_value)
    366 if isinstance(err, SocketTimeout):
--> 367     raise ReadTimeoutError(
    368         self, url, f"Read timed out. (read timeout={timeout_value})"
    369     ) from err
    371 # See the above comment about EAGAIN in Python 3.

ReadTimeoutError: HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Read timed out. (read timeout=1)

The above exception was the direct cause of the following exception:

MaxRetryError                             Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/requests/adapters.py:667, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    666 try:
--> 667     resp = conn.urlopen(
    668         method=request.method,
    669         url=url,
    670         body=request.body,
    671         headers=request.headers,
    672         redirect=False,
    673         assert_same_host=False,
    674         preload_content=False,
    675         decode_content=False,
    676         retries=self.max_retries,
    677         timeout=timeout,
    678         chunked=chunked,
    679     )
    681 except (ProtocolError, OSError) as err:

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/urllib3/connectionpool.py:871, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    868     log.warning(
    869         "Retrying (%r) after connection broken by '%r': %s", retries, err, url
    870     )
--> 871     return self.urlopen(
    872         method,
    873         url,
    874         body,
    875         headers,
    876         retries,
    877         redirect,
    878         assert_same_host,
    879         timeout=timeout,
    880         pool_timeout=pool_timeout,
    881         release_conn=release_conn,
    882         chunked=chunked,
    883         body_pos=body_pos,
    884         preload_content=preload_content,
    885         decode_content=decode_content,
    886         **response_kw,
    887     )
    889 # Handle redirect?

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/urllib3/connectionpool.py:871, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    868     log.warning(
    869         "Retrying (%r) after connection broken by '%r': %s", retries, err, url
    870     )
--> 871     return self.urlopen(
    872         method,
    873         url,
    874         body,
    875         headers,
    876         retries,
    877         redirect,
    878         assert_same_host,
    879         timeout=timeout,
    880         pool_timeout=pool_timeout,
    881         release_conn=release_conn,
    882         chunked=chunked,
    883         body_pos=body_pos,
    884         preload_content=preload_content,
    885         decode_content=decode_content,
    886         **response_kw,
    887     )
    889 # Handle redirect?

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/urllib3/connectionpool.py:841, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    839     new_e = ProtocolError("Connection aborted.", new_e)
--> 841 retries = retries.increment(
    842     method, url, error=new_e, _pool=self, _stacktrace=sys.exc_info()[2]
    843 )
    844 retries.sleep()

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/urllib3/util/retry.py:519, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    518     reason = error or ResponseError(cause)
--> 519     raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    521 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)

MaxRetryError: HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Max retries exceeded with url: /search?q=Chicago&format=json&limit=1 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Read timed out. (read timeout=1)"))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/geopy/adapters.py:482, in RequestsAdapter._request(self, url, timeout, headers)
    481 try:
--> 482     resp = self.session.get(url, timeout=timeout, headers=headers)
    483 except Exception as error:

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/requests/sessions.py:602, in Session.get(self, url, **kwargs)
    601 kwargs.setdefault("allow_redirects", True)
--> 602 return self.request("GET", url, **kwargs)

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/requests/sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
    591 return resp

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/requests/sessions.py:703, in Session.send(self, request, **kwargs)
    702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
    705 # Total elapsed time of the request (approximately)

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/requests/adapters.py:700, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    698         raise SSLError(e, request=request)
--> 700     raise ConnectionError(e, request=request)
    702 except ClosedPoolError as e:

ConnectionError: HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Max retries exceeded with url: /search?q=Chicago&format=json&limit=1 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Read timed out. (read timeout=1)"))

During handling of the above exception, another exception occurred:

GeocoderUnavailable                       Traceback (most recent call last)
Cell In[35], line 1
----> 1 plot_green_between('New York', 'Chicago', 20)

Cell In[34], line 3, in plot_green_between(start, end, steps)
      1 def plot_green_between(start, end, steps):
      2     """ount the amount of green space along a linear path between two locations"""
----> 3     green_between_locations = green_between(start, end, steps)
      4     plt.plot(green_between_locations)
      5     xticks_steps = 5 if steps > 10 else 1

Cell In[32], line 6, in green_between(start, end, steps)
      1 def green_between(start, end, steps):
      2     """Count the amount of green space along a linear path between two locations."""
      4     sequence = location_sequence(
      5         start=geolocate(start),
----> 6         end=geolocate(end),
      7         steps=steps,
      8     )
      9     maps = [map_content_at(latitude, longitude) for latitude, longitude in sequence]
     10     green_at_each_location = [count_green_in_png(current_map) for current_map in maps]

Cell In[4], line 4, in geolocate(city)
      1 def geolocate(city):
      2     """Get the latitude and longitude of a specific location."""
----> 4     full_name, coordinates = geocoder.geocode(city)
      5     return coordinates

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/geopy/geocoders/nominatim.py:297, in Nominatim.geocode(self, query, exactly_one, timeout, limit, addressdetails, language, geometry, extratags, country_codes, viewbox, bounded, featuretype, namedetails)
    295 logger.debug("%s.geocode: %s", self.__class__.__name__, url)
    296 callback = partial(self._parse_json, exactly_one=exactly_one)
--> 297 return self._call_geocoder(url, callback, timeout=timeout)

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/geopy/geocoders/base.py:368, in Geocoder._call_geocoder(self, url, callback, timeout, is_json, headers)
    366 try:
    367     if is_json:
--> 368         result = self.adapter.get_json(url, timeout=timeout, headers=req_headers)
    369     else:
    370         result = self.adapter.get_text(url, timeout=timeout, headers=req_headers)

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/geopy/adapters.py:472, in RequestsAdapter.get_json(self, url, timeout, headers)
    471 def get_json(self, url, *, timeout, headers):
--> 472     resp = self._request(url, timeout=timeout, headers=headers)
    473     try:
    474         return resp.json()

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/geopy/adapters.py:494, in RequestsAdapter._request(self, url, timeout, headers)
    492         raise GeocoderServiceError(message)
    493     else:
--> 494         raise GeocoderUnavailable(message)
    495 elif isinstance(error, requests.Timeout):
    496     raise GeocoderTimedOut("Service timed out")

GeocoderUnavailable: HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Max retries exceeded with url: /search?q=Chicago&format=json&limit=1 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Read timed out. (read timeout=1)"))

And that's it! We've covered - very very quickly - a lot of the Python language, and have introduced some of the most important concepts in modern software engineering.

Now we'll go back, carefully, through all the concepts we touched on, and learn how to use them properly ourselves.