Getting and using osm data in Python • openinfra

In this vignette we focus on two python packages commonly utilised to obtain and assess Open Street Map (OSM) data, namely OSMnx and Pyrosm, to complete necessary processing steps to work with OSM data for transport research comprising:

Downloading OSM data, for both pre-defined areas and user specified areas
Plotting OSM data, to understand the data and its structure as well as visualising it

Package Installs and Imports

First we must install the required packages for our analysis.

This analysis has been conducted in an interactive notebook running Python 3.7.13 (you can check your current version by running !python --version in a new code cell) Packages can be installed using the pip package manager as shown below.

If you are running Python on a local machine (your own computer, rather than colab) within a conda environment you must install packages from the conda servers using the conda package manager. Example: conda install -c conda-forge pyrosm rather than the pip !pip install pyrosm

To see if your desired package is supported by the conda package installer you can search for your package using the search bar in the centre of the anaconda homepage.

Package install from terminal (if required)

conda install -c conda-forge pyrosm
conda install -c conda-forge osmnx

# start by importing your packages
import pyrosm
import matplotlib

Analysis

Now that we have installed and imported our desired packages we can continue with our analysis. Within this document we show how to obtain network data-packs of OSM infrastructure in Python using the pyrosm and osmnx packages.

Pyrosm Package

Here we show the functionality of the pyrosm package. The main difference between OSMnx and pyrosm is best described by the pyrosm documentation itself

“the main difference between pyrosm and OSMnx is that OSMnx reads the data using an OverPass API, whereas pyrosm reads the data from local OSM data dumps that are downloaded from the PBF data providers (Geofabrik, BBBike). This makes it possible to parse OSM data faster and make it more feasible to extract data covering large regions.”

As we are based in Leeds (the University of Leeds) we will use 'Leeds' as our query.

Data Download

We can check if our query is stored by pyrosm providers with pyrosm.data.available, a number of regions, sub-regions and cities are available.

# Store available places, check Leeds is in stored places
available_places = pyrosm.data.available
print(available_places.keys())
print('Leeds available from providers:','Leeds' in available_places['cities'])

dict_keys(['test_data', 'regions', 'subregions', 'cities'])
Leeds available from providers: True

As Leeds is available from pyrosm providers, we will use this.

# Gets data from pyrosm providers (BBBike or Geofabrik) and stores in /temp directory - file can be saved to a user specified
# location with additional argument directory i.e. get_data(place_name, directory='save_path')
place_name = 'Leeds'
file_path = pyrosm.get_data(place_name)
print('Data downloaded to:', file_path)

Data downloaded to: /tmp/pyrosm/Leeds.osm.pbf

This has downloaded the OSM data in the Protocolbuffer Binary Format (pbf) file format ready to be parsed by the pyrosm OSM file reader.

Reproducable Data

As OSM data is constantly updated as mappers continually add/update features the data you request from pyrosm providers may have more features than within our example.

If you would like to use the same dataset as us (Leeds.osm.pbf - 06/06/2022) this can be downloaded from our repository releases using the following code:

Note that you will need to set your own desired download location with the save_to variable:

#____________Function__________________________________________________________
def download_file(url, save_to) :
    # url     - url of downloadable file
    # save_to - directory to save the file to
    
    from os import chdir, getcwd
    from requests import get
    
    print("downloading from url:", url)       
    req = get(url)
    filename = req.headers['Content-Disposition'].split("filename=")[1]
    chdir(save_to)
    
    print("writing", filename)
    with open(filename, 'wb') as file: 
        for chunk in req.iter_content(chunk_size=(8192)):
            if chunk:
                file.write(chunk)
    file.close()
    print(filename, "has been downloaded to", getcwd())
#______________________________________________________________________________

url = "https://github.com/udsleeds/openinfra/releases/download/v0.2/Leeds_06_06_22.osm.pbf"

# Change this to your desired download directory.
save_to = "/home/james/Desktop/Python_Downloads"

download_file(url, save_to)

Data Loading

Now we have downloaded the most up-to-date (or the reproducible) OSM dataset we can continue by loading our data and observing it’s structure:

If you are following this guide using the reproducible dataset then file_path will simply be the save_to path + the file name.

For me that is file_path = /home/james/Desktop/Python_Downloads/Leeds_06_06_22.osm.pbf

# Initialises the OSM object that parses .osm.pbf files

osm = pyrosm.OSM(file_path)
print('osm type:', type(osm))

osm type: <class 'pyrosm.pyrosm.OSM'>

Notice that the osm (lower case) variable is actually the reader instance (type: <class 'pyrosm.pyrosm.OSM'>) for the given .osm.pbf dataset. As such this (osm) variable should always be used to make the calls for fetching different network datasets from the OSM pbf file.

There are a number of ways to request network data from the osm.pbf file, most notable and of use in this demonstration are

get_network() and
get_data_by_custom_criteria()

Pre-defined Networks from get_network()

The functionality of OSM.get_network() is demonstrated below first. For assistance with this function we can call for the documentation with help.

# Calling the help function on OSM.get_network() to see documentation. 
help(pyrosm.OSM.get_network)

Help on function get_network in module pyrosm.pyrosm:

get_network(self, network_type='walking', extra_attributes=None, nodes=False)
    Parses street networks from OSM
    for walking, driving, and cycling.
    
    Parameters
    ----------
    
    network_type : str
        What kind of network to parse.
        Possible values are:
          - `'walking'`
          - `'cycling'`
          - `'driving'`
          - `'driving+service'`
          - `'all'`.
    
    extra_attributes : list (optional)
        Additional OSM tag keys that will be converted into columns in the resulting GeoDataFrame.
    
    nodes : bool (default: False)
        If True, 1) the nodes associated with the network will be returned in addition to edges,
        and 2) every segment of a road constituting a way is parsed as a separate row
        (to enable full connectivity in the graph).
    
    Returns
    -------
    
    gdf_edges or (gdf_nodes, gdf_edges)
    
    Return type
    -----------
    
    geopandas.GeoDataFrame or tuple
    
    See Also
    --------
    
    Take a look at the OSM documentation for further details about the data:
    `https://wiki.openstreetmap.org/wiki/Key:highway <https://wiki.openstreetmap.org/wiki/Key:highway>`__

As can be seen above, the get_network function accepts a number of network_type parameters depending on the network you are trying to analyse, including:

'all'
'driving'
'cycling'
'walking'
'driving+service'

where 'service' generally implies an access road to a building, service station, campsite, industrial estate, fuel station, wind turbine site etc.

Lets first obtain the network for all of Leeds, examine the data structure returned from our get_network() request, and visualise the network we have retrieved.

# Obtaining the total network for 'Leeds'
leeds_total_network = osm.get_network(network_type = 'all')
print('Variable shape:',leeds_total_network.shape, 'and type:', type(leeds_total_network), '\n')
leeds_total_network.head(2)

Variable shape: (136372, 39) and type: <class 'geopandas.geodataframe.GeoDataFrame'> 
(leeds_total_network.head(2) shown as image below)

leeds_total_network['geom_type'] = leeds_total_network['geometry'].geom_type
print(leeds_total_network['geom_type'].value_counts())

MultiLineString    136372
Name: geom_type, dtype: int64

We have saved the output from our get_network request as the variable leeds_total_network.

We can see this variable (leeds_total_network) is a geopandas GeoDataFrame (analogous to an Excel spreadsheet) with shape (135,546, 39), implying this DataFrame contains 135,546 rows and 39 columns.

Each row corresponds to a unique feature (such as a way - a road, path, cyclepath etc.) and each column corresponds to a tag for that feature (such as feature geometry, osmid, pedestrian access etc.)

We can observe all keys have been returned by calling for the column names of the DataFrame (leeds_total_network) below.

keys = leeds_total_network.columns
print(keys)

Index(['access', 'area', 'bicycle', 'bicycle_road', 'bridge', 'busway',
       'cycleway', 'est_width', 'foot', 'footway', 'highway', 'int_ref',
       'junction', 'lanes', 'lit', 'maxspeed', 'motorcar', 'motor_vehicle',
       'name', 'oneway', 'overtaking', 'psv', 'ref', 'service', 'segregated',
       'sidewalk', 'smoothness', 'surface', 'tracktype', 'tunnel', 'turn',
       'width', 'id', 'timestamp', 'version', 'tags', 'osm_type', 'geometry',
       'length'],
      dtype='object')

These are all the default keys (returned as columns) when a network is requested from the osm.pbf file with get_network. We can see there is a column named 'geometry' which stores the linestring geometries of features we have requested, and is used in the visualisation of our networks.

However, it should be noted that in some instances a number of extra tags are returned within the tags column, in instances where specific OSM features have more information attributed to them. We can take a look at them below.

Should we wish for any of these additional tags to be returned as columns in the DataFrame, rather than being stored within the tags column, we can specify this with the extra_attributes argument for the get_network function. i.e. get_network(network_type='walking', extra_attributes=["description", "crossing"]) would return all vlaues for the tags description and crossing as their own columns.

# Removes any features from the DataFrame which have no additional tags within the 'tags' column (i.e no additional tags are returned) and shows the first 4 features
leeds_total_network_noNA = leeds_total_network.loc[leeds_total_network.tags.isna() == False]
leeds_total_network_noNA.head(4)

output as image below

As can be seen the feature with index 20 (the 4th row above) has a number of additional tags within the 'tags' column - lets take a closer look.

leeds_total_network_noNA.tags.iloc[3]

{"gritting":"priority_1","maintenance":"gritting","maxweight":"7.5"}

Through some reverse searching of these additional tags using Tag Finder, we can see the feature being described is likely a well used public road as the local authorities are required to grit it in icy conditions (maintenance:gritting).

Furthermore, we know this is likely a well used road due to the highest gritting priority (gritting:priority_1) with a maximum permissible weight of 7.5 tonnes.

Obtaining the feature osm id id and feature type osm_type provides more context:

leeds_total_network_noNA.id.iloc[3], leeds_total_network_noNA.osm_type.iloc[3]

(2340358, 'way')

Having accessed the feature ID and osm_type field above, we can search for this feature using the OSM Nominatim search by ID field which uses osm_type + ID as the query argument.

So, for a way (W) with ID 2340358 our query becomes W2340358

Passing this query to Nominatim we find that this way corresponds to a road within central Wakefield. As such, and as hypothesised, it is likely a well used important road thus the high gritting priority in icy conditions.

Network Visualisation

Here we look to plot the networks we have requested using the get_network function.

Remember that within the GeoDataFrame returned there was a column named 'geometry' which stores the geometries of returned features used for visualisation of requested networks.

Calling the .plot method on the returned GeoDataFrame automatically detects the column containing feature geometry and plots them, as shown below.

# Plotting the total network for Leeds
leeds_total_network.plot()

# Requesting and plotting the drivable network for Leeds
leeds_total_walking = osm.get_network(network_type='driving')
leeds_total_walking.plot()

Custom networks with get_data_by_custom_criteria()

There exists the option in pyrosm to request custom networks using the get_data_by_custom_criteria function

As preivously described the get_network function has five pre-defined network configurations: walking, driving, cycling, all, driving+service.

Here we will have a brief view of the get_data_by_custom_criteria function before re-creating one of the pre-defined network_type= filters by copying filtering steps from the get_network documentation.

Specifically, we will recreate the cycling network for Leeds. From function documentation we know:

def cycling_filter():
    """
    Cycling filters for different tags as in OSMnx for 'bike'.
    Filter out foot ways, motor ways, private ways, and anything
    specifying biking=no.
    Applied filters:
        '["area"!~"yes"]["highway"!~"footway|steps|corridor|elevator|escalator|motor|proposed|'
        'construction|abandoned|platform|raceway"]'
        '["bicycle"!~"no"]["service"!~"private"
    """
    return dict(
        area=["yes"],
        highway=[
            "footway",
            "steps",
            "corridor",
            "elevator",
            "escalator",
            "motor",
            "proposed",
            "construction",
            "abandoned",
            "platform",
            "raceway",
            "motorway",
            "motorway_link",
        ],
        bicycle=["no"],
        service=["private"],
    )

From “Applied filters:” we know that features containing any tag values pairs contained within the returned dictionary will be excluded from our network rather than left in.

This is due to the ! operator within the filter which implies does not equal.

i.e. to create a cycling network, remove features from the network if they have the highway tag with any of the values: "footway", "steps", "corridor", "elevator", etc.. Intuitively these are all features that are not accessible when cycling.

We can see which features are removed from networks from the following pre-defined filters in tabular form below constructed from documentation:

Cycling Filter removed tags

Tag	Values
area	“yes”
highway	“footway”, “steps”, “corridor”, “elevator”, “escalator”, “motor”, “proposed”, “construction”, “abandoned”, “platform”, “raceway”, “motorway”, “motorway_link”
bicycle	“no”
service	“private”

Driving Filter removed tags

Tag	Values
area	“yes”
highway	“cycleway”, “footway”, “path”, “pedestrian”, “steps”, “track”, “corridor”, “elevator”, “escalator”, “proposed”, “construction”, “bridleway”, “abandoned”, “platform”, “raceway”
motor_vehicle	“no”
motorcar	“no”
service	“parking”, “parking_aisle”, “private”, “emergency_access”

Walking Filter removed tags

Tag	Values
area	“yes”
highway	“cycleway”, “motor”, “proposed”, “construction”, “abandoned”, “platform”, “raceway”, “motorway”, “motorway_link”
foot	“no”
service	“private”

We will now take a look at the get _data_by_custom_criteria function documentation to see how to define a custom network.

Help on function get_data_by_custom_criteria in module pyrosm.pyrosm:

get_data_by_custom_criteria(self, custom_filter, osm_keys_to_keep=None, filter_type='keep', tags_as_columns=None, keep_nodes=True, keep_ways=True, keep_relations=True, extra_attributes=None)
    `
    Parse OSM data based on custom criteria.
    
    Parameters
    ----------
    
    custom_filter : dict (required)
        A custom filter to filter only specific POIs from OpenStreetMap.
    
    osm_keys_to_keep : str | list
        A filter to specify which OSM keys should be kept.
    
    filter_type : str
        "keep" | "exclude"
        Whether the filters should be used to keep or exclude the data from OSM.
    
    tags_as_columns : list
        Which tags should be kept as columns in the resulting GeoDataFrame.
    
    keep_nodes : bool
        Whether or not the nodes should be kept in the resulting GeoDataFrame if they are found.
    
    keep_ways : bool
        Whether or not the ways should be kept in the resulting GeoDataFrame if they are found.
    
    keep_relations : bool
        Whether or not the relations should be kept in the resulting GeoDataFrame if they are found.
    
    extra_attributes : list (optional)
        Additional OSM tag keys that will be converted into columns in the resulting GeoDataFrame.

Studying the docmentation we see that we can create a dictionary of tags to either 'exclude' or 'keep' within our network, this is decided by the filter_type= paramater.

If we want to keep data on only roads (rather than footpaths etc.) we can do so by keeping only "highway" tags by specifying osm_keys_to_keep = "highway".

We will define our custom cycling network below:

# Specifying desired keys to be kept - this is our first level of filtering
keys_to_keep = "highway"

# Specifying key:value pairs to be filtered - this is the second level of filtering. 
cycling_filter = dict(area=["yes"],
        highway=[
            "footway",
            "steps",
            "corridor",
            "elevator",
            "escalator",
            "motor",
            "proposed",
            "construction",
            "abandoned",
            "platform",
            "raceway",
            "motorway",
            "motorway_link",
        ],
        bicycle=["no"],
        service=["private"])

# Specifying if the above tags should be kept or removed
filter_type = "exclude"

# Filter the network: 
# From the docuemtnation on get_network() function, nodes and relations are set False as default so we do so here too.
leeds_custom_cycling = osm.get_data_by_custom_criteria(custom_filter = cycling_filter, 
                                                       osm_keys_to_keep = keys_to_keep,
                                                       filter_type = filter_type,
                                                       keep_nodes = False,
                                                       keep_relations = False)

# Visualisation and stats
leeds_custom_cycling.plot()
print(leeds_custom_cycling.shape)

leeds_custom_cycling['geom_type'] = leeds_custom_cycling['geometry'].geom_type
print(leeds_custom_cycling.geom_type.value_counts())

(105460, 38)
MultiLineString    77219
LineString         28241
dtype: int64

Comparing to the default cycling network

leeds_cycling = osm.get_network(network_type = "cycling")
leeds_cycling['geom_type'] = leeds_cycling.geometry.geom_type
leeds_cycling.plot()
print(leeds_cycling.shape)
print(leeds_cycling.geom_type.value_counts())

(105460, 39)
MultiLineString    105460
dtype: int64

As can be seen we have recreated the get_network cycling filter with our custom query.

Both of the returned networks contain 105460 features, differing only in the number of columns returned.

We an find the difference in returned columns by calculating the set difference of each networks columns as follows:

set(leeds_custom_cycling.keys()).difference(set(leeds_cycling.keys()))
set(leeds_cycling.keys()).difference(set(leeds_custom_cycling.keys()))

{'length'}

Only one output is returned, the additional column within the leeds_cycling network. This is because every column within leeds_custom_cycling is already within leeds_cycling and so nothing is returned.

Saving Custom Networks

Here we look at saving the custom network we have defined for later use!