In this vignette we focus on two python packages commonly utilised to obtain and assess Open Street Map (OSM) data, namely OSMnx and Pyrosm, to complete necessary processing steps to work with OSM data for transport research comprising:
First we must install the required packages for our analysis.
This analysis has been conducted in an interactive notebook running
Python 3.7.13 (you can check your current version by running
!python --version
in a new code cell) Packages can be
installed using the pip package manager as shown below.
If you are running Python on a local machine (your own computer,
rather than colab) within a conda environment you must install packages
from the conda servers using the conda
package manager.
Example: conda install -c conda-forge pyrosm
rather than
the pip !pip install pyrosm
To see if your desired package is supported by the conda package installer you can search for your package using the search bar in the centre of the anaconda homepage.
Package install from terminal (if required)
conda install -c conda-forge pyrosm
conda install -c conda-forge osmnx
# start by importing your packages
import pyrosm
import matplotlib
Now that we have installed and imported our desired packages we can
continue with our analysis. Within this document we show how to obtain
network data-packs of OSM infrastructure in Python using the
pyrosm
and osmnx
packages.
Here we show the functionality of the pyrosm package. The main difference between OSMnx and pyrosm is best described by the pyrosm documentation itself
“the main difference between pyrosm and OSMnx is that OSMnx reads the data using an OverPass API, whereas pyrosm reads the data from local OSM data dumps that are downloaded from the PBF data providers (Geofabrik, BBBike). This makes it possible to parse OSM data faster and make it more feasible to extract data covering large regions.”
As we are based in Leeds (the University of Leeds) we will use
'Leeds'
as our query.
We can check if our query is stored by pyrosm providers with
pyrosm.data.available
, a number of regions, sub-regions and
cities are available.
# Store available places, check Leeds is in stored places
= pyrosm.data.available
available_places print(available_places.keys())
print('Leeds available from providers:','Leeds' in available_places['cities'])
dict_keys(['test_data', 'regions', 'subregions', 'cities'])
Leeds available from providers: True
As Leeds is available from pyrosm providers, we will use this.
# Gets data from pyrosm providers (BBBike or Geofabrik) and stores in /temp directory - file can be saved to a user specified
# location with additional argument directory i.e. get_data(place_name, directory='save_path')
= 'Leeds'
place_name = pyrosm.get_data(place_name)
file_path print('Data downloaded to:', file_path)
Data downloaded to: /tmp/pyrosm/Leeds.osm.pbf
This has downloaded the OSM data in the Protocolbuffer Binary Format (pbf) file format ready to be parsed by the pyrosm OSM file reader.
As OSM data is constantly updated as mappers continually add/update features the data you request from pyrosm providers may have more features than within our example.
If you would like to use the same dataset as us (Leeds.osm.pbf - 06/06/2022) this can be downloaded from our repository releases using the following code:
Note that you will need to set your own desired download location
with the save_to
variable:
#____________Function__________________________________________________________
def download_file(url, save_to) :
# url - url of downloadable file
# save_to - directory to save the file to
from os import chdir, getcwd
from requests import get
print("downloading from url:", url)
= get(url)
req = req.headers['Content-Disposition'].split("filename=")[1]
filename
chdir(save_to)
print("writing", filename)
with open(filename, 'wb') as file:
for chunk in req.iter_content(chunk_size=(8192)):
if chunk:
file.write(chunk)
file.close()
print(filename, "has been downloaded to", getcwd())
#______________________________________________________________________________
= "https://github.com/udsleeds/openinfra/releases/download/v0.2/Leeds_06_06_22.osm.pbf"
url
# Change this to your desired download directory.
= "/home/james/Desktop/Python_Downloads"
save_to
download_file(url, save_to)
Now we have downloaded the most up-to-date (or the reproducible) OSM dataset we can continue by loading our data and observing it’s structure:
If you are following this guide using the reproducible dataset then
file_path
will simply be the save_to
path +
the file name.
For me that is
file_path = /home/james/Desktop/Python_Downloads/Leeds_06_06_22.osm.pbf
# Initialises the OSM object that parses .osm.pbf files
= pyrosm.OSM(file_path)
osm print('osm type:', type(osm))
osm type: <class 'pyrosm.pyrosm.OSM'>
Notice that the osm (lower case) variable is actually the reader
instance (type: <class 'pyrosm.pyrosm.OSM'>
) for the
given .osm.pbf dataset. As such this (osm) variable should always be
used to make the calls for fetching different network datasets from the
OSM pbf file.
There are a number of ways to request network data from the osm.pbf file, most notable and of use in this demonstration are
The functionality of OSM.get_network() is demonstrated below first.
For assistance with this function we can call for the documentation with
help
.
# Calling the help function on OSM.get_network() to see documentation.
help(pyrosm.OSM.get_network)
Help on function get_network in module pyrosm.pyrosm:
get_network(self, network_type='walking', extra_attributes=None, nodes=False)
Parses street networks from OSM
for walking, driving, and cycling.
Parameters
----------
network_type : str
What kind of network to parse.
Possible values are:
- `'walking'`
- `'cycling'`
- `'driving'`
- `'driving+service'`
- `'all'`.
extra_attributes : list (optional)
Additional OSM tag keys that will be converted into columns in the resulting GeoDataFrame.
nodes : bool (default: False)
If True, 1) the nodes associated with the network will be returned in addition to edges,
and 2) every segment of a road constituting a way is parsed as a separate row
(to enable full connectivity in the graph).
Returns
-------
gdf_edges or (gdf_nodes, gdf_edges)
Return type
-----------
geopandas.GeoDataFrame or tuple
See Also
--------
Take a look at the OSM documentation for further details about the data:
`https://wiki.openstreetmap.org/wiki/Key:highway <https://wiki.openstreetmap.org/wiki/Key:highway>`__
As can be seen above, the get_network
function accepts a
number of network_type
parameters depending on the network
you are trying to analyse, including:
'all'
'driving'
'cycling'
'walking'
'driving+service'
where 'service'
generally implies an access road to a
building, service station, campsite, industrial estate, fuel station,
wind turbine site etc.
Lets first obtain the network for all of Leeds, examine the data structure returned from our get_network() request, and visualise the network we have retrieved.
# Obtaining the total network for 'Leeds'
= osm.get_network(network_type = 'all')
leeds_total_network print('Variable shape:',leeds_total_network.shape, 'and type:', type(leeds_total_network), '\n')
2) leeds_total_network.head(
Variable shape: (136372, 39) and type: <class 'geopandas.geodataframe.GeoDataFrame'>
(leeds_total_network.head(2) shown as image below)
'geom_type'] = leeds_total_network['geometry'].geom_type
leeds_total_network[print(leeds_total_network['geom_type'].value_counts())
MultiLineString 136372
Name: geom_type, dtype: int64
We have saved the output from our get_network
request as
the variable leeds_total_network
.
We can see this variable (leeds_total_network
) is a geopandas
GeoDataFrame (analogous to an Excel spreadsheet) with shape
(135,546, 39)
, implying this DataFrame contains 135,546
rows and 39 columns.
Each row corresponds to a unique feature (such as a way - a road, path, cyclepath etc.) and each column corresponds to a tag for that feature (such as feature geometry, osmid, pedestrian access etc.)
We can observe all keys have been returned by calling for the column
names of the DataFrame (leeds_total_network
) below.
= leeds_total_network.columns
keys print(keys)
Index(['access', 'area', 'bicycle', 'bicycle_road', 'bridge', 'busway',
'cycleway', 'est_width', 'foot', 'footway', 'highway', 'int_ref',
'junction', 'lanes', 'lit', 'maxspeed', 'motorcar', 'motor_vehicle',
'name', 'oneway', 'overtaking', 'psv', 'ref', 'service', 'segregated',
'sidewalk', 'smoothness', 'surface', 'tracktype', 'tunnel', 'turn',
'width', 'id', 'timestamp', 'version', 'tags', 'osm_type', 'geometry',
'length'],
dtype='object')
These are all the default keys (returned as columns) when a network
is requested from the osm.pbf file with get_network
. We can
see there is a column named 'geometry'
which stores the
linestring geometries of features we have requested, and is used in the
visualisation of our networks.
However, it should be noted that in some instances a number of extra
tags are returned within the tags
column, in
instances where specific OSM features have more information attributed
to them. We can take a look at them below.
Should we wish for any of these additional tags to be returned as
columns in the DataFrame, rather than being stored within the
tags
column, we can specify this with the
extra_attributes
argument for the get_network
function.
i.e. get_network(network_type='walking', extra_attributes=["description", "crossing"])
would return all vlaues for the tags description
and crossing as
their own columns.
# Removes any features from the DataFrame which have no additional tags within the 'tags' column (i.e no additional tags are returned) and shows the first 4 features
= leeds_total_network.loc[leeds_total_network.tags.isna() == False]
leeds_total_network_noNA 4) leeds_total_network_noNA.head(
output as image below
As can be seen the feature with index 20 (the 4th
row above) has a number of additional tags within the
'tags'
column - lets take a closer look.
3] leeds_total_network_noNA.tags.iloc[
{"gritting":"priority_1","maintenance":"gritting","maxweight":"7.5"}
Through some reverse searching of these additional tags using Tag Finder, we can see the feature being described is likely a well used public road as the local authorities are required to grit it in icy conditions (maintenance:gritting).
Furthermore, we know this is likely a well used road due to the highest gritting priority (gritting:priority_1) with a maximum permissible weight of 7.5 tonnes.
Obtaining the feature osm id id
and feature type
osm_type
provides more context:
id.iloc[3], leeds_total_network_noNA.osm_type.iloc[3] leeds_total_network_noNA.
(2340358, 'way')
Having accessed the feature ID and osm_type field above, we can search for this feature using the OSM Nominatim search by ID field which uses osm_type + ID as the query argument.
So, for a way (W) with ID 2340358 our query becomes W2340358
Passing this query to Nominatim we find that this way corresponds to a road within central Wakefield. As such, and as hypothesised, it is likely a well used important road thus the high gritting priority in icy conditions.
Here we look to plot the networks we have requested using the
get_network
function.
Remember that within the GeoDataFrame returned there was a column
named 'geometry'
which stores the geometries of returned
features used for visualisation of requested networks.
Calling the .plot
method on the returned GeoDataFrame
automatically detects the column containing feature geometry and plots
them, as shown below.
# Plotting the total network for Leeds
leeds_total_network.plot()
# Requesting and plotting the drivable network for Leeds
= osm.get_network(network_type='driving')
leeds_total_walking leeds_total_walking.plot()
There exists the option
in pyrosm to request custom networks using the
get_data_by_custom_criteria
function
As preivously described the get_network
function has
five pre-defined network configurations: walking
,
driving
, cycling
, all
,
driving+service
.
Here we will have a brief view of the
get_data_by_custom_criteria
function before re-creating one
of the pre-defined network_type=
filters by copying
filtering steps from the get_network
documentation.
Specifically, we will recreate the cycling network for Leeds. From function documentation we know:
def cycling_filter():
"""
Cycling filters for different tags as in OSMnx for 'bike'.
Filter out foot ways, motor ways, private ways, and anything
specifying biking=no.
Applied filters:
'["area"!~"yes"]["highway"!~"footway|steps|corridor|elevator|escalator|motor|proposed|'
'construction|abandoned|platform|raceway"]'
'["bicycle"!~"no"]["service"!~"private"
"""
return dict(
area=["yes"],
highway=[
"footway",
"steps",
"corridor",
"elevator",
"escalator",
"motor",
"proposed",
"construction",
"abandoned",
"platform",
"raceway",
"motorway",
"motorway_link",
],
bicycle=["no"],
service=["private"],
)
From “Applied filters:” we know that features containing any tag values pairs contained within the returned dictionary will be excluded from our network rather than left in.
This is due to the !
operator within the filter which
implies does not equal
.
i.e. to create a cycling network, remove features from the network if
they have the highway tag with any of the values:
"footway", "steps", "corridor", "elevator", etc.
.
Intuitively these are all features that are not accessible when
cycling.
We can see which features are removed from networks from the
following pre-defined filters in tabular form below constructed from documentation:
Cycling Filter removed tags
Tag | Values |
---|---|
area | “yes” |
highway | “footway”, “steps”, “corridor”, “elevator”, “escalator”, “motor”, “proposed”, “construction”, “abandoned”, “platform”, “raceway”, “motorway”, “motorway_link” |
bicycle | “no” |
service | “private” |
Driving Filter removed tags
Tag | Values |
---|---|
area | “yes” |
highway | “cycleway”, “footway”, “path”, “pedestrian”, “steps”, “track”, “corridor”, “elevator”, “escalator”, “proposed”, “construction”, “bridleway”, “abandoned”, “platform”, “raceway” |
motor_vehicle | “no” |
motorcar | “no” |
service | “parking”, “parking_aisle”, “private”, “emergency_access” |
Walking Filter removed tags
Tag | Values |
---|---|
area | “yes” |
highway | “cycleway”, “motor”, “proposed”, “construction”, “abandoned”, “platform”, “raceway”, “motorway”, “motorway_link” |
foot | “no” |
service | “private” |
We will now take a look at the
get _data_by_custom_criteria
function documentation to see
how to define a custom network.
Help on function get_data_by_custom_criteria in module pyrosm.pyrosm:
get_data_by_custom_criteria(self, custom_filter, osm_keys_to_keep=None, filter_type='keep', tags_as_columns=None, keep_nodes=True, keep_ways=True, keep_relations=True, extra_attributes=None)
`
Parse OSM data based on custom criteria.
Parameters
----------
custom_filter : dict (required)
A custom filter to filter only specific POIs from OpenStreetMap.
osm_keys_to_keep : str | list
A filter to specify which OSM keys should be kept.
filter_type : str
"keep" | "exclude"
Whether the filters should be used to keep or exclude the data from OSM.
tags_as_columns : list
Which tags should be kept as columns in the resulting GeoDataFrame.
keep_nodes : bool
Whether or not the nodes should be kept in the resulting GeoDataFrame if they are found.
keep_ways : bool
Whether or not the ways should be kept in the resulting GeoDataFrame if they are found.
keep_relations : bool
Whether or not the relations should be kept in the resulting GeoDataFrame if they are found.
extra_attributes : list (optional)
Additional OSM tag keys that will be converted into columns in the resulting GeoDataFrame.
Studying the docmentation we see that we can create a dictionary of
tags to either 'exclude'
or 'keep'
within our
network, this is decided by the filter_type=
paramater.
If we want to keep data on only roads (rather than footpaths
etc.) we can do so by keeping only "highway"
tags by
specifying osm_keys_to_keep = "highway"
.
We will define our custom cycling network below:
# Specifying desired keys to be kept - this is our first level of filtering
keys_to_keep = "highway"
# Specifying key:value pairs to be filtered - this is the second level of filtering.
cycling_filter = dict(area=["yes"],
highway=[
"footway",
"steps",
"corridor",
"elevator",
"escalator",
"motor",
"proposed",
"construction",
"abandoned",
"platform",
"raceway",
"motorway",
"motorway_link",
],
bicycle=["no"],
service=["private"])
# Specifying if the above tags should be kept or removed
filter_type = "exclude"
# Filter the network:
# From the docuemtnation on get_network() function, nodes and relations are set False as default so we do so here too.
leeds_custom_cycling = osm.get_data_by_custom_criteria(custom_filter = cycling_filter,
osm_keys_to_keep = keys_to_keep,
filter_type = filter_type,
keep_nodes = False,
keep_relations = False)
# Visualisation and stats
leeds_custom_cycling.plot()
print(leeds_custom_cycling.shape)
leeds_custom_cycling['geom_type'] = leeds_custom_cycling['geometry'].geom_type
print(leeds_custom_cycling.geom_type.value_counts())
(105460, 38)
MultiLineString 77219
LineString 28241
dtype: int64
Comparing to the default cycling network
leeds_cycling = osm.get_network(network_type = "cycling")
leeds_cycling['geom_type'] = leeds_cycling.geometry.geom_type
leeds_cycling.plot()
print(leeds_cycling.shape)
print(leeds_cycling.geom_type.value_counts())
(105460, 39)
MultiLineString 105460
dtype: int64
As can be seen we have recreated the get_network
cycling
filter with our custom query.
Both of the returned networks contain 105460 features, differing only in the number of columns returned.
We an find the difference in returned columns by calculating the set difference of each networks columns as follows:
set(leeds_custom_cycling.keys()).difference(set(leeds_cycling.keys()))
set(leeds_cycling.keys()).difference(set(leeds_custom_cycling.keys()))
{'length'}
Only one output is returned, the additional column within the
leeds_cycling
network. This is because every column within
leeds_custom_cycling
is already within
leeds_cycling
and so nothing is returned.