Speeding up GeoPanadas IO with Geofeather

  • blog
Speed! Image credit https://unsplash.com/photos/HUJDz6CJEaM

For scaling projects, speed is one of the main things that clients want. I like to use Python for rapid prototyping of an idea or a workflow and then I try to bring it to scale. When your process performs a lot of read/write or input/output (I/O) operations, then speeding this up can have a significant impact on your workflow. Multiple I/O can result in bottlenecks. Multi-threading often helps here.

I love GeoPandas, primarily because of its similarity to Pandas (and that you can easily convert your Geodataframe to a dataframe – and vice versa, assuming you have a geometry field). I use it a lot. When I saw this post about a new library called geofeather, and that it was written precisely for speeding up read and write with GeoPandas, I wanted to check it out. Please do go and read the post.

I have built a python test virtual environment on my machine (called python_test37 – imaginative, eh?) and often use this for testing out new libraries without breaking my main environment. This is the environment I installed geofeather on.

Installation

pip install geofeather

That is it. So straightforward. After installation, import the library and you are good to go:

from geofeather import to_geofeather, from_geofeather, to_shp

A simple test

The blog post claimed:

” I am seeing about 1.5–2x speedups compared to reading shapefiles with Geopandas, and about 5–7x speedups compared to writing shapefiles with Geopandas.” Fantastic! I have an RPA hexabin data set that I have used before for examples using GeoPandas, so I used this as a test data set. It contains 8147 rows of data.

I have an associated notebook here if you prefer to jump straight to that. First off, standard imports:

import geopandas as gpd
import pandas as pd
from geofeather import to_geofeather, from_geofeather, to_shp

Next, read in the file – I am using the magic function timeit in Jupyter Notebooks to test the speed:

%%timeit
gdf = gpd.read_file('RPA_hexagons.shp')

It gives me 534 ms ± 57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each). Not too bad.

Now I try and write that file:

%%timeit
gdf.to_file("RPA_hexagons_stats_out.shp")

This gives me 1.77 s ± 130 ms per loop (mean ± std. dev. of 7 runs, 1 loop each), which does seem pretty slow. Almost 2 seconds to write 8147 rows.

Now let’s compare with geofeather. First, writing (I can’t read until I have a .feather file to read):

%%timeit
to_geofeather(gdf, 'RPS.feather')

This gives me 260 ms ± 51.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each). Wow! that is some speed up. 0.26 seconds vs 1.77 seconds. Just under 7 times faster! What about reading? That was already pretty fast:

%%timeit
df = from_geofeather('RPS.feather')

This gives me 221 ms ± 25.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each). This is about twice as fast, not as much of an improvement as writing but still very impressive.

Why care about this?

I suppose if you are just using 1 or 2 files with read and write operations then perhaps saving a few seconds will not be noticed, but if you are read/writing thousands of files this is a massive performance improvement. The author notes that is not intended to create another geospatial dataformat. For vector processing chains, geofeather is now part of my standard development. Another step forward for Python geoprocessing!


I am a freelancer able to help you with your projects. I offer consultancy, training and writing. I’d be delighted to hear from you. Please check out the books I have written on QGIS 3.4

https://www.packtpub.com/application-development/learn-qgis-fourth-edition

https://www.packtpub.com/application-development/qgis-quick-start-guide


I have grouped all my previous blogs (technical stuff / tutorials / opinions / ideas) at http://gis.acgeospatial.co.uk.

Feel free to connect or follow me; I am always keen to talk about Earth Observation.

I am @map_andrew on twitter