Training data and Satellite data

  • blog

I don’t think I have come across a GitHub repo any code while looking for more resources to add for awesome-eo-code that is related to evaluating training data for Satellite imagery. I have often wondered how someone new to the industry could make their mark. In a sector that has focused a great deal of effort over the last years utilising machine and deep learning, to derive more insights from Satellite data, very few projects appear to look at collecting and evaluating training data to go into these models. The Radiant MLHub and Azavea Groundwork are leading the industry here. So perhaps if you want to make a difference in Earth Observation you could look at creating and evaluting training data, rather than just hoping someone will supply it.

With this in mind I thought I might be able to prototype a simple script that would allow me to plot the spectral profile of the mean pixel (for each band of an RGB image). There is huge potential to do more with this, the code could be extended to evaluate more bands, more complex data and attempt to guide which labelled vector feature appears good and which might not be so useful. In this blog (and associated code) I’ll step through an introduction to this idea using Python.

I am going to use training data in the form of a Shapefile with a field called ‘class’ that identifiies the terrain class and a simple RGB image. I’ll make these available in the repo so you can use them.

Settinging up

Assuming that you have a working Python environment setup and some core Geospatial libraries installed you are good to go. If you need some help check out my setting up Python post here and here (for Jupyter Notebooks)

Imports used in this script

%matplotlib inline
import geopandas as gpd
import pandas as pd
from osgeo import gdal, gdal_array
import numpy as np
import matplotlib.pyplot as plt
import shapely
import fiona
import rasterio
from rasterio import plot
from rasterio.mask import mask

Reading and plotting

Next lets read the shapefile in and plot in using GeoPandas (need help understanding this part? go here)

training_data = gpd.read_file(r"training.shp")
training_data.plot(column='class', cmap=None, legend=True, figsize=(10, 10))

Next lets load and plot the corresponding Satellite image that this training data set was based on, using Rasterio

datain = r"south_coast.tif"
raster =
arr =

and then lets overlay both of these two plots (need more info? go here)

fig, ax = plt.subplots(figsize=(10, 10)), transform=raster.transform, ax=ax)
training_data.plot(ax=ax, column='class', cmap=None, legend=True, alpha=0.75)

I want to extract the values from the Satellite image and plot and then do something with. I have 2 options to extract (or mask) the Satellite image with the Shapefile. Either use fiona

with"training.shp", "r") as shapefile:
   shapes = [feature["geometry"] for feature in shapefile]
out_image, out_transform = mask(raster, shapes, crop=True)
fig, ax = plt.subplots(figsize=(10, 10))

or using shapely

geom = [shapely.geometry.mapping(training_data)["features"][i]["geometry"] for i in range(0,len(training_data))]
out_image, out_transform = mask(raster, geom, crop=True)
fig, ax = plt.subplots(figsize=(10, 10))

I’ll let you decide which you prefer. Either way loop over the geometry to create a list of geometries (in the first case called ‘shapes’ and the second ‘geom’ and then parse it in as a parameter alonside the raster image to the Rasterio method called mask. Plotting it up you get:

I think this is already pretty useful. It should is be possible at this point to combine the labels from the shapefile with this raster as an input into a machine learning methodology. I run a one day course on this topic if you are interested drop me an email (

Extracting Values

In this blog though I want to extract these values, because I want to plot the mean pixel value for each of these training features. I do this by reusing the code above, but this time iteraing over each ‘geom’ (or ‘shapes’ depending on which method from above you prefer)

for count, value in enumerate(geom):
   out_image, out_transform = mask(raster, [value], crop=True)
   red = out_image[0,:,:].mean()
   green = out_image[1,:,:].mean()
   blue = out_image[2,:,:].mean()
   print(f"{training_data.iloc[count]['class']}: {red}, {green}, {blue}")

That allows me to get the red, green and blue bands and the mean values. I could get the max or min by using mean.max() or mean.min() if I wanted. This will print out:

Water: 347.2429110105581, 570.8265460030166, 732.142383107089
Field: 322.9537037037037, 595.5836139169472, 670.6175645342312
BareEarth: 575.673275862069, 533.7560344827587, 651.5491379310345
BareEarth: 429.76128472222223, 398.7057291666667, 479.7708333333333
Field: 456.66424682395643, 874.0199637023593, 952.8166969147005
Urban: 509.66346153846155, 617.0173076923077, 803.4378205128205
Cloud: 854.008064516129, 910.2085253456221, 1019.0921658986175
Sand: 307.9588100686499, 425.52402745995425, 563.7665903890161
Trees: 467.5656108597285, 597.4751131221719, 794.8054298642534
Trees: 193.95488721804512, 245.14786967418547, 322.328320802005
BareEarth: 632.3832288401254, 646.1669278996865, 797.2452978056426
Field: 628.3308589607635, 823.1198303287381, 971.2184517497349
Field: 821.4128787878788, 930.529356060606, 1068.3910984848485
Water: 190.1728545343883, 311.4396429296003, 409.89784946236557
Urban: 447.7516276041667, 578.8880208333334, 769.9345703125
BareEarth: 759.0893997445721, 738.4648786717752, 868.2605363984675
Field: 627.2172839506172, 814.0, 969.737037037037
Field: 451.36607142857144, 787.7839285714285, 898.7071428571429

Adding to a dataframe

Ideally I’d like to plot these up and I will come to that in the last step. Before I get there I need to get this data into a pandas dataframe. I think there are several ways of doing this, but this worked nicely for me:

lsmean = []
for count, value in enumerate(geom):
   out_image, out_transform = mask(raster, [value], crop=True)
   mean_img = out_image.mean(axis=(1,2))
df = pd.DataFrame(np.vstack(lsmean))
df.rename({0: 'mean_band1', 1: 'mean_band2', 2: 'mean_band3'}, axis=1, inplace=True)
df_mean = training_data.join(df)

In this block of code I am creating an empty list (lsmean) that I intend to append an array of the mean values for each feature for each band. I iterate over the geom, calling the mask method again, but this time I use the .mean(axis=(1,2) commands to get the mean for each band. If I was to plot the shape for this array it would look like

(3, 519, 751)

In this script I’m getting the mean of each of the 3 bands – hopefully you can see this. Once I have my mean_img I append the array to the lsmean list. Outside of the loop I have created a new dataframe (called df) and stacked these mean bands into columns and then renamed the fields (otherwise I have 0, 1, 2 as fieldnames). Finally I join my training_data geodataframe to this new dataframe and call it df_mean. Printing out the first 5 lines gives me:

   id      class                                           geometry  \
0   1      Water  POLYGON ((647905.136 5629875.093, 648156.469 5...   
1   2      Field  POLYGON ((650678.179 5629732.671, 650912.757 5...   
2   3  BareEarth  POLYGON ((651128.485 5627533.505, 651224.829 5...   
3   3  BareEarth  POLYGON ((650942.079 5627208.866, 651027.951 5...   
4   2      Field  POLYGON ((652427.040 5626062.159, 652533.856 5...   

   mean_band1  mean_band2  mean_band3  
0  347.242911  570.826546  732.142383  
1  322.953704  595.583614  670.617565  
2  575.673276  533.756034  651.549138  
3  429.761285  398.705729  479.770833  
4  456.664247  874.019964  952.816697  

Now that I have my data I am pretty much ready to plot. I just need to drop fields I don’t need (geometry and id) and then transpose (.T) the dataframe to allow me to use the .plot method.

df_mean = pd.DataFrame(df_mean.drop(columns=['geometry', 'id']))
df_mean = df_mean.set_index('class').T

If I print the first 5 lines (I only actually have 3 lines as I have 3 bands!) my dataframe now looks like:

class            Water       Field   BareEarth   BareEarth       Field  \
mean_band1  347.242911  322.953704  575.673276  429.761285  456.664247   
mean_band2  570.826546  595.583614  533.756034  398.705729  874.019964   
mean_band3  732.142383  670.617565  651.549138  479.770833  952.816697   

class            Urban        Cloud        Sand       Trees       Trees  \
mean_band1  509.663462   854.008065  307.958810  467.565611  193.954887   
mean_band2  617.017308   910.208525  425.524027  597.475113  245.147870   
mean_band3  803.437821  1019.092166  563.766590  794.805430  322.328321   

class        BareEarth       Field        Field       Water       Urban  \
mean_band1  632.383229  628.330859   821.412879  190.172855  447.751628   
mean_band2  646.166928  823.119830   930.529356  311.439643  578.888021   
mean_band3  797.245298  971.218452  1068.391098  409.897849  769.934570   

class        BareEarth       Field       Field  
mean_band1  759.089400  627.217284  451.366071  
mean_band2  738.464879  814.000000  787.783929  
mean_band3  868.260536  969.737037  898.707143  


Now I can plot. One line will do it

lines = df_mean.plot(figsize=(10,6), legend = True)

Which gives me:

Already you can tell that my data is very similar! As this is only an example dataset (and I am sure your training dataset will be much better) lets finish up by creating a graph for each class.

I will try and bring this together and share some code from the point of loading the image and shapefile in.

lsmean = []
for count, value in enumerate(geom):
out_image, out_transform = mask(raster, [value], crop=True)
mean_img = out_image.mean(axis=(1,2))
df = pd.DataFrame(np.vstack(lsmean))
df.rename({0: 'mean_band1', 1: 'mean_band2', 2: 'mean_band3'}, axis=1, inplace=True)
df_mean = training_data.join(df)
cat = list(df_mean['class'].unique())
def plot_training(cat, df_in):
df_cat = df_in[df_in['class'] == cat]
df = pd.DataFrame(df_cat.drop(columns=['geometry', 'id']))
df = df.set_index('class').T
title = cat
lines = df.plot(figsize=(10,6), legend = True, title = title)
return None
for category in cat:
plot_training(category, df_mean)

There are a couple of changes from above. This time I have created a list of the unique classes (or categories) called cat. I loop over these categories and parse in the df_mean (from before) and the cateory into a function called plot_training. This creates a new dataframe containing my class, drops unnecessary fields, transposes it and plots it up. Giving me images like:

I could then use this to investigate my training dataset and improve it by removing erronous training data, or perhaps finding mislabelled features. In a future blog I’ll show you how to bring this notebook into a script that you can call on the command line.


In this post I have shown you have to extract values from a Satellite image based on a labelled training dataset (Shapefile). The code here could form the basis for a much bigger code block that does alot more inspection and reporting on your training datasets. I think its useful to be able to plot the different classes quickly that will then allow you to return to and improve on your labelled training data.

All the code for this post is on my satellite imagery github repo here and the direct link for this corresponding notebook is here

I am a freelancer able to help you with your projects. I offer consultancy, training and writing. I’d be delighted to hear from you.

Feel free to connect or follow me; I am always keen to talk about Earth Observation.

I am @map_andrew on twitter