I don’t think I have come across a GitHub repo any code while looking for more resources to add for awesome-eo-code that is related to evaluating training data for Satellite imagery. I have often wondered how someone new to the industry could make their mark. In a sector that has focused a great deal of effort over the last years utilising machine and deep learning, to derive more insights from Satellite data, very few projects appear to look at collecting and evaluating training data to go into these models. The Radiant MLHub and Azavea Groundwork are leading the industry here. So perhaps if you want to make a difference in Earth Observation you could look at creating and evaluting training data, rather than just hoping someone will supply it.
With this in mind I thought I might be able to prototype a simple script that would allow me to plot the spectral profile of the mean pixel (for each band of an RGB image). There is huge potential to do more with this, the code could be extended to evaluate more bands, more complex data and attempt to guide which labelled vector feature appears good and which might not be so useful. In this blog (and associated code) I’ll step through an introduction to this idea using Python.
I am going to use training data in the form of a Shapefile with a field called ‘class’ that identifiies the terrain class and a simple RGB image. I’ll make these available in the repo so you can use them.
Settinging up
Assuming that you have a working Python environment setup and some core Geospatial libraries installed you are good to go. If you need some help check out my setting up Python post here and here (for Jupyter Notebooks)
Imports used in this script
%matplotlib inline
import geopandas as gpd
import pandas as pd
from osgeo import gdal, gdal_array
import numpy as np
import matplotlib.pyplot as plt
import shapely
import fiona
import rasterio
from rasterio import plot
from rasterio.mask import mask
Reading and plotting
Next lets read the shapefile in and plot in using GeoPandas (need help understanding this part? go here)
training_data = gpd.read_file(r"training.shp") training_data.plot(column='class', cmap=None, legend=True, figsize=(10, 10))

Next lets load and plot the corresponding Satellite image that this training data set was based on, using Rasterio
datain = r"south_coast.tif" raster = rasterio.open(datain) arr = raster.read() plt.figure(figsize=(10,10)) plot.show(rasterio.plot.adjust_band(arr))

and then lets overlay both of these two plots (need more info? go here)
fig, ax = plt.subplots(figsize=(10, 10))
plot.show(rasterio.plot.adjust_band(arr), transform=raster.transform, ax=ax)
training_data.plot(ax=ax, column='class', cmap=None, legend=True, alpha=0.75)

I want to extract the values from the Satellite image and plot and then do something with. I have 2 options to extract (or mask) the Satellite image with the Shapefile. Either use fiona
with fiona.open("training.shp", "r") as shapefile: shapes = [feature["geometry"] for feature in shapefile] out_image, out_transform = mask(raster, shapes, crop=True) fig, ax = plt.subplots(figsize=(10, 10)) plot.show(rasterio.plot.adjust_band(out_image))
or using shapely
geom = [shapely.geometry.mapping(training_data)["features"][i]["geometry"] for i in range(0,len(training_data))] out_image, out_transform = mask(raster, geom, crop=True) fig, ax = plt.subplots(figsize=(10, 10)) plot.show(rasterio.plot.adjust_band(out_image))
I’ll let you decide which you prefer. Either way loop over the geometry to create a list of geometries (in the first case called ‘shapes’ and the second ‘geom’ and then parse it in as a parameter alonside the raster image to the Rasterio method called mask. Plotting it up you get:

I think this is already pretty useful. It should is be possible at this point to combine the labels from the shapefile with this raster as an input into a machine learning methodology. I run a one day course on this topic if you are interested drop me an email (info@acgeospatial.co.uk)
Extracting Values
In this blog though I want to extract these values, because I want to plot the mean pixel value for each of these training features. I do this by reusing the code above, but this time iteraing over each ‘geom’ (or ‘shapes’ depending on which method from above you prefer)
for count, value in enumerate(geom): out_image, out_transform = mask(raster, [value], crop=True) red = out_image[0,:,:].mean() green = out_image[1,:,:].mean() blue = out_image[2,:,:].mean() print(f"{training_data.iloc[count]['class']}: {red}, {green}, {blue}")
That allows me to get the red, green and blue bands and the mean values. I could get the max or min by using mean.max() or mean.min() if I wanted. This will print out:
Water: 347.2429110105581, 570.8265460030166, 732.142383107089 Field: 322.9537037037037, 595.5836139169472, 670.6175645342312 BareEarth: 575.673275862069, 533.7560344827587, 651.5491379310345 BareEarth: 429.76128472222223, 398.7057291666667, 479.7708333333333 Field: 456.66424682395643, 874.0199637023593, 952.8166969147005 Urban: 509.66346153846155, 617.0173076923077, 803.4378205128205 Cloud: 854.008064516129, 910.2085253456221, 1019.0921658986175 Sand: 307.9588100686499, 425.52402745995425, 563.7665903890161 Trees: 467.5656108597285, 597.4751131221719, 794.8054298642534 Trees: 193.95488721804512, 245.14786967418547, 322.328320802005 BareEarth: 632.3832288401254, 646.1669278996865, 797.2452978056426 Field: 628.3308589607635, 823.1198303287381, 971.2184517497349 Field: 821.4128787878788, 930.529356060606, 1068.3910984848485 Water: 190.1728545343883, 311.4396429296003, 409.89784946236557 Urban: 447.7516276041667, 578.8880208333334, 769.9345703125 BareEarth: 759.0893997445721, 738.4648786717752, 868.2605363984675 Field: 627.2172839506172, 814.0, 969.737037037037 Field: 451.36607142857144, 787.7839285714285, 898.7071428571429
Adding to a dataframe
Ideally I’d like to plot these up and I will come to that in the last step. Before I get there I need to get this data into a pandas dataframe. I think there are several ways of doing this, but this worked nicely for me:
lsmean = [] for count, value in enumerate(geom): out_image, out_transform = mask(raster, [value], crop=True) mean_img = out_image.mean(axis=(1,2)) lsmean.append(mean_img) df = pd.DataFrame(np.vstack(lsmean)) df.rename({0: 'mean_band1', 1: 'mean_band2', 2: 'mean_band3'}, axis=1, inplace=True) df_mean = training_data.join(df) print(df_mean.head())
In this block of code I am creating an empty list (lsmean) that I intend to append an array of the mean values for each feature for each band. I iterate over the geom, calling the mask method again, but this time I use the .mean(axis=(1,2) commands to get the mean for each band. If I was to plot the shape for this array it would look like
(3, 519, 751)
In this script I’m getting the mean of each of the 3 bands – hopefully you can see this. Once I have my mean_img I append the array to the lsmean list. Outside of the loop I have created a new dataframe (called df) and stacked these mean bands into columns and then renamed the fields (otherwise I have 0, 1, 2 as fieldnames). Finally I join my training_data geodataframe to this new dataframe and call it df_mean. Printing out the first 5 lines gives me:
id class geometry \ 0 1 Water POLYGON ((647905.136 5629875.093, 648156.469 5... 1 2 Field POLYGON ((650678.179 5629732.671, 650912.757 5... 2 3 BareEarth POLYGON ((651128.485 5627533.505, 651224.829 5... 3 3 BareEarth POLYGON ((650942.079 5627208.866, 651027.951 5... 4 2 Field POLYGON ((652427.040 5626062.159, 652533.856 5... mean_band1 mean_band2 mean_band3 0 347.242911 570.826546 732.142383 1 322.953704 595.583614 670.617565 2 575.673276 533.756034 651.549138 3 429.761285 398.705729 479.770833 4 456.664247 874.019964 952.816697
Now that I have my data I am pretty much ready to plot. I just need to drop fields I don’t need (geometry and id) and then transpose (.T) the dataframe to allow me to use the .plot method.
df_mean = pd.DataFrame(df_mean.drop(columns=['geometry', 'id'])) df_mean = df_mean.set_index('class').T
If I print the first 5 lines (I only actually have 3 lines as I have 3 bands!) my dataframe now looks like:
class Water Field BareEarth BareEarth Field \ mean_band1 347.242911 322.953704 575.673276 429.761285 456.664247 mean_band2 570.826546 595.583614 533.756034 398.705729 874.019964 mean_band3 732.142383 670.617565 651.549138 479.770833 952.816697 class Urban Cloud Sand Trees Trees \ mean_band1 509.663462 854.008065 307.958810 467.565611 193.954887 mean_band2 617.017308 910.208525 425.524027 597.475113 245.147870 mean_band3 803.437821 1019.092166 563.766590 794.805430 322.328321 class BareEarth Field Field Water Urban \ mean_band1 632.383229 628.330859 821.412879 190.172855 447.751628 mean_band2 646.166928 823.119830 930.529356 311.439643 578.888021 mean_band3 797.245298 971.218452 1068.391098 409.897849 769.934570 class BareEarth Field Field mean_band1 759.089400 627.217284 451.366071 mean_band2 738.464879 814.000000 787.783929 mean_band3 868.260536 969.737037 898.707143
Plotting
Now I can plot. One line will do it
lines = df_mean.plot(figsize=(10,6), legend = True)
Which gives me:

Already you can tell that my data is very similar! As this is only an example dataset (and I am sure your training dataset will be much better) lets finish up by creating a graph for each class.
I will try and bring this together and share some code from the point of loading the image and shapefile in.
lsmean = []
for count, value in enumerate(geom):
out_image, out_transform = mask(raster, [value], crop=True)
mean_img = out_image.mean(axis=(1,2))
lsmean.append(mean_img)
df = pd.DataFrame(np.vstack(lsmean))
df.rename({0: 'mean_band1', 1: 'mean_band2', 2: 'mean_band3'}, axis=1, inplace=True)
df_mean = training_data.join(df)
cat = list(df_mean['class'].unique())
def plot_training(cat, df_in):
df_cat = df_in[df_in['class'] == cat]
df = pd.DataFrame(df_cat.drop(columns=['geometry', 'id']))
df = df.set_index('class').T
title = cat
lines = df.plot(figsize=(10,6), legend = True, title = title)
return None
for category in cat:
plot_training(category, df_mean)
There are a couple of changes from above. This time I have created a list of the unique classes (or categories) called cat. I loop over these categories and parse in the df_mean (from before) and the cateory into a function called plot_training. This creates a new dataframe containing my class, drops unnecessary fields, transposes it and plots it up. Giving me images like:

I could then use this to investigate my training dataset and improve it by removing erronous training data, or perhaps finding mislabelled features. In a future blog I’ll show you how to bring this notebook into a script that you can call on the command line.
Recap
In this post I have shown you have to extract values from a Satellite image based on a labelled training dataset (Shapefile). The code here could form the basis for a much bigger code block that does alot more inspection and reporting on your training datasets. I think its useful to be able to plot the different classes quickly that will then allow you to return to and improve on your labelled training data.
All the code for this post is on my satellite imagery github repo here and the direct link for this corresponding notebook is here
I am a freelancer able to help you with your projects. I offer consultancy, training and writing. Iād be delighted to hear from you.
Feel free to connect or follow me; I am always keen to talk about Earth Observation.