The world’s leading publication for data science, AI, and ML professionals.

Masking Geo-Spatial Level 3 satellite images using shapefiles: A Grid-based approach

A one shot guide to mask level 3 satellite data using shape files.

How to process and mask required Atmospheric level3 data over a particular region using the shapefile of that respective area


Photo by NASA on Unsplash
Photo by NASA on Unsplash

Introduction

Let us start with the concept of Remote Sensing. It is a study of nature with data collected via sensors (like satellites). These sensors can monitor and collect data on different aspects of Earth. Some of them include vegetation data, atmosphere data, ocean data, etc. They provide the data in form of heat maps or images. Image processing and Masking are fundamental tools required to extract information from the data collected by the sensors. This article focuses on how to process and mask required Atmospheric NO2 data over a particular region using the shapefile of that respective area. The data used here is collected by the Ozone Monitoring Instrument (OMI). The data was downloaded from the Earthdata website, you can download the data from the link below.

https://search.earthdata.nasa.gov/search/granules?p=C1266136111-GES_DISC&pg[0][v]=f&pg[0][gsk]=-start_date&q=omi%20no2&tl=1628603996.133!3!!

Data Description

The data used here is a level-3 2-D gridded satellite data recorded by the satellite OMI. The data has a resolution of (0.25 x 0.25 ) degree gridded. The data consist of tropospheric columnar Nitrogen Dioxide (NO2) gas. The detailed structure of the dataset can be extracted as given below:

Use your own file path in filename

The main data fields are under the Data Fields subtree of the dataset. There are four data variables as describes below:

ColumnAmountNo2: The total No2 columnar density.

ColumnAmountNo2CloudScreened: The total NO2 column density with cloud noise filtered.

ColumnAmountNo2Trop: Column density of NO2 in the tropospheric region.

ColumnAmountNo2TropCloudScreened: Column density of NO2 in the tropospheric region without cloud noise.

In this article, we will be working with the ColumnAmountNO2TropCloudScreened data variable.

Tools and Technologies

Python 3.7 is used as the main language for extracting and processing the dataset. The version is important as another package called geopandas will not work on later versions of python. Jupyter notebook is used for executing the scripts. The datasets are in .he5 file format. To read and extract the data from the data files, a specialized package called h5py is required. All these are available in the conda environment which you can install via the given link using Anaconda.

Anaconda | Individual Edition

The packages necessary for executing the code are:

Importing Data

To import the data, the package named os is required which is an inbuilt package in python. It traverses through the path given provided and reads all the file names containing in the folder into a list in python. After successfully retrieving the dataset names with the path as shown below, the data needs to be extracted using the h5py package.

While extracting the data, the data needs to be cleaned i.e. all pixel values less than zero needs to be filtered and replaced with NaN values. The data extraction and cleaning are done in one go to save some computation overhead and also reduces the size of the code. After cleaning the data, the mean of all the data is taken along axis=0 to get the mean data.

Note: Taking the mean value is on a requirement basis. If some analysis is to be done on daily basis, one can skip the mean step and run the masking for each data individually.

This whole process needs to be done in two steps:

  • Collect all the data file names.
  • Extract the data using h5py.
Collecting all files from the respected path
Collecting all files from the respected path
def extractData(filenames):
    data = []
    for fname in filenames:
        fHandle = hdf.File(fname,'r')
        #Data extraction
        #Use any one of the data variable from data description
        dataNo2 = fHandle['HDFEOS/GRIDS/ColumnAmountNO2/Data Fields/ColumnAmountNO2TropCloudScreened'][:]
        dataNo2[dataNo2 < 0] = np.nan #cleaning 
        data.append(dataNo2)
    dataArray = np.array(data)
    mean = np.nanmean(dataArray, axis=0)
    return mean

Here, the path inside the fHandle is the path taken from the data description.

Read Shapefile

For testing purposes, the India state borders Shapefile is used. To read and process the shapefile geopandas is required. Geopandas is a package in python which is used extensively for geo-data processing. For visualization purposes, matplotlib is used.

Creating Latitude and Longitude list

In order to convert the image data to a gridded geo-dataset, the exact position of each data point must be related to a corresponding position on the global coordinate system (Latitude and Longitude). So to relate the data, latitude and longitude vector is generated using the resolution of the data. Since the NO2 data is world data, the bounds for the latitude are set to [-90,90] and the bounds for the longitude are set to [-180,180]. Since the resolution of the data is 0.25 degrees, we create an equispaced array with a 0.25 interval and maintaining the bounds of the latitude and longitude.

Getting boundaries from shapefile

In order to clip the global data to the coordinates of the shapefile, it is required to extract the latitude and longitude bounds from the respective shapefile we are using. This will further be used in regrinding the data after masking and also clip the data to the required coordinates. This can be done by generating another set of latitude and longitude lists with the bounds of the shapefile.

A shapefile might consist of multiple polygons, it is required to get latitude and longitude coordinates from all the polygons. Taking the maximum and minimum among these will provide the bounds of the total shapefile as shown below.

def getBounds(shape):
    x1 = []
    y1 = []
    for i in range(len(shape)):
        if(isinstance(shape.iloc[i].geometry, shapely.geometry.polygon.Polygon)):  
            x = shape.exterior.iloc[i].coords.xy[0]
            y = shape.exterior.iloc[i].coords.xy[1]
            x1.append(min(x))
            x1.append(max(x))
            y1.append(min(y))
            y1.append(max(y))
        else:
            for poly in shape.iloc[i].geometry:
                x = poly.exterior.coords.xy[0]
                y = poly.exterior.coords.xy[1]
                x1.append(min(x))
                x1.append(max(x))
                y1.append(min(y))
                y1.append(max(y))
return x1,y1
def getBoundary(shape,res):
    x,y = getBounds(shape)
    my_lats = np.arange(np.min(y)-res/2, np.max(y)+res/2, res)
    my_lons = np.arange(np.min(x)-res/2, np.max(x)+res/2, res)
    return(my_lats,my_lons)

Creating and clipping Geo-DataFrame

Till now we were using the raw data from the files. But in order to mask the data later, this data needs to be converted to a Geodata frame consisting of geographical points. This can easily be accomplished using The geopandas library.

While creating the data frame, the data can be clipped using the latitude and longitude bounds of the shapefiles. The concept of clipping can be explained using the concept of a camera as shown in the diagram below.

Block diagram of world and viewport
Block diagram of world and viewport

In our case, the world is the actual data with longitudes ranging from -180 to +180 and latitudes ranging from -90 to +90. The coordinates of the viewport are determined by the bounds of the shapefile.

def createGeoData(lt,ln,mean_No,shape):
    lat = []
    lon = []
    data = []
    for i in range(len(lt)):
        for j in range(len(ln)):
            lat.append(lt[i])
            lon.append(ln[j])
            data.append(mean_No[i][j])

    Geo_Dataset = pd.DataFrame({'latitude':lat, 'longitude': lon, 'NO2':data})
    #clip data with shape file boundaries
    x1, y1 = getBoundary(shape,0.25)
    temp = Geo_Dataset[Geo_Dataset['latitude']>int(min(x1))]
    temp = temp[temp['latitude']<int(max(x1))]
    temp = temp[temp['longitude']>int(min(y1))]
    temp = temp[temp['longitude']<int(max(y1))]
    crc = {'init':'epsg:4326'}
    geometry = [Point(xy) for xy in zip(temp['longitude'], temp['latitude'])]
    geo_df = gpd.GeoDataFrame(temp,crs=crc,geometry=geometry)
    geo_df.reset_index(drop=True, inplace=True)
    return geo_df

Masking the data

Since the data is converted to a geodata frame consisting of geometry (points), these points can be directly used to check whether the data is containing in any of the polygons of the shapefile. The concept of checking points inside a polygon is used here. This is performed by using the point and polygon packages in the shapely library. The points which are inside any one of the polygons of the shapefile are kept intact and the rest are assigned a NaN value.

def mask(map1, geo_df):
    pol = map1.geometry
    pts = geo_df.geometry
    test = geo_df
    l,t,df = [],[],[]
    for k in range(len(pts)):
        flag = True
        for i in range(len(pol)):
            if(pol[i].contains(pts[k])):
                l.append(geo_df.latitude[k])
                t.append(geo_df.longitude[k])
                df.append(geo_df.NO2[k])
                flag = False
                break
            #end if
        #end for
        if(flag):
            l.append(np.nan)
            t.append(np.nan)
            df.append(np.nan)
        #end if
    #end for
    newdf = pd.DataFrame({'latitude':l, 'longitude': t, 'NO2':df})
    return newdf

Regridding the data

While converting the data to a geodata frame, the data was flattened for clipping to be easier. So after masking the data, the data needs to be re-gridded for visualization.

Since the data is already clipped, the shape of the data should be equal to the length of the boundary latitude array times the length of the boundary longitude array obtained from the shapefile.

Note: If the shape of the data does not match with the given condition, check if the bounds are retrieved correctly and the clipping is done properly.

The grid can be created and filled with the data in the following way

def createGrid(my_lats, my_lons,mask):
    grd = np.zeros((len(my_lats),len(my_lons)))
    print(grd.shape, mask.NO2.shape)
    k = 0
    for i in range(len(my_lats)):
        for j in range(len(my_lons)):
            grd[i][j] = mask.NO2[k]
            k = k+1
        #end for
    #end for
    return grd

Visualization

To visualize the data, Matplotlib is used. There is not much to explain about the visualization 😆 . Just tinker with the color bar parameters according to your needs.

def plot(grd, map1, x1,y1):
    plt.rcParams["font.weight"] = "bold"
    fig,ax = plt.subplots(figsize=(15,15))
    map1.plot(ax=ax, alpha=0.7, color='none')
    ax.set_title("No2 Distribution", fontweight='bold', size=20)
    ax.set_xlabel('Longitude (E)', fontsize = 20)
    ax.set_ylabel('Latitude (N)', fontsize = 20)
    bounds = [int(min(x1)),int(max(x1)),int(min(y1)),int(max(y1))]
    img = ax.imshow(grd,aspect='auto',interpolation='none', origin='lower', extent=bounds)
    cb = fig.colorbar(img)
    cb.ax.tick_params(labelsize=18)
    cb.set_label('mol cm^-2', labelpad=-57, size=14)
    ax.tick_params(labelsize = 20, width = 2.0)
    img.set_clim(0.1*1e16, 0.5*1e16)
    plt.show()

Execution

Now let us see how these above-mentioned functions work. There is a sequence of calling the above functions. Here it is

filenames = []
for path, subdirs, files in os.walk(r'path to your own data'):
    for filename in files:
        f = path +'/'+filename
        filenames.append(str(f))
mean_data = extractData(filenames)
latitude = np.arange(-90,90,0.25)
longitude = np.arange(-180,180,0.25)
m = readSHape(shpFile)
geo_df = createGeoData(latitude,longitude,mean_data,m)
msk = mask(m, geo_df)
my_lat,my_lon = getBoundary(m,0.25)
data = createGrid(my_lat,my_lon,msk)
plot(data,m, my_lon, my_lat)

Result

So lets us have a look into the output of all the coding done 😃

Conclusion

As you can see, the global data is masked inside the shapefile. These methods are applicable for all geographical areas or borders shapefile. It is a very handy tool to use for Geo-Spatial analysis.


Related Articles