Over 70% of the world’s fresh water is used for irrigation and naturally, there is a huge demand for accurate estimates and metrics that could aid sensible water use in the agricultural sector.
Reference evapotranspiration is the evapotranspiration from a hypothetical grass surface with the assumption that the surface is well-irrigated (unlimited water). It is an estimate that is used in irrigation that determines the amount of water required by crops. Since the reference evapotranspiration is a hypothetical concept, it is estimated and not measured.
The near-surface condition of the air affects this estimate as it depends on many atmospheric variables and essentially, it calculates the evaporative demand of the atmosphere. One of the most common and accurate methods for its estimation is the Penman-Monteith Equation.
In this article, we use Machine Learning to predict the reference evapotranspiration in the state of California using in-situ measurements. We initially select 8 features as an input to our model – average, maximum, and minimum temperature, relative humidity, wind speed, elevation, solar radiation, and a categorical variable to capture the spatial variation in these atmospheric variables.
1. Data
We obtained the data from the California Irrigation Management Information System (CIMIS) that has 145 in situ weather stations located across California. These stations are placed on well-watered grass surfaces. Since it measures all parameters required to calculate the reference evapotranspiration, we calculated it using the Penman-Monteith equation and that acts as our response variable.

The hourly data were averaged to obtain the daily data and only the growing season from April to September for the 5 years (2015–2019) was used for the study. The data for every station was stacked to produce 110929 rows of data
2. Feature Engineering
The features chosen for our machine learning model were the atmospheric variables RH, TMAX, TMIN, TAVG, and Wind Speed to consider the conditions of the site and the elevation to accommodate the terrain. However, since in-situ measurements are being used, the dependence of these atmospheric variables on reference evapotranspiration could change based on the location. In California, we may expect stations close to the coast to behave similarly, and to capture this variation in space, k-means clustering was used to group stations with similar atmospheric conditions.
The features chosen for the k-means clustering were the air temperature, relative humidity, and wind speed. To group similar stations together the above features were averaged for each station for a period of 5 years.
The number of clusters (k) was chosen based on the following methods: –
I. The elbow method

The elbow method shows a break at k=4 and hence we chose 4 clusters to group our stations.
II. Silhouette score
A silhouette score of 0.36 tells us that the points are not entirely separable based on the features and there is an extent of overlapping. However, we only used 145 data points and used long-term(~5years) averaged values as our features so the variation of the local conditions may not be as distinct as we thought.
The labels generated from the clustering of the stations were then used in our main dataset to identify similar stations and predict the reference evapotranspiration.
Additionally, using a random forest regressor model we evaluated the feature importance and ranked them accordingly.

Solar radiation and the average temperature have the highest feature importance and the rest show relatively lower importance. Also, we want to predict the target value with predictors that are easily available and can be remotely sensed using satellites, and since wind speed is not a very important parameter, we limit ourselves to the following features:

3. Approach
In our analysis, we used an artificial neural network to predict the reference evapotranspiration. Since most of the features showed some linear relationship with our predictor, we had initially used ridge regression for prediction. We chose ridge regression because some of the features show a strong linear relationship within themselves, for instance, the temperature is linearly related to the humidity. In the presence of multicollinearity in a feature space, ridge regression performs better than linear regression.
However, since the response function seems to be nonlinear with some of the features a multi-layer perceptron with two hidden layers was used to capture this nonlinearity. Through experimentation, the following architecture produced the best results.

The feature space contains climatic and geographic variables with different magnitudes and units. Feature scaling helps to normalize the feature space and is essential for MLP as it helps improve the prediction accuracy of our ANN.
The ReLu function was used as the activation function in our ANN and the goal of this model was to minimize the loss function.

We also tried ridge regression and random forest.
4. Model Performance
Model predictions were compared with the true values and the results show the model can explain 93% of the variability in our predicted values.


5. What can we conclude?
All three models have shown high R² values but the ANN performed slightly better than the random forest and the ridge regression model. The errors in the model could be partially attributed to using a limited feature space and excluding the wind speed component. The error for each day(every row) was also plotted against the features to investigate the presence of systematic bias in the model. However, the variation of the errors with the features did not show any bias in predicting the reference evapotranspiration.
In the future, this model could be improved by the addition of more features such as soil moisture to improve the clustering of the stations based on similar site conditions. It would also be interesting to deploy the model using satellite data instead of in-situ measurements for large scale reference evapotranspiration prediction.