Soybean Yield Predictions by the Yield Oracles

The Past

This section outlines our research into the past work completed in this space - in other words, our literature review. Significant work has been accomplished towards using satellite imagery for predicting yields of crops, and several models exist with varying spohistication from the USDA and other vendors. Our approach attempts to extend the existing body of work by predicting soybean yields in Brazil, incorporating weather data in addition to data from satellites.

This project uses vegetation indices calculated from MODIS satellite images to predict crop yields on a farm-by-farm basis ². Our efforts abstract from the complicated mixture of factors that affect yields on the ground, such as soil conditions, pest pressures, weather, irrigation, fertilizer use, crop rotation, and other advanced farming methods. We assume that vegetation indices can capture the effects of all such surface inputs and that by applying the latest machine learning techniques, we can find patterns in the data that predict yields as accurately as more data-intensive models.

Several researchers, notably William Allen, Harold Gausman, and Joseph Woolley, have established how the optical properties of plants relate to their growth and productivity. The different wavelengths of light reflected by plants and soils create unique spectral signatures that indicate water content, photosynthetic potential, chlorophyll concentrations, and other characteristics correlated with the developmental stages of plants and plant health. For a good summary of this work, see ¹⁴.

Gauging the health and productivity of plants throughout a growing season using their spectral signatures from imperfect satellite images is complex. To combat the many distortions of reflected light, such as the changing ratio of soil to plant canopy and the differing angles of the sun at the time of image creation, researchers have developed vegetation indices. A vegetation index is a formula relating the various wavelengths of light to plant canopy structures. Different indices are used to measure different qualities of cropland. For example, the Normalized Difference Vegetation Index (NDVI) that is a standard in agricultural remote sensing research and that is used in this project, is calculated using the red and near-infrared wavebands. It is applied to measure changes in the amount of green biomass over time, which can detect uneven patterns of growth. NDVI has also been used to solve other significant issues faced by those interested in measuring agricultural yields, including the identification and quantification of cropland ¹⁷ ¹⁸, the tracking of field abandonment and recultivation ⁴, and the calculation of the intensity of cropland use ⁵.

Various vegetation indices have been used to predict crop yields. Comparisons of the performance of different indices are given in ⁶, ⁸, and ¹⁶. Indices are also combined with surface variables such as soil moisture, surface temperature, and rainfall to build more complex models (see ³ and ¹⁵).

Instead of adding more data to models predicting yields, this project has instead applied newer techniques in machine learning to pure vegetation index models. Other researchers have applied machine learning to agricultural remote sensing problems. Stepwise regression was used in ⁷ to predict rice production in China. Hierarchical clustering, Bayesian neural networks, and model-based recursive partitioning was used in ⁸ to predict barley, canola, and spring wheat yields on the Canadian Prairies. The authors in ⁹ trained a neural network using the shuffled complex evolution optimization algorithm developed at the University of Arizona (SCE-UA) to predict corn and soybean yields at the county level in the US corn belt. Random forests were used in ¹⁰ and ¹⁹ to classify pixels in satellite images into different categories of land uses. In ¹¹, unsupervised linear unmixing was used on hyperspectral data to discriminate between soil and vegetation within one pixel. Multilayer perceptrons and radial basis functions were used in ¹³ to predict the crop area and yields of corn fields in India.

This project uses time series of satellite images across one growing cycle to predict soybean yields in Brazil. For some of our models, we turned these time series of NDVI values into trajectories and used them to predict yields at one point in time, as is done in ¹². We also applied convolutional neural networks to our time series of satellite images in a way similar to this paper¹.

If I have seen further it is by standing on the shoulders of giants.
- Sir Issac Newton

The Present - Users and Key Stakeholders

Defining the User

There are many constituents who can benefit from yield predictions for soybeans and other crops, including farmers, policy-makers, research houses and traders. For the purpose of this research project, the primary user was deemed to be the commodities traders who deal with buying and selling futures on a day-to-day basis. They are the individuals who help make the market (and as a result prices), and access to crop yield estimates has a direct impact on market dynamics.

Research Approach

In an effort to help guide our research, the team solicited feedback from a small group of commodities traders, which was collected through an online questionnaire. Because of the narrow audience, the data collection was limited to 5 respondents. However, the results provided good directional feedback that helped shape our research.

What types of financial instruments do you trade? If you trade commodities futures, which specific commodities?
What type of data factors into your trading decision making?
For commodities futures, have you ever used signals from non-traditional or relatively new data sources (i.e. something that you just started using in the past 2 years)? If so, please describe these new data sources.
How important would it be to have accurate yield predictions as you assess commodities futures pricing (e.g. having an estimate for the corn yield for the harvest coming up in 6 months)?
Do you have access to any sources of yield predictions today? How accurate are they?
How much advance access to accurate yield predictions could give you a competitive advantage in pricing? (e.g. 1 year before harvest, 6 months before harvest, etc)
At what resolution would you want yield predictions(nationwide yield estimates? county level? individual farms?)?
Please add any other thoughts or comments that you think might be helpful as we work on this project.

Research Findings

Some of the commonly used data sources cited by the respondents include: “fundamentals”, weekly statistics, market flows, macro themes and interest rates. These terms are broad and in some cases could include yield estimates.
All respondents said that yield estimates would be important or very important to them. One respondent did indicate that yield estimates are available today, but typically closer to the harvest date.
A subset of respondents had in-house research groups as well as external consultants who calculated yield estimates. However, there was interest in finding signals that may indicate if environmental changes may render the estimates obsolete.
Having a longer time horizon for yield estimates is of great interest. However, there is a great degree of skepticism that estimates 1 year out will be accurate. In general, it appears that most accurate yield estimates occur 3 months prior to harvest.
Most respondents felt that nationwide estimates would be the right level of granularity. One respondent saw value in county-level estimates.

Key Takeaways

Focus on national predictions
Include weather data, as this was deemed to be an important feature
Primary value of this effort would be if a more accurate prediction could me made in a reasonably long time window (e.g. 3 - 6 months from harvest)

Brazillian Growth Cycle

I KEEP six honest serving-men
(They taught me all I knew);
Their names are What and Why and When
And How and Where and Who.
- The Elephant's Child, Sir Rudyard Kipling

Data Harvesting and Basic Wrangling

Farm Yield Data

The farm-level soybean yield data we used was provided by a research team from Tufts University on special request. We express our gratitude to the team and to Prof. Coco Krumme for facilitating access to the data. The data contains fields such as dates, farm coordinates in latitude and longitude, yield calculation in metric tons per hectare, and many other agricultural indices for pestilence and disease (e.g. brown_stink_bug, cucurbit_beetle, fall_armyworm, false_medideira, green_stink_bug, snail, spodoptera_spp, stink_bug, tobacco_budworm, velvetbean_caterpillar and white_fly).

After removing all null data points, the visualization of the farm locations showed that we still had a very broad distribution over Brazil, necessitating a wide geographic area of satellite images in order to provide coverage over all the farms. Additionally, we noticed that some of the locations were outside of the geographic boundaries of Brazil. Further investigation showed that these locations had miscoded lat/long coordinates - either wrong sign of latitude and longitude, a zero value for one of the coordinates, or simply transposed latitude and longitude values. We chose to remove these data points to reduce any ambiguity regarding the validity of those farm locations.

Satellite Data

Our primary data source was a subset of MODIS data, which provides Vegetation Indices every 16 days with a resolution of 250m. From the NASA website at http://modis.gsfc.nasa.gov/data/dataprod/mod13.php:

“MODIS (or Moderate Resolution Imaging Spectroradiometer) is a key instrument aboard the Terra(originally known as EOS AM-1) and Aqua (originally known as EOS PM-1) satellites. MODIS vegetation indices, produced on 16-day intervals and at multiple spatial resolutions, provide consistent spatial and temporal comparisons of vegetation canopy greenness, a composite property of leaf area, chlorophyll and canopy structure. Two vegetation indices are derived from atmospherically-corrected reflectance in the red, near-infrared, and blue wavebands; the normalized difference vegetation index (NDVI), which provides continuity with NOAA's AVHRR NDVI time series record for historical and climate applications, and the enhanced vegetation index (EVI), which minimizes canopy-soil variations and improves sensitivity over dense vegetation conditions. The two products more effectively characterize the global range of vegetation states and processes.”

Due to stale MODIS product APIs and restrictions on download size per request, we could only manually download data as small grids in 200km by 200km sections. Considering the range of latitudes and longitudes of our farm locations, our initial calculations showed that we would have 165 different grids to download. However, further research narrowed that to only 64 grids (details below). The data collected is from April of 2006 to February of 2016, which contains 223 distinct satellite images in 16-day intervals. Only NDVI and EVI vegetation index files were downloaded, which was about 1GB each per grid per day and 128GB in total.

At the outset, we noticed that the MODIS website only allows two parameters for data requests: the center coordinates and distance from the center. It then generates a square-like grid. It is important to note that it is not a perfect square because, as the latitude changes, delta distance per degree change in longitude on the globe also changes. Therefore, we cannot think in a 2-D world and simply calculate the grids by dividing longitude by some constant, because that will either result in extra grid count (due to too little unit longitude change each time), or uncovered area (due to too big unit longitude change each time). Instead, we used the max latitude and longitude as the starting point and calculated the coordinates of the next grid center by subtracting degree changes in latitude and longitude that resulted in 200km changes of spherical distance. We repeated this until we had covered the minimal value of latitudes and longitudes needed to include all of our farms.

We had a total of 165 grids that covered all our farm locations. However, not all of them had a farm in it. We then matched each farm to its nearest grid according to the farm-to-grid center distance. After this step, we realized that there were 82 grids that contained no farms and 19 grids that contained only 1 or 2 farms. Considering these farms do not exist in all years of yield data and their minimal effect on the total yield, we decide to filter them out, which resulted in only 64 grids for download. Each of the grids we downloaded contained at least 5 farms. After manually assigning thirteen grids to each of our teammates for download, we made our data requests at http://daacmodis.ornl.gov/cgi-bin/MODIS/GLBVIZ_1_Glb/modis_subset_order_global_col5.pl, and were notified via email once our data set was ready. The response time ranged from 40 minutes to 3 days.

Weather Data

Our initial weather data source was the NOAA repository. GHCND (Global Historical Climatology Network) maintains different summaries based on data exchanged under the World Meteorological Organization (WMO) World Weather Watch Program. Absent API access, we resorted to manual downloads of the monthly summaries from http://www.ncdc.noaa.gov/cdo-web/search.

After investigation of the NOAA weather dataset on Brazil, it was found that our monthly summaries were very sparse, with about 50% of the data being NA values. We immediately started on a quest for alternative sources for this missing weather data and came upon Tutiempo. Tutiempo contains historical climate information for every country in the world. In the case of Brazil, it has data sets from more than 100 weather stations, and most of them contain the most recent 10 years of weather data that we needed. Since an API was not available, we programmed a customized webpage scraper. The scraper generated and scraped through all the Tutiempo Brazil dataset URLs, downloaded the necessary data, and organized it into csv-formatted files. This ingestable format was categorized by weather station latitude and longitude. Since the Tutiempo website robot was fairly adept at disconnecting us, the scraper was designed to be restarted from the last point of failure. Once the data download was complete, we were able to gather the following fields:

T: Average Temperature (°C)
TM: Maximum temperature (°C)
Tm: Minimum temperature (°C)
SLP: Atmospheric pressure at sea level (hPa)
H: Average relative humidity (%)
PP: Total rainfall and / or snowmelt (mm)
VV: Average visibility (Km)
V: Average wind speed (Km/h)
VM: Maximum sustained wind speed (Km/h)
VG: Maximum speed of wind (Km/h)
RA: Indicator whether there was rain or drizzle (In the monthly average, the total days it rained)
SN: Indicator if it snowed (In the monthly average, the total days it snowed)
TS: Indicator if there was a thunderstorm (In the monthly average, the total days with thunderstorms)
FG: Indicator whether there was fog (In the monthly average, the total days with fog)

War is ninety percent information.
- Napoleon Bonaparte

The Future: Modelling

Baseline Features and Labels

The first pass of modeling was built on a baseline data set consists of 1 training example for every farm/year combination. In other words, a given farm will have a measure of its soybean yield for a given year, say 2012. That same farm will have a different yield in 2013, and that would be recorded as a separate training example. The primary features for any given training example is the NDVI value for the center pixel in the associated satellite images for that farm over a given time period. Satellite images were collected twice a month starting 10 months before the given year’s harvest. The target variable is the yield (in metric tons per hectare) for that farm for that given year. A list of the features and target variable is included below:

Table 1 - Initial Model Features - Single Pixel Values

After running some initial models, it was decided to expand the area of pixels for which NDVI values were collected. Each pixel represents an area of 250m x 250m (about 15 acres), and without knowledge of the size of the farms, it seemed prudent to collect more than just the 1 pixel representing the centerpoint of the farm. Several “rings” of pixels were collected as is illustrated in Figure 1 below:

Figure 1 - Illustration of “Rings” as a parameter in feature engineering

Table 2 - Model Features with n rings of pixels

As will be discussed later, expanding to 3 rings provided a material benefit to a simple OLS model. Based on this result, all further modeling efforts included at least 3 rings of NDVI data around the center pixel, and in many cases more.

Additional Data Cleaning & Training /Test Data Splits

One issue that reduced the number of training examples were farms whose centers were close to an edge of the satellite image. For these farms, the rings of pixels were often truncated, resulting in only a partial data set. These examples were removed. Similarly, there were several N/A values. After trying various strategies to fill in missing values, including interpolation, it was shown that simply using the mean value of the satellite image was good enough.

The 7,456 remaining examples were split randomly into 5,964 training examples and 1,492 test examples. For this model, to keep the feature space small, 6 rings of pixel data were collected. However, an average of each ring was calculated, resulting in 7 pixel values (the center pixel value plus 6 averages) for each satellite image. Since there were 21 images per farm, the final feature space consisted of 147 pixel values (21 time periods x 7 pixel values per time

Incorporation of Weather Data

After spending some time with just the NDVI data, weather data was incorporated into the models as a new feature set. A summary of the related feature engineering is included below:

This data set had a lot of missing values. On average, each variable had more than 50% invalid values. Overall, only 1,755 observations out of 7,456 had complete weather data.
Several variables were dropped that were mostly zero.
Several variables that were heavily skewed were log transformed.
The final training data set dimensions were 5,964 x 492 and the test set was 1,492 x 492.
The large number of missing values and features resulted in very unstable models with negative R-squared values and large residuals.

Instead of constructing the dataset with one farm per row, we reshaped it to be one image per row. This way, we have 21 times more rows and we could add image sequence as a dummy variable.
New dimension of training set becomes 107,905 x 41, and testing set 26,977 x 41.
Despite the missing values, the NOAA data added some predictive power to all models. We filled in the missing values with the variable mean.

Much better data quality. 101,667 out of 134,882 images had complete observations after dropping two mostly missing variables (VG and SLP).
Log transformed FG, PP, RA, TS since they are right-skewed. Converted binary variable SN to a dummy variable.
Final train set 81,333 x 54; final test set 20,334 x 54.
Saw further improvements in results compared with NOAA.

Noticed that there were several very high yield farms that were really hard to predict.
Log transformed the Yield variable and saw much better model performance.

Modeling Strategy & Metrics of Performance

Various classes of models were examined:

OLS
Ridge Regression

Support Vector Machines (SVM)
Random Forest (RF)
Gradient Boosting Decision Trees (GBDT)

Several metrics were used to compare performance across models:

R-squared

Root Mean Squared Error (RMSE)

Mean Absolute Error (MAE)

Mean Absolute Percent Error (MAPE)

Linear Models

OLS
- Definition: Ordinary Least Squares regression models are the most basic form of linear regression models. An OLS model is built by finding parameters that minimize the squared distance between observed and predicted values of the dependent variable.
- Why we used them: We used OLS as a starting point to establish a simple baseline to which we could compare all additional models.
- What features were inputs: NDVI rings and weather data
Ridge Regression
- Definition: Ridge regression is a variant of OLS regression where a regularization term is added to penalize parameter estimates which are large. This approach is often used to alleviate against multicollinearity.
- Why we used them: Given the nature of the data set (NDVI values in a time series), there was a risk of multicollinearity and, as a result, a high variance in parameter estimates. Therefore, we decided to use Ridge Regression as a follow on to OLS.
- What features were inputs: NDVI rings and weather data

Non-Linear Models

Support Vector Regression (SVR)
- Definition: Support Vector Regression uses Support Vector Machines (SVMs) to find a separating line/hyperplane that minimizes the distance between this plane and the yield values. In contrast to OLS, SVMs can produce non-linear 'best fit' curves. It does this by using non-linear kernels to transform the parameter space and working in this transformed space to find optimal linear support vectors. When these linear support vectors are transformed back into the original parameter space, they can become non-linear.
- Why we used them: We suspected that our yield prediction problem was non-linear.
- What features were inputs: NDVI rings and weather data
Random Forest (RF)
- Definition: Random forest models fall under a class of ensemble models that take collections of decision trees and use the mean prediction across the trees. Decision tree models are learning algorithms which have a decision flow structure. Each node asks a question about a feature, and the resulting branches represent “decisions” made based on the value of that feature.
- Why we used them: The initial linear models that we ran were found to have poor predictive power (initial R-squared values were in the range of 0.05 - 0.10). Given the complex nature of the dataset (e.g. time series of arrays of NDVI pixel values), it seemed logical to explore non-linear models. Random Forests are known to be the most generally robust across various classes of problems. For example, they are very good at handling outliers, of which there were many in this dataset.
- What features were inputs: NDVI rings and weather data
Gradient Boosting Decision Trees (GBDT)
- Definition: This class of models is also an ensemble of decision trees. However, unlike Random Forests, GBDTs iteratively build new predictors from linear combinations of baseline weak learners (i.e. shallow trees that have high bias and low variance) using the gradient of the loss function as a key parameter.
- Why we used them: Since Random Forests use ensembles of fully grown trees, and therefore tend to have lower bias and higher variance, it seemed prudent to take an alternate approach with GBDTs which are iteratively built on weak learners that tend to be on the opposite end of the bias / variance tradeoff (weak learners with high bias and low variance).
- What features were inputs: NDVI rings and weather data

Results of Linear and Non-Linear Models

Model	MSE	RMSE	MAE	MAPE	R-squared	Var Explained	Training Dim	Testing Dim	# Rings	Weather	Year Categ	Image Categ	Missing Pixel	Missing Weather

OLS	0.0709	0.2664	0.2051	0.1962	0.1112	0.1113	81195x54	20299x54	10	tutiempo	1	1	0.5	exclude
Ridge	0.0709	0.2664	0.2051	0.1962	0.1113	0.1113	81195x54	20299x54	10	tutiempo	1	1	0.5	exclude
SVR	0.0640	0.2530	0.1956	0.1838	0.1984	0.1990	81195x54	20299x54	10	tutiempo	1	1	0.5	exclude
RF	0.0646	0.2542	0.1944	0.1762	0.1904	0.1905	81195x54	20299x54	10	tutiempo	1	1	0.5	exclude
GB	0.0653	0.2556	0.1982	0.1860	0.1814	0.1814	81195x54	20299x54	10	tutiempo	1	1	0.5	exclude

OLS	0.0700	0.2646	0.2036	0.1916	0.1061	0.1061	59528x48	14883x48	10	tutiempo	1	1	0.5	exclude
Ridge	0.0700	0.2646	0.2036	0.1916	0.1061	0.1061	59528x48	14883x48	10	tutiempo	1	1	0.5	exclude
SVR	0.0625	0.2499	0.1932	0.1794	0.2026	0.2033	59528x48	14883x48	10	tutiempo	1	1	0.5	exclude
RF	0.0614	0.2479	0.1891	0.1697	0.2157	0.2157	59528x48	14883x48	10	tutiempo	1	1	0.5	exclude
GB	0.0636	0.2522	0.1952	0.1804	0.1883	0.1883	59528x48	14883x48	10	tutiempo	1	1	0.5	exclude

OLS	0.0703	0.2652	0.2036	0.1945	0.1097	0.1099	40728x42	10182x42	10	tutiempo	1	1	0.5	exclude
Ridge	0.0703	0.2652	0.2036	0.1945	0.1098	0.1101	40728x42	10182x42	10	tutiempo	1	1	0.5	exclude
SVR	0.0635	0.2520	0.1952	0.1831	0.1965	0.1985	40728x42	10182x42	10	tutiempo	1	1	0.5	exclude
RF	0.0627	0.2504	0.1913	0.1745	0.2064	0.2067	40728x42	10182x42	10	tutiempo	1	1	0.5	exclude
GB	0.0638	0.2526	0.1958	0.1831	0.1926	0.1929	40728x42	10182x42	10	tutiempo	1	1	0.5	exclude

Neural Networks

Missing Values:

Missing values within a 13x13 image were replaced with the average value of the existing pixels in that image. Farms where there was one whole image of missing values (4,568 out of 9,162) were dropped from the analysis. This left 4,594 farms for training the models.

Feature Engineering:

The three types of neural networks we built required a different set of independent variables than the linear and nonlinear models discussed above. There were two main types of inputs to the models -- whole images and trajectories.

Whole Images
- 1 image per farm: Time 20 (early February right before harvest)
- 7 images per farm: Time 14 - 20 (early November through early February - entire soybean growing season)
Trajectories (Short Time Series)
- 1 trajectory of values through time for each pixel - 169 trajectories per farm
- Single Trajectory (7 values - early November through early February)
- Differences with 3 Lags (7 values - early November through early February and their lagged differences)
- Percent Differences with 3 Lags (7 values - early November through early February and their lagged percentage differences)
- From Maximum (10 Images Before, 2 Images After): Because we do not know when a field was planted, it is difficult to know on a particular date what stage of growth plants have reached. It is a standard assumption in remote sensing that the maximum NDVI value in a trajectory represents the peak of growth and greening for plants. For soybeans, this roughly means that harvest takes place two images, or four weeks, after maximum NDVI. It is therefore customary to find the maximum value and take the ten images before and the two images after it, giving a total trajectory length of 13 values.
  - Averages from Maximum (similar to ⁷. See table below.)
  - Differences from Maximum
  - Percentage Differences from Maximum
  - Single Trajectory from Maximum

Averages from Maximum:

From [7] Huang J, Wang X, Li X, Tian H, Pan Z (2013) Remotely Sensed Rice Yield Prediction Using Multi-Temporal NDVI Data Derived from NOAA's-AVHRR. PLoS ONE 8(8): e70816. doi:10.1371/journal.pone.0070816

Convolutional Neural Networks (CNN)

Definition: Convolutional neural networks are modeled after the visual cortex in the human brain. When humans take in a visual image, the brain breaks it up into overlapping tiles. The neurons in the visual cortex then work over these tiles to detect patterns. CNNs take in an image of pixel values and similarly break it up into overlapping sub-images to detect patterns. This convolution happens at the beginning of the algorithm, before the perceptrons in the neural network do their job. Because the convolutions will pick up patterns that are then fed into the neurons, the user can do minimal preprocessing of the data. We built 5-layer CNNs, with 3 convolutional layers (that included subsampling and pooling) and two globally connected neural layers using Python’s Theano module.
Why we used them: We are working with images, and this framework obviates the need for preprocessing.
What features were inputs: Our first set of CNNs were fed whole images. The second set were fed trajectories.
Code: Link to iPython Notebook

Convolutional Neural Network Results

	Model	Target Value	Features	Data	MAPE
1	5-layer with 1 channel	yield	Trajectories - from all 21 images/time periods - [1 x 21] array of NDVI values	13 Grids - 228,082 training examples	Predictions were off by 45%.
2	3-layer with 1 channel	yield	Image at Time 20 (beginning of February, right before harvest) - [1 x 169] array of NDVI values	13 Grids - 1,180 training examples	Predictions were off by 29%.
3	3-layer with 21 channels	yield	All 21 Images: Time0 to Time20 (early April through early Feburary) - [1 x 3549] array of NDVI values	13 Grids - 1,180 training examples	Predictions were off by 26%
4	3-layer with 21 channels	yield	All 21 Images: Time0 to Time20 (early April through early Feburary) - [1 x 3549] array of NDVI values	All 64 Grids - 7,127 training examples	Predictions were off by 39%.
5	5-layer with 21 channels	yield	All 21 Images: Time0 to Time20 (early April through early Feburary) - [1 x 3549] array of NDVI values	All 64 Grids - 7,127 training examples	Predictions were off by 37%.
6	5-layer with 1 channel	yield	All 21 Images: Time0 to Time20 (early April through early Feburary) - [1 x 3549] array of NDVI values	All 64 Grids - 7,127 training examples	Predictions were off by 55%.
7	5-layer with 11 channels	yield	11 Images: Time10 to Time20 (early September through early February) - [1 x 1859] array of NDVI values	All 64 Grids - 7,127 training examples	Predictions were off by 37%.
8	5-layer with 1 channel	yield	Trajectory Differences - 3 Lags - 11 Images (early September through early February) - [1 x 38] array of NDVI values	All 64 Grids - 1,204,632 training examples	Predictions were off by 74%.
9	5-layer with 1 channel	yield	Trajectory from Max NDVI Value - Averages - 7 images - [1 x 28] array of NDVI values	All 64 Grids - 777,213 training examples	Predictions were off by 49%.
10	5-layer with 1 channel	yield	Trajectory from Max NDVI Value - Differences - 3 Lags - 7 images - [1 x 22] array of NDVI values	All 64 Grids - 777,213 training examples	Predictions were off by 46%.
11	5-layer with 1 channel	yield	Replaced Missing Values - Trajectory from Max NDVI Value - Percentage Differences - 3 Lags - 7 images - [1 x 22] array of NDVI values	All 64 Grids - 483,327 training examples	Predictions were off by 54%.
Image = 13x13 grid of pixels around each farm lat/lon = 169 total pixels Trajectory = For one pixel, take the NDVI values across images/across time periods = 169 trajectories for each farm. 21 Images for each farm = early April of prior year to early February of current year. Missing Values = 0.5 Note: Soybean growing season is November of previous year to February of current year.

Recurrent Neural Networks (RNN)

Definition: Recurrent Neural Networks have memory. They take in a series of values collected over time (x) and predict an outcome at each time step (o). The prediction depends on the series of input values that came before it. Information from previous steps -- the memory (s) -- is encoded in a directed neural network. We built RNNs using Python’s Theano module.
Why we used them: We wanted to take a sequence of NDVI values and make predictions of the yield all along the soybean growing cycle. We wanted to see how early we could start to make accurate predictions.
What features were inputs: In our first set of models, we used trajectories only. In our second set of models, we used NDVI rings and weather data.
Code: Link to iPython Notebook

Recurrent Neural Network Results

	Model	Target Value	Missing Values	Epochs	Features	Data	MAPE
1	tanh rectifier	log(yield)	Replaced with image average	100	Trajectory from Max NDVI Value - 4 before/2 after - 7 images - [1 x 7] array of NDVI values	10,000 training examples	Predictions were off by 26%.
2	softplus rectifier	yield	Replaced with image average	100	Trajectory from Max NDVI Value - 4 before/2 after - 7 images - [1 x 7] array of NDVI values	64 Grids - 483,327 training examples	Predictions were off by 40%.
3	softplus rectifier - 200 hidden nodes	log(yield)	Replaced with image average	1000	Trajectory from Max NDVI Value - 10 before/2 after - 13 images - 0 Rings - Tutiempo Weather Data	64 Grids - 1,934 training examples	Predictions were off by 34%.
4	softplus rectifier - 500 hidden nodes	log(yield)	Replaced with image average	1000	Trajectory from Max NDVI Value - 10 before/2 after - 13 images - 10 Rings - Tutiempo Weather Data	64 Grids - 1,201 training examples	Predictions were off by 52%.

Gated Recurrent Units (GRU)

Definition: Gated Recurrent Units are a generalization of RNNs. They also have memory, but their memory is dynamic. At each time step, gates are used to decide which new information to let in and which old information to let go. We built GRUs with two gated layers and an embedding layer using Python’s Theano module.
Why we used them: GRUs allow longer sequences of information to be remembered.
What features were inputs: We used trajectories only.
Code: Link to iPython Notebook

Gated Recurrent Units Results

	Model	Target Value	Missing Values	Epochs	Features	Data	MAPE
1	2-layer	log(yield)	Replaced with image average	100	Trajectory from Max NDVI Value - 4 before/2 after - 7 images - [1 x 7] array of NDVI values	10,000 random training examples	Predictions were off by 25%.
2	2-layer	yield	Replaced with image average	3	Trajectory from Max NDVI Value - 4 before/2 after - 7 images - [1 x 7] array of NDVI values	10,000 random training examples	Predictions were off by 41%.

Conclusions

Our Goals

Assess the predictive power of satellite images and weather data
Compare and contrast modeling strategies
Assess model accuracy as a function of time horizon

First, in our assessments, we found that satellite images are measurably effective inputs for predicting yields. The yield models become even more robust when weather data is added. This is important because this data is easily accessible, as compared to the data used in the most cutting-edge models. The USDA is the gold standard, and their models incorporate many measurements directly from the fields, which requires boots on the ground analyzing samples and interviewing farmers about their techniques.

Second, we found that the most effective models at predicting yields using satellite and weather data are Random Forests. In particular, our best-performing Random Forest models had ten trees and were trained using mean-squared error as the objective function. We also gave the trees maximum latitude by extending their limits to making splits with any node that had a minimum of two samples and creating leafs, or end nodes, that had at least one sample.

The results of our model comparisons are listed in the tables below. The first table shows how our models performed right before harvest, when they had all the data from the growing season as inputs. Both Support Vector Regression and Random Forests were effective in this timeframe. The second table shows how our models performed three months ahead of harvest. Here, Random Forest models are the clear winner. The third table shows how our models performed six months ahead of harvest. Again, Random Forests were the top performing model.

Before Harvest Model Results

	R-squared	RMSE	MAE	MAPE
Baseline: Mean Model	0.00%	0.9664	0.7597	34.87%
OLS Regression	11.12%	0.2664	0.2051	19.62%
Ridge Regression	11.13%	0.2664	0.2051	19.62%
Support Vector Regression	19.84%	0.253	0.1956	18.38%
Random Forests	19.04%	0.2542	0.1944	17.62%
Gradient Boosting Decision Trees	18.14%	0.2556	0.1982	18.60%
Convolutional Neural Network	-4.98%	0.9767	0.7533	24.63%
Recurrent Neural Network	17.87%	0.8776	0.6736	22.17%
Gated Recurrent Units	-1.62%	0.9762	0.7596	25.28%

Late Planting Season Model Results (3 months ahead)

	R-squared	RMSE	MAE	MAPE
Baseline: Mean Model	0.00%	0.9664	0.7597	34.87%
OLS Regression	10.61%	0.2646	0.2036	19.16%
Ridge Regression	10.61%	0.2646	0.2036	19.16%
Support Vector Regression	20.26%	0.2499	0.1932	17.94%
Random Forests	21.57%	0.2479	0.1891	16.97%
Gradient Boosting Decision Trees	18.83%	0.2522	0.1952	18.04%
Convolutional Neural Network	NA	NA	NA	NA
Recurrent Neural Network	-9.03%	1.0459	0.8051	32.18%
Gated Recurrent Units	-5.70%	0.9956	0.7588	24.22%

Early Planting Season Model Results (6 months ahead)

	R-squared	RMSE	MAE	MAPE
Baseline: Mean Model	0.00%	0.9664	0.7597	34.87%
OLS Regression	10.97%	0.2652	0.2036	19.45%
Ridge Regression	10.98%	0.2652	0.2036	19.45%
Support Vector Regression	19.65%	0.252	0.1952	18.31%
Random Forests	20.64%	0.2504	0.1913	17.45%
Gradient Boosting Decision Trees	19.26%	0.2526	0.1958	18.31%
Convolutional Neural Network	NA	NA	NA	NA
Recurrent Neural Network	-16.35%	1.0805	0.8366	32.09%
Gated Recurrent Units	NA	NA	NA	NA

Finally, we found that our best predictions were made three months out from harvest. Our best model had a mean absolute percent error of 16.97%. When we compare our performance to that of the USDA’s more data-intensive models, we see that we did not outperform them, but we are getting closer. The graph below shows the USDA’s predictions three months out from harvest from 1990 through 2013. Their absolute percent error ranges from 1 to 16%. How much closer could we come to this gold standard with more research?

Future Research

First, to address the needs of our stakeholders, actual commodity prices should be incorporated into the models as targets and features. Unfortunately, we were not able to take our models that far due to employer compliance constraints.
Second, our models look at very small subsets of satellite images to make predictions for one farm. The models should be adapted to take in regional and national level images to make predictions for these larger areas.
Third, the Recurrent Neural Networks and Gated Recurrent Units could use more exploration. But in order to do this, the models need to run faster. The current code is by no means optimized and can use more development.
Fourth, we have not incorporated the weather data into the RNN and GRU models. We believe this could significantly increase the performance of these models.
Finally, we suggest testing alternative vegetation indices. We looked at NDVI and EVI. Others could be calculated, and models could be built using multiple indices as features. We never combined NDVI and EVI in the same model. This should be tried.

Zeus, whose will has marked for man
The sole way where wisdom lies;
Ordered one eternal plan:
Man must suffer to be wise.
- Agamemnon, by Aeschylus

Questions and Comments?

We welcome your questions and comments! Please use the form below to send us a quick note with your thoughts.

Additional data

Detailed Results of Linear and Non-Linear Models

# Rings	Log Yield	Weather	Year Categ	Image Categ	Model	MSE	RMSE	R-squared	Var Explained	Training Dim	Testing Dim	Missing Pixel	Missing Weather	MAE	MAPE
One Farm Per Row
0	0	0	0	0	LR	1.1761	1.0845	0.0366	0.0369	5964x21	1492x21	0.5	NA
0	0	0	0	0	Ridge	1.1758	1.0843	0.0368	0.0371	5964x21	1492x21	0.5	NA
0	0	0	0	0	SVM	1.1730	1.0831	0.0391	0.0412	5964x21	1492x21	0.5	NA
0	0	0	0	0	RF	1.2621	1.1234	-0.0339	-0.0339	5964x21	1492x21	0.5	NA
0	0	0	0	0	GB	1.1499	1.0723	0.0580	0.0581	5964x21	1492x21	0.5	NA
0	0	0	1	0	LR	1.1254	1.0608	0.0781	0.0783	5964x30	1492x30	0.5	NA
0	0	0	1	0	Ridge	1.1251	1.0607	0.0783	0.0785	5964x30	1492x30	0.5	NA
0	0	0	1	0	SVM	1.1302	1.0631	0.0741	0.0768	5964x30	1492x30	0.5	NA
0	0	0	1	0	RF	1.1671	1.0803	0.0440	0.0440	5964x30	1492x30	0.5	NA
0	0	0	1	0	GB	1.0886	1.0434	0.1082	0.1083	5964x30	1492x30	0.5	NA
1	0	0	1	0	LR	1.1274	1.0618	0.0764	0.0766	5964x51	1492x51	0.5	NA
1	0	0	1	0	Ridge	1.1264	1.0613	0.0772	0.0774	5964x51	1492x51	0.5	NA
1	0	0	1	0	SVM	1.1314	1.0637	0.0732	0.0762	5964x51	1492x51	0.5	NA
1	0	0	1	0	RF	1.1939	1.0927	0.0219	0.0220	5964x51	1492x51	0.5	NA
1	0	0	1	0	GB	1.0900	1.0440	0.1071	0.1072	5964x51	1492x51	0.5	NA
2	0	0	1	0	LR	1.1311	1.0635	0.0734	0.0735	5964x72	1492x72	0.5	NA
2	0	0	1	0	Ridge	1.1279	1.0620	0.0760	0.0762	5964x72	1492x72	0.5	NA
2	0	0	1	0	SVM	1.1294	1.0627	0.0748	0.0775	5964x72	1492x72	0.5	NA
2	0	0	1	0	RF	1.1699	1.0816	0.0416	0.0416	5964x72	1492x72	0.5	NA
2	0	0	1	0	GB	1.0964	1.0471	0.1018	0.1019	5964x72	1492x72	0.5	NA
3	0	0	1	0	LR	1.1368	1.0662	0.0688	0.0690	5964x93	1492x93	0.5	NA
3	0	0	1	0	Ridge	1.1299	1.0630	0.0744	0.0746	5964x93	1492x93	0.5	NA
3	0	0	1	0	SVM	1.1281	1.0621	0.0759	0.0784	5964x93	1492x93	0.5	NA
3	0	0	1	0	RF	1.1290	1.0625	0.0751	0.0753	5964x93	1492x93	0.5	NA
3	0	0	1	0	GB	1.0907	1.0444	0.1065	0.1066	5964x93	1492x93	0.5	NA
4	0	0	1	0	LR	1.1389	1.0672	0.0670	0.0671	5964x114	1492x114	0.5	NA
4	0	0	1	0	Ridge	1.1295	1.0628	0.0747	0.0749	5964x114	1492x114	0.5	NA
4	0	0	1	0	SVM	1.1261	1.0612	0.0775	0.0799	5964x114	1492x114	0.5	NA
4	0	0	1	0	RF	1.1641	1.0789	0.0464	0.0464	5964x114	1492x114	0.5	NA
4	0	0	1	0	GB	1.0926	1.0453	0.1050	0.1051	5964x114	1492x114	0.5	NA
5	0	0	1	0	LR	1.1433	1.0693	0.0634	0.0635	5964x135	1492x135	0.5	NA
5	0	0	1	0	Ridge	1.1300	1.0630	0.0743	0.0744	5964x135	1492x135	0.5	NA
5	0	0	1	0	SVM	1.1251	1.0607	0.0783	0.0808	5964x135	1492x135	0.5	NA
5	0	0	1	0	RF	1.1488	1.0718	0.0589	0.0589	5964x135	1492x135	0.5	NA
5	0	0	1	0	GB	1.0877	1.0429	0.1090	0.1091	5964x135	1492x135	0.5	NA
6	0	0	1	0	LR	1.1447	1.0699	0.0622	0.0623	5964x156	1492x156	0.5	NA
6	0	0	1	0	Ridge	1.1297	1.0629	0.0746	0.0747	5964x156	1492x156	0.5	NA
6	0	0	1	0	SVM	1.1246	1.0605	0.0788	0.0812	5964x156	1492x156	0.5	NA
6	0	0	1	0	RF	1.1674	1.0805	0.0437	0.0440	5964x156	1492x156	0.5	NA
6	0	0	1	0	GB	1.0843	1.0413	0.1118	0.1120	5964x156	1492x156	0.5	NA
7	0	0	1	0	LR	1.1431	1.0692	0.0636	0.0636	5964x177	1492x177	0.5	NA
7	0	0	1	0	Ridge	1.1266	1.0614	0.0771	0.0772	5964x177	1492x177	0.5	NA
7	0	0	1	0	SVM	1.1234	1.0599	0.0797	0.0821	5964x177	1492x177	0.5	NA
7	0	0	1	0	RF	1.1454	1.0702	0.0617	0.0619	5964x177	1492x177	0.5	NA
7	0	0	1	0	GB	1.0867	1.0424	0.1098	0.1100	5964x177	1492x177	0.5	NA
8	0	0	1	0	LR	1.1494	1.0721	0.0584	0.0585	5964x198	1492x198	0.5	NA
8	0	0	1	0	Ridge	1.1268	1.0615	0.0769	0.0770	5964x198	1492x198	0.5	NA
8	0	0	1	0	SVM	1.1224	1.0594	0.0805	0.0828	5964x198	1492x198	0.5	NA
8	0	0	1	0	RF	1.1449	1.0700	0.0621	0.0624	5964x198	1492x198	0.5	NA
8	0	0	1	0	GB	1.0874	1.0428	0.1092	0.1095	5964x198	1492x198	0.5	NA
9	0	0	1	0	LR	1.1557	1.0750	0.0532	0.0533	5964x219	1492x219	0.5	NA
9	0	0	1	0	Ridge	1.1273	1.0617	0.0765	0.0766	5964x219	1492x219	0.5	NA
9	0	0	1	0	SVM	1.1217	1.0591	0.0811	0.0834	5964x219	1492x219	0.5	NA
9	0	0	1	0	RF	1.1266	1.0614	0.0771	0.0773	5964x219	1492x219	0.5	NA
9	0	0	1	0	GB	1.0788	1.0387	0.1163	0.1164	5964x219	1492x219	0.5	NA
10	0	0	1	0	LR	1.1599	1.0770	0.0498	0.0498	5964x240	1492x240	0.5	NA
10	0	0	1	0	Ridge	1.1316	1.0638	0.0730	0.0731	5964x240	1492x240	0.5	NA
10	0	0	1	0	SVM	1.1211	1.0588	0.0816	0.0838	5964x240	1492x240	0.5	NA
10	0	0	1	0	RF	1.1261	1.0612	0.0775	0.0776	5964x240	1492x240	0.5	NA
10	0	0	1	0	GB	1.0783	1.0384	0.1167	0.1169	5964x240	1492x240	0.5	NA

One Image Per Row
10	0	0	0	0	LR	1.1543	1.0744	0.0010	0.0011	107905x11	26977x11	0.5	NA
10	0	0	0	0	Ridge	1.1543	1.0744	0.0010	0.0011	107905x11	26977x11	0.5	NA
10	0	0	0	0	SVM	1.1497	1.0722	0.0050	0.0060	107905x11	26977x11	0.5	NA
10	0	0	0	0	RF	1.2521	1.1190	-0.0836	-0.0836	107905x11	26977x11	0.5	NA
10	0	0	0	0	GB	1.1467	1.0708	0.0076	0.0077	107905x11	26977x11	0.5	NA
10	0	0	1	0	LR	1.0842	1.0412	0.0617	0.0618	107905x20	26977x20	0.5	NA
10	0	0	1	0	Ridge	1.0842	1.0412	0.0617	0.0618	107905x20	26977x20	0.5	NA
10	0	0	1	0	SVM	1.0838	1.0411	0.0620	0.0635	107905x20	26977x20	0.5	NA
10	0	0	1	0	RF	1.1480	1.0714	0.0065	0.0065	107905x20	26977x20	0.5	NA
10	0	0	1	0	GB	1.0755	1.0371	0.0692	0.0693	107905x20	26977x20	0.5	NA
10	0	0	1	1	LR	1.0845	1.0414	0.0614	0.0615	107905x41	26977x41	0.5	NA
10	0	0	1	1	Ridge	1.0845	1.0414	0.0615	0.0616	107905x41	26977x41	0.5	NA
10	0	0	1	1	SVM	1.0812	1.0398	0.0643	0.0659	107905x41	26977x41	0.5	NA
10	0	0	1	1	RF	1.1297	1.0629	0.0223	0.0223	107905x41	26977x41	0.5	NA
10	0	0	1	1	GB	1.0739	1.0363	0.0706	0.0707	107905x41	26977x41	0.5	NA
10	0	NOAA	1	1	LR	1.0455	1.0225	0.0952	0.0953	107905x41	26977x41	0.5	fill in avg
10	0	NOAA	1	1	Ridge	1.0455	1.0225	0.0952	0.0953	107905x41	26977x41	0.5	fill in avg
10	0	NOAA	1	1	SVM	0.9684	0.9841	0.1619	0.1663	107905x41	26977x41	0.5	fill in avg
10	0	NOAA	1	1	RF	1.0197	1.0098	0.1175	0.1175	107905x41	26977x41	0.5	fill in avg
10	0	NOAA	1	1	GB	0.9747	0.9873	0.1565	0.1565	107905x41	26977x41	0.5	fill in avg
10	0	tutiempo	1	1	LR	1.0286	1.0142	0.0872	0.0872	95152x53	23789x53	0.5	exclude
10	0	tutiempo	1	1	Ridge	1.0286	1.0142	0.0872	0.0872	95152x53	23789x53	0.5	exclude
10	0	tutiempo	1	1	SVM	0.9507	0.9750	0.1564	0.1594	95152x53	23789x53	0.5	exclude
10	0	tutiempo	1	1	RF	0.9411	0.9701	0.1649	0.1652	95152x53	23789x53	0.5	exclude
10	0	tutiempo	1	1	GB	0.9622	0.9809	0.1462	0.1462	95152x53	23789x53	0.5	exclude
10	0	tutiempo	1	1	LR	1.0479	1.0237	0.0855	0.0855	125260x53	31316x53	0.5	fill in avg
10	0	tutiempo	1	1	Ridge	1.0478	1.0236	0.0855	0.0855	125260x53	31316x53	0.5	fill in avg
10	0	tutiempo	1	1	SVM	0.9721	0.9860	0.1516	0.1541	125260x53	31316x53	0.5	fill in avg
10	0	tutiempo	1	1	RF	0.9555	0.9775	0.1661	0.1661	125260x53	31316x53	0.5	fill in avg
10	0	tutiempo	1	1	GB	0.9893	0.9946	0.1366	0.1366	125260x53	31316x53	0.5	fill in avg
10	0	tutiempo, cate_SN	1	1	LR	1.0715	1.0351	0.0882	0.0883	81333x54	20334x54	linear interpolate	exclude
10	0	tutiempo, cate_SN	1	1	Ridge	1.0715	1.0351	0.0882	0.0883	81333x54	20334x54	linear interpolate	exclude
10	0	tutiempo, cate_SN	1	1	SVM	0.9988	0.9994	0.1501	0.1531	81333x54	20334x54	linear interpolate	exclude
10	0	tutiempo, cate_SN	1	1	RF	1.0300	1.0149	0.1235	0.1238	81333x54	20334x54	linear interpolate	exclude
10	0	tutiempo, cate_SN	1	1	GB	1.0042	1.0021	0.1455	0.1456	81333x54	20334x54	linear interpolate	exclude

Log Transformation on Yield: All measurements are in Log Term (Removed Yield=0)
10	1	tutiempo, cate_SN	1	1	LR	0.0709	0.2664	0.1112	0.1113	81195x54	20299x54	0.5	exclude	0.2051	0.1962
10	1	tutiempo, cate_SN	1	1	Ridge	0.0709	0.2664	0.1113	0.1113	81195x54	20299x54	0.5	exclude	0.2051	0.1962
10	1	tutiempo, cate_SN	1	1	SVM	0.0640	0.2530	0.1984	0.1990	81195x54	20299x54	0.5	exclude	0.1956	0.1838
10	1	tutiempo, cate_SN	1	1	RF	0.0646	0.2542	0.1904	0.1905	81195x54	20299x54	0.5	exclude	0.1944	0.1762
10	1	tutiempo, cate_SN	1	1	GB	0.0653	0.2556	0.1814	0.1814	81195x54	20299x54	0.5	exclude	0.1982	0.1860
10	1	tutiempo, cate_SN	0	1	LR	0.0775	0.2784	0.0565	0.0566	81333x45	20334x45	0.5	exclude	0.2102	0.2019
10	1	tutiempo, cate_SN	0	1	Ridge	0.0775	0.2784	0.0566	0.0566	81333x45	20334x45	0.5	exclude	0.2102	0.2019
10	1	tutiempo, cate_SN	0	1	SVM	0.0659	0.2567	0.1971	0.1989	81333x45	20334x45	0.5	exclude	0.1970	0.1836
10	1	tutiempo, cate_SN	0	1	RF	0.0680	0.2608	0.1723	0.1723	81333x45	20334x45	0.5	exclude	0.1986	0.1775
10	1	tutiempo, cate_SN	0	1	GB	0.0692	0.2631	0.1576	0.1576	81333x45	20334x45	0.5	exclude	0.2018	0.1887

Three Month Ahead Prediction: log Transformation on Yield, Removed Yield=0
10	1	tutiempo, cate_SN	1	1	LR	0.0700	0.2646	0.1061	0.1061	59528x48	14883x48	0.5	exclude	0.2036	0.1916
10	1	tutiempo, cate_SN	1	1	Ridge	0.0700	0.2646	0.1061	0.1061	59528x48	14883x48	0.5	exclude	0.2036	0.1916
10	1	tutiempo, cate_SN	1	1	SVM	0.0625	0.2499	0.2026	0.2033	59528x48	14883x48	0.5	exclude	0.1932	0.1794
10	1	tutiempo, cate_SN	1	1	RF	0.0614	0.2479	0.2157	0.2157	59528x48	14883x48	0.5	exclude	0.1891	0.1697
10	1	tutiempo, cate_SN	1	1	GB	0.0636	0.2522	0.1883	0.1883	59528x48	14883x48	0.5	exclude	0.1952	0.1804
Six Month Ahead Prediction: log Transformation on Yield, Removed Yield=0
10	1	tutiempo, cate_SN	1	1	LR	0.0703	0.2652	0.1097	0.1099	40728x42	10182x42	0.5	exclude	0.2036	0.1945
10	1	tutiempo, cate_SN	1	1	Ridge	0.0703	0.2652	0.1098	0.1101	40728x42	10182x42	0.5	exclude	0.2036	0.1945
10	1	tutiempo, cate_SN	1	1	SVM	0.0635	0.2520	0.1965	0.1985	40728x42	10182x42	0.5	exclude	0.1952	0.1831
10	1	tutiempo, cate_SN	1	1	RF	0.0627	0.2504	0.2064	0.2067	40728x42	10182x42	0.5	exclude	0.1913	0.1745
10	1	tutiempo, cate_SN	1	1	GB	0.0638	0.2526	0.1926	0.1929	40728x42	10182x42	0.5	exclude	0.1958	0.1831