The Past

This section outlines our research into the past work completed in this space - in other words, our literature review. Significant work has been accomplished towards using satellite imagery for predicting yields of crops, and several models exist with varying spohistication from the USDA and other vendors. Our approach attempts to extend the existing body of work by predicting soybean yields in Brazil, incorporating weather data in addition to data from satellites.

This project uses vegetation indices calculated from MODIS satellite images to predict crop yields on a farm-by-farm basis 2. Our efforts abstract from the complicated mixture of factors that affect yields on the ground, such as soil conditions, pest pressures, weather, irrigation, fertilizer use, crop rotation, and other advanced farming methods. We assume that vegetation indices can capture the effects of all such surface inputs and that by applying the latest machine learning techniques, we can find patterns in the data that predict yields as accurately as more data-intensive models.

Several researchers, notably William Allen, Harold Gausman, and Joseph Woolley, have established how the optical properties of plants relate to their growth and productivity. The different wavelengths of light reflected by plants and soils create unique spectral signatures that indicate water content, photosynthetic potential, chlorophyll concentrations, and other characteristics correlated with the developmental stages of plants and plant health. For a good summary of this work, see 14.

Gauging the health and productivity of plants throughout a growing season using their spectral signatures from imperfect satellite images is complex. To combat the many distortions of reflected light, such as the changing ratio of soil to plant canopy and the differing angles of the sun at the time of image creation, researchers have developed vegetation indices. A vegetation index is a formula relating the various wavelengths of light to plant canopy structures. Different indices are used to measure different qualities of cropland. For example, the Normalized Difference Vegetation Index (NDVI) that is a standard in agricultural remote sensing research and that is used in this project, is calculated using the red and near-infrared wavebands. It is applied to measure changes in the amount of green biomass over time, which can detect uneven patterns of growth. NDVI has also been used to solve other significant issues faced by those interested in measuring agricultural yields, including the identification and quantification of cropland 17 18, the tracking of field abandonment and recultivation 4, and the calculation of the intensity of cropland use 5.

Various vegetation indices have been used to predict crop yields. Comparisons of the performance of different indices are given in 6, 8, and 16. Indices are also combined with surface variables such as soil moisture, surface temperature, and rainfall to build more complex models (see 3 and 15).

Instead of adding more data to models predicting yields, this project has instead applied newer techniques in machine learning to pure vegetation index models. Other researchers have applied machine learning to agricultural remote sensing problems. Stepwise regression was used in 7 to predict rice production in China. Hierarchical clustering, Bayesian neural networks, and model-based recursive partitioning was used in 8 to predict barley, canola, and spring wheat yields on the Canadian Prairies. The authors in 9 trained a neural network using the shuffled complex evolution optimization algorithm developed at the University of Arizona (SCE-UA) to predict corn and soybean yields at the county level in the US corn belt. Random forests were used in 10 and 19 to classify pixels in satellite images into different categories of land uses. In 11, unsupervised linear unmixing was used on hyperspectral data to discriminate between soil and vegetation within one pixel. Multilayer perceptrons and radial basis functions were used in 13 to predict the crop area and yields of corn fields in India.

This project uses time series of satellite images across one growing cycle to predict soybean yields in Brazil. For some of our models, we turned these time series of NDVI values into trajectories and used them to predict yields at one point in time, as is done in 12. We also applied convolutional neural networks to our time series of satellite images in a way similar to this paper1.

If I have seen further it is by standing on the shoulders of giants.
- Sir Issac Newton


The Present - Users and Key Stakeholders

Defining the User

There are many constituents who can benefit from yield predictions for soybeans and other crops, including farmers, policy-makers, research houses and traders. For the purpose of this research project, the primary user was deemed to be the commodities traders who deal with buying and selling futures on a day-to-day basis. They are the individuals who help make the market (and as a result prices), and access to crop yield estimates has a direct impact on market dynamics.

Research Approach

In an effort to help guide our research, the team solicited feedback from a small group of commodities traders, which was collected through an online questionnaire. Because of the narrow audience, the data collection was limited to 5 respondents. However, the results provided good directional feedback that helped shape our research.

    Below are the list of questions asked, which were meant to allow for open ended responses:
  1. What types of financial instruments do you trade? If you trade commodities futures, which specific commodities?
  2. What type of data factors into your trading decision making?
  3. For commodities futures, have you ever used signals from non-traditional or relatively new data sources (i.e. something that you just started using in the past 2 years)? If so, please describe these new data sources.
  4. How important would it be to have accurate yield predictions as you assess commodities futures pricing (e.g. having an estimate for the corn yield for the harvest coming up in 6 months)?
  5. Do you have access to any sources of yield predictions today? How accurate are they?
  6. How much advance access to accurate yield predictions could give you a competitive advantage in pricing? (e.g. 1 year before harvest, 6 months before harvest, etc)
  7. At what resolution would you want yield predictions(nationwide yield estimates? county level? individual farms?)?
  8. Please add any other thoughts or comments that you think might be helpful as we work on this project.

Research Findings

    Generally, the respondents gave fairly consistent answers to most questions. The key takeaways that are most relevant to this project are listed in the bullets below:
  • Some of the commonly used data sources cited by the respondents include: “fundamentals”, weekly statistics, market flows, macro themes and interest rates. These terms are broad and in some cases could include yield estimates.
  • All respondents said that yield estimates would be important or very important to them. One respondent did indicate that yield estimates are available today, but typically closer to the harvest date.
  • A subset of respondents had in-house research groups as well as external consultants who calculated yield estimates. However, there was interest in finding signals that may indicate if environmental changes may render the estimates obsolete.
  • Having a longer time horizon for yield estimates is of great interest. However, there is a great degree of skepticism that estimates 1 year out will be accurate. In general, it appears that most accurate yield estimates occur 3 months prior to harvest.
  • Most respondents felt that nationwide estimates would be the right level of granularity. One respondent saw value in county-level estimates.

Key Takeaways

    The research helped steer the project with the following guidelines:
  • Focus on national predictions
  • Include weather data, as this was deemed to be an important feature
  • Primary value of this effort would be if a more accurate prediction could me made in a reasonably long time window (e.g. 3 - 6 months from harvest)

Brazillian Growth Cycle

I KEEP six honest serving-men
(They taught me all I knew);
Their names are What and Why and When
And How and Where and Who.
- The Elephant's Child, Sir Rudyard Kipling


Data Harvesting and Basic Wrangling

Farm Yield Data

The farm-level soybean yield data we used was provided by a research team from Tufts University on special request. We express our gratitude to the team and to Prof. Coco Krumme for facilitating access to the data. The data contains fields such as dates, farm coordinates in latitude and longitude, yield calculation in metric tons per hectare, and many other agricultural indices for pestilence and disease (e.g. brown_stink_bug, cucurbit_beetle, fall_armyworm, false_medideira, green_stink_bug, snail, spodoptera_spp, stink_bug, tobacco_budworm, velvetbean_caterpillar and white_fly).

After removing all null data points, the visualization of the farm locations showed that we still had a very broad distribution over Brazil, necessitating a wide geographic area of satellite images in order to provide coverage over all the farms. Additionally, we noticed that some of the locations were outside of the geographic boundaries of Brazil. Further investigation showed that these locations had miscoded lat/long coordinates - either wrong sign of latitude and longitude, a zero value for one of the coordinates, or simply transposed latitude and longitude values. We chose to remove these data points to reduce any ambiguity regarding the validity of those farm locations.

Satellite Data

Our primary data source was a subset of MODIS data, which provides Vegetation Indices every 16 days with a resolution of 250m. From the NASA website at http://modis.gsfc.nasa.gov/data/dataprod/mod13.php:

“MODIS (or Moderate Resolution Imaging Spectroradiometer) is a key instrument aboard the Terra(originally known as EOS AM-1) and Aqua (originally known as EOS PM-1) satellites. MODIS vegetation indices, produced on 16-day intervals and at multiple spatial resolutions, provide consistent spatial and temporal comparisons of vegetation canopy greenness, a composite property of leaf area, chlorophyll and canopy structure. Two vegetation indices are derived from atmospherically-corrected reflectance in the red, near-infrared, and blue wavebands; the normalized difference vegetation index (NDVI), which provides continuity with NOAA's AVHRR NDVI time series record for historical and climate applications, and the enhanced vegetation index (EVI), which minimizes canopy-soil variations and improves sensitivity over dense vegetation conditions. The two products more effectively characterize the global range of vegetation states and processes.”

Due to stale MODIS product APIs and restrictions on download size per request, we could only manually download data as small grids in 200km by 200km sections. Considering the range of latitudes and longitudes of our farm locations, our initial calculations showed that we would have 165 different grids to download. However, further research narrowed that to only 64 grids (details below). The data collected is from April of 2006 to February of 2016, which contains 223 distinct satellite images in 16-day intervals. Only NDVI and EVI vegetation index files were downloaded, which was about 1GB each per grid per day and 128GB in total.

At the outset, we noticed that the MODIS website only allows two parameters for data requests: the center coordinates and distance from the center. It then generates a square-like grid. It is important to note that it is not a perfect square because, as the latitude changes, delta distance per degree change in longitude on the globe also changes. Therefore, we cannot think in a 2-D world and simply calculate the grids by dividing longitude by some constant, because that will either result in extra grid count (due to too little unit longitude change each time), or uncovered area (due to too big unit longitude change each time). Instead, we used the max latitude and longitude as the starting point and calculated the coordinates of the next grid center by subtracting degree changes in latitude and longitude that resulted in 200km changes of spherical distance. We repeated this until we had covered the minimal value of latitudes and longitudes needed to include all of our farms.

We had a total of 165 grids that covered all our farm locations. However, not all of them had a farm in it. We then matched each farm to its nearest grid according to the farm-to-grid center distance. After this step, we realized that there were 82 grids that contained no farms and 19 grids that contained only 1 or 2 farms. Considering these farms do not exist in all years of yield data and their minimal effect on the total yield, we decide to filter them out, which resulted in only 64 grids for download. Each of the grids we downloaded contained at least 5 farms. After manually assigning thirteen grids to each of our teammates for download, we made our data requests at http://daacmodis.ornl.gov/cgi-bin/MODIS/GLBVIZ_1_Glb/modis_subset_order_global_col5.pl, and were notified via email once our data set was ready. The response time ranged from 40 minutes to 3 days.

Weather Data

Our initial weather data source was the NOAA repository. GHCND (Global Historical Climatology Network) maintains different summaries based on data exchanged under the World Meteorological Organization (WMO) World Weather Watch Program. Absent API access, we resorted to manual downloads of the monthly summaries from http://www.ncdc.noaa.gov/cdo-web/search.

After investigation of the NOAA weather dataset on Brazil, it was found that our monthly summaries were very sparse, with about 50% of the data being NA values. We immediately started on a quest for alternative sources for this missing weather data and came upon Tutiempo. Tutiempo contains historical climate information for every country in the world. In the case of Brazil, it has data sets from more than 100 weather stations, and most of them contain the most recent 10 years of weather data that we needed. Since an API was not available, we programmed a customized webpage scraper. The scraper generated and scraped through all the Tutiempo Brazil dataset URLs, downloaded the necessary data, and organized it into csv-formatted files. This ingestable format was categorized by weather station latitude and longitude. Since the Tutiempo website robot was fairly adept at disconnecting us, the scraper was designed to be restarted from the last point of failure. Once the data download was complete, we were able to gather the following fields:

T: Average Temperature (°C)
TM: Maximum temperature (°C)
Tm: Minimum temperature (°C)
SLP: Atmospheric pressure at sea level (hPa)
H: Average relative humidity (%)
PP: Total rainfall and / or snowmelt (mm)
VV: Average visibility (Km)
V: Average wind speed (Km/h)
VM: Maximum sustained wind speed (Km/h)
VG: Maximum speed of wind (Km/h)
RA: Indicator whether there was rain or drizzle (In the monthly average, the total days it rained)
SN: Indicator if it snowed (In the monthly average, the total days it snowed)
TS: Indicator if there was a thunderstorm (In the monthly average, the total days with thunderstorms)
FG: Indicator whether there was fog (In the monthly average, the total days with fog)

War is ninety percent information.
- Napoleon Bonaparte


The Future: Modelling

Baseline Features and Labels

The first pass of modeling was built on a baseline data set consists of 1 training example for every farm/year combination. In other words, a given farm will have a measure of its soybean yield for a given year, say 2012. That same farm will have a different yield in 2013, and that would be recorded as a separate training example. The primary features for any given training example is the NDVI value for the center pixel in the associated satellite images for that farm over a given time period. Satellite images were collected twice a month starting 10 months before the given year’s harvest. The target variable is the yield (in metric tons per hectare) for that farm for that given year. A list of the features and target variable is included below:

Table 1 - Initial Model Features - Single Pixel Values

After running some initial models, it was decided to expand the area of pixels for which NDVI values were collected. Each pixel represents an area of 250m x 250m (about 15 acres), and without knowledge of the size of the farms, it seemed prudent to collect more than just the 1 pixel representing the centerpoint of the farm. Several “rings” of pixels were collected as is illustrated in Figure 1 below:

Figure 1 - Illustration of “Rings” as a parameter in feature engineering

Table 2 - Model Features with n rings of pixels

As will be discussed later, expanding to 3 rings provided a material benefit to a simple OLS model. Based on this result, all further modeling efforts included at least 3 rings of NDVI data around the center pixel, and in many cases more.

Additional Data Cleaning & Training /Test Data Splits

One issue that reduced the number of training examples were farms whose centers were close to an edge of the satellite image. For these farms, the rings of pixels were often truncated, resulting in only a partial data set. These examples were removed. Similarly, there were several N/A values. After trying various strategies to fill in missing values, including interpolation, it was shown that simply using the mean value of the satellite image was good enough.

The 7,456 remaining examples were split randomly into 5,964 training examples and 1,492 test examples. For this model, to keep the feature space small, 6 rings of pixel data were collected. However, an average of each ring was calculated, resulting in 7 pixel values (the center pixel value plus 6 averages) for each satellite image. Since there were 21 images per farm, the final feature space consisted of 147 pixel values (21 time periods x 7 pixel values per time

Incorporation of Weather Data

After spending some time with just the NDVI data, weather data was incorporated into the models as a new feature set. A summary of the related feature engineering is included below:

    NOAA Weather Data:
  • This data set had a lot of missing values. On average, each variable had more than 50% invalid values. Overall, only 1,755 observations out of 7,456 had complete weather data.
  • Several variables were dropped that were mostly zero.
  • Several variables that were heavily skewed were log transformed.
  • The final training data set dimensions were 5,964 x 492 and the test set was 1,492 x 492.
  • The large number of missing values and features resulted in very unstable models with negative R-squared values and large residuals.
    Restructuring the Data Set:
  • Instead of constructing the dataset with one farm per row, we reshaped it to be one image per row. This way, we have 21 times more rows and we could add image sequence as a dummy variable.
  • New dimension of training set becomes 107,905 x 41, and testing set 26,977 x 41.
  • Despite the missing values, the NOAA data added some predictive power to all models. We filled in the missing values with the variable mean.
    Tutiempo Weather Data:
  • Much better data quality. 101,667 out of 134,882 images had complete observations after dropping two mostly missing variables (VG and SLP).
  • Log transformed FG, PP, RA, TS since they are right-skewed. Converted binary variable SN to a dummy variable.
  • Final train set 81,333 x 54; final test set 20,334 x 54.
  • Saw further improvements in results compared with NOAA.
    Target variable transformation:
  • Noticed that there were several very high yield farms that were really hard to predict.
  • Log transformed the Yield variable and saw much better model performance.

Modeling Strategy & Metrics of Performance

Various classes of models were examined:

Several metrics were used to compare performance across models:

  • R-squared
  • Root Mean Squared Error (RMSE)
  • Mean Absolute Error (MAE)
  • Mean Absolute Percent Error (MAPE)
  • Linear Models

    • OLS
      • Definition: Ordinary Least Squares regression models are the most basic form of linear regression models. An OLS model is built by finding parameters that minimize the squared distance between observed and predicted values of the dependent variable.
      • Why we used them: We used OLS as a starting point to establish a simple baseline to which we could compare all additional models.
      • What features were inputs: NDVI rings and weather data
    • Ridge Regression
      • Definition: Ridge regression is a variant of OLS regression where a regularization term is added to penalize parameter estimates which are large. This approach is often used to alleviate against multicollinearity.
      • Why we used them: Given the nature of the data set (NDVI values in a time series), there was a risk of multicollinearity and, as a result, a high variance in parameter estimates. Therefore, we decided to use Ridge Regression as a follow on to OLS.
      • What features were inputs: NDVI rings and weather data

    Non-Linear Models

    • Support Vector Regression (SVR)
      • Definition: Support Vector Regression uses Support Vector Machines (SVMs) to find a separating line/hyperplane that minimizes the distance between this plane and the yield values. In contrast to OLS, SVMs can produce non-linear 'best fit' curves. It does this by using non-linear kernels to transform the parameter space and working in this transformed space to find optimal linear support vectors. When these linear support vectors are transformed back into the original parameter space, they can become non-linear.
      • Why we used them: We suspected that our yield prediction problem was non-linear.
      • What features were inputs: NDVI rings and weather data
    • Random Forest (RF)
      • Definition: Random forest models fall under a class of ensemble models that take collections of decision trees and use the mean prediction across the trees. Decision tree models are learning algorithms which have a decision flow structure. Each node asks a question about a feature, and the resulting branches represent “decisions” made based on the value of that feature.
      • Why we used them: The initial linear models that we ran were found to have poor predictive power (initial R-squared values were in the range of 0.05 - 0.10). Given the complex nature of the dataset (e.g. time series of arrays of NDVI pixel values), it seemed logical to explore non-linear models. Random Forests are known to be the most generally robust across various classes of problems. For example, they are very good at handling outliers, of which there were many in this dataset.
      • What features were inputs: NDVI rings and weather data
    • Gradient Boosting Decision Trees (GBDT)
      • Definition: This class of models is also an ensemble of decision trees. However, unlike Random Forests, GBDTs iteratively build new predictors from linear combinations of baseline weak learners (i.e. shallow trees that have high bias and low variance) using the gradient of the loss function as a key parameter.
      • Why we used them: Since Random Forests use ensembles of fully grown trees, and therefore tend to have lower bias and higher variance, it seemed prudent to take an alternate approach with GBDTs which are iteratively built on weak learners that tend to be on the opposite end of the bias / variance tradeoff (weak learners with high bias and low variance).
      • What features were inputs: NDVI rings and weather data

    Results of Linear and Non-Linear Models

    ModelMSERMSEMAEMAPER-squaredVar ExplainedTraining DimTesting Dim# RingsWeatherYear CategImage CategMissing PixelMissing Weather
                   
    OLS0.07090.26640.20510.19620.11120.111381195x5420299x5410tutiempo110.5exclude
    Ridge0.07090.26640.20510.19620.11130.111381195x5420299x5410tutiempo110.5exclude
    SVR0.06400.25300.19560.18380.19840.199081195x5420299x5410tutiempo110.5exclude
    RF0.06460.25420.19440.17620.19040.190581195x5420299x5410tutiempo110.5exclude
    GB0.06530.25560.19820.18600.18140.181481195x5420299x5410tutiempo110.5exclude
                   
    OLS0.07000.26460.20360.19160.10610.106159528x4814883x4810tutiempo110.5exclude
    Ridge0.07000.26460.20360.19160.10610.106159528x4814883x4810tutiempo110.5exclude
    SVR0.06250.24990.19320.17940.20260.203359528x4814883x4810tutiempo110.5exclude
    RF0.06140.24790.18910.16970.21570.215759528x4814883x4810tutiempo110.5exclude
    GB0.06360.25220.19520.18040.18830.188359528x4814883x4810tutiempo110.5exclude
                   
    OLS0.07030.26520.20360.19450.10970.109940728x4210182x4210tutiempo110.5exclude
    Ridge0.07030.26520.20360.19450.10980.110140728x4210182x4210tutiempo110.5exclude
    SVR0.06350.25200.19520.18310.19650.198540728x4210182x4210tutiempo110.5exclude
    RF0.06270.25040.19130.17450.20640.206740728x4210182x4210tutiempo110.5exclude
    GB0.06380.25260.19580.18310.19260.192940728x4210182x4210tutiempo110.5exclude

    Neural Networks

    Missing Values:

    Missing values within a 13x13 image were replaced with the average value of the existing pixels in that image. Farms where there was one whole image of missing values (4,568 out of 9,162) were dropped from the analysis. This left 4,594 farms for training the models.

    Feature Engineering:

    The three types of neural networks we built required a different set of independent variables than the linear and nonlinear models discussed above. There were two main types of inputs to the models -- whole images and trajectories.

    • Whole Images
      • 1 image per farm: Time 20 (early February right before harvest)
      • 7 images per farm: Time 14 - 20 (early November through early February - entire soybean growing season)
    • Trajectories (Short Time Series)
      • 1 trajectory of values through time for each pixel - 169 trajectories per farm
      • Single Trajectory (7 values - early November through early February)
      • Differences with 3 Lags (7 values - early November through early February and their lagged differences)
      • Percent Differences with 3 Lags (7 values - early November through early February and their lagged percentage differences)
      • From Maximum (10 Images Before, 2 Images After): Because we do not know when a field was planted, it is difficult to know on a particular date what stage of growth plants have reached. It is a standard assumption in remote sensing that the maximum NDVI value in a trajectory represents the peak of growth and greening for plants. For soybeans, this roughly means that harvest takes place two images, or four weeks, after maximum NDVI. It is therefore customary to find the maximum value and take the ten images before and the two images after it, giving a total trajectory length of 13 values.
        • Averages from Maximum (similar to 7. See table below.)
        • Differences from Maximum
        • Percentage Differences from Maximum
        • Single Trajectory from Maximum

    Averages from Maximum:

    From [7] Huang J, Wang X, Li X, Tian H, Pan Z (2013) Remotely Sensed Rice Yield Prediction Using Multi-Temporal NDVI Data Derived from NOAA's-AVHRR. PLoS ONE 8(8): e70816. doi:10.1371/journal.pone.0070816

    • Convolutional Neural Networks (CNN)
      • Definition: Convolutional neural networks are modeled after the visual cortex in the human brain. When humans take in a visual image, the brain breaks it up into overlapping tiles. The neurons in the visual cortex then work over these tiles to detect patterns. CNNs take in an image of pixel values and similarly break it up into overlapping sub-images to detect patterns. This convolution happens at the beginning of the algorithm, before the perceptrons in the neural network do their job. Because the convolutions will pick up patterns that are then fed into the neurons, the user can do minimal preprocessing of the data. We built 5-layer CNNs, with 3 convolutional layers (that included subsampling and pooling) and two globally connected neural layers using Python’s Theano module.
      • Why we used them: We are working with images, and this framework obviates the need for preprocessing.
      • What features were inputs: Our first set of CNNs were fed whole images. The second set were fed trajectories.
      • Code: Link to iPython Notebook
      Convolutional Neural Network Results
      ModelTarget ValueFeaturesDataMAPE
      15-layer with 1 channelyieldTrajectories - from all 21 images/time periods - [1 x 21] array of NDVI values13 Grids - 228,082 training examplesPredictions were off by 45%.
      23-layer with 1 channelyieldImage at Time 20 (beginning of February, right before harvest) - [1 x 169] array of NDVI values13 Grids - 1,180 training examplesPredictions were off by 29%.
      33-layer with 21 channelsyieldAll 21 Images: Time0 to Time20 (early April through early Feburary) - [1 x 3549] array of NDVI values13 Grids - 1,180 training examplesPredictions were off by 26%
      43-layer with 21 channelsyieldAll 21 Images: Time0 to Time20 (early April through early Feburary) - [1 x 3549] array of NDVI valuesAll 64 Grids - 7,127 training examplesPredictions were off by 39%.
      55-layer with 21 channelsyieldAll 21 Images: Time0 to Time20 (early April through early Feburary) - [1 x 3549] array of NDVI valuesAll 64 Grids - 7,127 training examplesPredictions were off by 37%.
      65-layer with 1 channelyieldAll 21 Images: Time0 to Time20 (early April through early Feburary) - [1 x 3549] array of NDVI valuesAll 64 Grids - 7,127 training examplesPredictions were off by 55%.
      75-layer with 11 channelsyield11 Images: Time10 to Time20 (early September through early February) - [1 x 1859] array of NDVI valuesAll 64 Grids - 7,127 training examplesPredictions were off by 37%.
      85-layer with 1 channelyieldTrajectory Differences - 3 Lags - 11 Images (early September through early February) - [1 x 38] array of NDVI valuesAll 64 Grids - 1,204,632 training examplesPredictions were off by 74%.
      95-layer with 1 channelyieldTrajectory from Max NDVI Value - Averages - 7 images - [1 x 28] array of NDVI valuesAll 64 Grids - 777,213 training examplesPredictions were off by 49%.
      105-layer with 1 channelyieldTrajectory from Max NDVI Value - Differences - 3 Lags - 7 images - [1 x 22] array of NDVI valuesAll 64 Grids - 777,213 training examplesPredictions were off by 46%.
      115-layer with 1 channelyieldReplaced Missing Values - Trajectory from Max NDVI Value - Percentage Differences - 3 Lags - 7 images - [1 x 22] array of NDVI valuesAll 64 Grids - 483,327 training examplesPredictions were off by 54%.
      Image = 13x13 grid of pixels around each farm lat/lon = 169 total pixels
      Trajectory = For one pixel, take the NDVI values across images/across time periods = 169 trajectories for each farm.
      21 Images for each farm = early April of prior year to early February of current year.
      Missing Values = 0.5
      Note: Soybean growing season is November of previous year to February of current year.
    • Recurrent Neural Networks (RNN)
      • Definition: Recurrent Neural Networks have memory. They take in a series of values collected over time (x) and predict an outcome at each time step (o). The prediction depends on the series of input values that came before it. Information from previous steps -- the memory (s) -- is encoded in a directed neural network. We built RNNs using Python’s Theano module.
      • Why we used them: We wanted to take a sequence of NDVI values and make predictions of the yield all along the soybean growing cycle. We wanted to see how early we could start to make accurate predictions.
      • What features were inputs: In our first set of models, we used trajectories only. In our second set of models, we used NDVI rings and weather data.
      • Code: Link to iPython Notebook
      Recurrent Neural Network Results
      ModelTarget ValueMissing ValuesEpochsFeaturesDataMAPE
      1tanh rectifierlog(yield)Replaced with image average100Trajectory from Max NDVI Value - 4 before/2 after - 7 images - [1 x 7] array of NDVI values10,000 training examplesPredictions were off by 26%.
      2softplus rectifieryieldReplaced with image average100Trajectory from Max NDVI Value - 4 before/2 after - 7 images - [1 x 7] array of NDVI values64 Grids - 483,327 training examplesPredictions were off by 40%.
      3softplus rectifier - 200 hidden nodeslog(yield)Replaced with image average1000Trajectory from Max NDVI Value - 10 before/2 after - 13 images - 0 Rings - Tutiempo Weather Data64 Grids - 1,934 training examplesPredictions were off by 34%.
      4softplus rectifier - 500 hidden nodeslog(yield)Replaced with image average1000Trajectory from Max NDVI Value - 10 before/2 after - 13 images - 10 Rings - Tutiempo Weather Data64 Grids - 1,201 training examplesPredictions were off by 52%.
    • Gated Recurrent Units (GRU)
      • Definition: Gated Recurrent Units are a generalization of RNNs. They also have memory, but their memory is dynamic. At each time step, gates are used to decide which new information to let in and which old information to let go. We built GRUs with two gated layers and an embedding layer using Python’s Theano module.
      • Why we used them: GRUs allow longer sequences of information to be remembered.
      • What features were inputs: We used trajectories only.
      • Code: Link to iPython Notebook
      Gated Recurrent Units Results
      ModelTarget ValueMissing ValuesEpochsFeaturesDataMAPE
      12-layerlog(yield)Replaced with image average100Trajectory from Max NDVI Value - 4 before/2 after - 7 images - [1 x 7] array of NDVI values10,000 random training examplesPredictions were off by 25%.
      22-layeryieldReplaced with image average3Trajectory from Max NDVI Value - 4 before/2 after - 7 images - [1 x 7] array of NDVI values10,000 random training examplesPredictions were off by 41%.


    Conclusions

    Our Goals

      The goals of this project were three-fold:
    1. Assess the predictive power of satellite images and weather data
    2. Compare and contrast modeling strategies
    3. Assess model accuracy as a function of time horizon

    First, in our assessments, we found that satellite images are measurably effective inputs for predicting yields. The yield models become even more robust when weather data is added. This is important because this data is easily accessible, as compared to the data used in the most cutting-edge models. The USDA is the gold standard, and their models incorporate many measurements directly from the fields, which requires boots on the ground analyzing samples and interviewing farmers about their techniques.

    Second, we found that the most effective models at predicting yields using satellite and weather data are Random Forests. In particular, our best-performing Random Forest models had ten trees and were trained using mean-squared error as the objective function. We also gave the trees maximum latitude by extending their limits to making splits with any node that had a minimum of two samples and creating leafs, or end nodes, that had at least one sample.

    The results of our model comparisons are listed in the tables below. The first table shows how our models performed right before harvest, when they had all the data from the growing season as inputs. Both Support Vector Regression and Random Forests were effective in this timeframe. The second table shows how our models performed three months ahead of harvest. Here, Random Forest models are the clear winner. The third table shows how our models performed six months ahead of harvest. Again, Random Forests were the top performing model.

    Before Harvest Model Results

    R-squaredRMSEMAEMAPE
    Baseline: Mean Model0.00%0.96640.759734.87%
    OLS Regression11.12%0.26640.205119.62%
    Ridge Regression11.13%0.26640.205119.62%
    Support Vector Regression19.84%0.2530.195618.38%
    Random Forests19.04%0.25420.194417.62%
    Gradient Boosting Decision Trees18.14%0.25560.198218.60%
    Convolutional Neural Network-4.98%0.97670.753324.63%
    Recurrent Neural Network17.87%0.87760.673622.17%
    Gated Recurrent Units-1.62%0.97620.759625.28%

    Late Planting Season Model Results (3 months ahead)

    R-squaredRMSEMAEMAPE
    Baseline: Mean Model0.00%0.96640.759734.87%
    OLS Regression10.61%0.26460.203619.16%
    Ridge Regression10.61%0.26460.203619.16%
    Support Vector Regression20.26%0.24990.193217.94%
    Random Forests21.57%0.24790.189116.97%
    Gradient Boosting Decision Trees18.83%0.25220.195218.04%
    Convolutional Neural NetworkNANANANA
    Recurrent Neural Network-9.03%1.04590.805132.18%
    Gated Recurrent Units-5.70%0.99560.758824.22%

    Early Planting Season Model Results (6 months ahead)

    R-squaredRMSEMAEMAPE
    Baseline: Mean Model0.00%0.96640.759734.87%
    OLS Regression10.97%0.26520.203619.45%
    Ridge Regression10.98%0.26520.203619.45%
    Support Vector Regression19.65%0.2520.195218.31%
    Random Forests20.64%0.25040.191317.45%
    Gradient Boosting Decision Trees19.26%0.25260.195818.31%
    Convolutional Neural NetworkNANANANA
    Recurrent Neural Network-16.35%1.08050.836632.09%
    Gated Recurrent UnitsNANANANA

    Finally, we found that our best predictions were made three months out from harvest. Our best model had a mean absolute percent error of 16.97%. When we compare our performance to that of the USDA’s more data-intensive models, we see that we did not outperform them, but we are getting closer. The graph below shows the USDA’s predictions three months out from harvest from 1990 through 2013. Their absolute percent error ranges from 1 to 16%. How much closer could we come to this gold standard with more research?

    Future Research

      Ours is a rich dataset and is ripe for more model building and research. Here are some suggested next steps:
    • First, to address the needs of our stakeholders, actual commodity prices should be incorporated into the models as targets and features. Unfortunately, we were not able to take our models that far due to employer compliance constraints.
    • Second, our models look at very small subsets of satellite images to make predictions for one farm. The models should be adapted to take in regional and national level images to make predictions for these larger areas.
    • Third, the Recurrent Neural Networks and Gated Recurrent Units could use more exploration. But in order to do this, the models need to run faster. The current code is by no means optimized and can use more development.
    • Fourth, we have not incorporated the weather data into the RNN and GRU models. We believe this could significantly increase the performance of these models.
    • Finally, we suggest testing alternative vegetation indices. We looked at NDVI and EVI. Others could be calculated, and models could be built using multiple indices as features. We never combined NDVI and EVI in the same model. This should be tried.

    Zeus, whose will has marked for man
    The sole way where wisdom lies;
    Ordered one eternal plan:
    Man must suffer to be wise.
    - Agamemnon, by Aeschylus

    Questions and Comments?

    We welcome your questions and comments! Please use the form below to send us a quick note with your thoughts.


    Additional data

    Detailed Results of Linear and Non-Linear Models

    # RingsLog YieldWeatherYear CategImage CategModelMSERMSER-squaredVar ExplainedTraining DimTesting DimMissing PixelMissing WeatherMAEMAPE
    One Farm Per Row
    00000LR1.17611.08450.03660.03695964x211492x210.5NA  
    00000Ridge1.17581.08430.03680.03715964x211492x210.5NA  
    00000SVM1.17301.08310.03910.04125964x211492x210.5NA  
    00000RF1.26211.1234-0.0339-0.03395964x211492x210.5NA  
    00000GB1.14991.07230.05800.05815964x211492x210.5NA  
    00010LR1.12541.06080.07810.07835964x301492x300.5NA  
    00010Ridge1.12511.06070.07830.07855964x301492x300.5NA  
    00010SVM1.13021.06310.07410.07685964x301492x300.5NA  
    00010RF1.16711.08030.04400.04405964x301492x300.5NA  
    00010GB1.08861.04340.10820.10835964x301492x300.5NA  
    10010LR1.12741.06180.07640.07665964x511492x510.5NA  
    10010Ridge1.12641.06130.07720.07745964x511492x510.5NA  
    10010SVM1.13141.06370.07320.07625964x511492x510.5NA  
    10010RF1.19391.09270.02190.02205964x511492x510.5NA  
    10010GB1.09001.04400.10710.10725964x511492x510.5NA  
    20010LR1.13111.06350.07340.07355964x721492x720.5NA  
    20010Ridge1.12791.06200.07600.07625964x721492x720.5NA  
    20010SVM1.12941.06270.07480.07755964x721492x720.5NA  
    20010RF1.16991.08160.04160.04165964x721492x720.5NA  
    20010GB1.09641.04710.10180.10195964x721492x720.5NA  
    30010LR1.13681.06620.06880.06905964x931492x930.5NA  
    30010Ridge1.12991.06300.07440.07465964x931492x930.5NA  
    30010SVM1.12811.06210.07590.07845964x931492x930.5NA  
    30010RF1.12901.06250.07510.07535964x931492x930.5NA  
    30010GB1.09071.04440.10650.10665964x931492x930.5NA  
    40010LR1.13891.06720.06700.06715964x1141492x1140.5NA  
    40010Ridge1.12951.06280.07470.07495964x1141492x1140.5NA  
    40010SVM1.12611.06120.07750.07995964x1141492x1140.5NA  
    40010RF1.16411.07890.04640.04645964x1141492x1140.5NA  
    40010GB1.09261.04530.10500.10515964x1141492x1140.5NA  
    50010LR1.14331.06930.06340.06355964x1351492x1350.5NA  
    50010Ridge1.13001.06300.07430.07445964x1351492x1350.5NA  
    50010SVM1.12511.06070.07830.08085964x1351492x1350.5NA  
    50010RF1.14881.07180.05890.05895964x1351492x1350.5NA  
    50010GB1.08771.04290.10900.10915964x1351492x1350.5NA  
    60010LR1.14471.06990.06220.06235964x1561492x1560.5NA  
    60010Ridge1.12971.06290.07460.07475964x1561492x1560.5NA  
    60010SVM1.12461.06050.07880.08125964x1561492x1560.5NA  
    60010RF1.16741.08050.04370.04405964x1561492x1560.5NA  
    60010GB1.08431.04130.11180.11205964x1561492x1560.5NA  
    70010LR1.14311.06920.06360.06365964x1771492x1770.5NA  
    70010Ridge1.12661.06140.07710.07725964x1771492x1770.5NA  
    70010SVM1.12341.05990.07970.08215964x1771492x1770.5NA  
    70010RF1.14541.07020.06170.06195964x1771492x1770.5NA  
    70010GB1.08671.04240.10980.11005964x1771492x1770.5NA  
    80010LR1.14941.07210.05840.05855964x1981492x1980.5NA  
    80010Ridge1.12681.06150.07690.07705964x1981492x1980.5NA  
    80010SVM1.12241.05940.08050.08285964x1981492x1980.5NA  
    80010RF1.14491.07000.06210.06245964x1981492x1980.5NA  
    80010GB1.08741.04280.10920.10955964x1981492x1980.5NA  
    90010LR1.15571.07500.05320.05335964x2191492x2190.5NA  
    90010Ridge1.12731.06170.07650.07665964x2191492x2190.5NA  
    90010SVM1.12171.05910.08110.08345964x2191492x2190.5NA  
    90010RF1.12661.06140.07710.07735964x2191492x2190.5NA  
    90010GB1.07881.03870.11630.11645964x2191492x2190.5NA  
    100010LR1.15991.07700.04980.04985964x2401492x2400.5NA  
    100010Ridge1.13161.06380.07300.07315964x2401492x2400.5NA  
    100010SVM1.12111.05880.08160.08385964x2401492x2400.5NA  
    100010RF1.12611.06120.07750.07765964x2401492x2400.5NA  
    100010GB1.07831.03840.11670.11695964x2401492x2400.5NA  
                    
    One Image Per Row
    100000LR1.15431.07440.00100.0011107905x1126977x110.5NA  
    100000Ridge1.15431.07440.00100.0011107905x1126977x110.5NA  
    100000SVM1.14971.07220.00500.0060107905x1126977x110.5NA  
    100000RF1.25211.1190-0.0836-0.0836107905x1126977x110.5NA  
    100000GB1.14671.07080.00760.0077107905x1126977x110.5NA  
    100010LR1.08421.04120.06170.0618107905x2026977x200.5NA  
    100010Ridge1.08421.04120.06170.0618107905x2026977x200.5NA  
    100010SVM1.08381.04110.06200.0635107905x2026977x200.5NA  
    100010RF1.14801.07140.00650.0065107905x2026977x200.5NA  
    100010GB1.07551.03710.06920.0693107905x2026977x200.5NA  
    100011LR1.08451.04140.06140.0615107905x4126977x410.5NA  
    100011Ridge1.08451.04140.06150.0616107905x4126977x410.5NA  
    100011SVM1.08121.03980.06430.0659107905x4126977x410.5NA  
    100011RF1.12971.06290.02230.0223107905x4126977x410.5NA  
    100011GB1.07391.03630.07060.0707107905x4126977x410.5NA  
    100NOAA11LR1.04551.02250.09520.0953107905x4126977x410.5fill in avg  
    100NOAA11Ridge1.04551.02250.09520.0953107905x4126977x410.5fill in avg  
    100NOAA11SVM0.96840.98410.16190.1663107905x4126977x410.5fill in avg  
    100NOAA11RF1.01971.00980.11750.1175107905x4126977x410.5fill in avg  
    100NOAA11GB0.97470.98730.15650.1565107905x4126977x410.5fill in avg  
    100tutiempo11LR1.02861.01420.08720.087295152x5323789x530.5exclude  
    100tutiempo11Ridge1.02861.01420.08720.087295152x5323789x530.5exclude  
    100tutiempo11SVM0.95070.97500.15640.159495152x5323789x530.5exclude  
    100tutiempo11RF0.94110.97010.16490.165295152x5323789x530.5exclude  
    100tutiempo11GB0.96220.98090.14620.146295152x5323789x530.5exclude  
    100tutiempo11LR1.04791.02370.08550.0855125260x5331316x530.5fill in avg  
    100tutiempo11Ridge1.04781.02360.08550.0855125260x5331316x530.5fill in avg  
    100tutiempo11SVM0.97210.98600.15160.1541125260x5331316x530.5fill in avg  
    100tutiempo11RF0.95550.97750.16610.1661125260x5331316x530.5fill in avg  
    100tutiempo11GB0.98930.99460.13660.1366125260x5331316x530.5fill in avg  
    100tutiempo, cate_SN11LR1.07151.03510.08820.088381333x5420334x54linear interpolateexclude  
    100tutiempo, cate_SN11Ridge1.07151.03510.08820.088381333x5420334x54linear interpolateexclude  
    100tutiempo, cate_SN11SVM0.99880.99940.15010.153181333x5420334x54linear interpolateexclude  
    100tutiempo, cate_SN11RF1.03001.01490.12350.123881333x5420334x54linear interpolateexclude  
    100tutiempo, cate_SN11GB1.00421.00210.14550.145681333x5420334x54linear interpolateexclude  
                    
    Log Transformation on Yield: All measurements are in Log Term (Removed Yield=0)
    101tutiempo, cate_SN11LR0.07090.26640.11120.111381195x5420299x540.5exclude0.20510.1962
    101tutiempo, cate_SN11Ridge0.07090.26640.11130.111381195x5420299x540.5exclude0.20510.1962
    101tutiempo, cate_SN11SVM0.06400.25300.19840.199081195x5420299x540.5exclude0.19560.1838
    101tutiempo, cate_SN11RF0.06460.25420.19040.190581195x5420299x540.5exclude0.19440.1762
    101tutiempo, cate_SN11GB0.06530.25560.18140.181481195x5420299x540.5exclude0.19820.1860
    101tutiempo, cate_SN01LR0.07750.27840.05650.056681333x4520334x450.5exclude0.21020.2019
    101tutiempo, cate_SN01Ridge0.07750.27840.05660.056681333x4520334x450.5exclude0.21020.2019
    101tutiempo, cate_SN01SVM0.06590.25670.19710.198981333x4520334x450.5exclude0.19700.1836
    101tutiempo, cate_SN01RF0.06800.26080.17230.172381333x4520334x450.5exclude0.19860.1775
    101tutiempo, cate_SN01GB0.06920.26310.15760.157681333x4520334x450.5exclude0.20180.1887
                    
    Three Month Ahead Prediction: log Transformation on Yield, Removed Yield=0
    101tutiempo, cate_SN11LR0.07000.26460.10610.106159528x4814883x480.5exclude0.20360.1916
    101tutiempo, cate_SN11Ridge0.07000.26460.10610.106159528x4814883x480.5exclude0.20360.1916
    101tutiempo, cate_SN11SVM0.06250.24990.20260.203359528x4814883x480.5exclude0.19320.1794
    101tutiempo, cate_SN11RF0.06140.24790.21570.215759528x4814883x480.5exclude0.18910.1697
    101tutiempo, cate_SN11GB0.06360.25220.18830.188359528x4814883x480.5exclude0.19520.1804
    Six Month Ahead Prediction: log Transformation on Yield, Removed Yield=0
    101tutiempo, cate_SN11LR0.07030.26520.10970.109940728x4210182x420.5exclude0.20360.1945
    101tutiempo, cate_SN11Ridge0.07030.26520.10980.110140728x4210182x420.5exclude0.20360.1945
    101tutiempo, cate_SN11SVM0.06350.25200.19650.198540728x4210182x420.5exclude0.19520.1831
    101tutiempo, cate_SN11RF0.06270.25040.20640.206740728x4210182x420.5exclude0.19130.1745
    101tutiempo, cate_SN11GB0.06380.25260.19260.192940728x4210182x420.5exclude0.19580.1831