Gro’s Scalable Data Platform
Gro’s expanding database and ontology architecture handles a widening array of complex challenges across global agriculture. As a result, we are increasingly able to provide powerful real-time forecasts and insights in response to unanticipated, market-moving developments.
We update Gro’s data in two steps: harvest and transformation. The harvest of each individual data source is a unique process that depends on its complexity and file format. We retain copies of all the raw data to preserve a history of each source’s original values should a change occur. Harvesting data in this fashion gives users access to a full history of revisions for many sources, which facilitates backtesting of trading models through the use of Gro’s as-reported data.
Data then goes through transformation, in which we fit each data point into a common ontology. The transformation step is critical as it makes our data easily searchable, improving discoverability and promoting unencumbered access to our data. Through this transformation process, we stitch together data from disparate sources through standardized crop and region names, unit conversions, and foreign language translations. We also categorize data geospatially, allowing straightforward comparisons of data from different sources within the same regions. Our proprietary ontology makes it easy for Gro to add new data sources that are useful to clients or our internal modeling efforts. Importantly, we run frequent quality checks to preserve the integrity of each data series.
Gro users can access the entire data platform, which now holds over 55 million series and 650 trillion data points with global coverage. Gro’s platform enables users to discover relationships and arrive at their own insights, without the onerous prep-work steps. Gro API clients can incorporate data into their models through simple database queries. In addition, all of Gro’s predictive models and their methodologies are made available to API clients.
In 2019 alone, we've experienced unexpected disruptions such as massive flooding in the American Midwest leading to unprecedented amounts of abandoned acreage and financial distress, the outbreak of serious crop and animal diseases in China, and a US government shutdown. In each instance, we have been able to leverage our domain-specific machine learning models – powered by the underlying Gro data platform -- to develop a forecast model for the issue at hand in a manner of weeks. In each of these instances, Gro was able to develop a series of machine learning frameworks that forecast (i) prevent plant acreage in the American Midwest, (ii) the spread of fall armyworms and (iii) supply and demand estimates for 35 crops grown around the world when the US government shutdown in early 2019. Gro API users are able to further customize these models to suit their needs, modifying Gro’s baseline methodology or adding additional data sets as necessary.
The Complexity of GFS
GFS is the weather model from NOAA that incorporates the new Finite-Volume Cubed Sphere (FV3) core. This updated core provides more accuracy, especially for the near-range forecasts. Three of the model outputs are available through Gro: maximum temperature, minimum temperature, and precipitation, forecast out 16 days in the future, at a resolution of 28 square kilometers. We harvest the earliest 00z (midnight Greenwich Mean Time) forecast run daily.
The GFS outputs present an especially complex geospatial data source ingestion task. Gro automates the harvesting and data processing to serve the data outputs to users quickly and efficiently. If a single user attempted to manually harvest, quality check, aggregate, and compute global regional values for a single day’s run (16 days of data), the efforts could take several days of work, depending on their workstation’s speed. Conversely, Gro’s optimized automated workflow completes the processing in a few hours.
There are a number of contributing factors that make geospatial data sets, and GFS in particular, so highly time-intensive. First, harvesting the files entails several steps, the first of which is determining the expected release time of the data sets from the source. Not every source releases data at the same time and date with every new upload, and the automated scripts must be flexible enough to account for these differences. In addition, some geospatial sources are delivered in hundreds of smaller pieces, with varying expected file sizes. For the GFS source, four files per variable for each day of the 16 day forecast are harvested by Gro. We check each source harvest for file completeness, and flag incomplete or failed harvests. In addition, if a source release is delayed more than expected, a flag is set for follow up, which may include contacting the source directly.
Once the data is harvested and stored, data preparation begins. Data preparation is entirely source-specific. Some sources need minimal data prep, while most need one or more of the following steps: decompressing, converting to a standard file format, mosaicking files together, transforming the files to a standard projection, filename changes to include additional metadata information, rescaling to more intuitive values, and file aggregation. The GFS data sets, for example, need decompression from grib2 format, filename changes, and aggregation from four six-hour files to one 24-hour file 16 times for each day’s run.
Next, we translate the harvested and prepped files to user-friendly formats that are ready for analyses and insights. Heatmaps allow for easy comparison from one time step to the next, and also visually highlight areas of concern. Each geospatial source has its own specialized legend, with colors and legend breaks assigned based on scientific convention and expected data ranges and bell curves. Zonal aggregations of the data to regions (i.e. districts/counties, provinces/states, and countries) provide more easily-digestible information for charts and models. By condensing pixels (in some cases, hundreds of them), to a single value per period per region, a large image file has now been transformed into a powerful set of values, easily accessible for users to discover patterns and inform decisions.
The maps above show how Gro’s data can be converted from satellite pixels to administrative regions. On the left, we have a heatmap of GFS precipitation forecast for Brazil on September 29th. On the right, the same data is presented in a choropleth map at the provincial level through the process of zonal aggregation. Click on the map to go to an interactive version on the Gro web app.
For many of our geospatial sources, additional metrics are also computed using the image files, such as aggregations from daily data to weekly/monthly, and means and anomalies, and these computations are available in Gro via heatmaps or database values.
The GFS source files have heatmaps created for each of the variables for each day, with similar legends to Gro’s other precipitation and temperature sources for easy comparison. More than 40,000 regional computations are completed for each of the 16 days of forecast data and uploaded to the database for users. (A single GFS variable will produce over 600,000 new values every day.) Recreating this step on a single workstation without automation can take several days’ time of processing.
Quality checks are run at all necessary steps, flagging issues such as values that fall outside of expected ranges and missing files, allowing for quick responses. Data files are stored for future computations and easy access to archived data without a need for reharvesting from the source, which is imperative if the original source goes off-line for any measure of time.
Practical Uses for GFS Data
GFS data is particularly useful for improving existing models that rely on temperature and precipitation data. Amid concerns about dry conditions ahead of the South American growing season, here we examine current corn and soybean crop-weighted soil moisture readings. We also show how GFS data can be used with historical precipitation to predict future soil moisture levels.
To construct these crop-weighted measurements, we use data from IBGE, which reports annual production for many crops in over five thousand municipalities in Brazil. The weights for each municipality are then applied to Gro’s soil moisture and rainfall data and aggregated to create a single national production-weighted series.
Brazil’s crop-weighted soil moisture readings for corn (blue line) and soybeans (green line) are currently below normal (red lines) for this time of year. Soil moisture levels usually increase through October and November, but the seasonal trend is in doubt this year given the lack of rain in the current forecast.
The charts above show Brazil’s corn and soybean soil moisture levels, which are currently 16 percent and 11 percent below the five-year average, respectively. While low early-season soil moisture has not been definitively shown to impact yield or overall production, continued low moisture levels will introduce enough potential stress to crops to keep markets on edge until more accurate yield measurements can be obtained. Therefore, using GFS data to predict future soil moisture can help market participants monitor potential planting progress, crop health, and likely prices.
The production-weighted precipitation series combine historical Tropical Rainfall Measuring Mission (TRMM) data with GFS forecasts, creating a seamless indicator including both realized and projected data. These series are then accumulated from July 1st each year. Season-to-date precipitation is currently 55 percent below the five-year average for corn, and 47 percent below the five-year average for soybeans. Adding GFS forecast data paints an even more dire picture. Corn’s season-to-date precipitation is expected to fall to 68% below the five-year average by the first week of October. The same measure for soybeans is expected to fall some 60% below the five-year average by the first week of October.
Brazil’s crop-weighted precipitation measures for corn (blue line) and soybeans (green line) have fallen well behind normal this year. Current GFS forecasts (red lines) show little respite in coming weeks. GFS data with global coverage is now available in the Gro web app.
Historically, cumulative precipitation has exhibited a positive correlation with future soil moisture readings, albeit with modest statistical significance. According to Gro’s analysis using historical TRMM precipitation and current GFS forecasts, Brazil’s soil moisture levels are expected to remain below normal in the weeks ahead as we enter the key Brazilian growing season. Therefore, market participants should remain wary of the higher-than-normal risks of crop damage and yield declines due to dryness this year.
In addition to predicting future soil moisture levels, GFS data can be used in many other situations. These include: estimating US prevented plantings, improving yield models, seasonal demand forecasting, and predicting crop area loss from tropical weather events. The Multivariate El Niño/Southern Oscillation Index (MEI), another recently added Gro data source, influences weather in many different regions, including Brazil. MEI can be used alongside GFS and TRMM to create long-term weather-based models.
Gro’s platform is built to increase the efficiency of global agricultural markets, providing solutions to many problems faced by businesses across the global agricultural supply chain. The analysis ready data in Gro from series like GFS, NDVI, TRMM, soil moisture, temperature, acreage, and others along with their histories give users an unprecedented ability to rapidly model complex scenarios and outcomes as they develop.
The addition of GFS weather forecasting data rounds out a full suite of global geospatial variables in Gro relevant to agriculture. We encourage anyone who wants to use weather forecasts to build or improve a model to contact firstname.lastname@example.org to learn more about our valuable data sets and analysis.