Brazil Sao Paulo sugarcane crop model using Gro API

In this notebook, we walk through the process of building a simple "crop model" using data in Gro. Our crop models use statistical or machine learning algorithms to predict crop yield of a speicific crop and regions. The accompanying Gro web app display for Sao Paulo Sugarcane provides an overview of the data series we will use to create a model to predict the sugar processing yield in Sao Paulo.

Preliminary

Here we use a CropModel object as well as some feature transformation functions. We also save the entity ids for the item sugarcane, metric processing yield, and for the region Sao Paulo. We will use these throughout the notebook to simplify the retrieval, storage and manipulation of data. In order to download data from our API, you need to set up a GROPI_TOKEN. Instructions are in this wiki page.

In [5]:
import warnings
warnings.filterwarnings('ignore')
import logging
logging.basicConfig(level=logging.ERROR) 
import os
import numpy as np
import pandas as pd
import itertools
import datetime

from api.client.crop_model import CropModel
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from sklearn.cluster import KMeans

model = CropModel('api.gro-intelligence.com', os.environ['GROAPI_TOKEN'])
country_id = model.search_for_entity('regions', "brazil")
region_id = model.search_for_entity('regions', "Sao Paulo")
item_id =  model.search_for_entity('items', "sugarcane, TRS")

Historical Sugar Processing yields

To get the province-level sucrose yield data for sugarcane in Sao Paulo state, we set the entities (item, metric and region) and get the available data series. The source for Sao Paulo sugarcane yields is UNICA.

In [6]:
yield_entities = {}
yield_entities['item_id']  = item_id
yield_entities['region_id'] = region_id
yield_entities['metric_id'] = model.search_for_entity('metrics', "processing yield mass/mass")
yield_entities['source_id'] = 47 # unica
data_series_list = model.get_data_series(**yield_entities)
print("There are {} data series for {}.".format(len(data_series_list), yield_entities))
for data_series in data_series_list:
    print("source_id {}: {} to {}".format(
        data_series['source_id'], data_series['start_date'], data_series['end_date']))
    model.add_single_data_series(data_series)
There are 2 data series for {'item_id': 7539, 'region_id': 10408, 'metric_id': 6881050, 'source_id': 47}.
source_id 47: 2007-04-01T00:00:00.000Z to 2020-03-31T00:00:00.000Z
source_id 47: 2008-04-01T00:00:00.000Z to 2021-04-30T00:00:00.000Z
  • As printed above, there are 2 available data series for the Sugar processing yield of Sao Paulo
  • Next we retrieve the data frame of the sugarcane total recoverable sugars (TRS) series and make sure to subset to the metric (Processing Yield (mass/mass)), semi-monthly frequency and region (Sao Paulo)
In [7]:
df = model.get_df()
yield_df = df.loc[(df.metric_id == yield_entities['metric_id']) & \
                  (df.region_id == yield_entities['region_id']) & \
                  (df.frequency_id == 23)
                 ]
yield_df.end_date = pd.to_datetime(yield_df.end_date, utc=True)
yield_df['DOY']  = yield_df.end_date.apply(lambda x: int(x.strftime('%j')))
yield_df['Year'] = yield_df.end_date.apply(lambda x: x.year)
yield_df.sort_values(['Year', 'DOY']).tail()
Out[7]:
start_date end_date value unit_id input_unit_id input_unit_scale reporting_date metric_id item_id region_id partner_region_id frequency_id source_id DOY Year
258 2021-02-16 00:00:00+00:00 2021-02-28 00:00:00+00:00 89.03 867.0 867.0 1 NaT 6881050 7539 10408 0 23 47 59 2021
259 2021-03-01 00:00:00+00:00 2021-03-15 00:00:00+00:00 97.63 867.0 867.0 1 NaT 6881050 7539 10408 0 23 47 74 2021
260 2021-03-16 00:00:00+00:00 2021-03-31 00:00:00+00:00 118.68 867.0 867.0 1 NaT 6881050 7539 10408 0 23 47 90 2021
261 2021-04-01 00:00:00+00:00 2021-04-15 00:00:00+00:00 101.49 867.0 867.0 1 NaT 6881050 7539 10408 0 23 47 105 2021
262 2021-04-16 00:00:00+00:00 2021-04-30 00:00:00+00:00 118.60 867.0 867.0 1 NaT 6881050 7539 10408 0 23 47 120 2021
  • The resulting data frame includes information on the start and end date of the periods corresponding to each processing yield observation, and ids representing the frequency, unit, unit scale, item, metric, and region of our Gro data series. Note: the unit used here for TRS is kg/ton or kilograms per US short ton, not the metric tonnes.
  • The Day of Year and Year columns are derived from the end_date column to help with feature transformation processing

  • A scatter plot helps us visualize the progression of the TRS processing yield over time. We suspect that it follows similar seasonal patterns in all years and more recent years might be higher in TRS than years further back.

  • So, we plot the following figure of TRS versus the day of year and color coded by year.
In [8]:
fig, axes = plt.subplots(1, 1)
yield_df[yield_df['value']<300].plot(x='DOY', y='value', kind='scatter', c='Year', 
                                     cmap='rainbow', ax=axes, figsize=(12,8), fontsize=20, legend=True)
plt.xlabel("Day of Year", fontsize=18)
plt.ylabel("Yield (kg/t)", fontsize=18)
plt.show()
  • As we can see in this figure above, the sugar processing yield follows an increasing pattern from the Day of year (DOY) 100 to about 270 and then goes back down
  • The processing yield between the DOY 350 and DOY 100 of the following year appear to show no pattern, and there are only data points for recent years
  • For the purpose of this analysis, we therefore limit the study period to between the DOY 100 and 350
  • Note: It is possible to limit the time period of the data being pulled. You just need to specify 'start_date' and 'end_date' in the entities dictionary that we used earlier as inputs to the CropModel get_data_series method
In [9]:
df_within_range = yield_df[(yield_df['DOY'] > 100) & (yield_df['DOY'] < 350)]
  • In case of making a TRS forecast for a period in the future, we generate an additional dataframe of the new dates. For example, here we add a date that occurs 15 days after the most recent available TRS period in order to make a forecast for that period.
In [10]:
# date to predict
new_date = df_within_range['end_date'].max() + datetime.timedelta(days=15)
print(new_date.strftime('%Y-%m-%d'))
2021-05-15
In [11]:
# expand the df_within_range with this new date
unknown_df = df_within_range.set_index('end_date').reindex([new_date]).reset_index()
unknown_df['DOY']  = unknown_df.end_date.apply(lambda x: int(x.strftime('%j')))
unknown_df['Year'] = unknown_df.end_date.apply(lambda x: x.year)
df_within_range = pd.concat([df_within_range, unknown_df
                             ], ignore_index=True)
  • Next, convert the end_date column to datetime
  • Convert the DOY and Year columns to integers type
In [12]:
df_within_range['end_date'] = pd.to_datetime(df_within_range['end_date'], utc=True)
df_within_range.DOY = df_within_range.DOY.astype(int)
df_within_range.Year = df_within_range.Year.astype(int)
df_within_range.rename(columns={'value': 'TRS'}, inplace=True)
df_within_range.tail()
Out[12]:
start_date end_date TRS unit_id input_unit_id input_unit_scale reporting_date metric_id item_id region_id partner_region_id frequency_id source_id DOY Year
215 2020-11-01 00:00:00+00:00 2020-11-15 00:00:00+00:00 158.06 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 320 2020
216 2020-11-16 00:00:00+00:00 2020-11-30 00:00:00+00:00 156.76 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 335 2020
217 2021-04-01 00:00:00+00:00 2021-04-15 00:00:00+00:00 101.49 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 105 2021
218 2021-04-16 00:00:00+00:00 2021-04-30 00:00:00+00:00 118.60 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 120 2021
219 NaT 2021-05-15 00:00:00+00:00 NaN NaN NaN NaN NaT NaN NaN NaN NaN NaN NaN 135 2021

NDVI of Sao Paulo

The main signal we will use to model TRS yields is NDVI, which represents vegetation biomass per pixel, and thus is a good physical proxy for vegetative growth (production mass per unit of area).

First we load the historical data for province level NDVI. Here, there are actually two data series for each region, one with 8-day and one with 16-day periods. We choose to use the series with 8-day (frequency_id = 3).

In [13]:
entities = {}
entities['item_id'] =  model.search_for_entity('items', "Vegetation NDVI")
entities['metric_id'] = model.search_for_entity('metrics', "Vegetation Indices index")
entities['frequency_id'] = 3


entities['region_id'] = region_id
for data_series in model.get_data_series(**entities):
    model.add_single_data_series(data_series)
In [14]:
df = model.get_df()
raw_ndvi = df[(df['item_id'] == entities['item_id']) & (df['metric_id']==entities['metric_id'])]
raw_ndvi.head()
Out[14]:
start_date end_date value unit_id input_unit_id input_unit_scale reporting_date metric_id item_id region_id partner_region_id frequency_id source_id
0 2000-02-18 00:00:00+00:00 2000-02-25 00:00:00+00:00 0.740052 189.0 189.0 1 NaT 70029 321 10408 0 3 3
1 2000-02-26 00:00:00+00:00 2000-03-04 00:00:00+00:00 0.695881 189.0 189.0 1 NaT 70029 321 10408 0 3 3
2 2000-03-05 00:00:00+00:00 2000-03-12 00:00:00+00:00 0.691287 189.0 189.0 1 NaT 70029 321 10408 0 3 3
3 2000-03-13 00:00:00+00:00 2000-03-20 00:00:00+00:00 0.688005 189.0 189.0 1 NaT 70029 321 10408 0 3 3
4 2000-03-21 00:00:00+00:00 2000-03-28 00:00:00+00:00 0.712438 189.0 189.0 1 NaT 70029 321 10408 0 3 3
  • A quick way to summarize the statistics of this NDVI dataframe is to use the describe() method.
  • As shown in the table below, there are 889 data points and the mean NDVI value is 0.63 with a standard deviation of 0.079, a min of 0.386, and a max of 0.76.
In [15]:
raw_ndvi.describe()
Out[15]:
value unit_id input_unit_id input_unit_scale metric_id item_id region_id partner_region_id frequency_id source_id
count 971.000000 971.0 971.0 971.0 971.0 971.0 971.0 971.0 971.0 971.0
mean 0.630839 189.0 189.0 1.0 70029.0 321.0 10408.0 0.0 3.0 3.0
std 0.079715 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
min 0.404479 189.0 189.0 1.0 70029.0 321.0 10408.0 0.0 3.0 3.0
25% 0.565856 189.0 189.0 1.0 70029.0 321.0 10408.0 0.0 3.0 3.0
50% 0.648854 189.0 189.0 1.0 70029.0 321.0 10408.0 0.0 3.0 3.0
75% 0.700651 189.0 189.0 1.0 70029.0 321.0 10408.0 0.0 3.0 3.0
max 0.759790 189.0 189.0 1.0 70029.0 321.0 10408.0 0.0 3.0 3.0
  • The start and end dates of individual NDVI periods do not match with that of the Sugar processing yield periods because NDVI data is reported in 8-day periods while the TRS series is reported in 15-day periods.
  • In order to synchronize these two data series, we transform the NDVI data to a daily time step first and then pick the NDVI value that corresponds to the sugar processing yield data periods' end dates.
  • The way we transform NDVI is a linear interpolation method. The reasoning is that NDVI is an indicator of the amount of biomass, which generally varies continuously as the vegetation grows.
In [16]:
# transform the 8-day data into a daily dataframe
date_index = 'end_date'
startday = 100
endday = 350
region_index='region_id'
feature = 'value'
raw_ndvi[date_index] = pd.to_datetime(raw_ndvi[date_index], utc=True)
min_date = raw_ndvi[date_index].min()
max_date = pd.to_datetime("{}-{}".format(raw_ndvi[date_index].max().year, endday), 
                          format='%Y-%j', utc=True)
date_range = pd.date_range(min_date, max_date, freq='D', name=date_index)
raw_ndvi_pivot = raw_ndvi.pivot_table(index=date_index, columns=region_index, values=feature)

raw_ndvi_filled = raw_ndvi_pivot.reset_index().set_index(date_index).reindex(
    date_range).asfreq('D').interpolate(
    method='linear',limit_direction='forward', axis=0).stack(   
    region_index).reset_index(name=feature)
raw_ndvi_filled.describe()
Out[16]:
region_id value
count 7966.0 7966.000000
mean 10408.0 0.630630
std 0.0 0.078133
min 10408.0 0.404479
25% 10408.0 0.568370
50% 10408.0 0.643511
75% 10408.0 0.699447
max 10408.0 0.759790
  • The data frame printed above shows the result of applying linear interpolation of NDVI from 8-day periods to daily timestep
  • Next, subset the NDVI data to our study timeframe between the DOY 100 and 350
In [17]:
# Get the Doy of year (DOY) as well as Year from the date index
raw_ndvi_filled.loc[:, 'doy'] = raw_ndvi_filled[date_index].dt.dayofyear
raw_ndvi_filled.loc[:, 'year'] = raw_ndvi_filled[date_index].dt.year
ndvi_within_range = raw_ndvi_filled[(raw_ndvi_filled['doy'] > 100) & (
    raw_ndvi_filled['doy'] < 350)]
  • And then merge together the NDVI data frame with the TRS or sugar processing yield data frame.
  • Keep the NDVI values of the dates that match with the TRS period end dates.
In [18]:
ndvi_within_range.rename(columns={'value': 'ndvi'}, inplace=True)
ndvi_df = ndvi_within_range[ndvi_within_range['year'].isin(df_within_range['Year'].unique())]


ndvi_df.loc[:, 'DOY'] = ndvi_df.doy.astype(int)
ndvi_df.loc[:, 'Year'] = ndvi_df.year.astype(int)

ndvi_y = df_within_range[['DOY','Year','TRS']].merge(ndvi_df[['end_date','doy','year', 'ndvi']],
                        left_on=['DOY','Year'], right_on= ['doy','year'], how='left')
ndvi_y.tail()
Out[18]:
DOY Year TRS end_date doy year ndvi
215 320 2020 158.06 2020-11-15 00:00:00+00:00 320 2020 0.538198
216 335 2020 156.76 2020-11-30 00:00:00+00:00 335 2020 0.583034
217 105 2021 101.49 2021-04-15 00:00:00+00:00 105 2021 0.681949
218 120 2021 118.60 2021-04-30 00:00:00+00:00 120 2021 0.645152
219 135 2021 NaN 2021-05-15 00:00:00+00:00 135 2021 0.621904
  • A scatter plot of TRS versus NDVI that is color coded by the Day of Year (DOY) indicates: the relationship between the two variables along with different patterns or relationships during different periods of the year.
In [19]:
fig, ax = plt.subplots(figsize=(12, 8))
        
ndvi_y.plot(x='ndvi', y='TRS', c='doy', cmap='rainbow', kind='scatter', ax=ax, fontsize=18)
plt.xlabel('NDVI', fontsize=18)
plt.ylabel('TRS (kg/ton)', fontsize=18)
plt.show()
  • As shown in this figure above, there is an inverse relationship between the content of Total Recoverable Sugar (TRS) and the NDVI values.
  • This negative correlation has greater magnitude (a steeper slope) during earlier DOY than the rest of the year

Daily FEWS_PET versus Sugar Content

In addition to NDVI, there are various climate features that may be able to capture variation in the sugar processing yield. The potential evapotranspiration is one important measurement: FEWS_PET

In [20]:
entities = {}
entities['item_id'] =  5072
entities['metric_id'] = 4660031
entities['frequency_id'] = 1 # daily
entities['source_id'] = 44
entities['region_id'] = region_id

for data_series in model.get_data_series(**entities):
    model.add_single_data_series(data_series)
In [21]:
model.get_df()
raw_pet = model._data_frame[model._data_frame['metric_id'] == 4660031]
raw_pet.tail()
Out[21]:
start_date end_date value unit_id input_unit_id input_unit_scale reporting_date metric_id item_id region_id partner_region_id frequency_id source_id
7420 2021-05-20 00:00:00+00:00 2021-05-20 00:00:00+00:00 3.553505 2.0 2.0 1 NaT 4660031 5072 10408 0 1 44
7421 2021-05-21 00:00:00+00:00 2021-05-21 00:00:00+00:00 3.831638 2.0 2.0 1 NaT 4660031 5072 10408 0 1 44
7422 2021-05-22 00:00:00+00:00 2021-05-22 00:00:00+00:00 2.403207 2.0 2.0 1 NaT 4660031 5072 10408 0 1 44
7423 2021-05-23 00:00:00+00:00 2021-05-23 00:00:00+00:00 1.515502 2.0 2.0 1 NaT 4660031 5072 10408 0 1 44
7424 2021-05-24 00:00:00+00:00 2021-05-24 00:00:00+00:00 2.692021 2.0 2.0 1 NaT 4660031 5072 10408 0 1 44
In [22]:
raw_pet.loc[:, date_index] = pd.to_datetime(raw_pet[date_index], utc=True)

raw_pet.loc[:, 'doy'] = raw_pet[date_index].dt.dayofyear
raw_pet.loc[:, 'year'] = raw_pet[date_index].dt.year
  • Since PET is available at daily time step, we are able to directly aggregate it to match with the same periods of the Sugar processing yield or TRS dataset.
  • First, merge the DOY information from the TRS dataset, and then aggregate PET (sum) to those periods.
In [23]:
START_YEAR = df_within_range['Year'].min()

pet_agg = raw_pet[['doy','year', 'value']].merge(df_within_range[['DOY','Year','TRS']],
                        right_on=['DOY','Year'], left_on= ['doy','year'], how='left')
pet_agg = pet_agg[pet_agg['year'] >= START_YEAR][pet_agg['doy']>=100][pet_agg['doy']<=350]
pet_grouped = pet_agg.fillna(method='bfill').groupby(['year','DOY']).sum().reset_index()

pet_y = df_within_range.merge(pet_grouped[['DOY', 'value', 'year']], 
                               left_on=['DOY','Year'], right_on=['DOY', 'year'], how='left')
pet_y.sort_values(['Year','DOY']).tail()
Out[23]:
start_date end_date TRS unit_id input_unit_id input_unit_scale reporting_date metric_id item_id region_id partner_region_id frequency_id source_id DOY Year value year
215 2020-11-01 00:00:00+00:00 2020-11-15 00:00:00+00:00 158.06 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 320 2020 84.400187 2020
216 2020-11-16 00:00:00+00:00 2020-11-30 00:00:00+00:00 156.76 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 335 2020 89.291531 2020
217 2021-04-01 00:00:00+00:00 2021-04-15 00:00:00+00:00 101.49 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 105 2021 24.885540 2021
218 2021-04-16 00:00:00+00:00 2021-04-30 00:00:00+00:00 118.60 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 120 2021 53.653951 2021
219 NaT 2021-05-15 00:00:00+00:00 NaN NaN NaN NaN NaT NaN NaN NaN NaN NaN NaN 135 2021 49.032044 2021
  • Next we use a color coded scatter plot to show the relationship between TRS and PET.
In [24]:
fig, ax = plt.subplots(figsize=(12, 8))
        
pet_y.plot(x='value', y='TRS', c='DOY', cmap='rainbow', kind='scatter', ax=ax, fontsize=18)
plt.xlabel('Potential Evapotranspiration (mm)', fontsize=18)
plt.ylabel('TRS (kg/ton)', fontsize=18)
plt.show()
  • There appears to be a positive relationship between TRS and PET for the DOY between 170 to 260.
  • The rest of the days of a year appeared to show a much weaker relationship.

Daily GPM precipitation with/without GFS weather forecast versus Sugar Content

Daily precipitation is another weather variable that proved to be important to the sugar processing yield. GPM provides actually measured daily wether data while GFS gives daily forecasted weather data including precipitation. Since GFS forecasts future precipitation up to about two weeks from today, it presumably improves predictions of sugar yield by incorporating this feature. GPM GFS Precipitation

In [25]:
entities = {}
entities['item_id'] =  10081
entities['metric_id'] = 2100031
entities['frequency_id'] = 1 # daily
entities['source_id'] = 126
entities['region_id'] = region_id

for data_series in model.get_data_series(**entities):
    if data_series['region_id']==region_id:
        model.add_single_data_series(data_series)
In [26]:
model.get_df()
raw_gpm = model._data_frame[model._data_frame['metric_id'] == 2100031]
raw_gpm.loc[:, date_index] = pd.to_datetime(raw_gpm[date_index], utc=True)

raw_gpm.loc[:, 'doy'] = raw_gpm[date_index].dt.dayofyear
raw_gpm.loc[:, 'year'] = raw_gpm[date_index].dt.year
raw_gpm.tail()
Out[26]:
start_date end_date value unit_id input_unit_id input_unit_scale reporting_date metric_id item_id region_id partner_region_id frequency_id source_id doy year
7658 2021-05-20 00:00:00+00:00 2021-05-20 00:00:00+00:00 0.011674 2.0 2.0 1 NaT 2100031 2039 10408 0 1 126 140 2021
7659 2021-05-21 00:00:00+00:00 2021-05-21 00:00:00+00:00 0.021535 2.0 2.0 1 NaT 2100031 2039 10408 0 1 126 141 2021
7660 2021-05-22 00:00:00+00:00 2021-05-22 00:00:00+00:00 15.975290 2.0 2.0 1 NaT 2100031 2039 10408 0 1 126 142 2021
7661 2021-05-23 00:00:00+00:00 2021-05-23 00:00:00+00:00 2.444301 2.0 2.0 1 NaT 2100031 2039 10408 0 1 126 143 2021
7662 2021-05-24 00:00:00+00:00 2021-05-24 00:00:00+00:00 0.490827 2.0 2.0 1 NaT 2100031 2039 10408 0 1 126 144 2021
  • Similar to PET, GPM is also a data source at the daily timestep. So, we follow a similar procedure to aggregate (sum) it to the same periods as the TRS dataset.
In [27]:
gpm_agg = raw_gpm[['doy','year', 'value']].merge(df_within_range[['DOY','Year','TRS']],
                        right_on=['DOY','Year'], left_on= ['doy','year'], how='left')

gpm_agg = gpm_agg[gpm_agg['year'] >= START_YEAR][gpm_agg['doy']>=100][gpm_agg['doy']<=350]


gpm_grouped = gpm_agg.fillna(method='bfill').groupby(['DOY','year']).sum().reset_index()

gpm_y = df_within_range.merge(gpm_grouped[['DOY', 'value', 'year']], 
                               left_on=['DOY','Year'], right_on=['DOY', 'year'], how='left')
gpm_y.tail()
Out[27]:
start_date end_date TRS unit_id input_unit_id input_unit_scale reporting_date metric_id item_id region_id partner_region_id frequency_id source_id DOY Year value year
215 2020-11-01 00:00:00+00:00 2020-11-15 00:00:00+00:00 158.06 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 320 2020 45.406355 2020
216 2020-11-16 00:00:00+00:00 2020-11-30 00:00:00+00:00 156.76 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 335 2020 51.885940 2020
217 2021-04-01 00:00:00+00:00 2021-04-15 00:00:00+00:00 101.49 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 105 2021 0.416071 2021
218 2021-04-16 00:00:00+00:00 2021-04-30 00:00:00+00:00 118.60 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 120 2021 20.128966 2021
219 NaT 2021-05-15 00:00:00+00:00 NaN NaN NaN NaN NaT NaN NaN NaN NaN NaN NaN 135 2021 12.941844 2021
  • Next, we use a scatter plot to show the relationship between TRS and GPM precipitation.
In [28]:
fig, ax = plt.subplots(figsize=(12, 8))
        
gpm_y.plot(x='value', y='TRS', c='DOY', cmap='rainbow', kind='scatter', ax=ax, fontsize=18)
plt.xlabel('GPM precipitation (mm)', fontsize=18)
plt.ylabel('TRS (kg/ton)', fontsize=18)
plt.show()
  • As shown in the scatter plot above, there appears to be a mild negative relationship between TRS and GPM precipitation in the period following DOY 300 and in the period preceding DOY 200.

Next, we obtain GFS data using the same method as GPM.

In [29]:
entities = {}
entities['item_id'] =  10081
entities['metric_id'] = 2100031
entities['frequency_id'] = 1 # daily
entities['source_id'] = 105   #GFS
entities['region_id'] = region_id

for data_series in model.get_data_series(**entities):
    if data_series['region_id']==region_id:
        model.add_single_data_series(data_series)
In [30]:
model.get_df()
raw_gfs = model._data_frame[(model._data_frame['metric_id'] == 2100031) & (
    model._data_frame['item_id'] == 2039)]
raw_gfs[date_index] = pd.to_datetime(raw_gfs[date_index], utc=True)

raw_gfs.loc[:, 'doy'] = raw_gfs[date_index].dt.dayofyear
raw_gfs.loc[:, 'year'] = raw_gfs[date_index].dt.year

raw_gfs.describe()
Out[30]:
value unit_id input_unit_id input_unit_scale metric_id item_id region_id partner_region_id frequency_id source_id doy year
count 8366.000000 8366.0 8366.0 8366.0 8366.0 8366.0 8366.0 8366.0 8366.0 8366.000000 8366.000000 8366.000000
mean 4.202691 2.0 2.0 1.0 2100031.0 2039.0 10408.0 0.0 1.0 124.235357 183.214559 2011.209300
std 6.909359 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.826456 105.679660 6.390524
min 0.000000 2.0 2.0 1.0 2100031.0 2039.0 10408.0 0.0 1.0 105.000000 1.000000 2000.000000
25% 0.047539 2.0 2.0 1.0 2100031.0 2039.0 10408.0 0.0 1.0 126.000000 91.000000 2006.000000
50% 0.929760 2.0 2.0 1.0 2100031.0 2039.0 10408.0 0.0 1.0 126.000000 184.000000 2011.000000
75% 5.842109 2.0 2.0 1.0 2100031.0 2039.0 10408.0 0.0 1.0 126.000000 275.000000 2017.000000
max 76.834189 2.0 2.0 1.0 2100031.0 2039.0 10408.0 0.0 1.0 126.000000 366.000000 2021.000000
  • Similar to GPM, we can aggregate (sum) GFS data to the same periods as the TRS dataset.
In [31]:
gfs_agg = raw_gfs[['doy','year', 'value']].merge(df_within_range[['DOY','Year','TRS']],
                        right_on=['DOY','Year'], left_on= ['doy','year'], how='left')

gfs_agg = gfs_agg[gfs_agg['year'] >= START_YEAR][gfs_agg['doy']>=100][gfs_agg['doy']<=350]


gfs_grouped = gfs_agg.fillna(method='bfill').groupby(['DOY','year']).sum().reset_index()

gfs_y = df_within_range.merge(gfs_grouped[['DOY', 'value', 'year']], 
                               left_on=['DOY','Year'], right_on=['DOY', 'year'], how='left')
gfs_y.tail()
Out[31]:
start_date end_date TRS unit_id input_unit_id input_unit_scale reporting_date metric_id item_id region_id partner_region_id frequency_id source_id DOY Year value year
215 2020-11-01 00:00:00+00:00 2020-11-15 00:00:00+00:00 158.06 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 320 2020 86.191647 2020
216 2020-11-16 00:00:00+00:00 2020-11-30 00:00:00+00:00 156.76 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 335 2020 122.190925 2020
217 2021-04-01 00:00:00+00:00 2021-04-15 00:00:00+00:00 101.49 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 105 2021 2.643811 2021
218 2021-04-16 00:00:00+00:00 2021-04-30 00:00:00+00:00 118.60 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 120 2021 42.933008 2021
219 NaT 2021-05-15 00:00:00+00:00 NaN NaN NaN NaN NaT NaN NaN NaN NaN NaN NaN 135 2021 25.036564 2021
  • Next, we use a scatter plot to show the relationship between TRS and GFS precipitation.
In [32]:
fig, ax = plt.subplots(figsize=(12, 8))
        
gfs_y.plot(x='value', y='TRS', c='DOY', cmap='rainbow', kind='scatter', ax=ax, fontsize=18)
plt.xlabel('GFS precipitation (mm)', fontsize=18)
plt.ylabel('TRS (kg/ton)', fontsize=18)
plt.show()
  • As shown in the scatter plot above, there appears to be a mild negative relationship between TRS and GFS precipitation in the period preceding DOY 200. Unlike GPM, it does not show negative relationship for the period of over 300 DOY.

Now, we merge GFS with GPM. For the future dates from today where actual GPM precipitation data is not availabe, we pad it with the forecasted precipitation from GFS.

In [33]:
raw_gpm1 = raw_gpm[['year','doy','value']].rename(columns={'value':'gpm'})
raw_gfs1 = raw_gfs[['year','doy','value']].rename(columns={'value':'gfs'})

df = raw_gfs1.merge(raw_gpm1, left_on=['doy','year'], right_on=['doy', 'year'], how='left')
df['value']=df['gpm'].fillna(df.gfs)
df[-24:]
Out[33]:
year doy gfs gpm value
8342 2021 137 3.510115e-03 0.004293 4.292619e-03
8343 2021 138 2.017481e-01 0.043577 4.357659e-02
8344 2021 139 5.097325e-03 0.003405 3.405195e-03
8345 2021 140 1.698443e-04 0.011674 1.167423e-02
8346 2021 141 0.000000e+00 0.021535 2.153517e-02
8347 2021 142 3.901263e+00 15.975290 1.597529e+01
8348 2021 143 3.836230e+00 2.444301 2.444301e+00
8349 2021 144 2.358494e-02 0.490827 4.908269e-01
8350 2021 145 7.987973e-07 NaN 7.987973e-07
8351 2021 146 1.067412e-01 NaN 1.067412e-01
8352 2021 147 1.089477e-01 NaN 1.089477e-01
8353 2021 148 2.270501e-01 NaN 2.270501e-01
8354 2021 149 2.412948e+00 NaN 2.412948e+00
8355 2021 150 5.002596e+00 NaN 5.002596e+00
8356 2021 151 1.494073e+01 NaN 1.494073e+01
8357 2021 152 3.417473e+00 NaN 3.417473e+00
8358 2021 153 1.213291e-01 NaN 1.213291e-01
8359 2021 154 8.267987e+00 NaN 8.267987e+00
8360 2021 155 1.456630e+00 NaN 1.456630e+00
8361 2021 156 1.722801e-02 NaN 1.722801e-02
8362 2021 157 3.529386e-03 NaN 3.529386e-03
8363 2021 158 2.765836e-04 NaN 2.765836e-04
8364 2021 159 4.765325e-04 NaN 4.765325e-04
8365 2021 160 5.851390e-03 NaN 5.851390e-03
  • For year 2020, GPM has data up to DOY 202 (the day before running date) while GFS has data up to DOY 219, which has 17 days more.
  • For DOY from 1 to 202, predicted precipitation (GFS) is typically different from actual precipitaion (GPM) - there is no general pattern between these 2 data series.
  • For DOY greater than 202, the predited precipitations are zero fom some days. This may suggest that GFS does not have sufficient data to make a prediction.

At this point, both GPM and GFS data are ready. you can run models with/without GFS forecast data. Note that we are not using GFS to replace GPM. Rather, we use GFS data only for the future dates where GPM data are not available.

In [34]:
gfs_agg = df[['doy','year', 'value']].merge(df_within_range[['DOY','Year','TRS']],
                        right_on=['DOY','Year'], left_on= ['doy','year'], how='left')

gfs_agg = gfs_agg[gfs_agg['year'] >= START_YEAR][gfs_agg['doy']>=100][gfs_agg['doy']<=350]


gfs_grouped = gfs_agg.fillna(method='bfill').groupby(['DOY','year']).sum().reset_index()

gfs_y = df_within_range.merge(gfs_grouped[['DOY', 'value', 'year']], 
                               left_on=['DOY','Year'], right_on=['DOY', 'year'], how='left')
gfs_y.tail()
Out[34]:
start_date end_date TRS unit_id input_unit_id input_unit_scale reporting_date metric_id item_id region_id partner_region_id frequency_id source_id DOY Year value year
215 2020-11-01 00:00:00+00:00 2020-11-15 00:00:00+00:00 158.06 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 320 2020 90.812711 2020
216 2020-11-16 00:00:00+00:00 2020-11-30 00:00:00+00:00 156.76 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 335 2020 103.771880 2020
217 2021-04-01 00:00:00+00:00 2021-04-15 00:00:00+00:00 101.49 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 105 2021 0.832142 2021
218 2021-04-16 00:00:00+00:00 2021-04-30 00:00:00+00:00 118.60 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 120 2021 40.257933 2021
219 NaT 2021-05-15 00:00:00+00:00 NaN NaN NaN NaN NaT NaN NaN NaN NaN NaN NaN 135 2021 25.883688 2021

Daily LST temperature versus Sugar Content

The Land Surface Temperature (LST) is a daily timestep temperature data source.

In [35]:
entities = {}
entities['item_id'] =  3457
entities['metric_id'] = 2540047
entities['region_id'] = region_id
entities['frequency_id']= 1

for data_series in model.get_data_series(**entities):
    if data_series['region_id'] == region_id :
        model.add_single_data_series(data_series)
  • Again, we aggregate (average) the LST temperature data to the same periods as the TRS sugar processing yield data.
In [36]:
model.get_df()
raw_lst = model._data_frame[model._data_frame['metric_id'] == entities['metric_id']]
print(raw_lst.end_date.unique())
raw_lst.loc[:, date_index] = pd.to_datetime(raw_lst[date_index], utc=True)

raw_lst.loc[:, 'doy'] = raw_lst[date_index].dt.dayofyear
raw_lst.loc[:, 'year'] = raw_lst[date_index].dt.year

raw_lst.describe()
<DatetimeArray>
['2000-02-25 00:00:00+00:00', '2000-02-26 00:00:00+00:00',
 '2000-02-27 00:00:00+00:00', '2000-02-28 00:00:00+00:00',
 '2000-02-29 00:00:00+00:00', '2000-03-01 00:00:00+00:00',
 '2000-03-02 00:00:00+00:00', '2000-03-03 00:00:00+00:00',
 '2000-03-04 00:00:00+00:00', '2000-03-05 00:00:00+00:00',
 ...
 '2021-05-13 00:00:00+00:00', '2021-05-14 00:00:00+00:00',
 '2021-05-15 00:00:00+00:00', '2021-05-16 00:00:00+00:00',
 '2021-05-17 00:00:00+00:00', '2021-05-18 00:00:00+00:00',
 '2021-05-19 00:00:00+00:00', '2021-05-20 00:00:00+00:00',
 '2021-05-21 00:00:00+00:00', '2021-05-24 00:00:00+00:00']
Length: 6193, dtype: datetime64[ns, UTC]
Out[36]:
value unit_id input_unit_id input_unit_scale metric_id item_id region_id partner_region_id frequency_id source_id doy year
count 6193.000000 6193.0 6193.0 6193.0 6193.0 6193.0 6193.0 6193.0 6193.0 6193.0 6193.000000 6193.000000
mean 28.117894 36.0 36.0 1.0 2540047.0 3457.0 10408.0 0.0 1.0 26.0 180.877119 2010.347651
std 4.133873 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 99.534863 6.149391
min 11.534710 36.0 36.0 1.0 2540047.0 3457.0 10408.0 0.0 1.0 26.0 1.000000 2000.000000
25% 25.448933 36.0 36.0 1.0 2540047.0 3457.0 10408.0 0.0 1.0 26.0 98.000000 2005.000000
50% 28.170784 36.0 36.0 1.0 2540047.0 3457.0 10408.0 0.0 1.0 26.0 180.000000 2010.000000
75% 30.852866 36.0 36.0 1.0 2540047.0 3457.0 10408.0 0.0 1.0 26.0 262.000000 2016.000000
max 42.723178 36.0 36.0 1.0 2540047.0 3457.0 10408.0 0.0 1.0 26.0 366.000000 2021.000000
In [37]:
lst_agg = raw_lst[['end_date','value', 'doy','year']].merge(df_within_range[['end_date','DOY']],
                        on='end_date' , how='left').drop_duplicates()
lst_agg = lst_agg[lst_agg['year'] >= START_YEAR][lst_agg['doy']>=100][lst_agg['doy']<=350]


lst_grouped = lst_agg.fillna(method='bfill').groupby(['DOY','year']).mean().reset_index()

lst_y = df_within_range.merge(lst_grouped[['DOY', 'value', 'year']], 
                               left_on=['DOY','Year'], right_on=['DOY', 'year'], how='left')
lst_y.tail()
Out[37]:
start_date end_date TRS unit_id input_unit_id input_unit_scale reporting_date metric_id item_id region_id partner_region_id frequency_id source_id DOY Year value year
215 2020-11-01 00:00:00+00:00 2020-11-15 00:00:00+00:00 158.06 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 320 2020 35.712360 2020.0
216 2020-11-16 00:00:00+00:00 2020-11-30 00:00:00+00:00 156.76 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 335 2020 NaN NaN
217 2021-04-01 00:00:00+00:00 2021-04-15 00:00:00+00:00 101.49 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 105 2021 30.107659 2021.0
218 2021-04-16 00:00:00+00:00 2021-04-30 00:00:00+00:00 118.60 867.0 867.0 1.0 NaT 6881050.0 7539.0 10408.0 0.0 23.0 47.0 120 2021 NaN NaN
219 NaT 2021-05-15 00:00:00+00:00 NaN NaN NaN NaN NaT NaN NaN NaN NaN NaN NaN 135 2021 26.650294 2021.0
  • Again, we plot a scatter plot of TRS versus LST to visualize the relationship between total recoverable sugar and the land surface temperature.
In [38]:
fig, ax = plt.subplots(figsize=(12, 8))
        
lst_y.plot(x='value', y='TRS', c='DOY', cmap='rainbow', kind='scatter', ax=ax, fontsize=18)
plt.xlabel('LST temperature (degrees celsius)', fontsize=18)
plt.ylabel('TRS (kg/ton)', fontsize=18)
plt.show()
  • The LST appears to be positively correlated with the TRS, especially for the DOY between 170 and 270

Merge the variables together and assess the relationships among them

In [39]:
pet_y.rename(columns={'value': 'pet'}, inplace=True)
gpm_y.rename(columns={'value': 'gpm'}, inplace=True)
lst_y.rename(columns={'value': 'lst'}, inplace=True)
pet_y.DOY = pet_y.DOY.astype(int)
gpm_y.DOY = gpm_y.DOY.astype(int)
lst_y.DOY = lst_y.DOY.astype(int)
df_tog = ndvi_y.merge(pet_y[['pet', 'DOY','year']], on=['DOY', 'year'], how='left'
                    ).merge(gpm_y[['gpm', 'DOY','year']], on=['DOY', 'year'], how='left'
                    ).merge(lst_y[['lst', 'DOY','year']], on=['DOY', 'year'], how='left'
                    )

df_tog.sort_values(['year','DOY'], ascending=True).tail()
Out[39]:
DOY Year TRS end_date doy year ndvi pet gpm lst
215 320 2020 158.06 2020-11-15 00:00:00+00:00 320 2020 0.538198 84.400187 45.406355 35.712360
216 335 2020 156.76 2020-11-30 00:00:00+00:00 335 2020 0.583034 89.291531 51.885940 NaN
217 105 2021 101.49 2021-04-15 00:00:00+00:00 105 2021 0.681949 24.885540 0.416071 30.107659
218 120 2021 118.60 2021-04-30 00:00:00+00:00 120 2021 0.645152 53.653951 20.128966 NaN
219 135 2021 NaN 2021-05-15 00:00:00+00:00 135 2021 0.621904 49.032044 12.941844 26.650294
In [40]:
corr = df_tog[['ndvi','pet','lst', 'gpm','TRS']].corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.set(font_scale=2) 
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce36b9d590>
  • NDVI is not correlated with precipitation and negatively correlated with LST and PET.
  • PET, as a modeled product with temperature and precipitation as major inputs, is positively correlated with LST and GPM.
  • Temperature appears to be mildly positively correlated with precipitation, which is because the dry season corresponds to the winter or colder season in the southern atmosphere, or Sao Paulo here.

Cluster the points to find the distinct periods during the season

  • The previous scatter plots between TRS and weather variables manifested three clusters of DOY periods.
  • Because of such differing responses of TRS to weather conditions at different time of the year, it is appropriate to model those three periods separately.
  • First, we need to clearly define the boundaries of those three clusters, and KMeans is a simple clustering method to do so.
In [41]:
X = np.array(df_tog[['ndvi','pet','lst', 'gpm','TRS','DOY']].dropna())
f_names = ['ndvi','pet','lst', 'gpm','TRS','DOY']
kmeans = KMeans(3)
kmeans.fit(X)
# save new clusters for chart
y_km = kmeans.fit_predict(X)
  • Add a variable name group to record the KMeans clustering result, as printed below
In [42]:
data_decision = pd.DataFrame(X)
data_decision.columns = f_names
data_decision['group'] = y_km
data_decision.head()
Out[42]:
ndvi pet lst gpm TRS DOY group
0 0.679789 25.817958 25.110343 28.980030 104.10 136.0 2
1 0.620524 26.142522 23.242788 7.257239 127.81 182.0 0
2 0.591495 32.461089 23.706658 1.079680 135.72 197.0 0
3 0.536887 45.161768 25.642185 9.449393 141.85 213.0 0
4 0.552193 39.370971 26.093677 148.132260 148.26 228.0 1
In [43]:
print(data_decision.groupby('group').min())
data_decision.groupby('group').max()
           ndvi        pet        lst       gpm     TRS    DOY
group                                                         
0      0.454647  25.977041  21.780792  0.197556  122.96  181.0
1      0.486248  37.918631  24.456103  6.612893  121.21  227.0
2      0.596103  19.757497  19.686612  0.416071   55.45  105.0
Out[43]:
ndvi pet lst gpm TRS DOY
group
0 0.673067 84.318708 33.580666 78.964679 160.21 273.0
1 0.704174 88.300020 35.712360 156.484292 172.91 349.0
2 0.727892 54.754388 31.398701 160.930850 143.27 196.0

We see that the three clusters of periods are the following

  • period DOY 100-182
  • period DOY 196-259
  • period DOY 273-350

Cutoff DOY between periods: 190, 265. Next, we visualize the relationship between all combinations of variables colored by the clustering groups

In [44]:
combos = [l for l in itertools.combinations(f_names,2)]

n_vars = len(combos)
n_rows = n_vars//2 + n_vars%2
sns.set(font_scale=1) 
alpha = 0.2
fig, axes = plt.subplots(n_rows, 2, figsize=(12, n_rows*4))
for i in range(n_vars):
    ax = axes[i//2, i % 2]
    var_1, var2 = combos[i]
    if 'DOY' in combos[i]:
        var_1 = 'DOY'
        var2 = (set(combos[i]) - {'DOY'}).pop()
    ax.scatter(X[y_km ==0,f_names.index(var_1)], X[y_km == 0,f_names.index(var2)], s=50, c='blue', alpha = alpha)
    ax.scatter(X[y_km ==1,f_names.index(var_1)], X[y_km == 1,f_names.index(var2)], s=50, c='black', alpha = alpha)
    ax.scatter(X[y_km ==2,f_names.index(var_1)], X[y_km == 2,f_names.index(var2)], s=50, c='red', alpha = alpha)
    ax.set_xlabel(var_1, fontsize=15)
    ax.set_ylabel(var2, fontsize=15)