# Anomaly Detection Tool¶

In this notebook, we outline a process for identifying anomalies in geospatial data. For example, this notebook can be used to set up alerts for data series. This example focuses on monitoring when the amount of rainfall or NDVI is higher or lower than usual in a certain area. Below, you can find some helpful frameworks for pulling a batch of data for multiple regions, transforming that data into a more statistically significant format, identifying key properties of that data, and then identifying if recent values are anomalous according to two sigma significance. This code is extendable to include different tests for significance, different data sets and different use cases.

## Set up the coding environment¶

In [2]:
from groclient import GroClient
from datetime import datetime, date, timedelta
import pandas as pd
import os
from groclient.lib import REGION_LEVELS

In [3]:
GROAPI_TOKEN = os.environ['GROAPI_TOKEN']
API_HOST = 'api.gro-intelligence.com'
gclient = GroClient(API_HOST, GROAPI_TOKEN)


In this first example, we'll investigate weather conditions for regions within Salta, a major soybean producing region of Argentina. In this step, we'll identify the IDs for all of the counties within Salta. If we wish to discover anomalies in another region, we can simply change the region_id to that of our region of interest

In [4]:
data_name = "Salta Province in Argentina Temperature Anomalies"
region_id = 10152 # This is the region id for salta from the gro platform
districts_list = [region['id'] for region in gclient.get_descendant_regions(region_id, REGION_LEVELS['district'])] #districts are region level 5 in Gro's ontology
print("district ids:")
print(districts_list)

district ids:
[102446, 102454, 102459, 102463, 102453, 102449, 102458, 102451, 102448, 102457, 102447, 102465, 102467, 102462, 102464, 102455, 102468, 102461, 102450, 102452, 102460, 102456, 102466]


## Land Temperature Anomalies¶

Since Gro can have multiple sources for a given data series, we will need a function to help us obtain the best data series for multiple locations. We want to choose the best series every time, so we'll use the rank_series_by_source method to retrieve the source with the best data coverage.

In [5]:
def best_series_picker(selection):
'''
chooses the series with the greatest density and history of datapoints
'''
for series in gclient.rank_series_by_source(selection):
return gclient.get_data_series(**series)[0]


Set up a method to pull data from gro using the desired regions, metric, and item.

In [6]:
def get_data_for_multiple_regions(regions_list, metric_id, item_id):
selections = []
for region in regions_list:
'''
ids from app.gro-intelligence.com
'''
try:
# You can get these id's from our web app at gro-intelligence.com
selections.append(best_series_picker([{'metric_id': metric_id, 'item_id': item_id, 'region_id': region}]))
except:
print("no data for region ",region)
output = []
for set_of_ids in selections:
if(set_of_ids!= None):
output.append(gclient.get_data_points(**set_of_ids))

return output


## Time to pull data from the Gro API and examine what we have. (This may take a few seconds)¶

You can get the ids necessary to pull a data series at gro-intelligence.com by searching for the terms you need. in this case we are going to look at Land temperature (daytime, modeled) from the MOD11 Satellite housed on gro.

In [7]:
# You can access ids in the web app at gro-intelligence.com
metric_id = 2540047 # Temperature
item_id = 3457 # Land temperature (daytime, modeled)
regions_list = districts_list
land_temperature_array = get_data_for_multiple_regions(regions_list, metric_id, item_id)
# We will use land_temperature_array later!


Check on that data by printing one data point! (if you try to print the whole thing the notebook will crash)

In [8]:
first_location_data = land_temperature_array[0]
most_recent_data_point = first_location_data[-1]
print(most_recent_data_point)

{'start_date': '2020-01-09T00:00:00.000Z', 'end_date': '2020-01-09T00:00:00.000Z', 'value': 38.8685747587904, 'input_unit_id': 36, 'input_unit_scale': 1, 'unit_id': 36, 'metric_id': 2540047, 'item_id': 3457, 'region_id': 102446, 'frequency_id': 1}


Lets look at what we are working with:

In [9]:
most_recent_value = (most_recent_data_point['value'])
item = gclient.lookup('items',first_location_data[0]['item_id'])['name']
unit = gclient.lookup('units',first_location_data[0]['unit_id'])['name']
region = gclient.lookup('regions',first_location_data[0]['region_id'])['name']
metric_id = first_location_data[0]['metric_id']
metric_name = gclient.lookup('metrics',metric_id)['name']
time_stamp = " on " + most_recent_data_point['end_date'][:10] + " "
frequency = gclient.lookup('frequencies',first_location_data[0]['frequency_id'])['name']
TIMESTAMP_FORMAT = '%Y-%m-%dT%H:%M:%S.%fZ'
years = datetime.strptime(most_recent_data_point['end_date'], TIMESTAMP_FORMAT).year - datetime.strptime(first_location_data[0]['end_date'], TIMESTAMP_FORMAT).year

print("most recently reported value", most_recent_value)
print("item name:", item)
print("unit name:", unit)
print("name of region:", region)
print("name of metric:", metric_name)
print("last reported data point:", time_stamp)
print("frequency reported:", frequency)
print("time period (years):", years)

most recently reported value 38.8685747587904
item name: Land temperature (daytime, modeled)
unit name: Celsius
name of region: Anta
name of metric: Temperature
last reported data point:  on 2020-01-09
frequency reported: daily
time period (years): 20


Next we will want to give our users friendly wording when they read their anomaly alerts this is optional, but for many readers "vegetation health" will help a user interpret "NDVI." The variable: 'data' is defined as one data series from the land_temperature_array.

In [10]:
RAINFALL_METRIC_ID = 2100031
SOIL_MOISTURE_METRIC_ID = 15531082
NDVI_METRIC_ID = 431132
ETA_METRIC_ID = 2010042
TEMP_METRIC_ID = 2540047

def check_if_geospatial(data):
metric_id = data[0]['metric_id']
if(metric_id==RAINFALL_METRIC_ID):# Trmm
metric_name = '7 day cumulative rainfall'
elif(metric_id==SOIL_MOISTURE_METRIC_ID):# Soil Moisture
metric_name = 'soil moisture'
elif(metric_id==NDVI_METRIC_ID):# NDVI
metric_name = 'vegetation health (NDVI)'
elif(metric_id==ETA_METRIC_ID):# ETA
metric_name = 'evapotranspiration'
elif(metric_id==TEMP_METRIC_ID):# Temp
metric_name = '7 day average temperature'
else:
return False, ''
return True, metric_name

In [11]:
# Lets try it out on our first data location
is_geospatial, new_metric_name = check_if_geospatial(first_location_data)
print("geospatial data:", is_geospatial)
print("type:", new_metric_name)

geospatial data: True
type: 7 day average temperature


## Transform data for more significant signals¶

For some data series we dont want to look at values as frequently as they are reported, instead we want to look at 7-day cumulative or 7-day average values. Here we will modify those specified series and Temperature is one of them!

In [12]:
def transform_by_data_type(df, current_value, frequency):
'''
change frequency for certain weather items
'''
metric_id = df.loc[0,'metric_id']
if(metric_id == 2540047 or metric_id == 15531082): # 2540047 is Land temp, 15531082 is soil moisture (MEAN)
df = df.fillna(df.mean())
df['value'] = df['value'].rolling(7).mean()
frequency = "7-day average"
elif( metric_id == 2100031):
df = df.fillna(0)
df['value'] = df['value'].rolling(7).sum() # 2100031 is rainfall (SUM)
frequency = "7-day cumulative"
else:
return df, current_value, frequency
df = df.iloc[::7, :]
df.index = pd.RangeIndex(len(df.index))
current_value = df.iloc[-1]['value']
return df, current_value, frequency

# Get percent change from average
df = pd.DataFrame(first_location_data)

df, current_value, frequency = transform_by_data_type(df, most_recent_value, frequency)

print("updated frequency:", frequency)
print("most recent transformed value:", current_value)

updated frequency: 7-day average
most recent transformed value: 31.458684808882985


## Create a significance / anomaly test¶

Now we need to find the lower and upper bounds for our testing for significance, we have chosen to consider two standard deviations from the mean. After that we will use the bounds to test for significant deviations from the mean and record highs and lows. This test could be replaced with any other test for significance.

In [13]:
def find_lower_and_upper(df,standard_devs):
'''
Standard Deviation Calculations for Alerts:
Using standard deviations we cand find out if recent numbers
reported to the user are abnormal or to be expected
Returns upper and lower bounds for a series and a given number of standard deviations
'''
mean = df.mean()
std = df.std()
lower_bound = (mean - (standard_devs * std))
upper_bound = (mean + (standard_devs * std))
return lower_bound, upper_bound

df_lower_upper = df['value']
standard_deviations = 2
lower_bound, upper_bound = find_lower_and_upper(df_lower_upper,standard_deviations)

print("lower bound:", lower_bound)
print("upper bound:", upper_bound)

lower bound: 16.17364207751991
upper bound: 35.84491956150512

In [14]:
def selected_tests(df,lower_bound,upper_bound,current_value):
is_significant = False
statement = ""
if(current_value == df["value"].max()):
is_significant = True
statement = "record high in "
elif(current_value == df["value"].min()):
is_significant = True
statement = "record low in "
elif(current_value<=lower_bound):
is_significant = True
statement = "significant low in "
elif(current_value>=upper_bound):
is_significant = True
statement = "significant high in "
else:
statement = "Not significant for period: "
return is_significant, statement

is_significant, statement = selected_tests(df,lower_bound,upper_bound,current_value)

statement += str(years) + " years "
current_value = ((round(current_value,2)))

print("test result:", statement)

test result: Not significant for period: 20 years


We also want to contextualize this number by saying exactly what the deviation from the mean looks like

In [15]:
def get_percent_change_statement(number, avg, metric_id):
if(metric_id == 2010042): # ETA
percent_val = number/100
deviation = number - 100
elif(metric_id == 431132): # NDVI
percent_val = abs(number)
deviation = number
else:
deviation = (number) - (avg)
if(avg!=0):
percent_val = (abs(deviation) / abs(avg))
else:
return ''
# Format string response
if(deviation>=0):
statement = str("{0:.0%}".format(percent_val)) + " above average which is a"
else:
statement = str("{0:.0%}".format(percent_val)) + " below average which is a"

return statement

mean = df["value"].mean()
percent_change = get_percent_change_statement(current_value, mean, metric_id)

print("percent deviation:", percent_change[:4])

percent deviation: 21%


## Putting it all together¶

After walking through the process we want to add some methods so we can integrate this into our workflow and use it for many data series at a time.

In [16]:
def current_is_significant(data):
'''
Returns bool True if the latest value is the global max
'''
current_value = (data[-1]['value'])
frequency = gclient.lookup('frequencies',data[0]['frequency_id'])['name']

df = pd.DataFrame(data)
df, current_value, frequency = transform_by_data_type(df, current_value, frequency)
current_value = df.iloc[-1]['value']

df_lower_upper = df['value']
standard_deviations = 2
lower_bound, upper_bound = find_lower_and_upper(df_lower_upper,standard_deviations)

is_significant, statement = selected_tests(df,lower_bound,upper_bound,current_value)

# Get frequency
years = datetime.strptime(data[-1]['end_date'], TIMESTAMP_FORMAT).year - datetime.strptime(data[0]['end_date'], TIMESTAMP_FORMAT).year
statement += str(years) + " years "
return is_significant, statement, ((round(current_value,2)))

'''
alert if current data is significant in gro series
make sure there was data in the original request before running the rest
'''
# Make sure the most recent reporting date is within the last 30 days
current_date = datetime.now()
try:
# Make sure there are more than 3 data points
assert(len(data) > 3)
most_recent = data[-1]['reporting_date']
data_date = datetime.strptime(most_recent, TIMESTAMP_FORMAT)
except:
most_recent = data[-1]['end_date']
data_date = datetime.strptime(most_recent, TIMESTAMP_FORMAT)
delta = current_date-data_date
days = int(str(delta).split()[0])
if(days>=30):
return ""

# Alert if current is max
bin_val1,statement1,val1 = current_is_significant(data)
if(bin_val1):
alert += " " + statement1

''' You can add tests for seasonal significance, tests for alternate distributions etc.
bin_val2,statement2,val2 = YOUR_TEST_HERE(data)
if(bin_val2):
alert += " " + statement2
'''

def string_formatting(data, number):
try:
# Make sure there was data returned
most_recent_value = (data[number]['value'])

# Provide metadata for the reader from gro api client and format into human readable string
item = gclient.lookup('items',data[0]['item_id'])['name']
unit = gclient.lookup('units',data[0]['unit_id'])['name']
region = gclient.lookup('regions',data[0]['region_id'])['name']
metric_id = data[0]['metric_id']
metric_name = gclient.lookup('metrics',metric_id)['name']
time_stamp = " on " + data[number]['end_date'][:10] + " "
frequency = gclient.lookup('frequencies',data[0]['frequency_id'])['name']

except (TypeError, KeyError):
print("type error or key error in string_formatting")
return "", ''

# Get percent change from average
df = pd.DataFrame(data)
df, current_value, frequency = transform_by_data_type(df, most_recent_value, frequency)
mean = df["value"].mean()
percent_change = get_percent_change_statement(most_recent_value, mean, metric_id)

# Check if this is a geospatial data series
is_geospatial, new_metric_name = check_if_geospatial(data)

# If this IS a geospatial metric
if(is_geospatial):
# Create string
significance_headline = region+ " " + new_metric_name + ": " + percent_change

# If this is NOT a geospatial metric
else:
# Make sure it exists
if(most_recent_value):
most_recent_value = round(most_recent_value,2)
else:
most_recent_value = 0

# Adds commas and gets rid of decimals
if(most_recent_value>100):
most_recent_value = (format(int(most_recent_value), ',d'))
else:
most_recent_value = str(most_recent_value)
# Create string
item_metric = item + " " + metric_name
significance_headline = region+ " " + item_metric + ": " + most_recent_value + " " + unit + ", " + percent_change

def work_to_distribute(data):
'''
Method run by workers, this will run each of the tests in return_alerts
'''
if(data_line_2!=""):
data_line_1 = string_formatting(data,-1)
string_add += (data_line_1 + data_line_2 + "\n")

def compute_amomalies(array, data_name):
'''
Takes array of data series and a category name and returns any alerts string formatted under a header
'''
string = ""
string_list = []
if(array != None):
try:
for ele in array:
string_list.append(work_to_distribute(ele))
except IndexError:
print("IndexError in compute_amomalies for " + data_name)

for x in string_list:
string = data_name
else:
string = ""

data_statement = string + " \n" + string_add

return data_statement

argentina_string = compute_amomalies(land_temperature_array, data_name)
print("argentina_string:")
print(argentina_string)

argentina_string:



## NDVI Anomalies in Brazil¶

Now that we have our alerts from Argentina about Land Temperature (If the cell above prints nothing, that means there were no alerts for significant temperature in Argentina), we will try brazil NDVI (Normalized Difference Vegetation Index) anomalies as well. Our data source for NDVI allows us to estimate the amount of chlorophyll in plants in a given region.

In [17]:
# Tocantins Province in Brazil
tocantins_province_id = 10428
districts_list_brazil = [region['id'] for region in gclient.get_descendant_regions(tocantins_province_id, REGION_LEVELS['district'])]
print(districts_list_brazil)

[110117, 110031, 110130, 110097, 110120, 110059, 110124, 110047, 110041, 110118, 110055, 110061, 110084, 110082, 110096, 110077, 110109, 110100, 109997, 110045, 110021, 110074, 109998, 110022, 110024, 110086, 110001, 109995, 110048, 110064, 110004, 110098, 110016, 110037, 110010, 110078, 110107, 109999, 110051, 110017, 110091, 110116, 110023, 110131, 110089, 110065, 110088, 110028, 110002, 110090, 110115, 110092, 110018, 110104, 110052, 110056, 110011, 110106, 110085, 110083, 110035, 110122, 110087, 110121, 110006, 109996, 110068, 110014, 110034, 110057, 110066, 110102, 110042, 110071, 110039, 110053, 110040, 110007, 110094, 110049, 110030, 110073, 110029, 110129, 110076, 110105, 110075, 110067, 110062, 110113, 110114, 110000, 110020, 110046, 110050, 110063, 110101, 110110, 110128, 110036, 110093, 110069, 110012, 110003, 110103, 110080, 110127, 110099, 110060, 110015, 110008, 110033, 110070, 110027, 110095, 110079, 110112, 110019, 110013, 110005, 110125, 110111, 110108, 110072, 110119, 110126, 110032, 110026, 110054, 110123, 110025, 110009, 110044, 110038, 110043, 110081, 110058, 100022399, 100022400, 100022401]


## Pull Tocantins province NDVI and analyze for significance (this may take a minute)¶

In [18]:
metric_id = 431132# Vegetation difference from 10-yr mean (2001-2010)
item_id = 321 # Vegetation (NDVI)

ndvi_array = get_data_for_multiple_regions(districts_list_brazil, metric_id, item_id)
brazil_string = compute_amomalies(ndvi_array, "Tocantins Province in Brazil NDVI Anomalies")

print(brazil_string)

No Content
No Content

Tocantins Province in Brazil NDVI Anomalies
Wanderlândia vegetation health (NDVI): 11% above average which is a significant high in 20 years
Porto Nacional vegetation health (NDVI): 7% above average which is a significant high in 20 years
Lajedão vegetation health (NDVI): 9% above average which is a significant high in 20 years
Almas vegetation health (NDVI): 7% above average which is a significant high in 20 years
Angico vegetation health (NDVI): 10% above average which is a significant high in 20 years
Novo Acordo vegetation health (NDVI): 9% above average which is a significant high in 20 years
Xambioá vegetation health (NDVI): 9% above average which is a significant high in 19 years
Caseara vegetation health (NDVI): 7% above average which is a significant high in 20 years
Riachinho vegetation health (NDVI): 9% above average which is a significant high in 20 years
Dianopolis vegetation health (NDVI): 12% above average which is a significant high in 20 years
Nazaré vegetation health (NDVI): 9% above average which is a significant high in 20 years
Santa Terezinha do Tocantins vegetation health (NDVI): 11% above average which is a significant high in 20 years
São Bento do Tocantins vegetation health (NDVI): 9% above average which is a significant high in 20 years
Ananás vegetation health (NDVI): 8% above average which is a significant high in 20 years
Itaguatins vegetation health (NDVI): 10% above average which is a significant high in 20 years
Oliveira de Fátima vegetation health (NDVI): 7% above average which is a significant high in 19 years