Anomaly Detection Tool

In this notebook, we outline a process for identifying anomalies in geospatial data. For example, this notebook can be used to set up alerts for data series. This example focuses on monitoring when the amount of rainfall or NDVI is higher or lower than usual in a certain area. Below, you can find some helpful frameworks for pulling a batch of data for multiple regions, transforming that data into a more statistically significant format, identifying key properties of that data, and then identifying if recent values are anomalous according to two sigma significance. This code is extendable to include different tests for significance, different data sets and different use cases.

Set up the coding environment

In [2]:
from groclient import GroClient
from datetime import datetime, date, timedelta
import pandas as pd
import os
from groclient.lib import REGION_LEVELS
In [3]:
GROAPI_TOKEN = os.environ['GROAPI_TOKEN']
API_HOST = 'api.gro-intelligence.com'
gclient = GroClient(API_HOST, GROAPI_TOKEN)

In this first example, we'll investigate weather conditions for regions within Salta, a major soybean producing region of Argentina. In this step, we'll identify the IDs for all of the counties within Salta. If we wish to discover anomalies in another region, we can simply change the region_id to that of our region of interest

In [4]:
data_name = "Salta Province in Argentina Temperature Anomalies"
region_id = 10152 # This is the region id for salta from the gro platform
districts_list = [region['id'] for region in gclient.get_descendant_regions(region_id, REGION_LEVELS['district'])] #districts are region level 5 in Gro's ontology
print("district ids:")
print(districts_list)
district ids:
[102446, 102454, 102459, 102463, 102453, 102449, 102458, 102451, 102448, 102457, 102447, 102465, 102467, 102462, 102464, 102455, 102468, 102461, 102450, 102452, 102460, 102456, 102466]

Land Temperature Anomalies

Since Gro can have multiple sources for a given data series, we will need a function to help us obtain the best data series for multiple locations. We want to choose the best series every time, so we'll use the rank_series_by_source method to retrieve the source with the best data coverage.

In [5]:
def best_series_picker(selection):
    '''
    chooses the series with the greatest density and history of datapoints
    '''
    for series in gclient.rank_series_by_source(selection):
        return gclient.get_data_series(**series)[0]

Set up a method to pull data from gro using the desired regions, metric, and item.

In [6]:
def get_data_for_multiple_regions(regions_list, metric_id, item_id):
    selections = []
    for region in regions_list:
        '''
        ids from app.gro-intelligence.com
        '''
        try:
            # You can get these id's from our web app at gro-intelligence.com
            selections.append(best_series_picker([{'metric_id': metric_id, 'item_id': item_id, 'region_id': region}]))    
        except:
            print("no data for region ",region)
    output = []
    for set_of_ids in selections:
        if(set_of_ids!= None):
            output.append(gclient.get_data_points(**set_of_ids))
    
    return output

Time to pull data from the Gro API and examine what we have. (This may take a few seconds)

You can get the ids necessary to pull a data series at gro-intelligence.com by searching for the terms you need. in this case we are going to look at Land temperature (daytime, modeled) from the MOD11 Satellite housed on gro.

In [7]:
# You can access ids in the web app at gro-intelligence.com
metric_id = 2540047 # Temperature
item_id = 3457 # Land temperature (daytime, modeled)
regions_list = districts_list
land_temperature_array = get_data_for_multiple_regions(regions_list, metric_id, item_id) 
# We will use land_temperature_array later!

Check on that data by printing one data point! (if you try to print the whole thing the notebook will crash)

In [8]:
first_location_data = land_temperature_array[0]
most_recent_data_point = first_location_data[-1]
print(most_recent_data_point)
{'start_date': '2020-01-09T00:00:00.000Z', 'end_date': '2020-01-09T00:00:00.000Z', 'value': 38.8685747587904, 'input_unit_id': 36, 'input_unit_scale': 1, 'unit_id': 36, 'metric_id': 2540047, 'item_id': 3457, 'region_id': 102446, 'frequency_id': 1}

Lets look at what we are working with:

In [9]:
most_recent_value = (most_recent_data_point['value'])
item = gclient.lookup('items',first_location_data[0]['item_id'])['name']
unit = gclient.lookup('units',first_location_data[0]['unit_id'])['name']
region = gclient.lookup('regions',first_location_data[0]['region_id'])['name']
metric_id = first_location_data[0]['metric_id']
metric_name = gclient.lookup('metrics',metric_id)['name']
time_stamp = " on " + most_recent_data_point['end_date'][:10] + " "
frequency = gclient.lookup('frequencies',first_location_data[0]['frequency_id'])['name']
TIMESTAMP_FORMAT = '%Y-%m-%dT%H:%M:%S.%fZ'
years = datetime.strptime(most_recent_data_point['end_date'], TIMESTAMP_FORMAT).year - datetime.strptime(first_location_data[0]['end_date'], TIMESTAMP_FORMAT).year

print("most recently reported value", most_recent_value)
print("item name:", item)
print("unit name:", unit)
print("name of region:", region)
print("name of metric:", metric_name)
print("last reported data point:", time_stamp)
print("frequency reported:", frequency)
print("time period (years):", years)
most recently reported value 38.8685747587904
item name: Land temperature (daytime, modeled)
unit name: Celsius
name of region: Anta
name of metric: Temperature
last reported data point:  on 2020-01-09 
frequency reported: daily
time period (years): 20

Next we will want to give our users friendly wording when they read their anomaly alerts this is optional, but for many readers "vegetation health" will help a user interpret "NDVI." The variable: 'data' is defined as one data series from the land_temperature_array.

In [10]:
RAINFALL_METRIC_ID = 2100031
SOIL_MOISTURE_METRIC_ID = 15531082
NDVI_METRIC_ID = 431132
ETA_METRIC_ID = 2010042
TEMP_METRIC_ID = 2540047

def check_if_geospatial(data):
    metric_id = data[0]['metric_id']
    if(metric_id==RAINFALL_METRIC_ID):# Trmm
        metric_name = '7 day cumulative rainfall'
    elif(metric_id==SOIL_MOISTURE_METRIC_ID):# Soil Moisture
        metric_name = 'soil moisture'
    elif(metric_id==NDVI_METRIC_ID):# NDVI
        metric_name = 'vegetation health (NDVI)'
    elif(metric_id==ETA_METRIC_ID):# ETA
        metric_name = 'evapotranspiration'
    elif(metric_id==TEMP_METRIC_ID):# Temp
        metric_name = '7 day average temperature'
    else:
        return False, ''
    return True, metric_name
In [11]:
# Lets try it out on our first data location
is_geospatial, new_metric_name = check_if_geospatial(first_location_data)
print("geospatial data:", is_geospatial)
print("type:", new_metric_name)
geospatial data: True
type: 7 day average temperature

Transform data for more significant signals

For some data series we dont want to look at values as frequently as they are reported, instead we want to look at 7-day cumulative or 7-day average values. Here we will modify those specified series and Temperature is one of them!

In [12]:
def transform_by_data_type(df, current_value, frequency):
    '''
    change frequency for certain weather items
    '''
    metric_id = df.loc[0,'metric_id']
    if(metric_id == 2540047 or metric_id == 15531082): # 2540047 is Land temp, 15531082 is soil moisture (MEAN)
        df = df.fillna(df.mean())
        df['value'] = df['value'].rolling(7).mean()
        frequency = "7-day average"
    elif( metric_id == 2100031):
        df = df.fillna(0)
        df['value'] = df['value'].rolling(7).sum() # 2100031 is rainfall (SUM)
        frequency = "7-day cumulative"
    else:
        return df, current_value, frequency
    df = df.iloc[::7, :]
    df.index = pd.RangeIndex(len(df.index))
    current_value = df.iloc[-1]['value']
    return df, current_value, frequency


# Get percent change from average
df = pd.DataFrame(first_location_data)

df, current_value, frequency = transform_by_data_type(df, most_recent_value, frequency)

print("updated frequency:", frequency)
print("most recent transformed value:", current_value)
updated frequency: 7-day average
most recent transformed value: 31.458684808882985

Create a significance / anomaly test

Now we need to find the lower and upper bounds for our testing for significance, we have chosen to consider two standard deviations from the mean. After that we will use the bounds to test for significant deviations from the mean and record highs and lows. This test could be replaced with any other test for significance.

In [13]:
def find_lower_and_upper(df,standard_devs):
    '''
    Standard Deviation Calculations for Alerts:
    Using standard deviations we cand find out if recent numbers
    reported to the user are abnormal or to be expected
    Returns upper and lower bounds for a series and a given number of standard deviations
    '''
    mean = df.mean()
    std = df.std()
    lower_bound = (mean - (standard_devs * std))
    upper_bound = (mean + (standard_devs * std))
    return lower_bound, upper_bound


df_lower_upper = df['value']
standard_deviations = 2
lower_bound, upper_bound = find_lower_and_upper(df_lower_upper,standard_deviations)

print("lower bound:", lower_bound)
print("upper bound:", upper_bound)
lower bound: 16.17364207751991
upper bound: 35.84491956150512
In [14]:
def selected_tests(df,lower_bound,upper_bound,current_value):
    is_significant = False
    statement = ""
    if(current_value == df["value"].max()):
        is_significant = True
        statement = "record high in "
    elif(current_value == df["value"].min()):
        is_significant = True
        statement = "record low in "
    elif(current_value<=lower_bound):
        is_significant = True
        statement = "significant low in "
    elif(current_value>=upper_bound):
        is_significant = True
        statement = "significant high in "
    else:
        statement = "Not significant for period: "
    return is_significant, statement


is_significant, statement = selected_tests(df,lower_bound,upper_bound,current_value)

statement += str(years) + " years "
current_value = ((round(current_value,2)))

print("test result:", statement)
test result: Not significant for period: 20 years 

We also want to contextualize this number by saying exactly what the deviation from the mean looks like

In [15]:
def get_percent_change_statement(number, avg, metric_id):
    if(metric_id == 2010042): # ETA
        percent_val = number/100
        deviation = number - 100
    elif(metric_id == 431132): # NDVI
        percent_val = abs(number)
        deviation = number
    else:
        deviation = (number) - (avg)
        if(avg!=0):
            percent_val = (abs(deviation) / abs(avg))
        else:
            return ''
    # Format string response
    if(deviation>=0):
        statement = str("{0:.0%}".format(percent_val)) + " above average which is a"
    else:
        statement = str("{0:.0%}".format(percent_val)) + " below average which is a"

    return statement

mean = df["value"].mean()
percent_change = get_percent_change_statement(current_value, mean, metric_id)

print("percent deviation:", percent_change[:4])
percent deviation: 21% 

Putting it all together

After walking through the process we want to add some methods so we can integrate this into our workflow and use it for many data series at a time.

In [16]:
def current_is_significant(data):
    '''
    Returns bool True if the latest value is the global max
    '''
    current_value = (data[-1]['value'])
    frequency = gclient.lookup('frequencies',data[0]['frequency_id'])['name']

    df = pd.DataFrame(data)
    df, current_value, frequency = transform_by_data_type(df, current_value, frequency)
    current_value = df.iloc[-1]['value']

    df_lower_upper = df['value']
    standard_deviations = 2
    lower_bound, upper_bound = find_lower_and_upper(df_lower_upper,standard_deviations)
    
    is_significant, statement = selected_tests(df,lower_bound,upper_bound,current_value)
    
    # Get frequency
    years = datetime.strptime(data[-1]['end_date'], TIMESTAMP_FORMAT).year - datetime.strptime(data[0]['end_date'], TIMESTAMP_FORMAT).year
    statement += str(years) + " years "
    return is_significant, statement, ((round(current_value,2)))


def return_alerts(data):
    alert = ""
    '''
    alert if current data is significant in gro series
    make sure there was data in the original request before running the rest
    '''
    # Make sure the most recent reporting date is within the last 30 days
    current_date = datetime.now()
    try:
        # Make sure there are more than 3 data points
        assert(len(data) > 3)  
        most_recent = data[-1]['reporting_date']
        data_date = datetime.strptime(most_recent, TIMESTAMP_FORMAT)
    except:
        most_recent = data[-1]['end_date']
        data_date = datetime.strptime(most_recent, TIMESTAMP_FORMAT)
    delta = current_date-data_date
    days = int(str(delta).split()[0])
    if(days>=30):
        return ""
    
    # Alert if current is max
    bin_val1,statement1,val1 = current_is_significant(data)
    if(bin_val1):
        alert += " " + statement1
    
    ''' You can add tests for seasonal significance, tests for alternate distributions etc.
    bin_val2,statement2,val2 = YOUR_TEST_HERE(data)  
    if(bin_val2):
        alert += " " + statement2
    '''
    return alert


def string_formatting(data, number):
    try:
        # Make sure there was data returned 
        most_recent_value = (data[number]['value'])
        
        # Provide metadata for the reader from gro api client and format into human readable string
        item = gclient.lookup('items',data[0]['item_id'])['name']
        unit = gclient.lookup('units',data[0]['unit_id'])['name']
        region = gclient.lookup('regions',data[0]['region_id'])['name']
        metric_id = data[0]['metric_id']
        metric_name = gclient.lookup('metrics',metric_id)['name']
        time_stamp = " on " + data[number]['end_date'][:10] + " "
        frequency = gclient.lookup('frequencies',data[0]['frequency_id'])['name']
        
    except (TypeError, KeyError):
        print("type error or key error in string_formatting")
        return "", ''

    # Get percent change from average
    df = pd.DataFrame(data)
    df, current_value, frequency = transform_by_data_type(df, most_recent_value, frequency)
    mean = df["value"].mean()
    percent_change = get_percent_change_statement(most_recent_value, mean, metric_id)

    # Check if this is a geospatial data series
    is_geospatial, new_metric_name = check_if_geospatial(data)

    # If this IS a geospatial metric
    if(is_geospatial):
        # Create string
        significance_headline = region+ " " + new_metric_name + ": " + percent_change

    # If this is NOT a geospatial metric
    else:
        # Make sure it exists
        if(most_recent_value):
            most_recent_value = round(most_recent_value,2)
        else:
            most_recent_value = 0

        # Adds commas and gets rid of decimals
        if(most_recent_value>100):
            most_recent_value = (format(int(most_recent_value), ',d'))
        else:
            most_recent_value = str(most_recent_value)
        # Create string
        item_metric = item + " " + metric_name
        significance_headline = region+ " " + item_metric + ": " + most_recent_value + " " + unit + ", " + percent_change
    return significance_headline


def work_to_distribute(data):
    '''
    Method run by workers, this will run each of the tests in return_alerts
    '''
    string_add = ""
    data_line_2 = return_alerts(data)
    if(data_line_2!=""):
        data_line_1 = string_formatting(data,-1)
        string_add += (data_line_1 + data_line_2 + "\n")
    return string_add


def compute_amomalies(array, data_name):
    '''
    Takes array of data series and a category name and returns any alerts string formatted under a header
    '''
    string = ""
    string_add = ""
    string_list = []
    if(array != None):
        try:
            for ele in array:
                string_list.append(work_to_distribute(ele))
        except IndexError:
            print("IndexError in compute_amomalies for " + data_name)
            return string_add
            
        for x in string_list:
            string_add += x
        if(string_add!=""):
            string = data_name
        else:
            string = ""

    data_statement = string + " \n" + string_add

    return data_statement


argentina_string = compute_amomalies(land_temperature_array, data_name)
print("argentina_string:")
print(argentina_string)
argentina_string:
 

NDVI Anomalies in Brazil

Now that we have our alerts from Argentina about Land Temperature (If the cell above prints nothing, that means there were no alerts for significant temperature in Argentina), we will try brazil NDVI (Normalized Difference Vegetation Index) anomalies as well. Our data source for NDVI allows us to estimate the amount of chlorophyll in plants in a given region.

In [17]:
# Tocantins Province in Brazil
tocantins_province_id = 10428
districts_list_brazil = [region['id'] for region in gclient.get_descendant_regions(tocantins_province_id, REGION_LEVELS['district'])]
print(districts_list_brazil)
[110117, 110031, 110130, 110097, 110120, 110059, 110124, 110047, 110041, 110118, 110055, 110061, 110084, 110082, 110096, 110077, 110109, 110100, 109997, 110045, 110021, 110074, 109998, 110022, 110024, 110086, 110001, 109995, 110048, 110064, 110004, 110098, 110016, 110037, 110010, 110078, 110107, 109999, 110051, 110017, 110091, 110116, 110023, 110131, 110089, 110065, 110088, 110028, 110002, 110090, 110115, 110092, 110018, 110104, 110052, 110056, 110011, 110106, 110085, 110083, 110035, 110122, 110087, 110121, 110006, 109996, 110068, 110014, 110034, 110057, 110066, 110102, 110042, 110071, 110039, 110053, 110040, 110007, 110094, 110049, 110030, 110073, 110029, 110129, 110076, 110105, 110075, 110067, 110062, 110113, 110114, 110000, 110020, 110046, 110050, 110063, 110101, 110110, 110128, 110036, 110093, 110069, 110012, 110003, 110103, 110080, 110127, 110099, 110060, 110015, 110008, 110033, 110070, 110027, 110095, 110079, 110112, 110019, 110013, 110005, 110125, 110111, 110108, 110072, 110119, 110126, 110032, 110026, 110054, 110123, 110025, 110009, 110044, 110038, 110043, 110081, 110058, 100022399, 100022400, 100022401]

Pull Tocantins province NDVI and analyze for significance (this may take a minute)

In [18]:
metric_id = 431132# Vegetation difference from 10-yr mean (2001-2010)
item_id = 321 # Vegetation (NDVI)

ndvi_array = get_data_for_multiple_regions(districts_list_brazil, metric_id, item_id)
brazil_string = compute_amomalies(ndvi_array, "Tocantins Province in Brazil NDVI Anomalies")

print(brazil_string)
No Content
No Content
Tocantins Province in Brazil NDVI Anomalies 
Wanderlândia vegetation health (NDVI): 11% above average which is a significant high in 20 years 
Porto Nacional vegetation health (NDVI): 7% above average which is a significant high in 20 years 
Lajedão vegetation health (NDVI): 9% above average which is a significant high in 20 years 
Almas vegetation health (NDVI): 7% above average which is a significant high in 20 years 
Angico vegetation health (NDVI): 10% above average which is a significant high in 20 years 
Novo Acordo vegetation health (NDVI): 9% above average which is a significant high in 20 years 
Xambioá vegetation health (NDVI): 9% above average which is a significant high in 19 years 
Caseara vegetation health (NDVI): 7% above average which is a significant high in 20 years 
Riachinho vegetation health (NDVI): 9% above average which is a significant high in 20 years 
Dianopolis vegetation health (NDVI): 12% above average which is a significant high in 20 years 
Nazaré vegetation health (NDVI): 9% above average which is a significant high in 20 years 
Santa Terezinha do Tocantins vegetation health (NDVI): 11% above average which is a significant high in 20 years 
São Bento do Tocantins vegetation health (NDVI): 9% above average which is a significant high in 20 years 
Ananás vegetation health (NDVI): 8% above average which is a significant high in 20 years 
Itaguatins vegetation health (NDVI): 10% above average which is a significant high in 20 years 
Oliveira de Fátima vegetation health (NDVI): 7% above average which is a significant high in 19 years