#StackBounty: #python #python-3.x #pandas #dataframe #pandas-groupby How to check for stuck data in a pandas dataframe

Bounty: 50

All,

** EDIT **

Please show timing as dealing with time-series data can have a lot of rows.

This is a common problem that I face while working with time-series data sets across languages. So, let us say we have hourly data for a day. I would like to check for 2 things:

  1. Variable # of consecutive stuck values depending on a group
  2. Jumps in data which violate a tolerance

Here is the example data set to work with and what I have attempted:

import pandas as pd
import numpy as np

# Constants
UPPER_LIMIT_RANGE_FILTER = 1.2
LOWER_LIMIT_RANGE_FILTER = 0.5


def count_consecutive_values(
        df_input: pd.DataFrame,
        column: str,
        n_consecutive_values: int
) -> pd.DataFrame:
    """

    :param df_input: input data frame to test consecutive values in
    :param column: column with consecutive values in
    :param n_consecutive_values: # of consecutive occurrences to count
    :return: original data frame with an extra column called 'count'
    """
    df_input[column + '_count'] = df_input[column].groupby(
        (df_input[column] != df_input[column].shift(n_consecutive_values)).cumsum()).cumsum()
    return df_input


# Create a random data frame
df = pd.DataFrame(data=[["2015-01-01 00:00:00", -0.76, 2, 2, 1.2],
                        ["2015-01-01 01:00:00", -0.73, 2, 4, 1.1],
                        ["2015-01-01 02:00:00", -0.71, 2, 4, 1.1],
                        ["2015-01-01 03:00:00", -0.68, 2, 32, 1.1],
                        ["2015-01-01 04:00:00", -0.65, 2, 2, 1.0],
                        ["2015-01-01 05:00:00", -0.76, 2, 2, 1.2],
                        ["2015-01-01 06:00:00", -0.73, 2, 4, 1.1],
                        ["2015-01-01 07:00:00", -0.71, 2, 4, 1.1],
                        ["2015-01-01 08:00:00", -0.68, 2, 32, 1.1],
                        ["2015-01-01 09:00:00", -0.65, 2, 2, 1.0],
                        ["2015-01-01 10:00:00", -0.76, 2, 2, 1.2],
                        ["2015-01-01 11:00:00", -0.73, 2, 4, 1.1],
                        ["2015-01-01 12:00:00", -0.71, 2, 4, 1.1],
                        ["2015-01-01 13:00:00", -0.68, 2, 32, 1.1],
                        ["2015-01-01 14:00:00", -0.65, 2, 2, 1.0],
                        ["2015-01-01 15:00:00", -0.76, 2, 2, 1.2],
                        ["2015-01-01 16:00:00", -0.73, 2, 4, 1.1],
                        ["2015-01-01 17:00:00", -0.71, 2, 4, 1.1],
                        ["2015-01-01 18:00:00", -0.68, 2, 32, 1.1],
                        ["2015-01-01 19:00:00", -0.65, 2, 2, 1.0],
                        ["2015-01-01 20:00:00", -0.76, 2, 2, 1.2],
                        ["2015-01-01 21:00:00", -0.73, 2, 4, 1.1],
                        ["2015-01-01 22:00:00", -0.71, 2, 4, 1.1],
                        ["2015-01-01 23:00:00", -0.68, 2, 32, 1.1],
                        ["2015-01-02 00:00:00", -0.65, 2, 2, 1.0]],
                  columns=['DateTime', 'column1', 'column2', 'column3', 'column4'])
consecutive_values_to_test_for = {
    'Zone_1': 4,
    'Zone_2': 2
}

# Set the index
df["DateTime"] = pd.to_datetime(df["DateTime"])
df.set_index("DateTime", inplace=True)

# Calculate difference between every 2 values in each column
df1 = df.diff()
print(df1)

# Add hour and time of day to create flag
df1['Hour'] = df1.index.hour
df1['Flag'] = np.where((df1['Hour'] <= 8) | (df1['Hour'] >= 18), 'Zone1', 'Zone2')

# Create Groups & apply filters on groups
grouped_data = df1.groupby(['Flag'])

Problem 1 :

So, I have split my day’s worth of data into 2 sets – Zone1 and Zone2. Now, I would like to see if the data is stuck. There should be a boolean flag at every timestamp when at least 2 consecutive occurrences of a value are observed in Zone 1, while this should happen if there are at least 4 consecutive occurrences of the value are observed in Zone 2.

Problem 2 :

I would like a boolean flag at every timestamp when the value in a column changes from one timestamp to another by more than the pre-defined tolerance value.

I think, problem-2 is straightforward and could be solved with the following, but I could use some help with detecting the stuck values.

My solution for problem-2 or jump values

def flag_jumps(
        df_input: pd.DataFrame,
        tolerance: float = 10**-2
) -> pd.DataFrame:
    """
    Returns a data frame the size of the input

    Flags rows in each column where tolerance is violated
    :param df_input: input data frame to test for jumps
    :param tolerance: acceptable value of tolerance
    :return: data frame with flags indicating whether tolerance has been violated or not
    """
    # Calculate the difference between every two rows
    df2 = df_input.diff()
    
    # Check for tolerance violation
    df3 = df2.gt(tolerance)

    return df3


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.