Bounty: 50
All,
** EDIT **
Please show timing as dealing with time-series data can have a lot of rows.
This is a common problem that I face while working with time-series data sets across languages. So, let us say we have hourly data for a day. I would like to check for 2 things:
- Variable # of consecutive stuck values depending on a group
- Jumps in data which violate a tolerance
Here is the example data set to work with and what I have attempted:
import pandas as pd
import numpy as np
# Constants
UPPER_LIMIT_RANGE_FILTER = 1.2
LOWER_LIMIT_RANGE_FILTER = 0.5
def count_consecutive_values(
df_input: pd.DataFrame,
column: str,
n_consecutive_values: int
) -> pd.DataFrame:
"""
:param df_input: input data frame to test consecutive values in
:param column: column with consecutive values in
:param n_consecutive_values: # of consecutive occurrences to count
:return: original data frame with an extra column called 'count'
"""
df_input[column + '_count'] = df_input[column].groupby(
(df_input[column] != df_input[column].shift(n_consecutive_values)).cumsum()).cumsum()
return df_input
# Create a random data frame
df = pd.DataFrame(data=[["2015-01-01 00:00:00", -0.76, 2, 2, 1.2],
["2015-01-01 01:00:00", -0.73, 2, 4, 1.1],
["2015-01-01 02:00:00", -0.71, 2, 4, 1.1],
["2015-01-01 03:00:00", -0.68, 2, 32, 1.1],
["2015-01-01 04:00:00", -0.65, 2, 2, 1.0],
["2015-01-01 05:00:00", -0.76, 2, 2, 1.2],
["2015-01-01 06:00:00", -0.73, 2, 4, 1.1],
["2015-01-01 07:00:00", -0.71, 2, 4, 1.1],
["2015-01-01 08:00:00", -0.68, 2, 32, 1.1],
["2015-01-01 09:00:00", -0.65, 2, 2, 1.0],
["2015-01-01 10:00:00", -0.76, 2, 2, 1.2],
["2015-01-01 11:00:00", -0.73, 2, 4, 1.1],
["2015-01-01 12:00:00", -0.71, 2, 4, 1.1],
["2015-01-01 13:00:00", -0.68, 2, 32, 1.1],
["2015-01-01 14:00:00", -0.65, 2, 2, 1.0],
["2015-01-01 15:00:00", -0.76, 2, 2, 1.2],
["2015-01-01 16:00:00", -0.73, 2, 4, 1.1],
["2015-01-01 17:00:00", -0.71, 2, 4, 1.1],
["2015-01-01 18:00:00", -0.68, 2, 32, 1.1],
["2015-01-01 19:00:00", -0.65, 2, 2, 1.0],
["2015-01-01 20:00:00", -0.76, 2, 2, 1.2],
["2015-01-01 21:00:00", -0.73, 2, 4, 1.1],
["2015-01-01 22:00:00", -0.71, 2, 4, 1.1],
["2015-01-01 23:00:00", -0.68, 2, 32, 1.1],
["2015-01-02 00:00:00", -0.65, 2, 2, 1.0]],
columns=['DateTime', 'column1', 'column2', 'column3', 'column4'])
consecutive_values_to_test_for = {
'Zone_1': 4,
'Zone_2': 2
}
# Set the index
df["DateTime"] = pd.to_datetime(df["DateTime"])
df.set_index("DateTime", inplace=True)
# Calculate difference between every 2 values in each column
df1 = df.diff()
print(df1)
# Add hour and time of day to create flag
df1['Hour'] = df1.index.hour
df1['Flag'] = np.where((df1['Hour'] <= 8) | (df1['Hour'] >= 18), 'Zone1', 'Zone2')
# Create Groups & apply filters on groups
grouped_data = df1.groupby(['Flag'])
Problem 1 :
So, I have split my day’s worth of data into 2 sets – Zone1 and Zone2. Now, I would like to see if the data is stuck. There should be a boolean flag at every timestamp when at least 2 consecutive occurrences of a value are observed in Zone 1, while this should happen if there are at least 4 consecutive occurrences of the value are observed in Zone 2.
Problem 2 :
I would like a boolean flag at every timestamp when the value in a column changes from one timestamp to another by more than the pre-defined tolerance value.
I think, problem-2 is straightforward and could be solved with the following, but I could use some help with detecting the stuck values.
My solution for problem-2 or jump values
def flag_jumps(
df_input: pd.DataFrame,
tolerance: float = 10**-2
) -> pd.DataFrame:
"""
Returns a data frame the size of the input
Flags rows in each column where tolerance is violated
:param df_input: input data frame to test for jumps
:param tolerance: acceptable value of tolerance
:return: data frame with flags indicating whether tolerance has been violated or not
"""
# Calculate the difference between every two rows
df2 = df_input.diff()
# Check for tolerance violation
df3 = df2.gt(tolerance)
return df3