#StackBounty: #python #dataframe #interpolation #raster #folium create interpolated polygons from GPS data with value column

Bounty: 100

I have some GPS data with a value assigned to each point (think like air quality). I can plot those points, with folium for instance, and map the value to the size of a circle, like this:

import pandas, numpy, folium
lat = numpy.random.uniform(45, 45.01, 250)
lon = numpy.random.uniform(3, 3.02, 250)
value = numpy.random.uniform(0,50,250)
df = pandas.DataFrame({'lat': lat, 'lon': lon, 'value': value})
mymap = folium.Map(location = [lat.mean(), lon.mean()], tiles="OpenStreetMap", zoom_start=14)
for elt in list(zip(df.lat, df.lon, df.value)):
    folium.Circle(elt[:2], color="blue", radius=elt[2]).add_to(mymap)
mymap.save('mymap.html')

enter image description here

And I would like to treat this data, interpolate it and create a shapefile as the output, with interpolated polygons containing the average value and showing areas of high and low value (fake picture on the right). The limit of the polygons would auto-generated, of course, from the interpolation.

How can I achieve this? Because i have tried using HeatMap tool in folium, but it is designed to interpolate the density of points, not the value associated with each point!
I hope it’s not too complex..Thanks,
note: I use folium but i’m ok with other python libraries.


Get this bounty!!!

#StackBounty: #python #python-3.x #pandas #dataframe #pandas-groupby How to check for stuck data in a pandas dataframe

Bounty: 50

All,

** EDIT **

Please show timing as dealing with time-series data can have a lot of rows.

This is a common problem that I face while working with time-series data sets across languages. So, let us say we have hourly data for a day. I would like to check for 2 things:

  1. Variable # of consecutive stuck values depending on a group
  2. Jumps in data which violate a tolerance

Here is the example data set to work with and what I have attempted:

import pandas as pd
import numpy as np

# Constants
UPPER_LIMIT_RANGE_FILTER = 1.2
LOWER_LIMIT_RANGE_FILTER = 0.5


def count_consecutive_values(
        df_input: pd.DataFrame,
        column: str,
        n_consecutive_values: int
) -> pd.DataFrame:
    """

    :param df_input: input data frame to test consecutive values in
    :param column: column with consecutive values in
    :param n_consecutive_values: # of consecutive occurrences to count
    :return: original data frame with an extra column called 'count'
    """
    df_input[column + '_count'] = df_input[column].groupby(
        (df_input[column] != df_input[column].shift(n_consecutive_values)).cumsum()).cumsum()
    return df_input


# Create a random data frame
df = pd.DataFrame(data=[["2015-01-01 00:00:00", -0.76, 2, 2, 1.2],
                        ["2015-01-01 01:00:00", -0.73, 2, 4, 1.1],
                        ["2015-01-01 02:00:00", -0.71, 2, 4, 1.1],
                        ["2015-01-01 03:00:00", -0.68, 2, 32, 1.1],
                        ["2015-01-01 04:00:00", -0.65, 2, 2, 1.0],
                        ["2015-01-01 05:00:00", -0.76, 2, 2, 1.2],
                        ["2015-01-01 06:00:00", -0.73, 2, 4, 1.1],
                        ["2015-01-01 07:00:00", -0.71, 2, 4, 1.1],
                        ["2015-01-01 08:00:00", -0.68, 2, 32, 1.1],
                        ["2015-01-01 09:00:00", -0.65, 2, 2, 1.0],
                        ["2015-01-01 10:00:00", -0.76, 2, 2, 1.2],
                        ["2015-01-01 11:00:00", -0.73, 2, 4, 1.1],
                        ["2015-01-01 12:00:00", -0.71, 2, 4, 1.1],
                        ["2015-01-01 13:00:00", -0.68, 2, 32, 1.1],
                        ["2015-01-01 14:00:00", -0.65, 2, 2, 1.0],
                        ["2015-01-01 15:00:00", -0.76, 2, 2, 1.2],
                        ["2015-01-01 16:00:00", -0.73, 2, 4, 1.1],
                        ["2015-01-01 17:00:00", -0.71, 2, 4, 1.1],
                        ["2015-01-01 18:00:00", -0.68, 2, 32, 1.1],
                        ["2015-01-01 19:00:00", -0.65, 2, 2, 1.0],
                        ["2015-01-01 20:00:00", -0.76, 2, 2, 1.2],
                        ["2015-01-01 21:00:00", -0.73, 2, 4, 1.1],
                        ["2015-01-01 22:00:00", -0.71, 2, 4, 1.1],
                        ["2015-01-01 23:00:00", -0.68, 2, 32, 1.1],
                        ["2015-01-02 00:00:00", -0.65, 2, 2, 1.0]],
                  columns=['DateTime', 'column1', 'column2', 'column3', 'column4'])
consecutive_values_to_test_for = {
    'Zone_1': 4,
    'Zone_2': 2
}

# Set the index
df["DateTime"] = pd.to_datetime(df["DateTime"])
df.set_index("DateTime", inplace=True)

# Calculate difference between every 2 values in each column
df1 = df.diff()
print(df1)

# Add hour and time of day to create flag
df1['Hour'] = df1.index.hour
df1['Flag'] = np.where((df1['Hour'] <= 8) | (df1['Hour'] >= 18), 'Zone1', 'Zone2')

# Create Groups & apply filters on groups
grouped_data = df1.groupby(['Flag'])

Problem 1 :

So, I have split my day’s worth of data into 2 sets – Zone1 and Zone2. Now, I would like to see if the data is stuck. There should be a boolean flag at every timestamp when at least 2 consecutive occurrences of a value are observed in Zone 1, while this should happen if there are at least 4 consecutive occurrences of the value are observed in Zone 2.

Problem 2 :

I would like a boolean flag at every timestamp when the value in a column changes from one timestamp to another by more than the pre-defined tolerance value.

I think, problem-2 is straightforward and could be solved with the following, but I could use some help with detecting the stuck values.

My solution for problem-2 or jump values

def flag_jumps(
        df_input: pd.DataFrame,
        tolerance: float = 10**-2
) -> pd.DataFrame:
    """
    Returns a data frame the size of the input

    Flags rows in each column where tolerance is violated
    :param df_input: input data frame to test for jumps
    :param tolerance: acceptable value of tolerance
    :return: data frame with flags indicating whether tolerance has been violated or not
    """
    # Calculate the difference between every two rows
    df2 = df_input.diff()
    
    # Check for tolerance violation
    df3 = df2.gt(tolerance)

    return df3


Get this bounty!!!

#StackBounty: #python #python-3.x #pandas #dataframe #data-science Better way to iterate over dataset and change a feature value for ot…

Bounty: 50

I have a dataset of velocities registered by sensors on highways and I’m changing the label values for the avg5 (velocities average of 5 minutes timestamp) 2 hours in the future (the normal is 30 minutes. The label value of now is the observed avg5 of 30 minutes in the future).

My dataset have the following features and values:
head of the dataset: features and values

expanded dataset

And I’m doing this switch of values by this way:

hours_added = datetime.timedelta(hours = 2)

for index in data_copy.index:

  hours_ahead = data.loc[index, "timestamp5"] + hours_added
  result = data_copy[((data_copy["timestamp5"] == hours_ahead) & (data_copy["sensor_id"] == data_copy["sensor_id"].loc[index]))]

  if len(result) == 1:
    data_copy.at[index, "label"] = result["avg5"]

  if(index % 50 == 0):
    print(f"Index: {index}")

The code is querying 2 hours ahead and catching the result for the same sensor_id that I’m iterating now. I only change the value of my label if the result brings me something (len(result) == 1).

My dataframe has 2950521 indexes and at the moment I’m publishing this question the kernel is running for more then 24 hours and only reached the 371650 Index.

So I started thinking that I’m doing something wrong or if have a better way of change these values who don’t take so long time.

Updates
The desired behavior is to assign the avg5 of the respective sensor_id of 2 hours in the future for the label 2 hours before.
Let’s take as example the two images from this question and suppose that instead of 2 hours I want to assign the avg5 of 10 minutes later in future (the sensor_id in this example are the same).

So the label of the row with index 0 instead of be 50.79 should be 51.59 (avg5 value of the row with index 2).


Get this bounty!!!

#StackBounty: #python-3.x #pandas #dataframe #formatting dataframe: transform row-based transaction data into aggregates per date

Bounty: 50

I retrieve data from a SQLITE Database (and transform it to a pandas dataframe) in the following format:

Driver | Date loading | Date unloading | Loading Adress | Unloading Address
Peter  | 02.05.2020   | 03.05.2020     | 12342, Berlin  | 14221, Utrecht
Peter  | 03.05.2020   | 04.05.2020     | 14221, Utrecht | 13222, Amsterdam
Franz  | 03.05.2020   | 03.05.2020     | 11111, Somewher| 11221, Somewhere2
Franz  | 03.05.2020   | 05.05.2020     | 11223, Upsalla | 14231, Berlin

The date range can be specified for the query, so that it gives an overview over which driver has which transports to deliver within the specified date range, ordered by date.

The goal of the transformation I want to do is a weekly plan for each driver, with the dates from the range sorted in the available columns. So for the data above, this would look like the following:

Driver | 02.05.2020           | 03.05.2020            | 04.05.2020         | 05.05.2020      |
Peter  | Loading:             | Unloading:              Unloading:
         12342, Berlin          14221, Utrecht          13222, Amsterdam
                                Loading:
                                14221, Utrecht

Franz  |                      | Loading:              |                    | Unloading:
                                11111, Somewher                              14231, Berlin
                                Unloading:
                                11221, Somewhere2
                                Loading:
                                11223, Upsalla

Is there any way to achieve the described output with dataframe operations? Within the single data columns I will need to keep the order, which is loading first, unloading second, and then go to the next data row if the date is the same.


Get this bounty!!!

#StackBounty: #data-mining #dataframe #data-analysis Automatically Detect Valid/Interesting pivot/drill downs of a dataset

Bounty: 50

Imagine you are given a tabular data set with a limited set of columns and rows and you are asked to find the valid/interesting pivot configuration by exploring the data. The brute-force option is to calculate all possible pivot configuration and somehow score the result features (such as its sparsity, count,…) and pick the ones that are scored higher. This is obviously very time-consuming.

I understand that my definition of “Valid/Interesting” is fuzzy in here but is there a more science-based approach (say using correlations, column cardinality,…) to automatically find the good pivot configuration of a given dataset? Any pointer is highly appreciated.


Get this bounty!!!

#StackBounty: #pandas #dataframe #pyspark #apache-spark-sql #pandas-groupby Make groupby.apply more efficient or convert to spark

Bounty: 200

All,

I am using pandas groupby.apply to use my own custom function. However, I have noticed that the function is very, very slow. Can someone help me in converting this code to apply to spark dataframes?

Adding simple example for people to use:

import pandas as pd
import operator

df = pd.DataFrame({
    'Instruments': ['A', 'B', 'A', 'B', 'A', 'C', 'C', 'B'],
    'Sers': ['Wind', 'Tool', 'Wind', 'Wind', 'Tool', 'Tool', 'Tool', 'Wind'],
    'Sounds': [42, 21, 34, 56, 43, 61, 24, 23]
})
def get_stats(data_frame):

    # For each grouped data_frame, cutoff all Sounds greater than 99th percentile
    cutoff_99 = data_frame[data_frame.Sounds <= data_frame.Sounds.quantile(.99)]

    # Based on total number of records, select the most-abundant sers
    sers_to_use = max((cutoff_99.Sers.value_counts() / cutoff_99.shape[0]).to_dict().items(), key = operator.itemgetter(1))[0]

    # Give me the average sound of the selected sers
    avg_sounds_of_sers_to_use = cutoff_99.loc[cutoff_99["Sers"] == sers_to_use].Sounds.mean()

    # Pre-allocate lists
    cool = []
    mean_sounds = []
    ratios = []
    _difference = []


    for i in cutoff_99.Sers.unique():
        # add each unique sers of that dataframe 
        cool.append(i) 

        # get the mean sound of that ser
        sers_mean_sounds = (cutoff_99.loc[cutoff_99["Sers"] == i].Sounds).mean()

        # add each mean sound for each sers
        mean_sounds.append(sers_mean_sounds) 

        # get the ratio of the sers to use vs. the current sers; add all of the ratios to the list
        ratios.append(avg_sounds_of_sers_to_use / sers_mean_sounds)

        # get the percent difference and add it to a list
        _difference.append(
            float(
                round(
                    abs(avg_sounds_of_sers_to_use - sers_mean_sounds)
                    / ((avg_sounds_of_sers_to_use + sers_mean_sounds) / 2),
                    2,
                )
                * 100
            )
        )

    # return a series with these lists/values.
    return pd.Series({
        'Cools': cool,
        'Chosen_Sers': sers_to_use,
        'Average_Sounds_99_Percent': mean_sounds,
        'Mean_Ratios': ratios,
        'Percent_Differences': _difference
    }) 

I call the function as follows in pandas:
df.groupby('Instruments').apply(get_stats)


Get this bounty!!!

#StackBounty: #python #data-mining #pandas #data-cleaning #dataframe Unformatted data entries

Bounty: 50

I have been working recently on an independent project using a database for Cybersecurity Attack classification. I imported the database using Pandas (Python) and before starting the processing step, I have noticed that some of the entries contain the following symbols: “-“, “0x000b”, “0xc0a8”, and many others that it is difficult to track them and see how many of these unformatted data are present, specially when the database is so big. Is there a way to take the whole dataframe, spot all the possible unformatted and error data and substitute them by NaN, to treat them later as missing values?

Thanks in advance!


Get this bounty!!!

#StackBounty: #python #pandas #dataframe Pandas Processing Large CSV Data

Bounty: 50

I am processing a Large Data Set with at least 8GB in size using pandas.

I’ve encountered a problem in reading the whole set so I read the file chunk by chunk.

In my understanding, chunking the whole file will create many different dataframes. So using my existing routine, this only removes the duplicate values on that certain dataframe and not the duplicates on the whole file.

I need to remove the duplicates on this whole data set based on three different columns on which on some cases that the three columns may or may not exists.

I tried to use the pd.concat but I also encountered a problem with the memory so I tried to write the file on a csv file and append all the results of the dataframes on it.

After running the code, the file doesn’t reduce much so I think my assumption is right that the current routine is not removing all the duplicates based on the whole data set.

I’m a newbie in Python so it would really help if someone can point me in the right direction.

    def removeduplicates(filename):
        CHUNK_SIZE = 250000
        df_iterator = pd.read_csv(filename, na_filter=False, chunksize=CHUNK_SIZE,
                                  low_memory=False)
        # new_df = pd.DataFrame()
        for df in df_iterator:
            list_headers = list(df.columns)

            UNIQUE_HEADER = 'Unique String (Combine Values)'

            if UNIQUE_HEADER in df.columns:
                del df[UNIQUE_HEADER]

            email_column = find_column("Email", list_headers)
            phone_column = find_column("Phone", list_headers)
            website_column = find_column("Web", list_headers)

            # print(email_column, phone_column, website_column)

            if email_column != '' and phone_column != '' and website_column !='':
                df = df.reset_index().drop_duplicates(subset=[email_column, phone_column,website_column],
                                                keep='first').set_index('index')
                unique_strings = df.apply(lambda row: str(row[email_column]) + 
                                                    str(row[phone_column]) + str(row[website_column]),axis=1)
            elif email_column == '' and phone_column !='' and website_column !='':
                df = df.reset_index().drop_duplicates(subset=[phone_column, website_column],
                                              keep='first').set_index('index')
                unique_strings = df.apply(lambda row: str(row[phone_column]) + str(row[website_column]), axis=1)
            elif email_column != '' and phone_column != '' and website_column == '':
                df = df.reset_index().drop_duplicates(subset=[email_column, phone_column],
                                              keep='first').set_index('index')
                unique_strings = df.apply(lambda row: str(row[email_column]) + str(row[phone_column], ''), axis=1)

            df.insert(1, UNIQUE_HEADER, unique_strings)
            df.to_csv("E:/test.csv", mode="a", index=False)
            # new_df = pd.concat([new_df, df])


Get this bounty!!!

#StackBounty: #python-3.x #pandas #csv #dataframe #converters Why the column type can't read as in converters's setting?

Bounty: 500

I want to read a csv file with string type for specified column, the data file located here:

data file to test

Please download and save it as $HOMEcbond.csv(can’t upload it into dropbox and other net disk because of GFW, jianguoyun provide english gui, create your own free account and download my sample data file).

import pandas as df
df = pd.read_csv('cbond.csv',sep=',',header=0, converters={'正股代码':str})

I make the column 正股代码 in csv file as string type with converters,check all columns data type with df.info().

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 239 entries, 0 to 238
Data columns (total 17 columns):
代码       239 non-null int64
转债名称     239 non-null object
现价       239 non-null float64
涨跌幅      239 non-null float64
正股名称     239 non-null object
正股价      239 non-null float64
正股涨跌     239 non-null float64
转股价      239 non-null float64
回售触发价    239 non-null float64
强赎触发价    239 non-null float64
到期时间     239 non-null object
剩余年限     239 non-null float64
正股代码     239 non-null object
转股起始日    239 non-null object
发行规模     239 non-null float64
剩余规模     239 non-null object
转股溢价率    239 non-null float64
dtypes: float64(10), int64(1), object(6)

Why the column 正股代码 is shown as

   正股代码     239 non-null object

instead of

   正股代码     239 non-null string  

?

Upgrade pandas:

sudo apt-get install --upgrade  python3-pandas
Reading package lists... Done
Building dependency tree       
Reading state information... Done
python3-pandas is already the newest version (0.19.2-5.1).

Try different statements:

>>> import pandas as pd
>>> pd.__version__
'0.24.2'
>>> test_1  = pd.read_csv('cbond.csv',dtype={'正股代码':'string'})
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/dtypes/common.py", line 2011, in pandas_dtype
    npdtype = np.dtype(dtype)
TypeError: data type "string" not understood

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 429, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 1122, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 1853, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 490, in pandas._libs.parsers.TextReader.__cinit__
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/dtypes/common.py", line 2017, in pandas_dtype
    dtype))
TypeError: data type 'string' not understood
>>> test_2  = pd.read_csv('cbond.csv',dtype={'正股代码':'str'})
>>> test_2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 239 entries, 0 to 238
Data columns (total 17 columns):
代码       239 non-null int64
转债名称     239 non-null object
现价       239 non-null float64
涨跌幅      239 non-null float64
正股代码     239 non-null object
正股名称     239 non-null object
正股价      239 non-null float64
正股涨跌     239 non-null float64
转股价      239 non-null float64
回售触发价    239 non-null float64
强赎触发价    239 non-null float64
到期时间     239 non-null object
剩余年限     239 non-null float64
转股起始日    239 non-null object
发行规模     239 non-null float64
剩余规模     239 non-null object
转股溢价率    239 non-null float64
dtypes: float64(10), int64(1), object(6)
memory usage: 31.8+ KB


Get this bounty!!!

#StackBounty: #python-3.x #pandas #dataframe #datetime #resampling How to resample data inside multiindex dataframe

Bounty: 100

I have the following dataframe:enter image description here

I need to resample the data to calculate the weekly pct_change(). How can i get the weekly change ?

Something like data['pct_week'] = data['Adj Close'].resample('W').ffill().pct_change() but the data need to groupby data.groupby(['month', 'week'])

This way every month would yield 4 values for weekly change.Which i can graph then

What i did was df['pct_week'] = data['Adj Close'].groupby(['week', 'day']).pct_change() but i got this error TypeError: 'type' object does not support item assignment


Get this bounty!!!