#StackBounty: #python #pandas #dataframe Pandas Processing Large CSV Data

Bounty: 50

I am processing a Large Data Set with at least 8GB in size using pandas.

I’ve encountered a problem in reading the whole set so I read the file chunk by chunk.

In my understanding, chunking the whole file will create many different dataframes. So using my existing routine, this only removes the duplicate values on that certain dataframe and not the duplicates on the whole file.

I need to remove the duplicates on this whole data set based on three different columns on which on some cases that the three columns may or may not exists.

I tried to use the pd.concat but I also encountered a problem with the memory so I tried to write the file on a csv file and append all the results of the dataframes on it.

After running the code, the file doesn’t reduce much so I think my assumption is right that the current routine is not removing all the duplicates based on the whole data set.

I’m a newbie in Python so it would really help if someone can point me in the right direction.

    def removeduplicates(filename):
        CHUNK_SIZE = 250000
        df_iterator = pd.read_csv(filename, na_filter=False, chunksize=CHUNK_SIZE,
                                  low_memory=False)
        # new_df = pd.DataFrame()
        for df in df_iterator:
            list_headers = list(df.columns)

            UNIQUE_HEADER = 'Unique String (Combine Values)'

            if UNIQUE_HEADER in df.columns:
                del df[UNIQUE_HEADER]

            email_column = find_column("Email", list_headers)
            phone_column = find_column("Phone", list_headers)
            website_column = find_column("Web", list_headers)

            # print(email_column, phone_column, website_column)

            if email_column != '' and phone_column != '' and website_column !='':
                df = df.reset_index().drop_duplicates(subset=[email_column, phone_column,website_column],
                                                keep='first').set_index('index')
                unique_strings = df.apply(lambda row: str(row[email_column]) + 
                                                    str(row[phone_column]) + str(row[website_column]),axis=1)
            elif email_column == '' and phone_column !='' and website_column !='':
                df = df.reset_index().drop_duplicates(subset=[phone_column, website_column],
                                              keep='first').set_index('index')
                unique_strings = df.apply(lambda row: str(row[phone_column]) + str(row[website_column]), axis=1)
            elif email_column != '' and phone_column != '' and website_column == '':
                df = df.reset_index().drop_duplicates(subset=[email_column, phone_column],
                                              keep='first').set_index('index')
                unique_strings = df.apply(lambda row: str(row[email_column]) + str(row[phone_column], ''), axis=1)

            df.insert(1, UNIQUE_HEADER, unique_strings)
            df.to_csv("E:/test.csv", mode="a", index=False)
            # new_df = pd.concat([new_df, df])


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.