#StackBounty: #python-3.x #pandas #dataframe #merge #pandas-groupby Why does merging unequally matched rows not work on local dataset?

Bounty: 50

I have a pandas dataframe with questions (type = 1) and answers (type = 2). col section_id and type are integer. all other col are string. I want to merge the "answer rows" with their corresponding "question rows" (equal values in section_id) before appending some of the answer rows’ values as extra columns (Ans, ans_t) to their corresponding "question rows".

c = ['pos', 'Ans', 'Q_ID', 'leg', 'que_l', 'ans_l', 'par', 'ans_f', 'que_date', 'ask', 'M_ID', 'part', 'area', 'que_t', 'ans_t', 'ISO', 'con', 'id', 'section_id', 'type', 'dep', 'off']
d = [[None, None, '16-17/1/2017-11-15/1', '16-17', '14.0', None, 'aaa', 'hhh', '2016-11-20', 'Peter Muller', '41749', 'bbb', 'ccc', 'ddd', None, 'eee', 'fff', '111865.q2', 24339851, 1, None, None],
     [None, None, '16-17/24/17-11-09/1', '16-17', '28.0', None, 'aaa', 'hhh', '2016-11-20', 'Peter Muller', '41749', 'bbb', 'ccc', 'ppp', None, 'eee', 'fff', '111867.q1', 24339851, 1, None, None],
     [None, None, '16-17/73/17-10-09/1', '16-17', '69.0', None, 'aaa', 'hhh', '2016-11-20', 'Peter Muller', '41749', 'bbb', 'ccc', 'lll', None, 'eee', 'fff', '111863.r0', 24339851, 1, None, None],
     ['erg', 'wer', '16-17/42/16-10-09/1', '16-17', None, 67.0, 'aaa', 'hhh', '2016-11-20', None, '46753', 'bbb', 'ccc', None, 'ttt', 'eee', 'asd', '111863.r0', 24339851, 2, None, None],
     [None, None, '16-17/12/16-12-08/1', '16-17', '37.0', None, 'aaa', 'hhh', '2016-10-10', 'Peter Muller', '41749', 'bbb', 'qqq', 'rrr', None, 'eee', 'fff', '108143.r0', 24303320, 1, None, None],
     ['erg', 'wer', '16-17/12/16-12-07/1', '16-17', None, 64.0, 'aaa', 'hhh', '2016-10-10', None, '46753', 'bbb', 'qqq', None, 'uuu', 'eee', 'asd', '108143.r0', 24303320, 2, None, None],
     [None, None, '16-17/77/16-12-04/1', '16-17', '46.0', None, 'aaa', 'hhh', '2016-10-08', 'Markus John', '34567', 'ztr', 'yyy', 'nnn', None, 'eee', 'www', '127193.q0', 10343145, 1, None, None],
     ['qwe', 'wer', '16-17/37/17-11-07/1', '16-17', None, 60.0, 'aaa', 'hhh', '2016-12-12', None, '19745', 'bbb', 'gtt', None, 'ooo', 'eee', 'asd', '906213.r0', 23222978, 2, None, None]]
data = pd.DataFrame(d,columns=c)
data.loc[data['type'] == 2, 'Ans.1'] = data['Ans']
data.loc[data['type'] == 2, 'ans_t.1'] = data['ans_t']
my_cols = ['que_t','ans_t','Ans','ans_t','Ans.1','ans_t.1']
data[my_cols] = data.sort_values(['section_id','type']).groupby('section_id')[my_cols].transform(lambda x: x.bfill())
data.dropna(subset=['que_t'],inplace=True)
data.reset_index(drop=True,inplace=True)
print(data)

The code works fine on the minimal reproducible example. Unfortunately the dataset is too large to account for every detail, which is why this example may not necessarily be representative.

Problem: When i run the code on the actual dataset, nothing gets merged, even though i manually checked for section_id duplicates

Before executing the code, i remove empty cells from the dataset

data.where(pd.notnull(data), None)
data.replace(r'^s+$', np.nan, regex=True, inplace=True)

which doesent solve the problem

Question: How do i need to adjust my code in order to account for details (e.g. encoding, formats, ..) in the dataset that could cause it not to merge?

Appendix:

Someone told me to remove data from the dataset gradually, checking each time that the testcase is still reproducible. If some removal results in the testcase not working then reinstate it and remove something else instead. When there’s absolutely nothing that can be removed, you have your minimal data set.

Someone else said i should apply a parsing function to parse the data

def parse(x):
try:
    return int(x)
except ValueError:
    return np.nan
data['col_name'] = data['col_name'].apply(parse)
data.dtypes

Or should I search for non-number string and replace when with NaN ?

replaced_with_nan = data['col_name'].replace(re.compile('D+'), np.nan)
data['col_name'] = replaced_with_nan.astype(np.float)

Here is another approach which as also the answer from Andrej Kesely returns an empty dataframe when used on the actual dataframe

df1 = data.loc[df.type == 1].copy()
df2 = data.loc[df.type == 2].copy()
merged_df = pd.merge(df1, df2, on='section_id', how='outer')
merged_df = merged_df.loc[:,['section_id','que_t_x','ans_t_y','Ans_x','Ans_y']]
merged_df.rename(columns={'que_t_x':'que_t','ans_t_y':'ans_t','Ans_x':'Ans','Ans_y':'Ans.1'}, inplace=True)


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.