相关文章推荐
不爱学习的课本  ·  DeletedSecret ...·  1 年前    · 
坚韧的啤酒  ·  SQL Server 2005 ...·  1 年前    · 
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

Apply a lambda with a shift function in python pandas were some null elements are to be replaced

Ask Question

I am trying to do the following in a dataframe. Change the Value of Column Attrition if the Period is not 1, then multiple the value of the retention column in that row by the attrition value in the row above within the groupby. My attempt is below:

import pandas as pd
data = {'Country': ['DE', 'DE', 'DE', 'US', 'US', 'US', 'FR', 'FR', 'FR'],
    'Week': ['201426', '201426', '201426', '201426', '201425', '201425', '201426', '201426', '201426'],
    'Period': [1, 2, 3, 1, 1, 2, 1, 2, 3],
    'Attrition': [0.5,'' ,'' ,0.85 ,0.865,'' ,0.74 ,'','' ],
    'Retention': [0.95,0.85,0.94,0.85,0.97,0.93,0.97,0.93,0.94]}
df = pd.DataFrame(data, columns= ['Country', 'Week', 'Period', 'Attrition','Retention'])
print df

Gives me this output:

  Country    Week  Period Attrition  Retention
0      DE  201426       1       0.5       0.95
1      DE  201426       2                 0.85
2      DE  201426       3                 0.94
3      US  201426       1      0.85       0.85
4      US  201425       1     0.865       0.97
5      US  201425       2                 0.93
6      FR  201426       1      0.74       0.97
7      FR  201426       2                 0.93
8      FR  201426       3                 0.94

The below:

df['Attrition'] = df.groupby(['Country','Week']).apply(lambda x: x.Attrition.shift(1)*x['Retention'] if x.Period != 1 else x.Attrition)
print df

gives me this error:

df['Attrition'] = df.groupby(['Country','Week']).apply(lambda x: x.Attrition.shift(1)*x['Retention'] if x.Period != 1 else x.Attrition)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

UPDATE: Complete Compiled Solution

The below is the complete working solution of what i was after which was basically using Primer's answer but adding a while loop to keep running the Lambda function on the dataframe column until there are no more NaNs.

import pandas as pd
import numpy as np
data = {'Country': ['DE', 'DE', 'DE', 'US', 'US', 'US', 'FR', 'FR', 'FR'],
    'Week': ['201426', '201426', '201426', '201426', '201425', '201425', '201426', '201426', '201426'],
    'Period': [1, 2, 3, 1, 1, 2, 1, 2, 3],
    'Attrition': [0.5, '' ,'' ,0.85 ,0.865,'' ,0.74 ,'','' ],
    'Retention': [0.95,0.85,0.94,0.85,0.97,0.93,0.97,0.93,0.94]}
df = pd.DataFrame(data, columns= ['Country', 'Week', 'Period', 'Attrition','Retention'])
print df

OUTPUT: Starting DF

  Country    Week  Period Attrition  Retention
0      DE  201426       1       0.5       0.95
1      DE  201426       2                 0.85
2      DE  201426       3                 0.94
3      US  201426       1      0.85       0.85
4      US  201425       1     0.865       0.97
5      US  201425       2                 0.93
6      FR  201426       1      0.74       0.97
7      FR  201426       2                 0.93
8      FR  201426       3                 0.94

Solution:

#Replaces empty string with NaNs
df['Attrition'] = df['Attrition'].replace('', np.nan)
#Stores a count of the number of null or NaNs in the column.
ContainsNaN = df['Attrition'].isnull().sum()
#run the loop while there are some NaNs in the column.
while ContainsNaN > 0:    
    df['Attrition'] = df.groupby(['Country','Week']).apply(lambda x: pd.Series(np.where((x.Period != 1), x.Attrition.shift() * x['Retention'], x.Attrition)))        
    ContainsNaN = df['Attrition'].isnull().sum()
print df

OUTPUT: Result

  Country    Week  Period Attrition  Retention
0      DE  201426       1       0.5       0.95
1      DE  201426       2     0.425       0.85
2      DE  201426       3    0.3995       0.94
3      US  201426       1      0.85       0.85
4      US  201425       1     0.865       0.97
5      US  201425       2   0.80445       0.93
6      FR  201426       1      0.74       0.97
7      FR  201426       2    0.6882       0.93
8      FR  201426       3  0.646908       0.94
                Actually, just as a followup to this. The code above was working fine and problem solved. Then I adapted it to my more complex problem and I started getting an error, Now even this code above doesnt run. this the error i got for line 22. "raise TypeError('incompatible index of inserted column ' My pandas version is 0.15.2 I had 0.12 but read there was some issues, so upgraded to 0.15.2 but it didnt fix the problem. TypeError: incompatible index of inserted column with frame index" I also using Python 2.7.5 | Anaconda 1.8.0 (64-bit) and the Spyder interface.
– IcemanBerlin
                Jan 30, 2015 at 9:05

First of all your Attrition column mixes numeric data with empty strings '', which is generally not a good idea and should be fixed before attempting calculations on this column:

df.loc[df['Attrition'] == '', 'Attrition'] = pd.np.nan
df['Attrition'] = df.Attrition.astype('float')

The Error you get is because your condition in .apply: x.Period != 1 produces a Boolean array:

0    False
1     True
2     True
3    False
4    False
5     True
6    False
7     True
8     True
Name: Period, dtype: bool

Which .apply does not know how to handle, because of its ambiguity (i.e what should be True in this case?).

You might consider numpy.where for this task:

import numpy as np
g = df.groupby(['Country','Week'], as_index=0, group_keys=0)
df['Attrition'] = g.apply(lambda x: pd.Series(np.where((x.Period != 1), x.Attrition.shift() * x['Retention'], x.Attrition)).fillna(method='ffill')).values

yielding:

  Country    Week  Period  Attrition  Retention
0      DE  201426       1      0.500       0.95
1      DE  201426       2      0.425       0.85
2      DE  201426       3      0.425       0.94
3      US  201426       1      0.740       0.85
4      US  201425       1      0.688       0.97
5      US  201425       2      0.688       0.93
6      FR  201426       1      0.865       0.97
7      FR  201426       2      0.804       0.93
8      FR  201426       3      0.850       0.94

Note that I have added .fillna method, which fills NaN with last observed value.

Thanks Primer, I am just going through your response one by one. Thanks for clarifying the issue with the empty string. I didn't quite understand how this line was working 'df.loc[df['Attrition'] == '', 'Attrition'] = pd.np.nan' I managed to find an example that was simpler for me to work with i.e. 'df['Attrition'] = df['Attrition'].replace('', np.nan)' this also seems to work. do you see any issues with this method. – IcemanBerlin Jan 20, 2015 at 8:52 for the last part, in row 2 period 3 I was hoping this would also subsequently be filled with the same formula i.e. period 2(0.425) x 0.94. I am guessing I would need to step through each row or repeat the lambda until there are no more NaNs? I was assuming a lambda function would automatically apply the function in a sort of iterrows fashion. – IcemanBerlin Jan 20, 2015 at 8:55 I see no problem with using .replace to fill in NaNs in this example. Your original question did not specify anything about filling NaNs after applying the desired calculation, hence there there was nothing done about that. To address this issue I have slightly modified the answer - adding .fillna() method in the .apply part). Hope this answers the question. – Primer Jan 20, 2015 at 10:27 Actually I dont want to fill NaN, I wanted each Period to have the Attrition.shift(1) x Retention formula applied except for the 1st period. Your original repsonse only filled the first NaN or Empty string with the calculation. I actually have 52 periods in each groupby in my real data and I need all of them to have the calculation applied. I think this lambda though wouldnt work in this case unless i stepped through the groupby rows. in my comment above I meant i wanted period 3 in row 2 to be 0.435 x 0.94. not a ffill of period 2. – IcemanBerlin Jan 20, 2015 at 13:01

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.