Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
>>> df = pd.DataFrame({'user_id':['a','a','s','s','s'],
'session':[4,5,4,5,5],
'revenue':[-1,0,1,2,1]})
revenue session user_id
0 -1 4 a
1 0 5 a
2 1 4 s
3 2 5 s
4 1 5 s
And each value of session and revenue represents a kind of type, and I want to count the number of each kind say the number of revenue=-1
and session=4
of user_id=a
is 1.
And I found simple call count()
function after groupby()
can't output the result I want.
>>> df.groupby('user_id').count()
revenue session
user_id
a 2 2
s 3 3
How can I do that?
–
You seem to want to group by several columns at once:
df.groupby(['revenue','session','user_id'])['user_id'].count()
should give you what you want
From pandas 1.1, this will be my recommended method for counting the number of rows in groups (i.e., the group size). To count the number of non-nan rows in a group for a specific column, check out the accepted answer.
df.groupby(['A', 'B']).size() # df.groupby(['A', 'B'])['C'].count()
New [✓]
df.value_counts(subset=['A', 'B'])
Note that size
and count
are not identical, the former counts all rows per group, the latter counts non-null rows only. See this other answer of mine for more.
Minimal Example
pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'
df = pd.DataFrame({'num_legs': [2, 4, 4, 6],
'num_wings': [2, 0, 0, 0]},
index=['falcon', 'dog', 'cat', 'ant'])
num_legs num_wings
falcon 2 2
dog 4 0
cat 4 0
ant 6 0
df.value_counts(subset=['num_legs', 'num_wings'], sort=False)
num_legs num_wings
2 2 1
4 0 2
6 0 1
dtype: int64
Compare this output with
df.groupby(['num_legs', 'num_wings'])['num_legs'].size()
num_legs num_wings
2 2 1
4 0 2
6 0 1
Name: num_legs, dtype: int64
Performance
It's also faster if you don't sort the result:
%timeit df.groupby(['num_legs', 'num_wings'])['num_legs'].count()
%timeit df.value_counts(subset=['num_legs', 'num_wings'], sort=False)
640 µs ± 28.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
568 µs ± 6.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
–
–
–
I struggled with the same issue, made use of the solution provided above. You can actually designate any of the columns to count:
df.groupby(['revenue','session','user_id'])['revenue'].count()
df.groupby(['revenue','session','user_id'])['session'].count()
would give the same answer.
–