How to sum in pandas by unique index in several columns?(如何通过几列中的唯一索引对 pandas 求和?)
问题描述
我有一个 pandas DataFrame,它详细说明了用户会话期间的点击"方面的在线活动.有多达 50,000 个独立用户,数据框有大约 150 万个样本.显然大多数用户都有多条记录.
I have a pandas DataFrame which details online activities in terms of "clicks" during an user session. There are as many as 50,000 unique users, and the dataframe has around 1.5 million samples. Obviously most users have multiple records.
四列是唯一的用户id,用户开始服务Registration"的日期,用户使用服务Session"的日期,总点击次数.
The four columns are a unique user id, the date when the user began the service "Registration", the date the user used the service "Session", the total number of clicks.
dataframe的组织结构如下:
The organization of the dataframe is as follows:
User_ID    Registration  Session      clicks
2349876    2012-02-22    2014-04-24   2 
1987293    2011-02-01    2013-05-03   1 
2234214    2012-07-22    2014-01-22   7 
9874452    2010-12-22    2014-08-22   2 
...
(上面还有一个以0开头的索引,但可以将User_ID设置为索引.)
(There is also an index above beginning with 0, but one could set User_ID as the index.)
我想汇总用户自注册日期以来的总点击次数.数据框(或 pandas Series 对象)将列出 User_ID 和Total_Number_Clicks".
I would like to aggregate the total number of clicks by the user since Registration date. The dataframe (or pandas Series object) would list User_ID and "Total_Number_Clicks".
User_ID    Total_Clicks
2349876    722 
1987293    341
2234214    220 
9874452    1405 
...
如何在 pandas 中做到这一点?这是由 .agg() 完成的吗?每个 User_ID 都需要单独求和.
How does one do this in pandas? Is this done by .agg()? Each User_ID needs to be summed individually. 
由于有 150 万条记录,这是否可以扩展?
As there are 1.5 million records, does this scale?
推荐答案
IIUC你可以使用groupby, sum 和 reset_index:
IIUC you can use groupby, sum and reset_index:
print df
   User_ID Registration    Session  clicks
0  2349876   2012-02-22 2014-04-24       2
1  1987293   2011-02-01 2013-05-03       1
2  2234214   2012-07-22 2014-01-22       7
3  9874452   2010-12-22 2014-08-22       2
print df.groupby('User_ID')['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2
如果第一列User_ID是index:
print df
        Registration    Session  clicks
User_ID                                
2349876   2012-02-22 2014-04-24       2
1987293   2011-02-01 2013-05-03       1
2234214   2012-07-22 2014-01-22       7
9874452   2010-12-22 2014-08-22       2
print df.groupby(level=0)['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2
或者:
print df.groupby(df.index)['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2
正如 Alexander 所指出的,您需要在 groupby 之前过滤数据,如果 Session 日期少于每个 User_ID 的 Registration 日期:
As Alexander pointed, you need filter data before groupby, if Session dates is less as Registration dates per User_ID:
print df
   User_ID Registration    Session  clicks
0  2349876   2012-02-22 2014-04-24       2
1  1987293   2011-02-01 2013-05-03       1
2  2234214   2012-07-22 2014-01-22       7
3  9874452   2010-12-22 2014-08-22       2
print df[df.Session >= df.Registration].groupby('User_ID')['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2
我更改了 3. 行数据以获得更好的样本:
I change 3. row of data for better sample:
print df
        Registration    Session  clicks
User_ID                                
2349876   2012-02-22 2014-04-24       2
1987293   2011-02-01 2013-05-03       1
2234214   2012-07-22 2012-01-22       7
9874452   2010-12-22 2014-08-22       2
print df.Session >= df.Registration
User_ID
2349876     True
1987293     True
2234214    False
9874452     True
dtype: bool
print df[df.Session >= df.Registration]
        Registration    Session  clicks
User_ID                                
2349876   2012-02-22 2014-04-24       2
1987293   2011-02-01 2013-05-03       1
9874452   2010-12-22 2014-08-22       2
df1 = df[df.Session >= df.Registration]
print df1.groupby(df1.index)['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2349876       2
2  9874452       2
                        这篇关于如何通过几列中的唯一索引对 pandas 求和?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:如何通过几列中的唯一索引对 pandas 求和?
				
        
 
            
        - 如何将一个类的函数分成多个文件? 2022-01-01
 - 分析异常:路径不存在:dbfs:/databricks/python/lib/python3.7/site-packages/sampleFolder/data; 2022-01-01
 - 如何在 python3 中将 OrderedDict 转换为常规字典 2022-01-01
 - python-m http.server 443--使用SSL? 2022-01-01
 - pytorch 中的自适应池是如何工作的? 2022-07-12
 - 使用Heroku上托管的Selenium登录Instagram时,找不到元素';用户名'; 2022-01-01
 - 沿轴计算直方图 2022-01-01
 - padding='same' 转换为 PyTorch padding=# 2022-01-01
 - 如何在 Python 的元组列表中对每个元组中的第一个值求和? 2022-01-01
 - python check_output 失败,退出状态为 1,但 Popen 适用于相同的命令 2022-01-01
 
				
				
				
				