Python 当索引列包含重复项时，从DataFrame列复制序列_Python_Python 2.7_Pandas

Python 当索引列包含重复项时，从DataFrame列复制序列

python python-2.7 pandas

Python 当索引列包含重复项时，从DataFrame列复制序列,python,python-2.7,pandas,Python,Python 2.7,Pandas,以下是从CSV中提取的内容，作为pd： return pd.Series((wb['impressions'].values * 1.0)/(wb['ad_requests'].values * 1.0), index=wb['\xef\xbb\xbf"ad_tag_name"']).to_dict() 不再有效，因为现在，如下图所示在第A列（第3列）中有多个同名条目 he.com_300x250_bottomloopmobile，例如he.com_300x250_bottomloopmob

以下是从CSV中提取的内容，作为

pd

：

return pd.Series((wb['impressions'].values * 1.0)/(wb['ad_requests'].values * 1.0), index=wb['\xef\xbb\xbf"ad_tag_name"']).to_dict()

不再有效，因为现在，如下图所示

在第A列（第3列）中有多个同名条目 he.com_300x250_bottomloopmobile，例如he.com_300x250_bottomloopmobile中的4个）

除了第一个条目外，C列将始终为空独特的价值

现在，我需要对A列中每个“键”的这些多个值进行求和，并对C列进行同样的操作，然后将这些值插入除法计算和序列创建中

单独试验

groupby（）

，前者表现良好（重复的键被删除，这正是我想要的）：

但是，当我在

index=wb['\xef\xbb\xbf“ad\u tag\u name”]

中重新添加以尝试重建完整公式时，pandas不再删除重复项：

In [37]: pd.Series(wb.groupby('\xef\xbb\xbf"ad_tag_name"').sum()['impressions'], index=wb['\xef\xbb\xbf"ad_tag_name"'])
Out[37]: 
"ad_tag_name"
he.com_300x250_bottomloopmobile          26752
he.com_300x250_bottomloopmobile          26752
he.com_300x250_bottomloopmobile          26752
he.com_300x250_bottomslidemobile         31217
he.com_300x250_bottomslidemobile         31217
he.com_300x250_bottomslidemobile         31217
he.com_300x250_bottomslidemobile         31217

假设公式的

groupby（）

组件可以保持原样，我们如何告诉序列创建识别索引列的重复键？

似乎需要将输出分配给

wb

-使用

sum

聚合所有数字列，以避免重复，最后添加

as_index=FalseforDataFrame
输出：
wb = wb.groupby("ad_tag_name", as_index=False).sum()
#alternative solution
#wb = wb.groupby("ad_tag_name").sum().reset_index()


样本：
wb = pd.DataFrame({'ad_tag_name':['he.com_300x250_bottomloopmobile'] * 3 +
                                 ['he.he.com_300x250_bottomslidemobile'] * 4, 
                   'impressions':[309, 3029,23414,1465,5725,2918,11109],
                    'ad_requests':[37849,np.nan,np.nan, 42300,np.nan, np.nan, np.nan]})

#print (wb)    

wb = wb.groupby('ad_tag_name', as_index=False).sum()
print (wb)
                           ad_tag_name  ad_requests  impressions
0      he.com_300x250_bottomloopmobile      37849.0        26752
1  he.he.com_300x250_bottomslidemobile      42300.0        21217

a = pd.Series((wb['impressions'].values * 1.0)/(wb['ad_requests'].values * 1.0), 
           index=wb['ad_tag_name']).to_dict()

print (a)
{'he.he.com_300x250_bottomslidemobile': 0.50158392434988175, 
'he.com_300x250_bottomloopmobile': 0.70680863431002139}

同样对于删除\xef\xbb\xbf
添加encoding='utf-8-sig'
或升级到最新版本，因为此错误是。
谢谢，但您在最终的df中有重复的值。根据独特的广告标签名称汇总印象。列ad_tag_name中的每个相同值都指向同一实体。因此，最终的df（在您的示例中）应该只有3行，a、b、c各一行。有趣的方法是，在应用计算之前，对所有列进行聚合和求和。优雅的谢谢
wb = pd.DataFrame({'ad_tag_name':['he.com_300x250_bottomloopmobile'] * 3 +
                                 ['he.he.com_300x250_bottomslidemobile'] * 4, 
                   'impressions':[309, 3029,23414,1465,5725,2918,11109],
                    'ad_requests':[37849,np.nan,np.nan, 42300,np.nan, np.nan, np.nan]})

#print (wb)    

wb = wb.groupby('ad_tag_name', as_index=False).sum()
print (wb)
                           ad_tag_name  ad_requests  impressions
0      he.com_300x250_bottomloopmobile      37849.0        26752
1  he.he.com_300x250_bottomslidemobile      42300.0        21217

a = pd.Series((wb['impressions'].values * 1.0)/(wb['ad_requests'].values * 1.0), 
           index=wb['ad_tag_name']).to_dict()

print (a)
{'he.he.com_300x250_bottomslidemobile': 0.50158392434988175, 
'he.com_300x250_bottomloopmobile': 0.70680863431002139}