Python 3.x 熊猫数据帧：添加'；计数'；一列上多个引用的列/重复项_Python 3.x_Pandas

Python 3.x 熊猫数据帧：添加'；计数'；一列上多个引用的列/重复项

python-3.x pandas

Python 3.x 熊猫数据帧：添加'；计数'；一列上多个引用的列/重复项,python-3.x,pandas,Python 3.x,Pandas,我有一个pandas数据框，我想通过添加“count”列（这里是最后一列，为我所在的行预设为“1”）来简化重复项（在一列上，这里是第一列）。我的数据框如下所示： df = pandas.DataFrame([["a", ..., 1], # last row always 1 (this will be the 'count' column ["a", ..., 1], #"a" = identical, other values not nece

我有一个pandas数据框，我想通过添加“count”列（这里是最后一列，为我所在的行预设为“1”）来简化重复项（在一列上，这里是第一列）。我的数据框如下所示：

df = pandas.DataFrame([["a", ..., 1], # last row always 1 (this will be the 'count' column
                       ["a", ..., 1], #"a" = identical, other values not necessarily
                       ["b", ..., 1],
                       ["c", ..., 1],
                       ["a", ..., 1]
                       ["d", ..., 1],
                       ["d", ..., 1]])

df2 = pandas.DataFrame([["a", ..., 3], # no changes except for last column counting three instances of "a": this line and two further lines
                                       # line deleted: "a" reoccurs
                       ["b", ..., 1],  # no changes
                       ["c", ..., 1],  # no changes
                                       # line deleted:  "a" reoccurs
                       ["d", ..., 2],  # no changes except last column counting two instances of "d": this line and one more
                                   ])  # line deleted:  "d" reoccurs

请注意，我感兴趣的是第一列中重复出现的字母。其他列不一定重复，但可以在此处忽略。我想逐行检查数据帧并执行以下操作：

实例第一次出现在第一列时（例如，在第一列中，“a”第一次出现），请检查此行最后一列的值是否正好为1-如果不是，则设置为1
在同一实例第二次出现时（例如，在第二行中，再次出现“a”）：删除此行，并将+1添加到该实例第一次出现的行中最后一列的值

我不确定这样做的最佳方式是在相同的数据帧中还是在新的数据帧中，但我希望以如下方式结束df：

df = pandas.DataFrame([["a", ..., 1], # last row always 1 (this will be the 'count' column
                       ["a", ..., 1], #"a" = identical, other values not necessarily
                       ["b", ..., 1],
                       ["c", ..., 1],
                       ["a", ..., 1]
                       ["d", ..., 1],
                       ["d", ..., 1]])

df2 = pandas.DataFrame([["a", ..., 3], # no changes except for last column counting three instances of "a": this line and two further lines
                                       # line deleted: "a" reoccurs
                       ["b", ..., 1],  # no changes
                       ["c", ..., 1],  # no changes
                                       # line deleted:  "a" reoccurs
                       ["d", ..., 2],  # no changes except last column counting two instances of "d": this line and one more
                                   ])  # line deleted:  "d" reoccurs

我真的不知道该怎么做，我希望能得到一些建议。提前谢谢

下面的代码

import pandas as pd
df = pd.DataFrame({"first":["a", "b", "b", "a", "b", "c"], "second":range(6)})
result = df.groupby('first').first()
result['count'] = df['first'].value_counts()
result.reset_index(inplace=True)

创建数据帧

  first  second
0     a       0
1     b       1
2     b       2
3     a       3
4     b       4
5     c       5

把它变成

  first  second  count
0     a       0      2
1     b       1      3
2     c       5      1

这正是你需要的

更新。在评论中，您询问了如何将不同的聚合应用于不同的列。这是一个例子

import pandas as pd
df = pd.DataFrame({"first":["a", "b", "b", "a", "b", "c"], 
                   "second":range(6), 'third': range(6)})
result = df.groupby('first').agg({'second': lambda x: x.iloc[0], 'third': max})
result['count'] = df['first'].value_counts()
result.reset_index(inplace=True)

产生

  first  second  third  count
0     a       0      3      2
1     b       1      4      3
2     c       5      5      1

因此，第二列和第三列的聚合方式不同。

来自David的数据

df.groupby('first').agg({'first':'count','second':'first'}).rename(columns={'first':'count'})
Out[1177]: 
       count  second
first               
a          2       0
b          3       1
c          1       5

谢谢有没有一种方法可以定义我想要保留一列的最高（

int

）值，比如：对于值

，不要保留第一次出现的列

second

（此处：

），而是保留df

结果中的最高值（此处：4
）？是的，只需使用max（）
而不是count（）
-请参阅文档中的更多信息否，我不是指列first
的最大值，我想用列second
的最大值计算行中列first的出现次数。我想你的意思是result=df.groupby（'first'）.max（）
，对吗？如果是：（如何）我可以指定要保留哪一列的最大值？使用.max（）
和.sum（）
，我可以使用最大值或所有列的总和，但我找不到一种方法来指定我要保留.first（）
列a
（和列c
和d
）和.max（）
或.sum（）
列b
。有没有办法做到这一点？是的，你可以这样做，或者看看我答案中更新的代码