Python3.x如何保持一组平均值较小的副本？_Python_Python 3.x_Csv_Duplicates_Filtering

Python3.x如何保持一组平均值较小的副本？

python python-3.x csv

Python3.x如何保持一组平均值较小的副本？,python,python-3.x,csv,duplicates,filtering,Python,Python 3.x,Csv,Duplicates,Filtering,您好，因为我是python新手，一位朋友建议我在stackoverflow上寻求帮助，所以我决定尝试一下。我目前正在使用python版本3.x 我在一个没有列标题的csv文件中有超过100k的数据集，我已将数据加载到pandasDataFrame。由于文件是保密的，我不能在这里显示数据但这是一个数据和列的示例，可以定义如下 ("id", "name", "number", "time", "text_id", "text", "text") 1 | apple | 12 | 123 | 2

您好，因为我是python新手，一位朋友建议我在stackoverflow上寻求帮助，所以我决定尝试一下。我目前正在使用python版本3.x

我在一个没有列标题的csv文件中有超过100k的数据集，我已将数据加载到pandas

DataFrame

。由于文件是保密的，我不能在这里显示数据但这是一个数据和列的示例，可以定义如下

("id", "name", "number", "time", "text_id", "text", "text")

1 | apple | 12 | 123 | 2 | abc | abc

1 | apple | 12 | 222 | 2 | abc | abc

2 | orange | 32 | 123 | 2 | abc | abc

2 | orange | 11 | 123 | 2 | abc | abc

3 | apple | 12 | 333 | 2 | abc | abc

3 | apple | 12 | 443 | 2 | abc | abc

3 | apple | 12 | 553 | 2 | abc | abc

从

name

列中可以看到，我有两个重复的“apple”集群，但ID不同

所以我的问题是：如何删除基于“时间”的平均值较高的整个集群（行）

示例：if（ID为1的集群）.mean（时间）<（ID为3的集群）.mean（时间），然后删除ID为3的集群中的所有行

期望输出：

1 |苹果| 12 | 123 | 2 | abc | abc

1 |苹果| 12 | 222 | 2 | abc | abc

2 |橙色| 32 | 123 | 2 | abc | abc

2 |橙色| 11 | 123 | 2 | abc | abc

我需要很多帮助和我能得到的任何帮助，我的时间不多了，提前谢谢

您可以使用和来获取要首先删除的行。然后您可以使用来获得最终结果

import pandas as pd

## define the rows with higher than mean value
def my_func(df):
    return df[df['time'] > df['time'].mean()]

## get rows to removed
df1 = df.groupby(by='name', group_keys=False).apply(my_func)

## take only the row we want
index_to_keep = set(range(df.shape[0])) - set(df1.index)
df2 = df.take(list(index_to_keep))

例如：

## df
id    name  number  time  text_id text text1
0   1   apple      12   123        2  abc   abc
1   1   apple      12   222        2  abc   abc
2   2  orange      32   123        2  abc   abc
3   2  orange      11   123        2  abc   abc
4   3   apple      12   333        2  abc   abc
5   3   apple      12   444        2  abc   abc
6   3   apple      12   553        2  abc   abc

df1 = df.groupby(by='name', group_keys=False).apply(my_func)

## df1
id   name  number  time  text_id text text1
5   3  apple      12   444        2  abc   abc
6   3  apple      12   553        2  abc   abc

index_to_keep = set(range(df.shape[0])) - set(df1.index)
df2 = df.take(list(index_to_keep))

#index_to_keep
{0, 1, 2, 3, 4}

# df2
id    name  number  time  text_id text text1
0   1   apple      12   123        2  abc   abc
1   1   apple      12   222        2  abc   abc
2   2  orange      32   123        2  abc   abc
3   2  orange      11   123        2  abc   abc
4   3   apple      12   333        2  abc   abc

p.S我从这里使用了

的用法。
你需要的是这些东西：





请尝试以下操作：
import pandas as pd

df = pd.read_csv('filename.csv', header=None)
df.columns = ['id', 'name', 'number', 'time', 'text_id', 'text', 'text']

print(df)

for eachname in df.name.unique():
    eachname_df = df.loc[df['name'] == eachname]
    grouped_df = eachname_df.groupby(['id', 'name'])
    avg_name = grouped_df['time'].mean()

    for a, b in grouped_df:
        if b['time'].mean() != avg_name.min():
            indextodrop = b.index.get_values()
            for eachindex in indextodrop:
                df = df.drop([eachindex])

print(df)


Result:
   id    name  number  time  text_id text text
0   1   apple      12   123        2  abc  abc
1   1   apple      12   222        2  abc  abc
2   2  orange      32   123        2  abc  abc
3   2  orange      11   123        2  abc  abc
4   3   apple      12   333        2  abc  abc
5   3   apple      12   443        2  abc  abc
6   3   apple      12   553        2  abc  abc

   id    name  number  time  text_id text text
0   1   apple      12   123        2  abc  abc
1   1   apple      12   222        2  abc  abc
2   2  orange      32   123        2  abc  abc
3   2  orange      11   123        2  abc  abc

您好@SCC，谢谢您的回复，但是erm但是我要寻找的是，索引_to_keep是{0，1，2，3}4必须删除，并且它属于ID为3的集群。是否有方法通过集群平均值（基于时间）计算？因此，如果平均值较高的集群=删除示例：（ID为1的集群）。平均值（时间）<（ID为3的集群）。平均值（时间）=删除集群ID为3的所有行，您可以通过将
更改为=
或其他条件来调整我的示例中my_func（）
中的条件，具体取决于您的要求。