Python Pandas,带附加列字符串的唯一条件

Python Pandas,带附加列字符串的唯一条件,python,pandas,Python,Pandas,考虑这样一个数据帧: coordinates metric year [55.2274742137, 25.1560686018] met_1 2014 [55.1554330879, 25.0986809174] met_2 2015 [55.1554330879, 25.0986809174] met_2 2016 [55.14353879, 25.44] met_221212 2020 [55.11239959, 25.3232]

考虑这样一个数据帧:

coordinates                     metric year
[55.2274742137, 25.1560686018]  met_1  2014
[55.1554330879, 25.0986809174]  met_2  2015
[55.1554330879, 25.0986809174]  met_2  2016
[55.14353879, 25.44]  met_221212  2020
[55.11239959, 25.3232]  met_2132  2022
预期结果:

coordinates                     metric year
[55.2274742137, 25.1560686018]  met_1  2014
[55.1554330879, 25.0986809174]  met_2  [2015,2016]
[55.14353879, 25.44]  met_221212  2020
[55.11239959, 25.3232]  met_2132  2022
我希望找到那些在
坐标
度量
列上重复的记录。当他们这样做时,将
指标附加到列表中,并将其作为新的
列传递。然后,我想删除您需要的重复项

但是如果带有
的列列出了

TypeError:不可损坏的类型:“列表”

转换为可散列的
元组

另一个问题是,如果需要
仅当更多值为
1
时才列出
,那么需要有点复杂的
列表理解

df.coordinates = df.coordinates.apply(tuple)
df = df.groupby(['coordinates','metric'], sort=False)['year']
       .apply(lambda x: list(x) if len(x) > 1 else x.item())
df = df.reset_index()
df.coordinates = df.coordinates.apply(list)
print (df)
                      coordinates      metric          year
0  [55.2274742137, 25.1560686018]       met_1          2014
1  [55.1554330879, 25.0986809174]       met_2  [2015, 2016]
2            [55.14353879, 25.44]  met_221212          2020
3          [55.11239959, 25.3232]    met_2132          2022
如果可能,使用输出列中的
列出所有值:

df.coordinates = df.coordinates.apply(tuple)
df = df.groupby(['coordinates','metric'], sort=False)['year'].apply(list)
df = df.reset_index()
df.coordinates = df.coordinates.apply(list)
print (df)
                      coordinates      metric          year
0  [55.2274742137, 25.1560686018]       met_1        [2014]
1  [55.1554330879, 25.0986809174]       met_2  [2015, 2016]
2            [55.14353879, 25.44]  met_221212        [2020]
3          [55.11239959, 25.3232]    met_2132        [2022]
如果需要输出为
字符串

df.coordinates = df.coordinates.apply(tuple)
df = df.groupby(['coordinates','metric'], sort=False)['year']
       .apply(lambda x: ','.join(x.astype(str)))
df = df.reset_index()
df.coordinates = df.coordinates.apply(list)
print (df)
                      coordinates      metric       year
0  [55.2274742137, 25.1560686018]       met_1       2014
1  [55.1554330879, 25.0986809174]       met_2  2015,2016
2            [55.14353879, 25.44]  met_221212       2020
3          [55.11239959, 25.3232]    met_2132       2022

您可以在此处使用groupby作为帮助:

# dummy data
df = pd.DataFrame([[[55.2274742137, 25.1560686018], "met_1", 2014], 
                  [[55.1554330879, 25.0986809174], "met_2", 2015], 
                  [[55.1554330879, 25.0986809174], "met_2", 2015]],
                  columns=["coordinates", "metric", "year"])

print(df)
    coordinates                     metric  year
0   [55.2274742137, 25.1560686018]  met_1   2014
1   [55.1554330879, 25.0986809174]  met_2   2015
2   [55.1554330879, 25.0986809174]  met_2   2015

# define apply function
def aggregate(sub_df):
    years = sub_df["year"].values
    if len(years) > 1:
        return years
    else:
        return years[0]

# groupby needs hashable items, that's why we convert to tuple before
df["coordinates"] = df["coordinates"].apply(tuple)

# groupby and apply aggregator
print(df.groupby(["coordinates", "metric"]).apply(aggregate))

coordinates                     metric
(55.1554330879, 25.0986809174)  met_2     [2015, 2015]
(55.2274742137, 25.1560686018)  met_1            2014