Python Pandas，带附加列字符串的唯一条件_Python_Pandas

Python Pandas，带附加列字符串的唯一条件

python pandas

Python Pandas，带附加列字符串的唯一条件,python,pandas,Python,Pandas,考虑这样一个数据帧： coordinates metric year [55.2274742137, 25.1560686018] met_1 2014 [55.1554330879, 25.0986809174] met_2 2015 [55.1554330879, 25.0986809174] met_2 2016 [55.14353879, 25.44] met_221212 2020 [55.11239959, 25.3232]

考虑这样一个数据帧：

coordinates                     metric year
[55.2274742137, 25.1560686018]  met_1  2014
[55.1554330879, 25.0986809174]  met_2  2015
[55.1554330879, 25.0986809174]  met_2  2016
[55.14353879, 25.44]  met_221212  2020
[55.11239959, 25.3232]  met_2132  2022

预期结果：

coordinates                     metric year
[55.2274742137, 25.1560686018]  met_1  2014
[55.1554330879, 25.0986809174]  met_2  [2015,2016]
[55.14353879, 25.44]  met_221212  2020
[55.11239959, 25.3232]  met_2132  2022

我希望找到那些在

坐标

和

度量

列上重复的记录。当他们这样做时，将

年

指标附加到列表中，并将其作为新的

年

列传递。然后，我想删除您需要的重复项

：

但是如果带有

的列列出了：
TypeError:不可损坏的类型：“列表”
转换为可散列的元组

另一个问题是，如果需要仅当更多值为1
时才列出，那么需要有点复杂的列表理解
：
df.coordinates = df.coordinates.apply(tuple)
df = df.groupby(['coordinates','metric'], sort=False)['year']
       .apply(lambda x: list(x) if len(x) > 1 else x.item())
df = df.reset_index()
df.coordinates = df.coordinates.apply(list)
print (df)
                      coordinates      metric          year
0  [55.2274742137, 25.1560686018]       met_1          2014
1  [55.1554330879, 25.0986809174]       met_2  [2015, 2016]
2            [55.14353879, 25.44]  met_221212          2020
3          [55.11239959, 25.3232]    met_2132          2022

如果可能，使用输出列中的列出所有值：
df.coordinates = df.coordinates.apply(tuple)
df = df.groupby(['coordinates','metric'], sort=False)['year'].apply(list)
df = df.reset_index()
df.coordinates = df.coordinates.apply(list)
print (df)
                      coordinates      metric          year
0  [55.2274742137, 25.1560686018]       met_1        [2014]
1  [55.1554330879, 25.0986809174]       met_2  [2015, 2016]
2            [55.14353879, 25.44]  met_221212        [2020]
3          [55.11239959, 25.3232]    met_2132        [2022]

如果需要输出为字符串
：
df.coordinates = df.coordinates.apply(tuple)
df = df.groupby(['coordinates','metric'], sort=False)['year']
       .apply(lambda x: ','.join(x.astype(str)))
df = df.reset_index()
df.coordinates = df.coordinates.apply(list)
print (df)
                      coordinates      metric       year
0  [55.2274742137, 25.1560686018]       met_1       2014
1  [55.1554330879, 25.0986809174]       met_2  2015,2016
2            [55.14353879, 25.44]  met_221212       2020
3          [55.11239959, 25.3232]    met_2132       2022

您可以在此处使用groupby作为帮助：
# dummy data
df = pd.DataFrame([[[55.2274742137, 25.1560686018], "met_1", 2014], 
                  [[55.1554330879, 25.0986809174], "met_2", 2015], 
                  [[55.1554330879, 25.0986809174], "met_2", 2015]],
                  columns=["coordinates", "metric", "year"])

print(df)
    coordinates                     metric  year
0   [55.2274742137, 25.1560686018]  met_1   2014
1   [55.1554330879, 25.0986809174]  met_2   2015
2   [55.1554330879, 25.0986809174]  met_2   2015

# define apply function
def aggregate(sub_df):
    years = sub_df["year"].values
    if len(years) > 1:
        return years
    else:
        return years[0]

# groupby needs hashable items, that's why we convert to tuple before
df["coordinates"] = df["coordinates"].apply(tuple)

# groupby and apply aggregator
print(df.groupby(["coordinates", "metric"]).apply(aggregate))

coordinates                     metric
(55.1554330879, 25.0986809174)  met_2     [2015, 2015]
(55.2274742137, 25.1560686018)  met_1            2014