Python Pandas,带附加列字符串的唯一条件
考虑这样一个数据帧:Python Pandas,带附加列字符串的唯一条件,python,pandas,Python,Pandas,考虑这样一个数据帧: coordinates metric year [55.2274742137, 25.1560686018] met_1 2014 [55.1554330879, 25.0986809174] met_2 2015 [55.1554330879, 25.0986809174] met_2 2016 [55.14353879, 25.44] met_221212 2020 [55.11239959, 25.3232]
coordinates metric year
[55.2274742137, 25.1560686018] met_1 2014
[55.1554330879, 25.0986809174] met_2 2015
[55.1554330879, 25.0986809174] met_2 2016
[55.14353879, 25.44] met_221212 2020
[55.11239959, 25.3232] met_2132 2022
预期结果:
coordinates metric year
[55.2274742137, 25.1560686018] met_1 2014
[55.1554330879, 25.0986809174] met_2 [2015,2016]
[55.14353879, 25.44] met_221212 2020
[55.11239959, 25.3232] met_2132 2022
我希望找到那些在坐标
和度量
列上重复的记录。当他们这样做时,将年
指标附加到列表中,并将其作为新的年
列传递。然后,我想删除您需要的重复项:
但是如果带有的列列出了:
TypeError:不可损坏的类型:“列表”
转换为可散列的元组
另一个问题是,如果需要仅当更多值为1
时才列出,那么需要有点复杂的列表理解
:
df.coordinates = df.coordinates.apply(tuple)
df = df.groupby(['coordinates','metric'], sort=False)['year']
.apply(lambda x: list(x) if len(x) > 1 else x.item())
df = df.reset_index()
df.coordinates = df.coordinates.apply(list)
print (df)
coordinates metric year
0 [55.2274742137, 25.1560686018] met_1 2014
1 [55.1554330879, 25.0986809174] met_2 [2015, 2016]
2 [55.14353879, 25.44] met_221212 2020
3 [55.11239959, 25.3232] met_2132 2022
如果可能,使用输出列中的列出所有值:
df.coordinates = df.coordinates.apply(tuple)
df = df.groupby(['coordinates','metric'], sort=False)['year'].apply(list)
df = df.reset_index()
df.coordinates = df.coordinates.apply(list)
print (df)
coordinates metric year
0 [55.2274742137, 25.1560686018] met_1 [2014]
1 [55.1554330879, 25.0986809174] met_2 [2015, 2016]
2 [55.14353879, 25.44] met_221212 [2020]
3 [55.11239959, 25.3232] met_2132 [2022]
如果需要输出为字符串
:
df.coordinates = df.coordinates.apply(tuple)
df = df.groupby(['coordinates','metric'], sort=False)['year']
.apply(lambda x: ','.join(x.astype(str)))
df = df.reset_index()
df.coordinates = df.coordinates.apply(list)
print (df)
coordinates metric year
0 [55.2274742137, 25.1560686018] met_1 2014
1 [55.1554330879, 25.0986809174] met_2 2015,2016
2 [55.14353879, 25.44] met_221212 2020
3 [55.11239959, 25.3232] met_2132 2022
您可以在此处使用groupby作为帮助:
# dummy data
df = pd.DataFrame([[[55.2274742137, 25.1560686018], "met_1", 2014],
[[55.1554330879, 25.0986809174], "met_2", 2015],
[[55.1554330879, 25.0986809174], "met_2", 2015]],
columns=["coordinates", "metric", "year"])
print(df)
coordinates metric year
0 [55.2274742137, 25.1560686018] met_1 2014
1 [55.1554330879, 25.0986809174] met_2 2015
2 [55.1554330879, 25.0986809174] met_2 2015
# define apply function
def aggregate(sub_df):
years = sub_df["year"].values
if len(years) > 1:
return years
else:
return years[0]
# groupby needs hashable items, that's why we convert to tuple before
df["coordinates"] = df["coordinates"].apply(tuple)
# groupby and apply aggregator
print(df.groupby(["coordinates", "metric"]).apply(aggregate))
coordinates metric
(55.1554330879, 25.0986809174) met_2 [2015, 2015]
(55.2274742137, 25.1560686018) met_1 2014