Sorting 如果两列之间的值“重叠”,则无序移动数据帧行
我有下面的数据框Sorting 如果两列之间的值“重叠”,则无序移动数据帧行,sorting,pandas,dataframe,intersection,overlap,Sorting,Pandas,Dataframe,Intersection,Overlap,我有下面的数据框 import numpy as np import pandas as pd df = pd.DataFrame({"first_element":[20, 125, 156, 211, 227, 220, 230, 472, 4765], "second_element":[35, 145, 178, 233, 321, 234, 231, 498, 8971], "next":[0.32, 0.04, 0.59, 0.103, 0.37, 0.92, 0.81, 0.24
import numpy as np
import pandas as pd
df = pd.DataFrame({"first_element":[20, 125, 156, 211, 227, 220, 230, 472, 4765], "second_element":[35, 145, 178, 233, 321, 234, 231, 498, 8971], "next":[0.32, 0.04, 0.59, 0.103, 0.37, 0.92, 0.81, 0.24, 0.77]})
df = df[["first_element", "second_element", "next"]]
print(df)
### print(df) outputs:
first_element second_element next
0 20 35 0.320
1 125 145 0.040
2 156 178 0.590
3 211 233 0.103
4 227 321 0.370
5 220 234 0.920
6 230 231 0.810
7 472 498 0.240
8 4765 8971 0.770
在此数据帧中,每一行被视为沿实线的间隔,[第一个元素,第二个元素],例如20到35,125到145
如果我希望根据这两列对df进行排序,我将使用.sort_值,即
哪个输出
print(sorted_df)
first_element second_element next
0 20 35 0.320
1 125 145 0.040
2 156 178 0.590
3 211 233 0.103
5 220 234 0.920
4 227 321 0.370
6 230 231 0.810
7 472 498 0.240
8 4765 8971 0.770
有几个相交/重叠的区间,即[211233]、[220234]、[227321]、[230231]。因为[230231]是[211233]的一个子集,所以有几种方法可以对这两者进行排序
我的目标是1编写一个函数,查找所有重叠的间隔—第一个\u元素和第二个\u元素中的两列中的值,2随机洗牌这些间隔
目标2听起来非常棘手,因为需要单独洗牌/重新排列多组重叠的间隔。例如,假设我们的数据帧更大,并且有以下重叠间隔:
[211, 233], [220, 234], [227, 321], [230, 231], [5550, 5879], [5400, 5454]
我想分别重新洗牌[211233]、[220234]、[227321]、[230231]和[55505879]、[54005454],而不是混淆重叠区间的子集
有几种方法可以使用pandas洗牌行,例如按索引洗牌
def shuffle_by_index(df):
index = list(df.index)
random.shuffle(index)
df = df.ix[index]
df.reset_index()
return df
或者使用sklearn
但是1如何以pythonic/pandas的方式搜索所有重叠区间,2如何选择这些重叠区间的子集,并仅单独洗牌?这不是解决问题的最佳方法,但它会给出您想要的结果。我把第二部分留给你了
import numpy as np
import pandas as pd
df = pd.DataFrame({"first_element":[20, 125, 156, 211, 227, 220, 230, 472, 4765], "second_element":[35, 145, 178, 233, 321, 234, 231, 498, 8971], "next":[0.32, 0.04, 0.59, 0.103, 0.37, 0.92, 0.81, 0.24, 0.77]})
df = df[["first_element", "second_element", "next"]]
sorted_df = df.sort_values(["first_element", "second_element"], ascending=[True, False])
sorted_df.reset_index(0, inplace = True)
prev_min = sorted_df.first_element.iloc[0]
prev_max = sorted_df.second_element.iloc[0]
labels = []
label_counter = 1
labels.append(label_counter)
for rowIndex in xrange(1, sorted_df.shape[0]):
row = sorted_df.iloc[rowIndex]
if row.first_element > prev_max:
# totally different interval, may be overlapping interval
prev_min = row.first_element
prev_max = row.second_element
label_counter += 1
labels.append(label_counter)
elif row.first_element >= prev_min:
prev_max = max(prev_max, row.second_element)
labels.append(label_counter)
sorted_df['overlapping_index'] = labels
# group sorted_df by overlapping index, and randomly select the save interval group
我知道最后一行代码输出一个具有所有重叠间隔的数据帧。我不知道如何将这些索引1分为单独的相交间隔组,2随机地洗牌这些索引,以便最终输出为原始数据帧。
import sklearn.utils
shuffled = sklearn.utils.shuffle(df)
df = df.reset_index(drop=True)
import numpy as np
import pandas as pd
df = pd.DataFrame({"first_element":[20, 125, 156, 211, 227, 220, 230, 472, 4765], "second_element":[35, 145, 178, 233, 321, 234, 231, 498, 8971], "next":[0.32, 0.04, 0.59, 0.103, 0.37, 0.92, 0.81, 0.24, 0.77]})
df = df[["first_element", "second_element", "next"]]
sorted_df = df.sort_values(["first_element", "second_element"], ascending=[True, False])
sorted_df.reset_index(0, inplace = True)
prev_min = sorted_df.first_element.iloc[0]
prev_max = sorted_df.second_element.iloc[0]
labels = []
label_counter = 1
labels.append(label_counter)
for rowIndex in xrange(1, sorted_df.shape[0]):
row = sorted_df.iloc[rowIndex]
if row.first_element > prev_max:
# totally different interval, may be overlapping interval
prev_min = row.first_element
prev_max = row.second_element
label_counter += 1
labels.append(label_counter)
elif row.first_element >= prev_min:
prev_max = max(prev_max, row.second_element)
labels.append(label_counter)
sorted_df['overlapping_index'] = labels
# group sorted_df by overlapping index, and randomly select the save interval group