Python 根据条件查找同一数据帧中的行
我希望为数据框列表中的每一行查找与所选行相似的行,然后将这些行潜在地放在相关行下的同一数据框中。基本上,我有一段时间的功耗,我想根据我定义的标准从过去找到匹配的。我的数据帧标题已附加。这可能吗Python 根据条件查找同一数据帧中的行,python,pandas,numpy,dataframe,Python,Pandas,Numpy,Dataframe,我希望为数据框列表中的每一行查找与所选行相似的行,然后将这些行潜在地放在相关行下的同一数据框中。基本上,我有一段时间的功耗,我想根据我定义的标准从过去找到匹配的。我的数据帧标题已附加。这可能吗 timestamp power daytype ... dayofweek weekday quarter 0 2014-10-15 12:30:00 0.031707 weekday ... 2 2 4 1 2014-10-15 12:4
timestamp power daytype ... dayofweek weekday quarter
0 2014-10-15 12:30:00 0.031707 weekday ... 2 2 4
1 2014-10-15 12:45:00 0.140829 weekday ... 2 2 4
2 2014-10-15 13:00:00 1.703882 weekday ... 2 2 4
3 2014-10-15 13:15:00 0.032661 weekday ... 2 2 4
4 2014-10-15 13:30:00 0.032939 weekday ... 2 2 4
根据@brentertainer的回复,我尝试了以下方法:
dfNew = pd.DataFrame()
for index, row in dfAll.iterrows:
mask = np.logical_and.reduce([
dfAll['date']== row['date'],
dfAll['hour']==row['hour']
])
dfNew.append(dfAll.loc[mask,:])`
我希望为每一行添加新的数据帧和这些过滤值。此外,我是否可以添加一个额外的列,其中包含筛选条目的行的索引?我认为您的问题的答案是“是”,但您描述的场景感觉相当抽象。我提供了一个类似的抽象例子,说明了一些可能性,我希望你们知道它如何适用于你们的情况 根据“相似”的构成,更改函数内部的
掩码定义
创建虚拟数据:
import pandas as pd
import numpy as np
# make example repeatable
np.random.seed(0)
# make dummy data
N = 100
df = pd.DataFrame(data=np.random.choice(range(5), size=(N, 8)))
df.columns = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
def similar_rows(idx, row, df):
mask = np.logical_and.reduce([
df['a'] == row['a'],
abs(df['b'] - row['b']) <= 1,
df['h'] == (3 - row['h'])
])
df_tmp = df.loc[mask, :]
df_tmp.insert(0, 'original_index', idx)
return df_tmp
# create result
df_new = pd.concat([similar_rows(idx, row, df) for idx, row in df.iterrows()])
df_new.reset_index(inplace=True)
df_new.rename({'index': 'similar_index'}, axis=1, inplace=True)
print(df_new.head(10))
similar_index original_index a b c d e f g h
0 1 0 4 0 0 4 2 1 0 1
1 88 0 4 1 4 0 0 2 3 1
2 0 1 4 0 3 3 3 1 3 2
3 59 1 4 1 4 1 4 1 2 2
4 82 1 4 0 2 3 4 3 0 2
5 4 2 1 1 1 0 2 4 3 3
6 7 2 1 1 3 3 2 3 0 3
7 37 2 1 0 2 4 4 2 4 3
8 14 3 2 3 1 2 1 4 2 3
9 16 3 2 3 0 4 0 0 2 3
# get row at random
row = df.loc[np.random.choice(N), :]
print('Randomly Selected Row:')
print(pd.DataFrame(row).T)
# create and apply a mask for arbitrarily similar rows
mask = np.logical_and.reduce([
df['a'] == row['a'],
abs(df['b'] - row['b']) <= 1,
df['h'] == (3 - row['h'])
])
print('"Similar" Results:')
df_filtered = df.loc[mask, :]
print(df_filtered)
Randomly Selected Row:
a b c d e f g h
23 3 2 4 3 3 0 3 0
"Similar" Results:
a b c d e f g h
26 3 2 2 4 3 1 2 3
60 3 1 2 2 4 2 2 3
86 3 2 4 1 3 0 4 3
更新建议:
import pandas as pd
import numpy as np
# make example repeatable
np.random.seed(0)
# make dummy data
N = 100
df = pd.DataFrame(data=np.random.choice(range(5), size=(N, 8)))
df.columns = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
def similar_rows(idx, row, df):
mask = np.logical_and.reduce([
df['a'] == row['a'],
abs(df['b'] - row['b']) <= 1,
df['h'] == (3 - row['h'])
])
df_tmp = df.loc[mask, :]
df_tmp.insert(0, 'original_index', idx)
return df_tmp
# create result
df_new = pd.concat([similar_rows(idx, row, df) for idx, row in df.iterrows()])
df_new.reset_index(inplace=True)
df_new.rename({'index': 'similar_index'}, axis=1, inplace=True)
print(df_new.head(10))
similar_index original_index a b c d e f g h
0 1 0 4 0 0 4 2 1 0 1
1 88 0 4 1 4 0 0 2 3 1
2 0 1 4 0 3 3 3 1 3 2
3 59 1 4 1 4 1 4 1 2 2
4 82 1 4 0 2 3 4 3 0 2
5 4 2 1 1 1 0 2 4 3 3
6 7 2 1 1 3 3 2 3 0 3
7 37 2 1 0 2 4 4 2 4 3
8 14 3 2 3 1 2 1 4 2 3
9 16 3 2 3 0 4 0 0 2 3
# get row at random
row = df.loc[np.random.choice(N), :]
print('Randomly Selected Row:')
print(pd.DataFrame(row).T)
# create and apply a mask for arbitrarily similar rows
mask = np.logical_and.reduce([
df['a'] == row['a'],
abs(df['b'] - row['b']) <= 1,
df['h'] == (3 - row['h'])
])
print('"Similar" Results:')
df_filtered = df.loc[mask, :]
print(df_filtered)
Randomly Selected Row:
a b c d e f g h
23 3 2 4 3 3 0 3 0
"Similar" Results:
a b c d e f g h
26 3 2 2 4 3 1 2 3
60 3 1 2 2 4 2 2 3
86 3 2 4 1 3 0 4 3
原始建议:
import pandas as pd
import numpy as np
# make example repeatable
np.random.seed(0)
# make dummy data
N = 100
df = pd.DataFrame(data=np.random.choice(range(5), size=(N, 8)))
df.columns = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
def similar_rows(idx, row, df):
mask = np.logical_and.reduce([
df['a'] == row['a'],
abs(df['b'] - row['b']) <= 1,
df['h'] == (3 - row['h'])
])
df_tmp = df.loc[mask, :]
df_tmp.insert(0, 'original_index', idx)
return df_tmp
# create result
df_new = pd.concat([similar_rows(idx, row, df) for idx, row in df.iterrows()])
df_new.reset_index(inplace=True)
df_new.rename({'index': 'similar_index'}, axis=1, inplace=True)
print(df_new.head(10))
similar_index original_index a b c d e f g h
0 1 0 4 0 0 4 2 1 0 1
1 88 0 4 1 4 0 0 2 3 1
2 0 1 4 0 3 3 3 1 3 2
3 59 1 4 1 4 1 4 1 2 2
4 82 1 4 0 2 3 4 3 0 2
5 4 2 1 1 1 0 2 4 3 3
6 7 2 1 1 3 3 2 3 0 3
7 37 2 1 0 2 4 4 2 4 3
8 14 3 2 3 1 2 1 4 2 3
9 16 3 2 3 0 4 0 0 2 3
# get row at random
row = df.loc[np.random.choice(N), :]
print('Randomly Selected Row:')
print(pd.DataFrame(row).T)
# create and apply a mask for arbitrarily similar rows
mask = np.logical_and.reduce([
df['a'] == row['a'],
abs(df['b'] - row['b']) <= 1,
df['h'] == (3 - row['h'])
])
print('"Similar" Results:')
df_filtered = df.loc[mask, :]
print(df_filtered)
Randomly Selected Row:
a b c d e f g h
23 3 2 4 3 3 0 3 0
"Similar" Results:
a b c d e f g h
26 3 2 2 4 3 1 2 3
60 3 1 2 2 4 2 2 3
86 3 2 4 1 3 0 4 3
请严格限定“与所选行相似”的含义。相似是指用户指定的条件,例如,查找日期、日期类型和温度在+-2范围内的行。由于注释中的限制,根据@brentertainer@Sanja的回答,我进一步澄清了这个问题。我已经更新了我的帖子。如果你仅仅把相似性建立在几个严格的等式上,有更有效的方法来实现这一点。但是,如果你需要它是灵活的(对于+/-2范围的事情),这仍然有效。谢谢你,我遇到了性能方面的问题,所以我会对此保持谨慎。再次感谢。是的,我需要它的灵活性。