Python 如何在数据框中对列值的组合进行二进制搜索?
很抱歉,如果这是熊猫文档解释的一个简单问题,但我已经尝试搜索如何做到这一点,但没有任何运气 我有一个包含多个列的pandas datafame,我希望能够使用二进制搜索搜索特定的行,因为我的数据集很大,我将进行大量搜索 我的数据如下所示:Python 如何在数据框中对列值的组合进行二进制搜索?,python,pandas,numpy,Python,Pandas,Numpy,很抱歉,如果这是熊猫文档解释的一个简单问题,但我已经尝试搜索如何做到这一点,但没有任何运气 我有一个包含多个列的pandas datafame,我希望能够使用二进制搜索搜索特定的行,因为我的数据集很大,我将进行大量搜索 我的数据如下所示: Name Course Week Grade ------------- ------- ---- ----- Homer Simpson MATH001 1 97 Homer Simpson MATH001 3
Name Course Week Grade
------------- ------- ---- -----
Homer Simpson MATH001 1 97
Homer Simpson MATH001 3 85
Homer Simpson CSCI100 1 89
John McGuirk MATH001 2 78
John McGuirk CSCI100 1 100
John McGuirk CSCI100 2 96
我希望能够快速搜索我的数据,查找姓名、课程和周的特定组合。名称、课程和周的每个不同组合在数据集中都将有零行或一行。如果我要搜索的名称、课程和周的组合缺少值,我希望搜索结果返回0
例如,我想搜索值(John McGuirk,CSCI100,1)
是否有一种内置的方法来实现这一点,或者我必须编写自己的二进制搜索
更新:
我尝试使用下面的一位评论者建议的内置方式来实现这一点,我还尝试使用为我的特定数据编写的自定义二进制搜索和另一个自定义二进制搜索来实现这一点,该自定义二进制搜索使用递归来处理与我的特定示例不同的列
这些测试的数据框包含10000行。我把时间放在下面。这两种二进制搜索的性能都比使用[…]
获取行要好。我远不是Python专家,所以我不确定我的代码优化得有多好
# Load data
from pandas import DataFrame, read_csv
import math
import pandas as pd
import time
file = 'grades.xlsx'
df = pd.read_excel(file)
# This was suggested by one of the commenters below
def get_grade(name, course, week):
mask = (df.name.values == name) & (df.course.values == course) & (df.week.values == week)
row = df[mask]
if row.empty == False:
return row.grade.values[0]
else:
return 0
# Binary search that is specific to my particular data
def get_grade_binary_search(name, course, week):
lower = 0
upper = len(df.index) - 1
while lower <= upper:
mid = math.floor((lower + upper) / 2)
row_name = df.iat[mid, 0]
if name < row_name:
upper = mid - 1
elif name > row_name:
lower = mid + 1
else:
row_course = df.iat[mid, 1]
if course < row_course:
upper = mid - 1
elif course > row_course:
lower = mid + 1
else:
row_week = df.iat[mid, 2]
if week < row_week:
upper = mid - 1
elif week > row_week:
lower = mid + 1
else:
return df.iat[mid, 3]
return 0
# General purpose binary search
def get_grade_binary_search_recursive(search_value):
lower = 0
upper = len(df.index) - 1
while lower <= upper:
mid = math.floor((lower + upper) / 2)
comparison = compare(search_value, 0, mid)
if comparison < 0:
upper = mid - 1
elif comparison > 0:
lower = mid + 1
else:
return df.iat[mid, len(search_value)]
# Utility method
def compare(search_value, search_column_index, df_value_index):
if search_column_index >= len(search_value):
return 0
if search_value[search_column_index] < df.iat[df_value_index, search_column_index]:
return -1
elif search_value[search_column_index] > df.iat[df_value_index, search_column_index]:
return 1
else:
return compare(search_value, search_column_index + 1, df_value_index)
运行时间:26.130020141601562
等级总和:498724
# Binary search specific to this data
sum_of_grades = 0
start = time.time()
for week in range(first_week, last_week + 1):
for name in names:
for course in courses:
val = get_grade_binary_search(name, course, week)
sum_of_grades += val
end = time.time()
print('elapsed time: ', end - start)
print('sum of grades: ', sum_of_grades)
# Binary search with recursion
sum_of_grades = 0
start = time.time()
for week in range(first_week, last_week + 1):
for name in names:
for course in courses:
val = get_grade_binary_search_recursive([name, course, week])
sum_of_grades += val
end = time.time()
print('elapsed time: ', end - start)
print('sum_of_grades: ', sum_of_grades)
运行时间:4.4506165981292725
等级总和:498724
# Binary search specific to this data
sum_of_grades = 0
start = time.time()
for week in range(first_week, last_week + 1):
for name in names:
for course in courses:
val = get_grade_binary_search(name, course, week)
sum_of_grades += val
end = time.time()
print('elapsed time: ', end - start)
print('sum of grades: ', sum_of_grades)
# Binary search with recursion
sum_of_grades = 0
start = time.time()
for week in range(first_week, last_week + 1):
for name in names:
for course in courses:
val = get_grade_binary_search_recursive([name, course, week])
sum_of_grades += val
end = time.time()
print('elapsed time: ', end - start)
print('sum_of_grades: ', sum_of_grades)
运行时间:7.559535264968872
等级总和:498724
# Binary search specific to this data
sum_of_grades = 0
start = time.time()
for week in range(first_week, last_week + 1):
for name in names:
for course in courses:
val = get_grade_binary_search(name, course, week)
sum_of_grades += val
end = time.time()
print('elapsed time: ', end - start)
print('sum of grades: ', sum_of_grades)
# Binary search with recursion
sum_of_grades = 0
start = time.time()
for week in range(first_week, last_week + 1):
for name in names:
for course in courses:
val = get_grade_binary_search_recursive([name, course, week])
sum_of_grades += val
end = time.time()
print('elapsed time: ', end - start)
print('sum_of_grades: ', sum_of_grades)
从注释中:
二进制搜索用于查找所需的插入点
,aso请发布所需的输出数据帧使用
numpy.where
或df[((df.Name='foo')&(df.Week='bar'))]
语法是否有问题?您要搜索的“特定组合”是什么,如上文所述。包括您的数据,以便我们可以复制和粘贴它。如果您想深入了解技术背景,pandas
使用布尔索引
,请参阅不确定这是否有帮助,但我只是在500万行的数据帧上运行了一个基于4列的定时测试选择64 ms±595µs/循环(平均±标准偏差为7次运行,每个循环10次)
我遇到的问题是,我不知道如何使用它来插入一个按几列排序的数据框,而不仅仅是第一列。您是否尝试过类似的方法:my_dataframe.sort_value(按=['Name','Course','Week','Week',升序=True)
我尝试过,但是它返回一个DataFrame
,而searchsorted
需要一个序列