Python 3.x python中列表的高效过滤_Python 3.x_List_Filter_Time Complexity

Python 3.x python中列表的高效过滤

python-3.x list filter time-complexity

Python 3.x python中列表的高效过滤,python-3.x,list,filter,time-complexity,Python 3.x,List,Filter,Time Complexity,我有一个名为“do_not_call”的数据库表，其中包含有关文件的信息，这些文件按递增顺序包含10位电话号码。“filename”列包含包含从“first\u phone”到“last\u phone”的数字范围的文件名。“请勿调用”表中大约有2500条记录我有一份炼金术记录的清单。我需要找到保存这些记录的“电话”字段的文件。因此，我创建了一个函数，它接收sqlalchemy记录并返回一个字典，其中键是文件名，值是sqlalchemy记录中的电话号码列表，该列表位于文件中包含的第一个和最后一

我有一个名为“

do_not_call

”的数据库表，其中包含有关文件的信息，这些文件按递增顺序包含10位电话号码。“

filename

”列包含包含从“

first\u phone

”到“

last\u phone

”的数字范围的文件名。“

请勿调用”表中大约有2500条记录
我有一份炼金术记录的清单。我需要找到保存这些记录的“电话”字段的文件。因此，我创建了一个函数，它接收sqlalchemy记录并返回一个字典，其中键是文件名，值是sqlalchemy记录中的电话号码列表，该列表位于文件中包含的第一个和最后一个电话号码的范围内
def get_file_mappings(dbcursor, import_records):
    start_time = datetime.now()
    phone_list = [int(rec.phone) for rec in import_records]
    dnc_sql = "SELECT * from do_not_call;"
    dbcursor.execute(dnc_sql)
    dnc_result = dbcursor.fetchall()
    file_mappings = {}
    for file_info in dnc_result:
        first_phone = int(file_info.get('first_phone'))
        last_phone = int(file_info.get('last_phone'))
        phone_ranges = list(filter(lambda phone: phone in range(first_phone, last_phone), phone_list))
        if phone_ranges:
            file_mappings.update({file_info.get('filename'): phone_ranges})
            phone_list = list(set(phone_list) - set(phone_ranges))
    # print(file_mappings)
    print("Time = ", datetime.now() - start_time)
    return file_mappings

例如，如果电话列表为
[2023143300、2024393100、2027981539、2022760321、2026416368、2027585911]
，将返回文件映射
    {'1500000_2020-9-24_Global_45A62481-17A2-4E45-82D6-DDF8B58B1BF8.txt': [2023143300, 2022760321],
     '1700000_2020-9-24_Global_45A62481-17A2-4E45-82D6-DDF8B58B1BF8.txt': [2024393100], 
'1900000_2020-9-24_Global_45A62481-17A2-4E45-82D6-DDF8B58B1BF8.txt': [2027981539, 2026416368, 2027585911]}

这里的问题是执行起来需要很多时间。1000条记录平均需要1.5秒。有没有更好的方法/算法来解决这个问题。非常感谢您的帮助。
这是一种非常低效的方法，无法将内容合并到已排序的列表中。你没有利用你的垃圾箱被分类的事实（或者如果它们没有被分类的话很容易被分类）。你在这里通过用lambda语句测试电话号码来做一个大的嵌套循环
通过与set
use（见下文）保持一致，您可以做出一些细微的改进。但最终，您可以/应该通过有效的搜索（如二分法）找到每个手机在列表中的位置。请参见下面的示例，了解原始、集合实现和对分插入的计时
如果您的phone_列表
非常庞大，那么其他方法可能是有利的，例如找到电话列表排序副本中的截止箱位置。。。但下面的速度比现在1000或10000条记录的速度快500倍
# phone sorter
import random
import bisect
import time
from collections import defaultdict

# make some fake data of representative size
low_phone = 200_000_0000
data = []   # [file, low_phone, high_phone]
for idx in range(2500):
    row = []
    row.append(f'file_{idx}')
    row.append(low_phone + idx * 20000000)
    row.append(low_phone + (idx + 1) * 20000000 - 20)  # some gap
    data.append(row)

high_phone = data[-1][-1]

# generate some random phone numbers in range
num_phones = 10000
phone_list_orig = [random.randint(low_phone, high_phone) for t in range(num_phones)]

# orig method...
phone_list = phone_list_orig[:]
tic = time.time()
results = {}
for row in data:
    low = row[1]
    high = row[2]
    phone_ranges = list(filter(lambda phone: phone in range(low, high), phone_list))
    if phone_ranges:
        results.update({row[0]:phone_ranges})
        phone_list = list(set(phone_list) - set(phone_ranges))
toc = time.time()
print(f'orig time: {toc-tic:.3f}')

# with sets across the board...
phone_list = set(phone_list_orig)
tic = time.time()
results2 = {}
for row in data:
    low = row[1]
    high = row[2]
    phone_ranges = set(filter(lambda phone: phone in range(low, high), phone_list))
    if phone_ranges:
        results2.update({row[0]:phone_ranges})
        phone_list = phone_list - phone_ranges
toc = time.time()
print(f'using sets time: {toc-tic:.3f}')

# using bisection search
phone_list = set(phone_list_orig)
tic = time.time()
results3 = defaultdict(list)
lows = [t[1] for t in data]
for phone in phone_list:
    location = bisect.bisect(lows, phone) - 1
    if phone <= data[location][2]:  # it is within the high limit of bin
        results3[data[location][0]].append(phone)
toc = time.time()
print(f'using bisection sort time: {toc-tic:.3f}')

# for k in sorted(results3):
#   print(k, ':', results.get(k))

assert(results==results2==results3)

list（filter（lambda phone:phone in range（first_phone，last_phone），phone_list））
更为惯用（可能更快），因为[phone for phone in phone in phone in list if phone in range（first_phone，last_phone）]
@Carcigenicate是的，1000条记录缩短了0.5秒a，我没想到会有这么大的收益。不过速度会快一点，因为生成一个惰性集合（通过过滤器
），然后将其强制放入列表对我来说总是比较慢。而且，lambda
的使用似乎很慢<代码>列表（set（phone_list）-设置（phone_ranges））

也可能会比

phone_range\u set=set（phone_ranges）快得多；phone_list=[如果phone不在phone_范围内，则phone for phone in phone_list]

创建两个集只是为了做减法似乎不必要的昂贵。请仔细检查这些集的输出，因为我只是在看它们，但它们应该是等效的。

phone_list

的典型大小是多少？你的结果似乎比我预期的有点落后。。。我想你会想要key=phone，value=file？谢谢你的回答。二进制搜索法很快。如果电话号码太多怎么办？我开发时考虑了几千张唱片，但在生产中可能会有数百万张唱片。太好了。如果答案有帮助，你可以打勾接受。如果电话列表变得庞大。。。。如果有一种方法来保持电话列表排序（或排序它脱机或这样），那么你可能会考虑反转插入排序方向。也就是说，将文件中的bin边界排序到电话列表中，并使用这些索引进行切片等操作。或者你也可以分批做。要修补的东西

orig time: 5.236
using sets time: 4.597
using bisection sort time: 0.012
[Finished in 9.9s]