Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/309.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 迭代大列表(18895个元素)时双for循环的更快方法_Python_List_Python 2.7_Csv_For Loop - Fatal编程技术网

Python 迭代大列表(18895个元素)时双for循环的更快方法

Python 迭代大列表(18895个元素)时双for循环的更快方法,python,list,python-2.7,csv,for-loop,Python,List,Python 2.7,Csv,For Loop,代码如下: import csv import re with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \ open('cities2.txt', 'rb') as cities, \ open('drug_rehab_city_state.csv', 'wb') as out_csv: writer = csv.writer(out_csv, delimiter = ",") reader = csv.rea

代码如下:

import csv
import re

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    reader = csv.reader(csv_f)
    city_lst = cities.readlines()

    for row in reader:
        for city in city_lst:
            city = city.strip()
            match = re.search((r'\b{0}\b').format(city), row[0])
            if match:
                writer.writerow(row)
                break
“酒精康复”有145行,而“cities2.txt”有18895行(转换为列表时变为18895行)。这个过程需要一段时间才能运行,我没有计时,但可能需要5分钟左右。我在这里忽略了一些简单(或更复杂)的东西,它们可以使这个脚本运行得更快。我将使用其他.csv文件来运行大型文本文件“cities.txt”,这些csv文件可能有多达1000行。任何关于如何加快速度的想法都将不胜感激! 这里是csv文件:关键词(144),平均CPC,本地搜索,广告商竞争

[alcohol rehab san diego],$49.54,90,High
[alcohol rehab dallas],$86.48,110,High
[alcohol rehab atlanta],$60.93,50,High
[free alcohol rehab centers],$11.88,110,High
[christian alcohol rehab centers],–,70,High
[alcohol rehab las vegas],$33.40,70,High
[alcohol rehab cost],$57.37,110,High
文本文件中的某些行:

san diego
dallas
atlanta
dallas
los angeles
denver

尽管我不认为循环/IO是一个很大的瓶颈,但如果您可以尝试从它们开始

我可以提供两个建议:
(r'\b{0}\b').format(c.strip())
可以在循环外部,这将提高一些性能,因为我们不必在每个循环中都使用strip()、format

此外,您不必在每个循环中写入输出结果,相反,您可以创建一个结果列表
output\u list
在循环期间保存结果,并在循环后写入一次

import csv
import re
import datetime

start = datetime.datetime.now()

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    space = ""
    reader = csv.reader(csv_f)
    city_lst = [(r'\b{0}\b').format(c.strip()) for c in cities.readlines()]
    output_list = []
    for row in reader:
        for city in city_lst:
            #city = city.strip()
            match = re.search(city, row[0])
            if match:
                output_list.append(row)
                break
    writer.writerows(output_list)



end = datetime.datetime.now()

print end -  start

尽管我不认为循环/IO是一个很大的瓶颈,但如果您可以尝试从它们开始

我可以提供两个建议:
(r'\b{0}\b').format(c.strip())
可以在循环外部,这将提高一些性能,因为我们不必在每个循环中都使用strip()、format

此外,您不必在每个循环中写入输出结果,相反,您可以创建一个结果列表
output\u list
在循环期间保存结果,并在循环后写入一次

import csv
import re
import datetime

start = datetime.datetime.now()

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    space = ""
    reader = csv.reader(csv_f)
    city_lst = [(r'\b{0}\b').format(c.strip()) for c in cities.readlines()]
    output_list = []
    for row in reader:
        for city in city_lst:
            #city = city.strip()
            match = re.search(city, row[0])
            if match:
                output_list.append(row)
                break
    writer.writerows(output_list)



end = datetime.datetime.now()

print end -  start

请注意,我假设您可以使用比使用
re.search
更好的方法来查找行中的城市,因为通常城市将由类似分隔符的空格分隔。否则,它的复杂度大于O(n*m)

一种方法是使用哈希表

ht = [0]*MAX
阅读所有城市(假设这些城市以千为单位)并填写一个哈希表

ht[hash(city)] = 1
现在,当您遍历reader中的每一行时

for row in reader:
    for word in row:
        if ht[hash(word)] == 1:
            # found, do stuff here
            pass

请注意,我假设您可以使用比使用
re.search
更好的方法来查找行中的城市,因为通常城市将由类似分隔符的空格分隔。否则,它的复杂度大于O(n*m)

一种方法是使用哈希表

ht = [0]*MAX
阅读所有城市(假设这些城市以千为单位)并填写一个哈希表

ht[hash(city)] = 1
现在,当您遍历reader中的每一行时

for row in reader:
    for word in row:
        if ht[hash(word)] == 1:
            # found, do stuff here
            pass

我认为您可以使用
集合
和索引:

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    space = ""
    reader = csv.reader(csv_f)
    # make set of all city names, lookups are 0(1)
    city_set = {line.rstrip() for line in cities}
    output_list = []
    header = next(reader) # skip header
    for row in reader:
        try:
            # names are either first or last with two words preceding or following 
            # so split twice on whitespace from either direction
            if row[0].split(None,2)[-1].rstrip("]") in city_set or row[0].rsplit(None, 2)[0][1:] in city_set:
                output_list.append(row)
        except IndexError as e:
            print(e,row[0])
    writer.writerows(output_list)

运行时间现在是
0(n)
而不是二次型。

我认为您可以使用
集合和索引:

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    space = ""
    reader = csv.reader(csv_f)
    # make set of all city names, lookups are 0(1)
    city_set = {line.rstrip() for line in cities}
    output_list = []
    header = next(reader) # skip header
    for row in reader:
        try:
            # names are either first or last with two words preceding or following 
            # so split twice on whitespace from either direction
            if row[0].split(None,2)[-1].rstrip("]") in city_set or row[0].rsplit(None, 2)[0][1:] in city_set:
                output_list.append(row)
        except IndexError as e:
            print(e,row[0])
    writer.writerows(output_list)

运行时间现在是
0(n)
而不是二次。

首先,正如@Shawn Zhang所建议的
(r'\b{0}\b')。格式(c.strip())
可以在循环之外,您可以创建结果列表,以避免在每次迭代中写入文件

其次,您可以尝试
re.compile
来编译正则表达式,这可能会提高正则表达式的性能

第三,尝试对其进行一点分析以找到瓶颈,例如,使用
timeit
或其他类似
ica
的分析工具(如果您有SciPy)

另外,如果城市总是在第一列,并且我假设它的名称为“城市”,为什么不使用
csv.DictReader()
来读取csv?我确信它比正则表达式快

编辑 正如您提供的文件示例一样,我去掉了
re
(因为您似乎真的不需要它们),并使用下面的代码将速度提高了10倍以上:

import csv

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    output_list = []
    reader = csv.reader(csv_f)
    city_lst = cities.readlines()

    for row in reader:
        for city in city_lst:
            city = city.strip()
            if city in row[0]:
                output_list.append(row)
    writer.writerows(output_list)

首先,正如@Shawn Zhang所建议的
(r'\b{0}\b')。format(c.strip())
可以在循环之外,您可以创建结果列表,以避免在每次迭代中写入文件

其次,您可以尝试
re.compile
来编译正则表达式,这可能会提高正则表达式的性能

第三,尝试对其进行一点分析以找到瓶颈,例如,使用
timeit
或其他类似
ica
的分析工具(如果您有SciPy)

另外,如果城市总是在第一列,并且我假设它的名称为“城市”,为什么不使用
csv.DictReader()
来读取csv?我确信它比正则表达式快

编辑 正如您提供的文件示例一样,我去掉了
re
(因为您似乎真的不需要它们),并使用下面的代码将速度提高了10倍以上:

import csv

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    output_list = []
    reader = csv.reader(csv_f)
    city_lst = cities.readlines()

    for row in reader:
        for city in city_lst:
            city = city.strip()
            if city in row[0]:
                output_list.append(row)
    writer.writerows(output_list)

构建一个包含所有城市名称的正则表达式:

city_re = re.compile(r'\b('+ '|'.join(c.strip() for c in cities.readlines()) + r')\b')
然后做:

for row in reader:
    match = city_re.search(row[0])
    if match:
        writer.writerow(row)
这将使循环迭代次数从18895 x 145减少到只有18895次,而正则表达式引擎在这145个城市名称的字符串前缀匹配方面做得最好

为方便您进行测试,以下是完整列表:

import csv
import re

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    reader = csv.reader(csv_f)

    city_re = re.compile(r'\b('+ '|'.join(c.strip() for c in cities.readlines()) + r')\b')

    for row in reader:
        match = city_re.search(row[0])
        if match:
            writer.writerow(row)

构建一个包含所有城市名称的正则表达式:

city_re = re.compile(r'\b('+ '|'.join(c.strip() for c in cities.readlines()) + r')\b')
然后做:

for row in reader:
    match = city_re.search(row[0])
    if match:
        writer.writerow(row)
这将使循环迭代次数从18895 x 145减少到只有18895次,而正则表达式引擎在这145个城市名称的字符串前缀匹配方面做得最好

为方便您进行测试,以下是完整列表:

import csv
import re

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    reader = csv.reader(csv_f)

    city_re = re.compile(r'\b('+ '|'.join(c.strip() for c in cities.readlines()) + r')\b')

    for row in reader:
        match = city_re.search(row[0])
        if match:
            writer.writerow(row)

嵌套循环的大小相对较小(18895 x 145次迭代)。您是否以任何方式对代码计时?您确定让您等待5分钟的瓶颈确实是循环吗?如果是这样的话,我会尝试摆脱正则表达式,转而使用非字母字符拆分字符串,并使用
If city in row.split(r'\W'):
(将拆分从“cities”循环中提出来)当我实际测量它时,它是2分30秒。我以前没有使用过datetime,所以我很高兴Shawn Zhang将它添加到他的代码中,这样我就可以了解如何使用它。为什么要使用re?您可以显示一些您的输入,因为我认为您所做的工作比使用re所需的要多得多,因为城市名称将包含在csv每行的第一列中。例如,r