Python 单词/表达式列表频率分布-性能改进_Python_Regex_Performance_Word Frequency_Word List

Python 单词/表达式列表频率分布-性能改进

python regex performance

Python 单词/表达式列表频率分布-性能改进,python,regex,performance,word-frequency,word-list,Python,Regex,Performance,Word Frequency,Word List,我有另一个Python问题，从匹配预定义单词列表的文本中创建频率分布。事实上，我处理了100000多个文本文件（每个文件大约由15000个单词组成），我想读取这些文件，并相应地与大约6000个条目的单词/表达式列表（词汇表）相匹配。结果应该是所有条目的字典，并附上它们各自的频率。以下是我目前正在做的事情： sample text = "As the prices of U.S. homes started to falter, doubts arose throughout the global

我有另一个Python问题，从匹配预定义单词列表的文本中创建频率分布。事实上，我处理了100000多个文本文件（每个文件大约由15000个单词组成），我想读取这些文件，并相应地与大约6000个条目的单词/表达式列表（词汇表）相匹配。结果应该是所有条目的字典，并附上它们各自的频率。以下是我目前正在做的事情：

sample text = "As the prices of U.S. homes started to falter, doubts arose throughout the global financial system. Banks became weaker, private credit markets stopped functioning, and by the end of the year it was clear that the world banks had sunk into a global recession."

vocabulary_dict = ['prices', 'banks', 'world banks', 'private credit marktes', 'recession', 'global recession']

def list_textfiles(directory):
    # Creates a list of all files stored in DIRECTORY ending on '.txt'
    textfiles = []
    for filename in listdir(directory):
        if filename.endswith(".txt"):
            textfiles.append(directory + "/" + filename)
    return textfiles

for filename in list_textfiles(directory):
    # inread each report as textfile, match tokenized text with predefined wordlist and count number of occurences of each element of that wordlist
    sample_text = read_textfile(filename).lower()
    splitted = nltk.word_tokenize(sample_text)
    c = Counter()
    c.update(splitted)
    outfile = open(filename[:-4] + '_output' + '.txt', mode = 'w')
    string = str(filename) # write certain part of filename to outfile
    string_print = string[string.rfind('/')+1:string.find('-')] + ':' + string[-6:-4] + '.' + string[-8:-6] + '.' + string[-12:-8]
    for k in sorted(vocabulary_dict):
    # recognize if wordlist element consist of one or more tokens, accordingly find matching token length in text file (report)
        spl = k.split()
        ln = len(spl)
        if ln > 1:
            if re.findall(r'\b{0}\b'.format(k),sample_text):
                vocabulary_dict[k.lower()] += 1
        elif k in sample_text.split():
            vocabulary_dict[k.lower()] += c[k]
    outfile.write(string_print + '\n')
    # line wise write each entry of the dictionary to the corresponding outputfile including comapany name, fiscal year end and tabulated frequency distribution
    for key, value in sorted( vocabulary_dict.items() ):
        outfile.write( str(key) + '\t' + str(value) + '\n' )
    outfile.close()

# Output accoring to the above stated example should be in the form:
"selected part of filename (=string1)"
'prices' 1 
'banks' 2
'world banks' 1
'private credit marktes' 1 
'recession' 1
'global recession' 1

代码运行得很好，但我认为还有优化的空间，因为处理文本文件的时间大约为1分钟。我的问题：有没有办法让文本与单词/表达式列表的匹配更快？

非常感谢您的帮助：）

我不知道这是否更快，但肯定更短。转一转

text = "As the prices of U.S. homes started to falter, doubts arose throughout the global financial system. Banks became weaker, private credit markets stopped functioning, and by the end of the year it was clear that the world banks had sunk into a global recession."

newDict = dict((k, text.count(k) + text.count(k.title())) for k in vocabulary_dict)

在任何情况下，你都应该问这个问题