Python 3.x 如何在两个大文本文件(Python 3.6.4)之间创建更有效的单词解析方法

Python 3.x 如何在两个大文本文件(Python 3.6.4)之间创建更有效的单词解析方法,python-3.x,Python 3.x,我是Python的新手,这是我第一次尝试应用所学知识,但我知道我效率低下。代码可以工作,但需要几分钟才能在一个新大小的文本文件上完成执行 有没有更有效的方法达到同样的产量?任何风格的批评也将不胜感激。谢谢大家! def realWords(inFile, dictionary, outFile): with open(inFile, 'r') as inf, open(dictionary, 'r') as dictionary, open(outFile, 'w') as outf:

我是Python的新手,这是我第一次尝试应用所学知识,但我知道我效率低下。代码可以工作,但需要几分钟才能在一个新大小的文本文件上完成执行

有没有更有效的方法达到同样的产量?任何风格的批评也将不胜感激。谢谢大家!

def realWords(inFile, dictionary, outFile):
    with open(inFile, 'r') as inf, open(dictionary, 'r') as dictionary, open(outFile, 'w') as outf:
    realWords = ''
    dList = []
    for line in dictionary:
        dSplit = line.split()
        for word in dSplit:
            dList.append(word)
    for line in inf:
        wordSplit = line.split()
        for word in wordSplit:
            if word in dList:
                realWords += word + ' '
    outf.write(realWords)
    print('File of real words created')
    inf.close()
    dictionary.close()
    outf.close()

'''
I created a function to compare the words in a text file to real words taken 
from a reference dictionary (like the Webster Unabridged Dictionary). It 
takes a text file and breaks it up into individual word components. It then 
compares each word to each word in the reference dictionary text file in 
order to test whether the world is a real word or not. This is done so as to 
eliminate non-real words, names, and some other junk. For each word that 
passes the test, each word is then added to the same empty string. Once all 
words have been parsed, the output string containing all real words is 
written to a new text file.
'''

对于你小说中的每一个单词,你都要在整本字典中搜索一次,看看你是否能找到那个单词。太慢了

使用set()数据结构可以使您受益匪浅,它允许您在固定时间内快速确定元素是否在其中

此外,通过取消字符串连接并改用.join(),您可以进一步加快代码的速度

我对您的代码做了一些调整,使其使用set()和.join(),这将大大加快它的速度

def realWords(inFile, dictionary, outFile):
    with open(inFile, 'r') as inf, open(dictionary, 'r') as dictionary, open(outFile, 'w') as outf:
    realWords = [] #note list for constant time appends
    dList = set()
    for line in dictionary:
        dSplit = line.split()
        for word in dSplit:
        dList.add(word)
    for line in inf:
        wordSplit = line.split()
        for word in wordSplit:
            if word in dList: #done in constant time because dList is a set
                realWords.append(word)
    outf.write(' '.join(realWords))
    print('File of real words created')
    inf.close()
    dictionary.close()
    outf.close()

对于你小说中的每一个单词,你都要在整本字典中搜索一次,看看你是否能找到那个单词。太慢了

使用set()数据结构可以使您受益匪浅,它允许您在固定时间内快速确定元素是否在其中

此外,通过取消字符串连接并改用.join(),您可以进一步加快代码的速度

我对您的代码做了一些调整,使其使用set()和.join(),这将大大加快它的速度

def realWords(inFile, dictionary, outFile):
    with open(inFile, 'r') as inf, open(dictionary, 'r') as dictionary, open(outFile, 'w') as outf:
    realWords = [] #note list for constant time appends
    dList = set()
    for line in dictionary:
        dSplit = line.split()
        for word in dSplit:
        dList.add(word)
    for line in inf:
        wordSplit = line.split()
        for word in wordSplit:
            if word in dList: #done in constant time because dList is a set
                realWords.append(word)
    outf.write(' '.join(realWords))
    print('File of real words created')
    inf.close()
    dictionary.close()
    outf.close()
您可以使用
set()
快速查找单词,还可以通过使用
来提高字符串连接速度。join(您的\u列表)
,类似于:

def write_real_words(in_file, dictionary, out_file):
    with open(in_file, 'r') as i, open(dictionary, 'r') as d, open(out_file, 'w') as o:
        dictionary_words = set()
        for l in d:
            dictionary_words |= set(l.split())
        real_words = [word for l in i for word in l.split() if word in dictionary_words]
        o.write(" ".join(real_words))
        print('File of real words created')
至于样式,上面的大部分都是PEP兼容的,我缩短了变量名以避免在代码块上滚动,因此,我建议您使用更具描述性的方法来实际使用。

您可以使用
set()
来快速查找单词,还可以使用
来提高字符串连接速度。join(您的列表)
,类似于:

def write_real_words(in_file, dictionary, out_file):
    with open(in_file, 'r') as i, open(dictionary, 'r') as d, open(out_file, 'w') as o:
        dictionary_words = set()
        for l in d:
            dictionary_words |= set(l.split())
        real_words = [word for l in i for word in l.split() if word in dictionary_words]
        o.write(" ".join(real_words))
        print('File of real words created')

至于样式,上面的大部分都是PEP兼容的,我缩短了变量名,以避免在代码块上滚动,因此,我建议您使用更具描述性的方式来实际使用。

我写了一个可能的回复。我的主要评论如下:

1) 功能模块化程度更高;也就是说,每个函数应该做较少的事情(即应该很好地完成一件事情)。函数
realWords
只能在您希望完全按照您的建议执行的非常特定的情况下重用。下面的函数做的事情较少,因此它们更有可能被重用

2) 我添加了从单词中删除特殊字符的功能,以避免第二类错误(也就是说,避免遗漏一个真正的单词并称之为废话)

3) 我添加了存储所有被指定为非真实单词的功能。此工作流的主要QC步骤是迭代检查“无意义”类别的输出,并系统地消除遗漏的真实单词

4) 在python中将真实单词存储为
,以保证最短的查找时间

5) 我没有运行这个,因为我没有适当的输入文件,所以我可能有一些打字错误或bug

# real words could be missed if they adjoin a special character. strip all incoming words of special chars
def clean_words_in_line(input_line):
""" iterate through a line, remove special characters, return clean words"""
        chars_to_strip=[":", ";", ",", "."] # add characters as need be to remove them
        for dirty_word in input_line:
                for char in chars_to_strip: 
                        clean_word=dirty_word.strip(char)
                        clean_words.append(dirty_word)
        return(clean_words)

def ref_words_to_set(dct_file):
""" iterate through a source file containing known words, build a list of real words, return as a set """
        clean_word_list=[]
        with open(dct_file, 'r') as dt_fh:
                for line in dt_fh:
                        line=line.strip().split()
                        clean_line=clean_words_in_line(line)
                        for word in clean_line:
                                clean_word_list.append(word)
        clean_word_set=set(clean_word_list) # use a set comprehension to minimize lookup time 
        return(clean_word_set)

def find_real_words(my_novel, cws):
""" iterate through a book or novel, check for clean words """
        words_in_dict=[]
        quite_possibly_runcible=[]
        with open(my_novel) as mn_fh:
                for line in my_novel:
                        line=line.strip().split()
                        clean_line=clean_words_in_line(line)
                        for word in clean_line:
                                if word in cws:
                                        words_in_dict.append(word)
                                else:
                                        quite_possibly_runcible.append(word)
        return(words_in_dict, quite_possibly_runcible)


set_of_real_words=ref_words_to_set("The_Webster_Unabridged_Dictionary.txt")
(real_words, non_sense)=find_real_words("Don_Quixote.txt", set_of_real_words)

with open("Verified_words.txt", 'a') as outF:
        outF.write(" ".join(real_words) + "\n")

with open("Lears_words.txt", 'a') as n_outF:
        n_outF.write(" ".join(non_sense) + "\n")

我写了一封可能的回信。我的主要意见是:

1) 功能模块化程度更高;也就是说,每个函数应该做较少的事情(即应该很好地完成一件事情)。函数
realWords
只能在您希望完全按照您的建议执行的非常特定的情况下重用。下面的函数做的事情较少,因此它们更有可能被重用

2) 我添加了从单词中删除特殊字符的功能,以避免第二类错误(也就是说,避免遗漏一个真正的单词并称之为废话)

3) 我添加了存储所有被指定为非真实单词的功能。此工作流的主要QC步骤是迭代检查“无意义”类别的输出,并系统地消除遗漏的真实单词

4) 在python中将真实单词存储为
,以保证最短的查找时间

5) 我没有运行这个,因为我没有适当的输入文件,所以我可能有一些打字错误或bug

# real words could be missed if they adjoin a special character. strip all incoming words of special chars
def clean_words_in_line(input_line):
""" iterate through a line, remove special characters, return clean words"""
        chars_to_strip=[":", ";", ",", "."] # add characters as need be to remove them
        for dirty_word in input_line:
                for char in chars_to_strip: 
                        clean_word=dirty_word.strip(char)
                        clean_words.append(dirty_word)
        return(clean_words)

def ref_words_to_set(dct_file):
""" iterate through a source file containing known words, build a list of real words, return as a set """
        clean_word_list=[]
        with open(dct_file, 'r') as dt_fh:
                for line in dt_fh:
                        line=line.strip().split()
                        clean_line=clean_words_in_line(line)
                        for word in clean_line:
                                clean_word_list.append(word)
        clean_word_set=set(clean_word_list) # use a set comprehension to minimize lookup time 
        return(clean_word_set)

def find_real_words(my_novel, cws):
""" iterate through a book or novel, check for clean words """
        words_in_dict=[]
        quite_possibly_runcible=[]
        with open(my_novel) as mn_fh:
                for line in my_novel:
                        line=line.strip().split()
                        clean_line=clean_words_in_line(line)
                        for word in clean_line:
                                if word in cws:
                                        words_in_dict.append(word)
                                else:
                                        quite_possibly_runcible.append(word)
        return(words_in_dict, quite_possibly_runcible)


set_of_real_words=ref_words_to_set("The_Webster_Unabridged_Dictionary.txt")
(real_words, non_sense)=find_real_words("Don_Quixote.txt", set_of_real_words)

with open("Verified_words.txt", 'a') as outF:
        outF.write(" ".join(real_words) + "\n")

with open("Lears_words.txt", 'a') as n_outF:
        n_outF.write(" ".join(non_sense) + "\n")

这个答案是为了理解,而不是仅仅给出更好的代码

你需要做的是学习

阅读字典的复杂性是
O(字典中的行数*每行的字数)
,或者仅仅是
O(字典中的字数)

阅读
inf
的复杂性起初看起来很相似。然而,惯用python包含欺骗性的做法,即,如果dList中的word对某些类型来说不是一个固定时间操作,则
。此外,python语言需要一个新的对象用于
+=
(虽然在有限的情况下,它可以对其进行优化-但不要依赖于此),因此复杂性等于
O(realWords的长度)
。假设大多数单词实际上都在字典中,这相当于文件的长度

因此,这一步的总体复杂度是
O(infle中的单词数*字典中的单词数)
有优化,或者
O(infle中的单词数)²*字典中的单词数)
没有优化

由于第一步的复杂度更小,更小的组件也消失了,所以总体复杂度只是第二步的复杂度

其他答案的复杂度为
O(字典中的单词数+文件中的单词数),这是不可还原的,因为+的两边是不相关的。当然,这假设没有散列冲突,但只要您的字典不是使用者输入的主题,这是一个安全的假设。(如果这样做,请抓取一个方便且最坏情况下性能良好的容器)。

此ans