Python：提高对文本数据执行拼写更正的代码性能_Python_Performance_Nlp_Nltk_Spacy

Python：提高对文本数据执行拼写更正的代码性能

python performance nlp

Python：提高对文本数据执行拼写更正的代码性能,python,performance,nlp,nltk,spacy,Python,Performance,Nlp,Nltk,Spacy,我有一个以注释形式存在的文本数据，我想对其进行预处理。除了去除诸如URL、数字等噪音外。。。在进行柠檬化的同时，我还想进行拼写更正。具体来说，我只想对出现次数不超过给定次数的单词执行拼写更正，以避免误报。出于这个目的，我使用它来进行校正和获取词频，但是这样做会大大增加预处理所需的时间我试着让事情尽我所能的表现，但我被卡住了，想知道是否还有改进的地方这是我的密码：进口：代码： dict_拼写错误={} pipe=nlp.pipe（注释列表，批量大小=512，禁用=[“标记器”，“解析器”]

我有一个以注释形式存在的文本数据，我想对其进行预处理。除了去除诸如URL、数字等噪音外。。。在进行柠檬化的同时，我还想进行拼写更正。具体来说，我只想对出现次数不超过给定次数的单词执行拼写更正，以避免误报。出于这个目的，我使用它来进行校正和获取词频，但是这样做会大大增加预处理所需的时间

我试着让事情尽我所能的表现，但我被卡住了，想知道是否还有改进的地方

这是我的密码：进口：

代码：

dict_拼写错误={}
pipe=nlp.pipe（注释列表，批量大小=512，禁用=[“标记器”，“解析器”]）
对于j，枚举中的文档（管道）：
tokens=[token.lemma_uu.lower（），如果不是token.is_punct而不是token.is_digit，则表示文档中的token\
而不是token.like_url和not token.like_email和not token.like_num]
已处理的注释。追加（“.join（标记））
fdist+=FreqDist（令牌）
#记住哪些注释包含拼写错误，以避免以后查看每个注释
拼写错误=拼写未知（标记）
如果（len（拼写错误）>0）：
对于拼写错误中的拼写错误的单词：
如果dict\u misspell.keys（）中的单词拼写错误：
拼写错误[拼写错误的单词].append（j）
其他：
拼写错误[拼写错误的单词]=[j]
#拼写修正是在休息之后完成的，因为只有在休息之后，频率指令才能完全建立。
对于k，枚举中的mis（dict_misspell.keys（））：
if（fdist[mis]拼写检查是一个相当繁重的过程
您可以尝试过滤掉dict_拼写错误中的一些标记，以便在较少的单词上调用更正
。您可以分析注释子集中的未知单词，并创建一些规则来过滤某些类型的标记
示例：少于2个字符的单词；里面有数字的单词；表情符号；命名实体；。。谢谢，这是个好主意，我会试试看
from spacy.lang.en import English
from spellchecker import SpellChecker
from nltk.probability import FreqDist
nlp = spacy.load("en_core_web_sm")
spell = SpellChecker()
fdist = FreqDist()

dict_misspell = {}

pipe = nlp.pipe(list_of_comments, batch_size = 512 ,disable = ["tagger", "parser"])
    for j, doc in enumerate(pipe):
        tokens = [token.lemma_.lower() for token in doc if not token.is_punct and not token.is_digit\
                                  and not token.like_url and not token.like_email and not token.like_num]
        processed_comments.append(" ".join(tokens))
        fdist += FreqDist(tokens)
        
        #remember which comments contain missspellings to avoid having to look at every comment later
        misspelled = spell.unknown(tokens)
        if (len(misspelled) > 0):
            for misspelled_word in misspelled:
                if misspelled_word in dict_misspell.keys():
                    dict_misspell[misspelled_word].append(j)
                else:
                    dict_misspell[misspelled_word] = [j]
    
    #spell correction is done after the rest because only then is the frequency dict fully build.
    for k, mis in enumerate(dict_misspell.keys()):
        if(fdist[mis] <= 5):  #only fix below certain word frequency to avoid false positives
            missspelling_idxs = dict_misspell[mis]
            correct_spelling = spell.correction(mis)
            for idx in missspelling_idxs:
                processed_comments[idx] = processed_comments[idx].replace(mis, correct_spelling)