Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/316.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
我想删除Python 3.x中句子中的非英语单词_Python_Nlp - Fatal编程技术网

我想删除Python 3.x中句子中的非英语单词

我想删除Python 3.x中句子中的非英语单词,python,nlp,Python,Nlp,我有一大堆用户查询。其中有些查询也包含垃圾字符,例如,我在谷歌asdasb asnlkasn工作 我只需要我在谷歌工作 import nltk import spacy import truecase words = set(nltk.corpus.words.words()) nlp = spacy.load('en_core_web_lg') def check_ner(word): doc = nlp(word) ner_list = [] for token i

我有一大堆用户查询。其中有些查询也包含垃圾字符,例如,
我在谷歌asdasb asnlkasn工作
我只需要
我在谷歌工作

import nltk
import spacy
import truecase
words = set(nltk.corpus.words.words())
nlp = spacy.load('en_core_web_lg')

def check_ner(word):
    doc = nlp(word)
    ner_list = []
    for token in doc.ents:
        ner_list.append(token.text)
    return ner_list



sent = "I work in google asdasb asnlkasn"
sent = truecase.get_true_case(sent)
ner_list = check_ner(sent)

final_sent = " ".join(w for w in nltk.wordpunct_tokenize(sent)if w.lower() in words or not 
w.isalpha() or w in ner_list)
我试过了,但这并没有删除字符,因为ner将
谷歌asdasb asnlkasn
检测为
艺术品
或有时
asdasb asnlkasn
检测为个人。
我必须包含ner,因为
words=set(nltk.corpus.words.words())
在语料库中没有谷歌、微软、苹果等或任何其他ner值。

你可以用它来识别你的非单词

words = set(nltk.corpus.words.words())

sent = "I work in google asdasb asnlkasn"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
         if w.lower() in words or not w.isalpha())
试着用这个。多亏了@DYZ

然而,由于你说谷歌、苹果等需要NER,这导致了错误的识别,你可以做的是使用beam解析计算NER预测的分数。然后,您可以使用分数为NER设置可接受值的阈值,并将低于该阈值的阈值删除。我相信这些毫无意义的词在分类时会得到一个很低的概率分数,比如person,如果你不需要的话,你可以用它们来删除诸如艺术品之类的类别

使用beamparse进行评分的示例:

import spacy
import sys
from collections import defaultdict

nlp = spacy.load(output_dir)
print("Loaded model '%s'" % output_dir)
text = u'I work in Google asdasb asnlkasn'


with nlp.disable_pipes('ner'):
    doc = nlp(text)


threshold = 0.2
(beams) = nlp.entity.beam_parse([ doc ], beam_width = 16, beam_density = 0.0001)

entity_scores = defaultdict(float)
for beam in beams:
    for score, ents in nlp.entity.moves.get_beam_parses(beam):
        for start, end, label in ents:
            entity_scores[(start, end, label)] += score

print ('Entities and scores (detected with beam search)')
for key in entity_scores:
    start, end, label = key
    score = entity_scores[key]
    if ( score > threshold):
        print ('Label: {}, Text: {}, Score: {}'.format(label, doc[start:end], score))

它在我的测试中起作用,但NER无法识别它。

检查此问题:如果要将其保留为字符串而不是空间令牌,则应使用:
doc[start:end].text