我想删除Python 3.x中句子中的非英语单词
我有一大堆用户查询。其中有些查询也包含垃圾字符,例如,我想删除Python 3.x中句子中的非英语单词,python,nlp,Python,Nlp,我有一大堆用户查询。其中有些查询也包含垃圾字符,例如,我在谷歌asdasb asnlkasn工作 我只需要我在谷歌工作 import nltk import spacy import truecase words = set(nltk.corpus.words.words()) nlp = spacy.load('en_core_web_lg') def check_ner(word): doc = nlp(word) ner_list = [] for token i
我在谷歌asdasb asnlkasn工作
我只需要我在谷歌工作
import nltk
import spacy
import truecase
words = set(nltk.corpus.words.words())
nlp = spacy.load('en_core_web_lg')
def check_ner(word):
doc = nlp(word)
ner_list = []
for token in doc.ents:
ner_list.append(token.text)
return ner_list
sent = "I work in google asdasb asnlkasn"
sent = truecase.get_true_case(sent)
ner_list = check_ner(sent)
final_sent = " ".join(w for w in nltk.wordpunct_tokenize(sent)if w.lower() in words or not
w.isalpha() or w in ner_list)
我试过了,但这并没有删除字符,因为ner将谷歌asdasb asnlkasn
检测为艺术品
或有时asdasb asnlkasn
检测为个人。
我必须包含ner,因为
words=set(nltk.corpus.words.words())
在语料库中没有谷歌、微软、苹果等或任何其他ner值。你可以用它来识别你的非单词
words = set(nltk.corpus.words.words())
sent = "I work in google asdasb asnlkasn"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if w.lower() in words or not w.isalpha())
试着用这个。多亏了@DYZ
然而,由于你说谷歌、苹果等需要NER,这导致了错误的识别,你可以做的是使用beam解析计算NER预测的分数。然后,您可以使用分数为NER设置可接受值的阈值,并将低于该阈值的阈值删除。我相信这些毫无意义的词在分类时会得到一个很低的概率分数,比如person,如果你不需要的话,你可以用它们来删除诸如艺术品之类的类别
使用beamparse进行评分的示例:
import spacy
import sys
from collections import defaultdict
nlp = spacy.load(output_dir)
print("Loaded model '%s'" % output_dir)
text = u'I work in Google asdasb asnlkasn'
with nlp.disable_pipes('ner'):
doc = nlp(text)
threshold = 0.2
(beams) = nlp.entity.beam_parse([ doc ], beam_width = 16, beam_density = 0.0001)
entity_scores = defaultdict(float)
for beam in beams:
for score, ents in nlp.entity.moves.get_beam_parses(beam):
for start, end, label in ents:
entity_scores[(start, end, label)] += score
print ('Entities and scores (detected with beam search)')
for key in entity_scores:
start, end, label = key
score = entity_scores[key]
if ( score > threshold):
print ('Label: {}, Text: {}, Score: {}'.format(label, doc[start:end], score))
它在我的测试中起作用,但NER无法识别它。检查此问题:如果要将其保留为字符串而不是空间令牌,则应使用:
doc[start:end].text