Python 可以告诉词干分析器忽略特定语言的单词吗？_Python_Nltk

Python 可以告诉词干分析器忽略特定语言的单词吗？

python

Python 可以告诉词干分析器忽略特定语言的单词吗？,python,nltk,Python,Nltk,我正在使用用于德语的Cistem词干分析器我正在生成的文档也包含英语单词因此，我想告诉德语词干分析器忽略英语单词，然后我想告诉我的英语词干分析器忽略德语单词例如：我的德语文本包含英语单词“case”。德国的词干分析器将其词干化为“cas”，但它应该保持“case”。因此忽略了英语单词“case” 这可能吗我的代码： stemmer = Cistem() sl = [] for line in o: sp = line.split() sl.append(sp)

我正在使用用于德语的Cistem词干分析器

我正在生成的文档也包含英语单词

因此，我想告诉德语词干分析器忽略英语单词，然后我想告诉我的英语词干分析器忽略德语单词

例如：

我的德语文本包含英语单词“case”。德国的词干分析器将其词干化为“cas”，但它应该保持“case”。因此忽略了英语单词“case”

这可能吗

我的代码：

stemmer = Cistem()
sl = []
for line in o: 
    sp = line.split()
    sl.append(sp)

st = [[stemmer.segment(s) for s in l] for l in sl]

一种很好的方法是比较一个给定单词在英语文档语料库中出现的频率与该单词在德语文档语料库中出现的频率。
例如，如果单词w1在德语维基百科中出现的频率高于在英语维基百科中出现的频率，那么它可能是一个德语单词

现在，与下载、解析和计算这两个版本的维基百科的词频不同，更直接的方法是使用经过预训练的模型，其中包括在训练过程中遇到的词频指示

我们可以在Spacy使用英国和德国模式：

import spacy
nlpDE = spacy.load("de_core_news_md")
nlpEN = spacy.load('en_core_web_md')

# some test sentences in both languages:
sl = [ "Python is an interpreted, high-level, general-purpose programming language.",
"Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace.", 
"Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.", 
"Python ist eine universelle, üblicherweise interpretierte höhere Programmiersprache.",
" Wegen ihrer klaren und übersichtlichen Syntax gilt Python als einfach zu erlernen.",
"Python unterstützt mehrere Programmierparadigmen, z. B. die objektorientierte, die aspektorientierte und die funktionale Programmierung"]

#let's randomly shuffle this list of test sentences:
from random import shuffle
shuffle(sl)
s = " ".join(sl)

#Our function which will compare the likelihoods:
def compare(word):
    prob_en = nlpEN.vocab[word].prob
    prob_de = nlpDE.vocab[word].prob
    if prob_en > prob_de:
        return('EN')
    else:
        return('DE')


doc = nlpEN(s)    
print([(t, compare(t.text))  for t in doc if not t.is_punct])

以及该方法对样本数据的结果：

[(Python, 'EN'), (is, 'EN'), (an, 'DE'), (interpreted, 'EN'), (high, 'EN'), (level, 'EN'), 
(general, 'EN'), (purpose, 'EN'), (programming, 'EN'), (language, 'EN'), (Python, 'EN'), 
(unterstützt, 'DE'), (mehrere, 'DE'), (Programmierparadigmen, 'DE'), (z., 'DE'), (B., 'DE'), (die, 'DE'), 
(objektorientierte, 'DE'), (die, 'DE'), (aspektorientierte, 'DE'), (und, 'DE'), (die, 'DE'), (funktionale, 'DE'),
 (Programmierung, 'DE'), (Created, 'EN'), (by, 'EN'), (Guido, 'DE'), (van, 'DE'), (Rossum, 'DE'), (and, 'EN'),
 (first, 'EN'), (released, 'EN'), (in, 'DE'), (1991, 'EN'), (Python, 'EN'), ('s, 'EN'), (design, 'EN'), 
(philosophy, 'EN'), (emphasizes, 'EN'), (code, 'EN'), (readability, 'EN'), (with, 'EN'), (its, 'EN'), 
(notable, 'EN'), (use, 'EN'), (of, 'EN'), (significant, 'EN'), (whitespace, 'EN'), ( , 'EN'), (Wegen, 'DE'), 
(ihrer, 'DE'), (klaren, 'DE'), (und, 'DE'), (übersichtlichen, 'DE'), (Syntax, 'DE'), (gilt, 'DE'), (Python, 'EN'), 
(als, 'DE'), (einfach, 'DE'), (zu, 'DE'), (erlernen, 'DE'), (Python, 'EN'), (ist, 'DE'), (eine, 'DE'), 
(universelle, 'DE'), (üblicherweise, 'DE'), (interpretierte, 'DE'), (höhere, 'DE'), (Programmiersprache, 'DE'),
 (Its, 'EN'), (language, 'EN'), (constructs, 'EN'), (and, 'EN'), (object, 'EN'), (oriented, 'EN'), (approach, 'EN'), 
(aim, 'EN'), (to, 'EN'), (help, 'EN'), (programmers, 'EN'), (write, 'EN'), (clear, 'EN'), (logical, 'EN'),
 (code, 'EN'), (for, 'EN'), (small, 'EN'), (and, 'EN'), (large, 'EN'), (scale, 'EN'), (projects, 'EN')]

一个直接的想法是检查每个标记是否属于给定的德语词汇或英语词汇。根据这一点，您可以决定是否将词干分析器应用于标记。

为此，您需要相应语言的词典，也许您可以使用nltk中的词典，或者检查语料库中的出现率或频率，以便您可以指定一个词是否在目标语言中使用或是否是借用

你能说得更具体些吗？预期的输出是什么？嗨@mario_sunny，我添加了一个例子。Spacy的词频法对你有用吗？嗨@DBaker！我刚刚试过你的代码（最近几天没有时间），它工作得很好。谢谢你详细的回答！一个问题：如果我想分析句子，我能把[word]改成[句子]吗？嗨@gython，我很高兴它对你有用！代码逐字分析句子。这就是为什么每个句子都被标记成单词，然后每个单词的语言都被预测。再次感谢@DBaker。如果我想以句子为基础而不是以单词为基础，你能给我推荐一些资源吗？我不确定你所说的“以句子为基础”到底是什么意思：你的意思是确定一个给定的句子是英语还是德语？如果这就是你的意思，那么你可以计算句子中预测为“EN”的单词数量，并与预测为“DE”的单词数量进行比较。我将在下一篇评论中添加一些参考资料的链接。是一个很好的标记化资源