在句子字符串中查找单词的不同实现-Python_Python_String_Nlp_Stemming

在句子字符串中查找单词的不同实现-Python

python string nlp

在句子字符串中查找单词的不同实现-Python,python,string,nlp,stemming,Python,String,Nlp,Stemming,（这个问题是关于一般的字符串检查，而不是自然语言过程本身，但如果您将其视为NLP问题，请设想它不是当前分析器可以分析的语言，为简单起见，我将使用英语字符串作为示例。）让我们假设一个单词只有6种可能的实现形式首字母大写其复数形式带有“s” 其复数形式带有“es” 大写+“es” 大写+“s” 没有复数或大写的基本形式假设我想找到第一个实例的索引任何形式的单词coach出现在一个句子中，有没有更简单的方法来实现这两种方法：长if条件 sentence = "this is a senten

（这个问题是关于一般的字符串检查，而不是自然语言过程本身，但如果您将其视为NLP问题，请设想它不是当前分析器可以分析的语言，为简单起见，我将使用英语字符串作为示例。）

让我们假设一个单词只有6种可能的实现形式

首字母大写

其复数形式带有“s”

其复数形式带有“es”

大写+“es”

大写+“s”

没有复数或大写的基本形式

假设我想找到第一个实例的索引任何形式的单词

coach

出现在一个句子中，有没有更简单的方法来实现这两种方法：

长if条件

sentence = "this is a sentence with the Coaches"
target = "coach"

print target.capitalize()

for j, i in enumerate(sentence.split(" ")):
  if i == target.capitalize() or i == target.capitalize()+"es" or \
     i == target.capitalize()+"s" or i == target+"es" or i==target+"s" or \
     i == target:
    print j

迭代尝试，除了

variations = [target, target+"es", target+"s", target.capitalize()+"es",
target.capitalize()+"s", target.capitalize()]

ind = 0
for i in variations:
  try:
    j == sentence.split(" ").index(i)
    print j
  except ValueError:
    continue

我建议看一下NLTK的stem包：

使用它，你可以“从单词中删除形态词缀，只留下词干。词干提取算法旨在删除语法角色、时态、派生词缀所需的词缀，只留下词干。”

如果您的语言目前未被NLTK覆盖，则应考虑扩展NLTK。如果您真的需要一些简单的东西，并且不需要为NLTK操心，那么您仍然应该将代码编写为一个小的、易于组合的实用程序函数集合，例如：

import string 

def variation(stem, word):
    return word.lower() in [stem, stem + 'es', stem + 's']

def variations(sentence, stem):
    sentence = cleanPunctuation(sentence).split()
    return ( (i, w) for i, w in enumerate(sentence) if variation(stem, w) )

def cleanPunctuation(sentence):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in sentence if ch not in exclude)

def firstVariation(sentence, stem):
    for i, w  in variations(sentence, stem):
        return i, w

sentence = "First coach, here another two coaches. Coaches are nice."

print firstVariation(sentence, 'coach')

# print all variations/forms of 'coach' found in the sentence:
print "\n".join([str(i) + ' ' + w for i,w in variations(sentence, 'coach')])

形态学是一种典型的有限状态现象，因此正则表达式是处理它的完美工具。构建一个RE，该RE使用以下函数匹配所有案例：

def inflect(stem):
    """Returns an RE that matches all inflected forms of stem."""
    pat = "^[%s%s]%s(?:e?s)$" % (stem[0], stem[0].upper(), re.escape(stem[1:]))
    return re.compile(pat)

用法：

>>> sentence = "this is a sentence with the Coaches"
>>> target = inflect("coach")
>>> [(i, w) for i, w in enumerate(sentence.split()) if re.match(target, w)]
[(6, 'Coaches')]

如果拐点规则比这更复杂，考虑使用。

α2和3是相同的。一个应该是“s”而另一个应该是“es”？使用正则表达式会简单得多。它还不是NLTK词干分析器支持的语言。我正在尝试构建一个有限的规则库系统，可以覆盖NLP中的高精度，因此，如果没有注释数据，基于统计分类的词干分析器将不可能实现。但是NLTK应该是查找任何NLP相关任务的第一件事（尽管=）