Python 从文本中提取主题关键字_Python_Python 2.7_Nltk

Python 从文本中提取主题关键字

python python-2.7

Python 从文本中提取主题关键字,python,python-2.7,nltk,Python,Python 2.7,Nltk,我想从烹饪食谱中提取一份配料表。为了做到这一点，我在一个文件中列出了许多配料，然后根据配方检查所有这些配料。代码如下所示： ingredients = ['sugar', 'flour', 'apple'] found = [] recipe = ''' 1 teaspoon of sugar 2 tablespoons of flour. 3 apples ''' for ingredient in ingredients: if ingredient in recipe:

我想从烹饪食谱中提取一份配料表。为了做到这一点，我在一个文件中列出了许多配料，然后根据配方检查所有这些配料。代码如下所示：

ingredients = ['sugar', 'flour', 'apple']
found = []
recipe = '''
1 teaspoon of sugar
2 tablespoons of flour.
3 apples
'''
for ingredient in ingredients:
    if ingredient in recipe:
         found.append(ingredient)

ingredients = set(nouns) - set(stopwords)  # take the difference

我正在寻找一种更有效的方法来做到这一点，因为可能的成分列表可能非常大。有什么想法吗？

您可以尝试使用

nltk

进行词性（POS）标记，保留名词，然后使用自定义的停止列表排除涉及数量的名词，如

茶匙

，

少量

，等等。这将为您提供一个小得多的手动构建/维护列表，以及一个短得多的检查列表，如下所示：

ingredients = ['sugar', 'flour', 'apple']
found = []
recipe = '''
1 teaspoon of sugar
2 tablespoons of flour.
3 apples
'''
for ingredient in ingredients:
    if ingredient in recipe:
         found.append(ingredient)

ingredients = set(nouns) - set(stopwords)  # take the difference

为了更有效地实际检查配方中的成分，您最好按照@jbrown的建议，将配方中的单词（这里可能不值得做词性标记）与成分列表进行交叉。

您可以拆分输入并使用集合：

ingredients = set(['sugar', 'flour', 'apple'])    
recipe_elements = set([i.strip() for i in recipe.split(' ')])
used_ingredients = ingredients & recipe_elements    # the intersection

您可能需要对输入进行各种清理，这取决于您从何处获得输入。不过，您需要进行基准测试，看看这是否真的更好，它与用户在示例中输入“apple”的“apple”不匹配，因为用户在没有额外工作的情况下输入了“apples”（例如，使所有内容都单数）。

您正在寻找一些包含成分的词典？我正在寻找一种有效的方法，因为如果你有很多食谱，那么根据每个食谱检查整个列表会非常低效。