Python 使用TF-IDF方案从给定句子中提取关键短语是否有有效的解决方法？_Python_Nlp_Nltk_Text Extraction

Python 使用TF-IDF方案从给定句子中提取关键短语是否有有效的解决方法？

python nlp

Python 使用TF-IDF方案从给定句子中提取关键短语是否有有效的解决方法？,python,nlp,nltk,text-extraction,Python,Nlp,Nltk,Text Extraction,我试图用TF-IDF模式从给定的句子中提取一个关键短语。为此，我试着找出句子中的候选词或候选短语，然后在句子中使用get frequency单词。然而，当我引入新的CFG规则来查找句子中可能的关键短语时，我发现了一个错误这是我的剧本： rm_punct=re.compile('[{}]'.format(re.escape(string.punctuation))) stop_words=set(stopwords.words('english')) def get_cand_words(se

我试图用TF-IDF模式从给定的句子中提取一个关键短语。为此，我试着找出句子中的候选词或候选短语，然后在句子中使用get frequency单词。然而，当我引入新的

CFG

规则来查找句子中可能的关键短语时，我发现了一个错误

这是我的剧本：

rm_punct=re.compile('[{}]'.format(re.escape(string.punctuation)))
stop_words=set(stopwords.words('english'))

def get_cand_words(sent, cand_type='word', remove_punct=False):
    candidates=list()
    sent=rm_punct.sub(' ', sent)
    tokenized=word_tokenize(sent)
    tagged_words=pos_tag(tokenized)
    if cand_type=='word':
        pos_tag_patt=tags = set(['JJ', 'JJR', 'JJS', 'NN', 'NNP', 'NNS', 'NNPS'])
        tagged_words=chain.from_iterable(tagged_words)
        for word, tag in enumerate(tagged_words):
            if tag in pos_tag_patt and word not in stop_words:
                candidates.append(word)

    elif cand_type == 'phrase':
        grammar = r'KT: {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+}'
        chunker = RegexpParser(grammar)
        all_tag = chain.from_iterable([chunker.parse(tag) for tag in tagged_words])
        for key, group in groupby(all_tag, lambda tag: tag[2] != 'O'):
            candidate = ' '.join([word for (word, pos, chunk) in group])
            if key is True and candidate not in stop_words:
                candidates.append(candidate)
    else:
        print("return word or phrase as target phrase")
    return candidates

我是基于从长文本段落中提取关键短语而编写上述代码的，我的目标是在给定的句子中找到一个唯一的关键短语，但是上面的实现并不能很好地工作

如何修复此值错误？我如何才能使上述实现在给定句子或句子列表中提取关键短语？有没有更好的办法来实现这一点？还有什么想法吗？谢谢

目标：

我想从给定的句子中找出一个最相关的名词形容词短语或复合名词形容词短语。如何在python中实现这一点？有人知道如何做到这一点吗？提前感谢

您能试用此代码吗

   rm_punct=re.compile('[{}]'.format(re.escape(string.punctuation)))
   stop_words=set(stopwords.words('english'))

   def get_cand_words(sent, cand_type='word', remove_punct=False):
    import nltk
    candidates=list()
    sent=rm_punct.sub(' ', sent)
    tokenized=word_tokenize(sent)
    tagged_words=pos_tag(tokenized)
    if cand_type=='word':
        pos_tag_patt=tags = set(['JJ', 'JJR', 'JJS', 'NN', 'NNP', 'NNS', 'NNPS'])
        tagged_words=chain.from_iterable(tagged_words)
        for word, tag in enumerate(tagged_words):
            if tag in pos_tag_patt and word not in stop_words:
                candidates.append(word)

    elif cand_type == 'phrase':
        grammar = r'KT: {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+}'
        chunker = RegexpParser(grammar)
        tagged_words=nltk.pos_tag_sents(nltk.word_tokenize(text) for text in nltk.sent_tokenize(sent))
        all_tag = list(chain.from_iterable(nltk.chunk.tree2conlltags(chunker.parse(tagged_word)) for tagged_word in tagged_words))
        for key, group in groupby(all_tag, lambda tag: tag[2] != 'O'):
            candidate = ' '.join([word for (word, pos, chunk) in group])
            if key is True and candidate not in stop_words:
                candidates.append(candidate)
    else:
        print("return word or phrase as target phrase")
    return candidates

rm_punct=re.compile（'[{}]'.format（re.escape（string.标点）））
stop_words=set（stopwords.words（'english'））
def get_cand_words（已发送，cand_type='word'，remove_punct=False）：
导入nltk
候选人=名单（）
已发送=rm_点播子（“”，已发送）
标记化=单词\u标记化（已发送）
标记的单词=位置标记（标记化）
如果cand_type=='word'：
pos_tag_patt=tags=set（['JJ'，'JJR'，'JJS'，'NN'，'NNP'，'NNS'，'NNPS']）
tagged_words=链。from_iterable（tagged_words）
对于单词，在枚举中标记（标记的单词）：
如果标记在位置标记中，而单词不在停止词中：
候选项。追加（word）
elif cand_type==‘短语’：
文法=r'KT:{（*+）？*+}
chunker=RegexpParser（语法）
taged_words=nltk.pos_tag_sents（nltk.sent_tokenize（文本）用于nltk.sent_tokenize（sent））中的文本）
all_tag=list（标记词中标记词的chain.from_iterable（nltk.chunk.tree2conlltags（chunker.parse（taged_词）））
对于键，在groupby中分组（所有_标记，lambda标记：标记[2]！='O'）：
候选=''.join（[组中的（单词、位置、区块）的单词]）
如果关键字为True且候选项不在stop_单词中：
候选人。附加（候选人）
其他：
打印（“将单词或短语作为目标短语返回”）
返回候选人

你能试试这个代码吗

   rm_punct=re.compile('[{}]'.format(re.escape(string.punctuation)))
   stop_words=set(stopwords.words('english'))

   def get_cand_words(sent, cand_type='word', remove_punct=False):
    import nltk
    candidates=list()
    sent=rm_punct.sub(' ', sent)
    tokenized=word_tokenize(sent)
    tagged_words=pos_tag(tokenized)
    if cand_type=='word':
        pos_tag_patt=tags = set(['JJ', 'JJR', 'JJS', 'NN', 'NNP', 'NNS', 'NNPS'])
        tagged_words=chain.from_iterable(tagged_words)
        for word, tag in enumerate(tagged_words):
            if tag in pos_tag_patt and word not in stop_words:
                candidates.append(word)

    elif cand_type == 'phrase':
        grammar = r'KT: {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+}'
        chunker = RegexpParser(grammar)
        tagged_words=nltk.pos_tag_sents(nltk.word_tokenize(text) for text in nltk.sent_tokenize(sent))
        all_tag = list(chain.from_iterable(nltk.chunk.tree2conlltags(chunker.parse(tagged_word)) for tagged_word in tagged_words))
        for key, group in groupby(all_tag, lambda tag: tag[2] != 'O'):
            candidate = ' '.join([word for (word, pos, chunk) in group])
            if key is True and candidate not in stop_words:
                candidates.append(candidate)
    else:
        print("return word or phrase as target phrase")
    return candidates

rm_punct=re.compile（'[{}]'.format（re.escape（string.标点）））
stop_words=set（stopwords.words（'english'））
def get_cand_words（已发送，cand_type='word'，remove_punct=False）：
导入nltk
候选人=名单（）
已发送=rm_点播子（“”，已发送）
标记化=单词\u标记化（已发送）
标记的单词=位置标记（标记化）
如果cand_type=='word'：
pos_tag_patt=tags=set（['JJ'，'JJR'，'JJS'，'NN'，'NNP'，'NNS'，'NNPS']）
tagged_words=链。from_iterable（tagged_words）
对于单词，在枚举中标记（标记的单词）：
如果标记在位置标记中，而单词不在停止词中：
候选项。追加（word）
elif cand_type==‘短语’：
文法=r'KT:{（*+）？*+}
chunker=RegexpParser（语法）
taged_words=nltk.pos_tag_sents（nltk.sent_tokenize（文本）用于nltk.sent_tokenize（sent））中的文本）
all_tag=list（标记词中标记词的chain.from_iterable（nltk.chunk.tree2conlltags（chunker.parse（taged_词）））
对于键，在groupby中分组（所有_标记，lambda标记：标记[2]！='O'）：
候选=''.join（[组中的（单词、位置、区块）的单词]）
如果关键字为True且候选项不在stop_单词中：
候选人。附加（候选人）
其他：
打印（“将单词或短语作为目标短语返回”）
返回候选人

我没有从你的答案中看到任何重要的更新，与我原来的帖子差不多，你的代码不起作用，你能删除你的答案吗？@Jerry-我在elif block中做了更改。标记的单词和所有标记赋值行都被更改。这段代码对我来说运行得非常好。我得到的输出是['Hillary Clinton'、'John McCain'、'George Bush'、'benefit'、'对伊朗的怀疑'。当我选择参数

cand_type

作为

短语时，上面的代码不适合我，我有一个错误：目前，NLTK pos_标记只支持英语和俄语（即lang='eng'或lang='rus'），为什么？那你的设置是什么？另外，你能同时更新你的可用代码吗？我从你的答案中没有看到任何重要的更新，与我原来的帖子差不多，你的代码不工作，你能删除你的答案吗？@Jerry-我在elif block中做了更改。标记的单词和所有标记赋值行都被更改。这段代码对我来说运行得非常好。我得到的输出是['Hillary Clinton'、'John McCain'、'George Bush'、'benefit'、'对伊朗的怀疑'。当我选择参数cand_type
作为短语时，上面的代码不适合我，我有一个错误：目前，NLTK pos_标记只支持英语和俄语（即lang='eng'或lang='rus'），为什么？那你的设置是什么？另外，你能同时更新你的代码吗？