Nlp 从句子列表中提取关键字的方法/工具

Nlp 从句子列表中提取关键字的方法/工具,nlp,search-engine,data-mining,text-mining,semantic-analysis,Nlp,Search Engine,Data Mining,Text Mining,Semantic Analysis,我有一个很大的句子列表,我想用它们各自独特的关键字来标记每个句子,以帮助我识别哪些句子是相似的,以便进行分组 例如: The dog ran fast. - tagged as: dog The cat is sleeping - tagged as: cat The German Sheppard is awake. - tagged as dog 狗跑得很快标记为:狗 猫正在睡觉-标记为:猫 德国人谢泼德醒了被标记为狗 我一直在研究诸如alchemy api和openCalais之类的工具来

我有一个很大的句子列表,我想用它们各自独特的关键字来标记每个句子,以帮助我识别哪些句子是相似的,以便进行分组

例如:

The dog ran fast. - tagged as: dog The cat is sleeping - tagged as: cat The German Sheppard is awake. - tagged as dog 狗跑得很快标记为:狗 猫正在睡觉-标记为:猫 德国人谢泼德醒了被标记为狗 我一直在研究诸如alchemy api和openCalais之类的工具来提取关键词,但是,似乎您更多地使用这些工具来从一块数据中提取含义,比如整个文档或段落,而不是标记1000个独特但相似的单独句子

简言之,理想情况下,我希望:

  • 从文档或网页(可能来自大型电子表格或推特列表)中选取一句话
  • 在其上放置唯一标识符(某种类型的关键字)
  • 用keywrd将句子分组

  • 我认为附加标识符的意思类似于nltk的词性标注(词类)以及词干分析。这是nltkbook的一个链接,可能会对您有所帮助。下载说明如下所示
    IMO选择的语言应该是Python 我有几个例子,你可能想看看:

    词干词

    >>>import nltk
    >>>from nltk.stem import PorterStemmer
    >>>stemmer = PorterStemmer()
    >>>stemmer.stem('cooking')
    #'cook' 
    
    >>> from nltk.corpus.reader import TaggedCorpusReader
    >>> reader = TaggedCorpusReader('.', r'.*\.pos')
    >>> reader.words()
    ['The', 'expense', 'and', 'time', 'involved', 'are', ...]
    >>> reader.tagged_words()
    [('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ...]
    >>> reader.sents()
    [['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
    >>> reader.tagged_sents()
    [[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]
    >>> reader.paras()
    [[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]]
    >>> reader.tagged_paras()
    [[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]]
    
    >>> from nltk.tokenize import SpaceTokenizer
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', word_tokenizer=SpaceTokenizer())
    >>> reader.words()
    ['The', 'expense', 'and', 'time', 'involved', 'are', ...]
    
    >>> from nltk.tokenize import LineTokenizer
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', sent_tokenizer=LineTokenizer())
    >>> reader.sents()
    [['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
    
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=lambda t: t.lower())
    >>> reader.tagged_words(simplify_tags=True)
    [('The', 'at-tl'), ('expense', 'nn'), ('and', 'cc'), ...]
    
    >>> from nltk.tag import simplify
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=simplify.simplify_brown_tag)
    >>> reader.tagged_words(simplify_tags=True)
    [('The', 'DET'), ('expense', 'N'), ('and', 'CNJ'), ...]
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=simplify.simplify_tag)
    >>> reader.tagged_words(simplify_tags=True)
    [('The', 'A'), ('expense', 'N'), ('and', 'C'), ...]
    
    创建词性标记词语料库

    >>>import nltk
    >>>from nltk.stem import PorterStemmer
    >>>stemmer = PorterStemmer()
    >>>stemmer.stem('cooking')
    #'cook' 
    
    >>> from nltk.corpus.reader import TaggedCorpusReader
    >>> reader = TaggedCorpusReader('.', r'.*\.pos')
    >>> reader.words()
    ['The', 'expense', 'and', 'time', 'involved', 'are', ...]
    >>> reader.tagged_words()
    [('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ...]
    >>> reader.sents()
    [['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
    >>> reader.tagged_sents()
    [[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]
    >>> reader.paras()
    [[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]]
    >>> reader.tagged_paras()
    [[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]]
    
    >>> from nltk.tokenize import SpaceTokenizer
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', word_tokenizer=SpaceTokenizer())
    >>> reader.words()
    ['The', 'expense', 'and', 'time', 'involved', 'are', ...]
    
    >>> from nltk.tokenize import LineTokenizer
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', sent_tokenizer=LineTokenizer())
    >>> reader.sents()
    [['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
    
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=lambda t: t.lower())
    >>> reader.tagged_words(simplify_tags=True)
    [('The', 'at-tl'), ('expense', 'nn'), ('and', 'cc'), ...]
    
    >>> from nltk.tag import simplify
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=simplify.simplify_brown_tag)
    >>> reader.tagged_words(simplify_tags=True)
    [('The', 'DET'), ('expense', 'N'), ('and', 'CNJ'), ...]
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=simplify.simplify_tag)
    >>> reader.tagged_words(simplify_tags=True)
    [('The', 'A'), ('expense', 'N'), ('and', 'C'), ...]
    
    以上两个代码示例取自nltk的书籍示例。我已经发布了,所以无论它是否有用,您都可以按面值来接受。
    按照这两个特性的组合思路思考。它们符合你的目的吗?

    另外,你可能想研究一下停止词,以便从你给出的第一句话中找出正确的答案。

    我认为附加标识符的意思类似于nltk的词性标记(词类)和词干分析。这是nltkbook的一个链接,可能会对您有所帮助。下载说明如下所示
    IMO选择的语言应该是Python 我有几个例子,你可能想看看:

    词干词

    >>>import nltk
    >>>from nltk.stem import PorterStemmer
    >>>stemmer = PorterStemmer()
    >>>stemmer.stem('cooking')
    #'cook' 
    
    >>> from nltk.corpus.reader import TaggedCorpusReader
    >>> reader = TaggedCorpusReader('.', r'.*\.pos')
    >>> reader.words()
    ['The', 'expense', 'and', 'time', 'involved', 'are', ...]
    >>> reader.tagged_words()
    [('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ...]
    >>> reader.sents()
    [['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
    >>> reader.tagged_sents()
    [[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]
    >>> reader.paras()
    [[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]]
    >>> reader.tagged_paras()
    [[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]]
    
    >>> from nltk.tokenize import SpaceTokenizer
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', word_tokenizer=SpaceTokenizer())
    >>> reader.words()
    ['The', 'expense', 'and', 'time', 'involved', 'are', ...]
    
    >>> from nltk.tokenize import LineTokenizer
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', sent_tokenizer=LineTokenizer())
    >>> reader.sents()
    [['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
    
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=lambda t: t.lower())
    >>> reader.tagged_words(simplify_tags=True)
    [('The', 'at-tl'), ('expense', 'nn'), ('and', 'cc'), ...]
    
    >>> from nltk.tag import simplify
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=simplify.simplify_brown_tag)
    >>> reader.tagged_words(simplify_tags=True)
    [('The', 'DET'), ('expense', 'N'), ('and', 'CNJ'), ...]
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=simplify.simplify_tag)
    >>> reader.tagged_words(simplify_tags=True)
    [('The', 'A'), ('expense', 'N'), ('and', 'C'), ...]
    
    创建词性标记词语料库

    >>>import nltk
    >>>from nltk.stem import PorterStemmer
    >>>stemmer = PorterStemmer()
    >>>stemmer.stem('cooking')
    #'cook' 
    
    >>> from nltk.corpus.reader import TaggedCorpusReader
    >>> reader = TaggedCorpusReader('.', r'.*\.pos')
    >>> reader.words()
    ['The', 'expense', 'and', 'time', 'involved', 'are', ...]
    >>> reader.tagged_words()
    [('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ...]
    >>> reader.sents()
    [['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
    >>> reader.tagged_sents()
    [[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]
    >>> reader.paras()
    [[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]]
    >>> reader.tagged_paras()
    [[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]]
    
    >>> from nltk.tokenize import SpaceTokenizer
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', word_tokenizer=SpaceTokenizer())
    >>> reader.words()
    ['The', 'expense', 'and', 'time', 'involved', 'are', ...]
    
    >>> from nltk.tokenize import LineTokenizer
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', sent_tokenizer=LineTokenizer())
    >>> reader.sents()
    [['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
    
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=lambda t: t.lower())
    >>> reader.tagged_words(simplify_tags=True)
    [('The', 'at-tl'), ('expense', 'nn'), ('and', 'cc'), ...]
    
    >>> from nltk.tag import simplify
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=simplify.simplify_brown_tag)
    >>> reader.tagged_words(simplify_tags=True)
    [('The', 'DET'), ('expense', 'N'), ('and', 'CNJ'), ...]
    >>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=simplify.simplify_tag)
    >>> reader.tagged_words(simplify_tags=True)
    [('The', 'A'), ('expense', 'N'), ('and', 'C'), ...]
    
    以上两个代码示例取自nltk的书籍示例。我已经发布了,所以无论它是否有用,您都可以按面值来接受。
    按照这两个特性的组合思路思考。它们符合你的目的吗?

    此外,你可能还想研究一下停止词,以便从你给出的第一句话中找出正确的答案。

    你所说的“唯一关键字”是什么意思?例如,如果输入为“狗跑得比猫快”,您是否将其标记为“狗”或“猫”?如何期望算法确定一个单独的、唯一的标记来概括整个句子?对不起,只是指关键字…在您的示例中,狗和猫适用。您所说的“唯一关键字”是什么意思?例如,如果输入为“狗跑得比猫快”,您是否将其标记为“狗”或“猫”?你希望算法如何确定一个单一的、唯一的标记来概括整个句子?对不起,在你的例子中,只需要简单的关键字…狗和猫就可以了。