Nlp 从句子列表中提取关键字的方法/工具
我有一个很大的句子列表,我想用它们各自独特的关键字来标记每个句子,以帮助我识别哪些句子是相似的,以便进行分组 例如: The dog ran fast. - tagged as: dog The cat is sleeping - tagged as: cat The German Sheppard is awake. - tagged as dog 狗跑得很快标记为:狗 猫正在睡觉-标记为:猫 德国人谢泼德醒了被标记为狗 我一直在研究诸如alchemy api和openCalais之类的工具来提取关键词,但是,似乎您更多地使用这些工具来从一块数据中提取含义,比如整个文档或段落,而不是标记1000个独特但相似的单独句子 简言之,理想情况下,我希望:Nlp 从句子列表中提取关键字的方法/工具,nlp,search-engine,data-mining,text-mining,semantic-analysis,Nlp,Search Engine,Data Mining,Text Mining,Semantic Analysis,我有一个很大的句子列表,我想用它们各自独特的关键字来标记每个句子,以帮助我识别哪些句子是相似的,以便进行分组 例如: The dog ran fast. - tagged as: dog The cat is sleeping - tagged as: cat The German Sheppard is awake. - tagged as dog 狗跑得很快标记为:狗 猫正在睡觉-标记为:猫 德国人谢泼德醒了被标记为狗 我一直在研究诸如alchemy api和openCalais之类的工具来
我认为附加标识符的意思类似于nltk的词性标注(词类)以及词干分析。这是nltkbook的一个链接,可能会对您有所帮助。下载说明如下所示
IMO选择的语言应该是Python 我有几个例子,你可能想看看: 词干词
>>>import nltk
>>>from nltk.stem import PorterStemmer
>>>stemmer = PorterStemmer()
>>>stemmer.stem('cooking')
#'cook'
>>> from nltk.corpus.reader import TaggedCorpusReader
>>> reader = TaggedCorpusReader('.', r'.*\.pos')
>>> reader.words()
['The', 'expense', 'and', 'time', 'involved', 'are', ...]
>>> reader.tagged_words()
[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ...]
>>> reader.sents()
[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
>>> reader.tagged_sents()
[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]
>>> reader.paras()
[[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]]
>>> reader.tagged_paras()
[[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]]
>>> from nltk.tokenize import SpaceTokenizer
>>> reader = TaggedCorpusReader('.', r'.*\.pos', word_tokenizer=SpaceTokenizer())
>>> reader.words()
['The', 'expense', 'and', 'time', 'involved', 'are', ...]
>>> from nltk.tokenize import LineTokenizer
>>> reader = TaggedCorpusReader('.', r'.*\.pos', sent_tokenizer=LineTokenizer())
>>> reader.sents()
[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
>>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=lambda t: t.lower())
>>> reader.tagged_words(simplify_tags=True)
[('The', 'at-tl'), ('expense', 'nn'), ('and', 'cc'), ...]
>>> from nltk.tag import simplify
>>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=simplify.simplify_brown_tag)
>>> reader.tagged_words(simplify_tags=True)
[('The', 'DET'), ('expense', 'N'), ('and', 'CNJ'), ...]
>>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=simplify.simplify_tag)
>>> reader.tagged_words(simplify_tags=True)
[('The', 'A'), ('expense', 'N'), ('and', 'C'), ...]
创建词性标记词语料库
>>>import nltk
>>>from nltk.stem import PorterStemmer
>>>stemmer = PorterStemmer()
>>>stemmer.stem('cooking')
#'cook'
>>> from nltk.corpus.reader import TaggedCorpusReader
>>> reader = TaggedCorpusReader('.', r'.*\.pos')
>>> reader.words()
['The', 'expense', 'and', 'time', 'involved', 'are', ...]
>>> reader.tagged_words()
[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ...]
>>> reader.sents()
[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
>>> reader.tagged_sents()
[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]
>>> reader.paras()
[[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]]
>>> reader.tagged_paras()
[[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]]
>>> from nltk.tokenize import SpaceTokenizer
>>> reader = TaggedCorpusReader('.', r'.*\.pos', word_tokenizer=SpaceTokenizer())
>>> reader.words()
['The', 'expense', 'and', 'time', 'involved', 'are', ...]
>>> from nltk.tokenize import LineTokenizer
>>> reader = TaggedCorpusReader('.', r'.*\.pos', sent_tokenizer=LineTokenizer())
>>> reader.sents()
[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
>>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=lambda t: t.lower())
>>> reader.tagged_words(simplify_tags=True)
[('The', 'at-tl'), ('expense', 'nn'), ('and', 'cc'), ...]
>>> from nltk.tag import simplify
>>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=simplify.simplify_brown_tag)
>>> reader.tagged_words(simplify_tags=True)
[('The', 'DET'), ('expense', 'N'), ('and', 'CNJ'), ...]
>>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=simplify.simplify_tag)
>>> reader.tagged_words(simplify_tags=True)
[('The', 'A'), ('expense', 'N'), ('and', 'C'), ...]
以上两个代码示例取自nltk的书籍示例。我已经发布了,所以无论它是否有用,您都可以按面值来接受。按照这两个特性的组合思路思考。它们符合你的目的吗?
另外,你可能想研究一下停止词,以便从你给出的第一句话中找出正确的答案。我认为附加标识符的意思类似于nltk的词性标记(词类)和词干分析。这是nltkbook的一个链接,可能会对您有所帮助。下载说明如下所示
IMO选择的语言应该是Python 我有几个例子,你可能想看看: 词干词
>>>import nltk
>>>from nltk.stem import PorterStemmer
>>>stemmer = PorterStemmer()
>>>stemmer.stem('cooking')
#'cook'
>>> from nltk.corpus.reader import TaggedCorpusReader
>>> reader = TaggedCorpusReader('.', r'.*\.pos')
>>> reader.words()
['The', 'expense', 'and', 'time', 'involved', 'are', ...]
>>> reader.tagged_words()
[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ...]
>>> reader.sents()
[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
>>> reader.tagged_sents()
[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]
>>> reader.paras()
[[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]]
>>> reader.tagged_paras()
[[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]]
>>> from nltk.tokenize import SpaceTokenizer
>>> reader = TaggedCorpusReader('.', r'.*\.pos', word_tokenizer=SpaceTokenizer())
>>> reader.words()
['The', 'expense', 'and', 'time', 'involved', 'are', ...]
>>> from nltk.tokenize import LineTokenizer
>>> reader = TaggedCorpusReader('.', r'.*\.pos', sent_tokenizer=LineTokenizer())
>>> reader.sents()
[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
>>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=lambda t: t.lower())
>>> reader.tagged_words(simplify_tags=True)
[('The', 'at-tl'), ('expense', 'nn'), ('and', 'cc'), ...]
>>> from nltk.tag import simplify
>>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=simplify.simplify_brown_tag)
>>> reader.tagged_words(simplify_tags=True)
[('The', 'DET'), ('expense', 'N'), ('and', 'CNJ'), ...]
>>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=simplify.simplify_tag)
>>> reader.tagged_words(simplify_tags=True)
[('The', 'A'), ('expense', 'N'), ('and', 'C'), ...]
创建词性标记词语料库
>>>import nltk
>>>from nltk.stem import PorterStemmer
>>>stemmer = PorterStemmer()
>>>stemmer.stem('cooking')
#'cook'
>>> from nltk.corpus.reader import TaggedCorpusReader
>>> reader = TaggedCorpusReader('.', r'.*\.pos')
>>> reader.words()
['The', 'expense', 'and', 'time', 'involved', 'are', ...]
>>> reader.tagged_words()
[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ...]
>>> reader.sents()
[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
>>> reader.tagged_sents()
[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]
>>> reader.paras()
[[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]]
>>> reader.tagged_paras()
[[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]]
>>> from nltk.tokenize import SpaceTokenizer
>>> reader = TaggedCorpusReader('.', r'.*\.pos', word_tokenizer=SpaceTokenizer())
>>> reader.words()
['The', 'expense', 'and', 'time', 'involved', 'are', ...]
>>> from nltk.tokenize import LineTokenizer
>>> reader = TaggedCorpusReader('.', r'.*\.pos', sent_tokenizer=LineTokenizer())
>>> reader.sents()
[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
>>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=lambda t: t.lower())
>>> reader.tagged_words(simplify_tags=True)
[('The', 'at-tl'), ('expense', 'nn'), ('and', 'cc'), ...]
>>> from nltk.tag import simplify
>>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=simplify.simplify_brown_tag)
>>> reader.tagged_words(simplify_tags=True)
[('The', 'DET'), ('expense', 'N'), ('and', 'CNJ'), ...]
>>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=simplify.simplify_tag)
>>> reader.tagged_words(simplify_tags=True)
[('The', 'A'), ('expense', 'N'), ('and', 'C'), ...]
以上两个代码示例取自nltk的书籍示例。我已经发布了,所以无论它是否有用,您都可以按面值来接受。按照这两个特性的组合思路思考。它们符合你的目的吗?
此外,你可能还想研究一下停止词,以便从你给出的第一句话中找出正确的答案。你所说的“唯一关键字”是什么意思?例如,如果输入为“狗跑得比猫快”,您是否将其标记为“狗”或“猫”?如何期望算法确定一个单独的、唯一的标记来概括整个句子?对不起,只是指关键字…在您的示例中,狗和猫适用。您所说的“唯一关键字”是什么意思?例如,如果输入为“狗跑得比猫快”,您是否将其标记为“狗”或“猫”?你希望算法如何确定一个单一的、唯一的标记来概括整个句子?对不起,在你的例子中,只需要简单的关键字…狗和猫就可以了。