python中文本的n-grams_Python_Regex_Nlp_Nltk_N Gram

python中文本的n-grams

python regex nlp

python中文本的n-grams,python,regex,nlp,nltk,n-gram,Python,Regex,Nlp,Nltk,N Gram,更新了我以前的版本，但做了一些更改：说我有100条推特。在这些推文中，我需要摘录：1）食物名称，2）饮料名称。我还需要为每次提取附上类型（饮料或食物）和id号（每个项目都有一个唯一的id）。我已经有了一本包含姓名、类型和id号的词典： lexicon = { 'dr pepper': {'type': 'drink', 'id': 'd_123'}, 'coca cola': {'type': 'drink', 'id': 'd_234'}, 'cola': {'type': 'd

更新了我以前的版本，但做了一些更改：

说我有100条推特。在这些推文中，我需要摘录：1）食物名称，2）饮料名称。我还需要为每次提取附上类型（饮料或食物）和id号（每个项目都有一个唯一的id）。

我已经有了一本包含姓名、类型和id号的词典：

lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}

推特示例：

在对“tweet_1”进行各种处理后，我有以下句子：

sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream', 
'coca cola and banana is not a good combo']

我请求的输出（可以是列表以外的其他类型）：

重要的是，输出应而不是在ngrams（n>1）中提取单位图：

理想情况下，在提取之前，我希望能够在各种nltk过滤器中运行我的句子，如lemmatize（）和pos_tag（），以获得如下输出。但是在这个regexp解决方案中，如果我这样做，那么所有的单词都会被分割成单字符，或者它们会从字符串“coca-cola”中生成1个单字符和1个双字符，这将生成我不想要的输出（如上例）。理想输出（输出类型同样不重要）：

我想要一个循环来过滤
使用if语句在键中查找字符串。。如果希望包括单格，请删除

len（key.split（））>1
如果您只希望包含Unigram，请将其更改为：

len（key.split（））==1

filtered_list = ['tweet_id_1'] for k, v in lexicon.items(): for s in sentences: if k in s and len(k.split()) > 1: filtered_list.extend((k, v)) print(filtered_list)

可能不是最有效的解决方案，但这肯定会让您开始-

sentences = [ 'dr pepper is better than coca cola and suits banana split with ice cream', 'coca cola and banana is not a good combo'] lexicon = { 'dr pepper': {'type': 'drink', 'id': 'd_123'}, 'coca cola': {'type': 'drink', 'id': 'd_234'}, 'cola': {'type': 'drink', 'id': 'd_345'}, 'banana': {'type': 'food', 'id': 'f_456'}, 'banana split': {'type': 'food', 'id': 'f_567'}, 'cream': {'type': 'food', 'id': 'f_678'}, 'ice cream': {'type': 'food', 'id': 'f_789'}} lexicon_list = list(lexicon.keys()) lexicon_list.sort(key = lambda s: len(s.split()), reverse=True) chunks = [] for sentence in sentences: for lex in lexicon_list: if lex in sentence: chunks.append({lex: list(lexicon[lex].values()) }) sentence = sentence.replace(lex, '') print(chunks)
输出

[{'dr pepper': ['drink', 'd_123']}, {'coca cola': ['drink', 'd_234']}, {'banana split': ['food', 'f_567']}, {'ice cream': ['food', 'f_789']}, {'coca cola': ['drink', 'd_234']}, {'banana': ['food', 'f_456']}]
解释

lexicon\u list=list（lexicon.keys（））
获取需要搜索的短语列表，并按长度对其排序（以便首先找到较大的块）

输出是一个
dict
列表，其中每个dict都有
列表
值。
不幸的是，由于我的声誉不高，我无法发表评论，但Vivek的答案可以通过1）regex改进，2）将pos_标记作为NN，3）字典结构，您可以通过tweet选择tweets结果：

import re import nltk from collections import OrderedDict tweets = {"tweet_1": ['dr pepper is better than coca cola and suits banana split with ice cream', 'coca cola and banana is not a good combo']} lexicon = { 'dr pepper': {'type': 'drink', 'id': 'd_123'}, 'coca cola': {'type': 'drink', 'id': 'd_234'}, 'cola': {'type': 'drink', 'id': 'd_345'}, 'banana': {'type': 'food', 'id': 'f_456'}, 'banana split': {'type': 'food', 'id': 'f_567'}, 'cream': {'type': 'food', 'id': 'f_678'}, 'ice cream': {'type': 'food', 'id': 'f_789'}} lexicon_list = list(lexicon.keys()) lexicon_list.sort(key = lambda s: len(s.split()), reverse=True) #regex will be much more faster than "in" operator pattern = "(" + "|".join(lexicon_list) + ")" pattern = re.compile(pattern) # Here we make the dictionary of our phrases and their tagged equivalents lexicon_pos_tag = {word:nltk.pos_tag(nltk.word_tokenize(word)) for word in lexicon_list} # if you will train model that it recognizes e.g. "banana split" as ("banana split", "NN") # not as ("banana", "NN") and ("split", "NN") you could use the following # lexicon_pos_tag = {word:nltk.pos_tag(word) for word in lexicon_list} #chunks will register the tweets as the keywords chunks = OrderedDict() for tweet in tweets: chunks[tweet] = [] for sentence in tweets[tweet]: temp = OrderedDict() for word in pattern.findall(sentence): temp[word] = [lexicon_pos_tag[word], [lexicon[word]["type"], lexicon[word]["id"]]] chunks[tweet].append((temp))
最后输出为：

OrderedDict([('tweet_1', [OrderedDict([('dr pepper', [[('dr', 'NN'), ('pepper', 'NN')], ['drink', 'd_123']]), ('coca cola', [[('coca', 'NN'), ('cola', 'NN')], ['drink', 'd_234']]), ('banana split', [[('banana', 'NN'), ('split', 'NN')], ['food', 'f_567']]), ('ice cream', [[('ice', 'NN'), ('cream', 'NN')], ['food', 'f_789']])]), OrderedDict([('coca cola', [[('coca', 'NN'), ('cola', 'NN')], ['drink', 'd_234']]), ('banana', [[('banana', 'NN')], ['food', 'f_456']])])])])

这在第二句中找不到“香蕉”。它应该会检测所有NGRAM，但不会生成相同字符串的副本。的副本？不是副本，但非常相似谢谢您的回复。然而，pos_标记的意义并不是说每个“香蕉”都应该是NN，而是在预先训练的模型中只找到NN类型的香蕉。当然，但正如我在上面的词典pos_标记评论中所指出的。。。如果您将在训练pos_标记模型之后执行上述代码，那么代码：
lexicon_pos_标记={word:nltk.pos_标记（word）for word in lexicon_list}
将创建一个类似于{“香蕉分割”：（“香蕉分割”，“NN”）}的字典。然后它将在代码
temp[word]=[lexicon\u pos\u tag[word]，…
中得到进一步利用。谢谢！目前，我正在开发您的原始regexp解决方案。但我也会尝试更新！输入非常好！：）
[{'dr pepper': ['drink', 'd_123']}, {'coca cola': ['drink', 'd_234']}, {'banana split': ['food', 'f_567']}, {'ice cream': ['food', 'f_789']}, {'coca cola': ['drink', 'd_234']}, {'banana': ['food', 'f_456']}]

import re import nltk from collections import OrderedDict tweets = {"tweet_1": ['dr pepper is better than coca cola and suits banana split with ice cream', 'coca cola and banana is not a good combo']} lexicon = { 'dr pepper': {'type': 'drink', 'id': 'd_123'}, 'coca cola': {'type': 'drink', 'id': 'd_234'}, 'cola': {'type': 'drink', 'id': 'd_345'}, 'banana': {'type': 'food', 'id': 'f_456'}, 'banana split': {'type': 'food', 'id': 'f_567'}, 'cream': {'type': 'food', 'id': 'f_678'}, 'ice cream': {'type': 'food', 'id': 'f_789'}} lexicon_list = list(lexicon.keys()) lexicon_list.sort(key = lambda s: len(s.split()), reverse=True) #regex will be much more faster than "in" operator pattern = "(" + "|".join(lexicon_list) + ")" pattern = re.compile(pattern) # Here we make the dictionary of our phrases and their tagged equivalents lexicon_pos_tag = {word:nltk.pos_tag(nltk.word_tokenize(word)) for word in lexicon_list} # if you will train model that it recognizes e.g. "banana split" as ("banana split", "NN") # not as ("banana", "NN") and ("split", "NN") you could use the following # lexicon_pos_tag = {word:nltk.pos_tag(word) for word in lexicon_list} #chunks will register the tweets as the keywords chunks = OrderedDict() for tweet in tweets: chunks[tweet] = [] for sentence in tweets[tweet]: temp = OrderedDict() for word in pattern.findall(sentence): temp[word] = [lexicon_pos_tag[word], [lexicon[word]["type"], lexicon[word]["id"]]] chunks[tweet].append((temp))

OrderedDict([('tweet_1', [OrderedDict([('dr pepper', [[('dr', 'NN'), ('pepper', 'NN')], ['drink', 'd_123']]), ('coca cola', [[('coca', 'NN'), ('cola', 'NN')], ['drink', 'd_234']]), ('banana split', [[('banana', 'NN'), ('split', 'NN')], ['food', 'f_567']]), ('ice cream', [[('ice', 'NN'), ('cream', 'NN')], ['food', 'f_789']])]), OrderedDict([('coca cola', [[('coca', 'NN'), ('cola', 'NN')], ['drink', 'd_234']]), ('banana', [[('banana', 'NN')], ['food', 'f_456']])])])])