Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
python中文本的n-grams_Python_Regex_Nlp_Nltk_N Gram - Fatal编程技术网

python中文本的n-grams

python中文本的n-grams,python,regex,nlp,nltk,n-gram,Python,Regex,Nlp,Nltk,N Gram,更新了我以前的版本,但做了一些更改: 说我有100条推特。 在这些推文中,我需要摘录:1)食物名称,2)饮料名称。我还需要为每次提取附上类型(饮料或食物)和id号(每个项目都有一个唯一的id)。 我已经有了一本包含姓名、类型和id号的词典: lexicon = { 'dr pepper': {'type': 'drink', 'id': 'd_123'}, 'coca cola': {'type': 'drink', 'id': 'd_234'}, 'cola': {'type': 'd

更新了我以前的版本,但做了一些更改:

说我有100条推特。 在这些推文中,我需要摘录:1)食物名称,2)饮料名称。我还需要为每次提取附上类型(饮料或食物)和id号(每个项目都有一个唯一的id)。

我已经有了一本包含姓名、类型和id号的词典:

lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}

推特示例:

在对“tweet_1”进行各种处理后,我有以下句子:

sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream', 
'coca cola and banana is not a good combo']
我请求的输出(可以是列表以外的其他类型):

重要的是,输出应而不是在ngrams(n>1)中提取单位图:



理想情况下,在提取之前,我希望能够在各种nltk过滤器中运行我的句子,如lemmatize()和pos_tag(),以获得如下输出。但是在这个regexp解决方案中,如果我这样做,那么所有的单词都会被分割成单字符,或者它们会从字符串“coca-cola”中生成1个单字符和1个双字符,这将生成我不想要的输出(如上例)。 理想输出(输出类型同样不重要):


我想要一个循环来过滤

使用if语句在键中查找字符串。。如果希望包括单格,请删除

len(key.split())>1

如果您只希望包含Unigram,请将其更改为:

len(key.split())==1

 filtered_list = ['tweet_id_1']

 for k, v in lexicon.items():
     for s in sentences:
         if k in s and len(k.split()) > 1:
             filtered_list.extend((k, v))

  print(filtered_list)

可能不是最有效的解决方案,但这肯定会让您开始-

sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream', 
'coca cola and banana is not a good combo']

lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}

lexicon_list = list(lexicon.keys())
lexicon_list.sort(key = lambda s: len(s.split()), reverse=True)

chunks = []

for sentence in sentences:
    for lex in lexicon_list:
        if lex in sentence:
                chunks.append({lex: list(lexicon[lex].values()) })
                sentence = sentence.replace(lex, '')

print(chunks)
输出

[{'dr pepper': ['drink', 'd_123']}, {'coca cola': ['drink', 'd_234']}, {'banana split': ['food', 'f_567']}, {'ice cream': ['food', 'f_789']}, {'coca cola': ['drink', 'd_234']}, {'banana': ['food', 'f_456']}]
解释

lexicon\u list=list(lexicon.keys())
获取需要搜索的短语列表,并按长度对其排序(以便首先找到较大的块)


输出是一个
dict
列表,其中每个dict都有
列表
值。

不幸的是,由于我的声誉不高,我无法发表评论,但Vivek的答案可以通过1)regex改进,2)将pos_标记作为NN,3)字典结构,您可以通过tweet选择tweets结果:

import re
import nltk
from collections import OrderedDict

tweets = {"tweet_1": ['dr pepper is better than coca cola and suits banana split with ice cream', 'coca cola and banana is not a good combo']}

lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}

lexicon_list = list(lexicon.keys())
lexicon_list.sort(key = lambda s: len(s.split()), reverse=True)

#regex will be much more faster than "in" operator
pattern = "(" + "|".join(lexicon_list) +  ")"
pattern = re.compile(pattern)

# Here we make the dictionary of our phrases and their tagged equivalents
lexicon_pos_tag = {word:nltk.pos_tag(nltk.word_tokenize(word)) for word in lexicon_list}
# if you will train model that it recognizes e.g. "banana split" as ("banana split", "NN")
# not as ("banana", "NN") and ("split", "NN") you could use the following
# lexicon_pos_tag = {word:nltk.pos_tag(word) for word in lexicon_list}

#chunks will register the tweets as the keywords
chunks = OrderedDict()
for tweet in tweets:
    chunks[tweet] = []
    for sentence in tweets[tweet]:
        temp = OrderedDict()
        for word in pattern.findall(sentence):
            temp[word] = [lexicon_pos_tag[word], [lexicon[word]["type"], lexicon[word]["id"]]]
        chunks[tweet].append((temp))
最后输出为:

OrderedDict([('tweet_1',
          [OrderedDict([('dr pepper',
                         [[('dr', 'NN'), ('pepper', 'NN')],
                          ['drink', 'd_123']]),
                        ('coca cola',
                         [[('coca', 'NN'), ('cola', 'NN')],
                          ['drink', 'd_234']]),
                        ('banana split',
                         [[('banana', 'NN'), ('split', 'NN')],
                          ['food', 'f_567']]),
                        ('ice cream',
                         [[('ice', 'NN'), ('cream', 'NN')],
                          ['food', 'f_789']])]),
           OrderedDict([('coca cola',
                         [[('coca', 'NN'), ('cola', 'NN')],
                          ['drink', 'd_234']]),
                        ('banana',
                         [[('banana', 'NN')], ['food', 'f_456']])])])])

这在第二句中找不到“香蕉”。它应该会检测所有NGRAM,但不会生成相同字符串的副本。的副本?不是副本,但非常相似谢谢您的回复。然而,pos_标记的意义并不是说每个“香蕉”都应该是NN,而是在预先训练的模型中只找到NN类型的香蕉。当然,但正如我在上面的词典pos_标记评论中所指出的。。。如果您将在训练pos_标记模型之后执行上述代码,那么代码:
lexicon_pos_标记={word:nltk.pos_标记(word)for word in lexicon_list}
将创建一个类似于{“香蕉分割”:(“香蕉分割”,“NN”)}的字典。然后它将在代码
temp[word]=[lexicon\u pos\u tag[word],…
中得到进一步利用。谢谢!目前,我正在开发您的原始regexp解决方案。但我也会尝试更新!输入非常好!:)
[{'dr pepper': ['drink', 'd_123']}, {'coca cola': ['drink', 'd_234']}, {'banana split': ['food', 'f_567']}, {'ice cream': ['food', 'f_789']}, {'coca cola': ['drink', 'd_234']}, {'banana': ['food', 'f_456']}]
import re
import nltk
from collections import OrderedDict

tweets = {"tweet_1": ['dr pepper is better than coca cola and suits banana split with ice cream', 'coca cola and banana is not a good combo']}

lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}

lexicon_list = list(lexicon.keys())
lexicon_list.sort(key = lambda s: len(s.split()), reverse=True)

#regex will be much more faster than "in" operator
pattern = "(" + "|".join(lexicon_list) +  ")"
pattern = re.compile(pattern)

# Here we make the dictionary of our phrases and their tagged equivalents
lexicon_pos_tag = {word:nltk.pos_tag(nltk.word_tokenize(word)) for word in lexicon_list}
# if you will train model that it recognizes e.g. "banana split" as ("banana split", "NN")
# not as ("banana", "NN") and ("split", "NN") you could use the following
# lexicon_pos_tag = {word:nltk.pos_tag(word) for word in lexicon_list}

#chunks will register the tweets as the keywords
chunks = OrderedDict()
for tweet in tweets:
    chunks[tweet] = []
    for sentence in tweets[tweet]:
        temp = OrderedDict()
        for word in pattern.findall(sentence):
            temp[word] = [lexicon_pos_tag[word], [lexicon[word]["type"], lexicon[word]["id"]]]
        chunks[tweet].append((temp))
OrderedDict([('tweet_1',
          [OrderedDict([('dr pepper',
                         [[('dr', 'NN'), ('pepper', 'NN')],
                          ['drink', 'd_123']]),
                        ('coca cola',
                         [[('coca', 'NN'), ('cola', 'NN')],
                          ['drink', 'd_234']]),
                        ('banana split',
                         [[('banana', 'NN'), ('split', 'NN')],
                          ['food', 'f_567']]),
                        ('ice cream',
                         [[('ice', 'NN'), ('cream', 'NN')],
                          ['food', 'f_789']])]),
           OrderedDict([('coca cola',
                         [[('coca', 'NN'), ('cola', 'NN')],
                          ['drink', 'd_234']]),
                        ('banana',
                         [[('banana', 'NN')], ['food', 'f_456']])])])])