Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 对包含B或I标记的连续单词进行分组_Python_Python 3.x_Nlp - Fatal编程技术网

Python 对包含B或I标记的连续单词进行分组

Python 对包含B或I标记的连续单词进行分组,python,python-3.x,nlp,Python,Python 3.x,Nlp,我有以下数据: [[('Natural', 'JJ', 'B'), ('language', 'NN', 'I'), ('processing', 'NN', 'I'), ('is', 'VBZ', 'O'), ('one', 'CD', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('important', 'JJ', 'O'), ('branch', 'NN', 'O'), ('of', 'IN', 'O'), ('CS', 'NNP', 'B'

我有以下数据:

[[('Natural', 'JJ', 'B'), ('language', 'NN', 'I'), ('processing', 'NN', 'I'), ('is', 'VBZ', 'O'), ('one', 'CD', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('important', 'JJ', 'O'), ('branch', 'NN', 'O'), ('of', 'IN', 'O'), ('CS', 'NNP', 'B'), ('.', '.', 'I')] ... ...]]
我想对带有标记B或I的连续单词进行分组,忽略带有“O”标记的单词

输出关键字应如下所示:

自然语言处理CS机器学习深度学习

我做了如下代码:

data=[[('Natural', 'JJ', 'B'), ('language', 'NN', 'I'), ('processing', 'NN', 'I'), ('is', 'VBZ', 'O'), ('one', 'CD', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('important', 'JJ', 'O'), ('branch', 'NN', 'O'), ('of', 'IN', 'O'), ('CS', 'NNP', 'B'), ('.', '.', 'I')],
[('Machine', 'NN', 'B'), ('learning', 'NN', 'I'), (',', ',', 'I'), ('deep', 'JJ', 'I'), ('learning', 'NN', 'I'), ('are', 'VBP', 'O'), ('heavily', 'RB', 'O'), ('used', 'VBN', 'O'), ('in', 'IN', 'O'), ('natural', 'JJ', 'B'), ('language', 'NN', 'I'), ('processing', 'NN', 'I'), ('.', '.', 'I')],
[('It', 'PRP', 'O'), ('is', 'VBZ', 'O'), ('too', 'RB', 'O'), ('cool', 'JJ', 'O'), ('.', '.', 'O')]]
Key_words = []
index = 0
for sen in data:
    for i in range(len(sen)):
        while index < len(sen):
data=[('Natural','JJ','B'),('language','NN','I'),('processing','NN','I'),('is','VBZ','O'),('one','CD','O'),('of','IN','O'),('DT','O'),('important','JJ','O'),('branch','NN','O'),('of','IN','O'),('CS','NNP','B'),('IN','I'),',
[('Machine','NN','B'),('learning','NN','I'),('deep','JJ','I'),('learning','NN','I'),('are','VBP','O'),('throughly','RB','O'),('in','in','O'),('natural','JJ','B'),('language','NN I'),('processing','NN I','I'),('processing','NN I',',
[('It','PRP','O'),('is','VBZ','O'),('too','RB','O'),('cool','JJ','O'),(','O')]
关键词=[]
索引=0
对于数据中的sen:
对于范围内的i(len(sen)):
当指数
我不知道下一步该怎么办。谁能帮帮我吗

谢谢

希望这有帮助

remove_o = list(filter(lambda x: x[2] in ['I', 'B'], data))
words = [item[0] for item in remove_o]
reuslt = ' '.join(words)

您应该使用
itertools.groupby
来获得相当紧凑的解决方案:

import itertools
import string

data = [[('Natural', 'JJ', 'B'), ('language', 'NN', 'I'), ('processing', 'NN', 'I'), ('is', 'VBZ', 'O'), ('one', 'CD', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('important', 'JJ', 'O'), ('branch', 'NN', 'O'), ('of', 'IN', 'O'), ('CS', 'NNP', 'B'), ('.', '.', 'I')],
[('Machine', 'NN', 'B'), ('learning', 'NN', 'I'), (',', ',', 'I'), ('deep', 'JJ', 'I'), ('learning', 'NN', 'I'), ('are', 'VBP', 'O'), ('heavily', 'RB', 'O'), ('used', 'VBN', 'O'), ('in', 'IN', 'O'), ('natural', 'JJ', 'B'), ('language', 'NN', 'I'), ('processing', 'NN', 'I'), ('.', '.', 'I')],
[('It', 'PRP', 'O'), ('is', 'VBZ', 'O'), ('too', 'RB', 'O'), ('cool', 'JJ', 'O'), ('.', '.', 'O')]]

punctuation = set(string.punctuation)
keywords = [[' '.join(w[0] for w in g) for k, g in itertools.groupby(sen, key=lambda x: x[0] not in punctuation and x[2] != 'O') if k] for sen in data]

print(keywords)
# [['Natural language processing', 'CS'],
#  ['Machine learning', 'deep learning', 'natural language processing'],
#  []]

当“O”不作为第三个元素出现时,需要获取元组中的第一个值,对吗?你可以这样做

output = [j[0] for i in data for j in i if(j[2]!='O')]
以上代码与

for i in data:
    for j in i:
        if(j[2]!='O'): # if(j[2] in ['I','B']) also works
            print(j[0]) # Or append to the output list