在Python中将文本文件中的列数据转换为嵌套列表？_Python_Type Conversion_Nested Lists_Named Entity Recognition

在Python中将文本文件中的列数据转换为嵌套列表？

python

在Python中将文本文件中的列数据转换为嵌套列表？,python,type-conversion,nested-lists,named-entity-recognition,Python,Type Conversion,Nested Lists,Named Entity Recognition,我有一个txt文件，其中包含按列书写的句子和标签，如下所示： O are O there O any O good B-GENRE romantic I-GENRE comedies O out B-YEAR right I-YEAR now O show O me O a O movie O about B-PLOT cars I-PLOT that I-PLOT talk 我想将这个txt文件中的数据读入两个嵌套列表。所需的输出应如下所

我有一个

txt

文件，其中包含按列书写的句子和标签，如下所示：

O   are
O   there
O   any
O   good
B-GENRE romantic
I-GENRE comedies
O   out
B-YEAR  right
I-YEAR  now

O   show
O   me
O   a
O   movie
O   about
B-PLOT  cars
I-PLOT  that
I-PLOT  talk

我想将这个

txt

文件中的数据读入两个嵌套列表。所需的输出应如下所示：

labels = [['O','O','O','O','B-GENRE','I-GENRE','O','B-YEAR','I-YEAR'],['O','O','O','O','O','B-PLOT','I-PLOT','I-PLOT']]
sentences = [['are','there','any','good','romantic','comedies','out','right','now'],['show','me','a','movie','about','cars','that','talk']]

我尝试了以下方法：

with open("engtrain.bio.txt", "r") as f:
  lsta = []
  for line in f:
    lsta.append([x for x in line.replace("\n", "").split()])

with open("engtest.bio.txt", "r") as f:
  lines = f.readlines()
  labels = []
  sentences = []
  for l in lines:
    as_list = l.split("\t")
    labels.append(as_list[0])
    sentences.append(as_list[1].replace("\n", ""))

但我有以下输出：

[['O', 'are'],
 ['O', 'there'],
 ['O', 'any'],
 ['O', 'good'],
 ['B-GENRE', 'romantic'],
 ['I-GENRE', 'comedies'],
 ['O', 'out'],
 ['B-YEAR', 'right'],
 ['I-YEAR', 'now'],
 [],
 ['O', 'show'],
 ['O', 'me'],
 ['O', 'a'],
 ['O', 'movie'],
 ['O', 'about'],
 ['B-PLOT', 'cars'],
 ['I-PLOT', 'that'],
 ['I-PLOT', 'talk']]

更新我还尝试了以下方法：

with open("engtrain.bio.txt", "r") as f:
  lsta = []
  for line in f:
    lsta.append([x for x in line.replace("\n", "").split()])

with open("engtest.bio.txt", "r") as f:
  lines = f.readlines()
  labels = []
  sentences = []
  for l in lines:
    as_list = l.split("\t")
    labels.append(as_list[0])
    sentences.append(as_list[1].replace("\n", ""))

不幸的是，仍然有一个错误：

IndexError                                Traceback (most recent call last)
<ipython-input-66-63c266df6ace> in <module>()
      6     as_list = l.strip().split("\t")
      7     labels.append(as_list[0])
----> 8     sentences.append(as_list[1].replace("\n", ""))

IndexError: list index out of range

索引器错误回溯（最近一次调用）
在（）
6 as_list=l.strip（）.split（“\t”）
7标签。追加（作为列表[0]）
---->8个句子。追加（如列表[1]。替换（“\n”，”）
索引器：列表索引超出范围

原始数据来自此链接（engtest.bio或entrain.bio）：

你能帮我吗

提前感谢

迭代每一行，并按

选项卡将其拆分

：

all_labels, all_sentences = [], []
with open('inp', 'r') as f:
    lines = f.readlines()
    lines.append('') # make sure we process the last sentence
    labels, sentences = [], []
    for line in lines:
        line = line.strip()
        if not line: # detect the end of a sentence
            if len(labels): # make sure we got some words here
                all_labels.append(labels)
                all_sentences.append(sentences)
                labels, sentences = [], []
            continue
        # extend the current sentence
        label, sentence = line.split()
        labels.append(label)
        sentences.append(sentence)

print(all_labels)
print(all_sentences)

labels = [[]]
sentences = [[]]
with open('engtrain.bio', 'r') as f:
    for line in f.readlines():
        line = line.rstrip()
        if line:
            label, sentence = line.split('\t')
            labels[-1].append(label)
            sentences[-1].append(sentence)
        else:
            labels.append([])
            sentences.append([])

输出

标签

：

[['O', 'O', 'O', 'B-ACTOR', 'I-ACTOR'], ['O', 'O', 'O', 'O', 'B-ACTOR', 'I-ACTOR', 'O', 'O', 'B-YEAR'] ...

输出句子：

[['what', 'movies', 'star', 'bruce', 'willis'], ['show', 'me', 'films', 'with', 'drew', 'barrymore', 'from', 'the', '1980s'] ...

迭代每行并按

选项卡将其拆分：
labels = [[]]
sentences = [[]]
with open('engtrain.bio', 'r') as f:
    for line in f.readlines():
        line = line.rstrip()
        if line:
            label, sentence = line.split('\t')
            labels[-1].append(label)
            sentences[-1].append(sentence)
        else:
            labels.append([])
            sentences.append([])

输出标签
：
[['O', 'O', 'O', 'B-ACTOR', 'I-ACTOR'], ['O', 'O', 'O', 'O', 'B-ACTOR', 'I-ACTOR', 'O', 'O', 'B-YEAR'] ...

输出句子

：

[['what', 'movies', 'star', 'bruce', 'willis'], ['show', 'me', 'films', 'with', 'drew', 'barrymore', 'from', 'the', '1980s'] ...

文件中的行可以按逻辑分组为多个部分，并用分隔符分隔空行。因此，实际上您有一个两级数据结构，您需要处理一个分区列表，在每个分区内，您需要处理一个列表一行行。当然，文本文件是一个简单的行列表，所以我们需要重新构建两个级别

这是一种非常通用的模式，因此这里有一种方法可以对其进行编码，无论您在每个部分中需要做什么，都可以重用：

labels = []
sentences = []

# Prepare next section
inner_labels = []
inner_sentences = []

with open('engtrain.bio.txt') as f:
    for line in f.readlines():
        if len(line.strip()) == 0:
            # Finish previous section
            labels.append(inner_labels)
            sentences.append(inner_sentences)
            # Prepare next section
            inner_labels = []
            inner_sentences = []
            continue
        # Process line in section
        l, s = line.strip().split()
        inner_labels.append(l)
        inner_sentences.append(s)

# Finish previous section
labels.append(inner_labels)
sentences.append(inner_sentences)

要在不同的情况下重用它，只需重新定义“准备下一个部分”、“部分中的流程行”和“完成上一个部分”

可能有一种更具python风格的方式来预处理行列表等，但这是完成工作的可靠模式。

文件中的行可以按逻辑分组为多个部分，以空行。因此，实际上您有一个两级数据结构，您需要处理一个分区列表，在每个分区内，您需要处理一个列表一行行。当然，文本文件是一个简单的行列表，所以我们需要重新构建两个级别

这是一种非常通用的模式，因此这里有一种方法可以对其进行编码，无论您在每个部分中需要做什么，都可以重用：

labels = []
sentences = []

# Prepare next section
inner_labels = []
inner_sentences = []

with open('engtrain.bio.txt') as f:
    for line in f.readlines():
        if len(line.strip()) == 0:
            # Finish previous section
            labels.append(inner_labels)
            sentences.append(inner_sentences)
            # Prepare next section
            inner_labels = []
            inner_sentences = []
            continue
        # Process line in section
        l, s = line.strip().split()
        inner_labels.append(l)
        inner_sentences.append(s)

# Finish previous section
labels.append(inner_labels)
sentences.append(inner_sentences)

要在不同的情况下重用它，只需重新定义“准备下一个部分”、“部分中的流程行”和“完成上一个部分”

可能有一种更具python风格的方法来预处理行列表等，但这是一种可靠的模式，可以完成工作。

感谢您的回答！但是，在原始数据集上应用此代码会产生以下错误：

索引器：列表索引超出范围

。我想问题是因为句子之间的空行？我这样做了

words\u 2d=[line.split（），如果line！='\n']

它适用于空行。但是，它返回两个扁平列表。我需要嵌套列表中的句子：（明白了，我错过了关于使用空行表示句子结尾的部分。请查看更新的代码。感谢您的回答！但是在原始数据集上应用此代码时，我有以下错误：

索引器：列表索引超出范围。

。我想问题是因为句子之间的空行？我这样做了

words_2d=[line.split（）（如果line！='\n']）

它可以处理空行。但是它返回两个展开列表。我需要嵌套列表中的句子：（明白了，我错过了使用空行表示句子结尾的部分。请查看更新的代码。谢谢！我有一个小问题：如何使用as.bio下载该文件？我使用复制粘贴到

txt

文件，因为直接下载的结构不正确哇！你在12行中完成了我在2行中所做的操作5，我在投票：-）@如果在浏览器中单击

另存为，然后选择所有文件类型
选项谢谢！我有一个小问题：你怎么能用as.bio下载文件？我使用复制粘贴到txt
文件，因为直接下载的结构不正确哇！你用12行代码完成了我在25行中所做的工作，我在投票：-）@AliF在浏览器中单击另存为
，然后选择所有文件类型
选项