使用Python splitlines（）将文本文件转换为列表，同时将一些行组合为列表中的单个项_Python_Re_Txt

使用Python splitlines（）将文本文件转换为列表，同时将一些行组合为列表中的单个项

python

使用Python splitlines（）将文本文件转换为列表，同时将一些行组合为列表中的单个项,python,re,txt,Python,Re,Txt,我正致力于将原本用于VBA应用程序的文本文件转换为新Python应用程序的字符串列表。每个文本文件在单独的行上都有多个字符串，但为了简单起见，我只给每个文本文件一个字符串。我遇到的问题是，由于Excel/VBA的行字符限制，向量占用多行。以下是一个例子： vector（1）=“这是第一个只占用1行的向量！” vector（2）=“这是vector 2的部分文本，但仍在继续！” 向量（2）=向量（2）&“这是文本的延续矢量2！” vector（3）=“这是一个只有一行的新向量！” 我试图做的是遍

我正致力于将原本用于VBA应用程序的文本文件转换为新Python应用程序的字符串列表。每个文本文件在单独的行上都有多个字符串，但为了简单起见，我只给每个文本文件一个字符串。我遇到的问题是，由于Excel/VBA的行字符限制，向量占用多行。以下是一个例子：

vector（1）=“这是第一个只占用1行的向量！”

vector（2）=“这是vector 2的部分文本，但仍在继续！”

向量（2）=向量（2）&“这是文本的延续矢量2！”

vector（3）=“这是一个只有一行的新向量！”

我试图做的是遍历splitlines（）创建的列表，创建一个新的列表，该列表通过尝试回顾前一行来查看它是否具有相同的“vector（x）”标签，然后在附加到最终列表之前连接字符串。但是，它会将未完成的字符串和已连接的字符串添加到列表中。以下是我正在使用的代码：

import os import re Lines = open(doc).read().splitlines() New_Lines = [] previous_label = 0 vector_label = 0 previous_contents = 0 vector_contents = 0 for z, vector_check in enumerate(Lines, 1): if vector_check.startswith("vector"): v_split = re.split(r"=", vector_check) previous_label = vector_label vector_label = v_split[0] previous_contents = vector_contents vector_contents = v_split[1] else : continue # print(vector_label) if previous_label != vector_label: repeat = 0 New_Lines.append(vector_contents) else : repeat += 1 vec_split_2 = re.split(r"&", v_split[1]) vector_contents = previous_contents[:-1] + " " + vec_split_2[1][2:] New_Lines.append(vector_contents) print(vector_contents) continue i = 1 for obj in New_Lines: print("vector_CRS(" + str(i) + ")=" + obj) i += 1
给出了结果：
vector_CRS（1）=“这是第一个只占用1行的向量！”
vector_CRS（2）=“这是vector 2的部分文本，但它继续！”
vector_CRS（3）=“这是vector 2的部分文本，但它继续！这是向量2文本的继续！”
vector_CRS（4）=“这是一个只有一行的新向量！”

我也尝试过在列表中展望未来（这就是为什么有枚举），但结果比这些更糟。这是一个更大的脚本的最后一个“谜题”，尽管感觉很简单，好像我错过了一个简单的答案，但我已经花了几个小时试图修复这一部分。
如果您有一个文本文件
vectors.txt
，它看起来是这样的：

vector(1)="This is the first vector that only takes 1 line!" vector(2)="This is some of the text for vector 2 but it continues!" vector(2)= vector(2) & "This is the continuation of the text for vector 2!" vector(3)= "This is a new vector with only a single line!"
您可以使用正则表达式模式，使用
itertools.groupby
按向量的编号对向量进行分组。然后，使用另一个正则表达式，合并组中每个向量的所有内容：

def main(): with open("vectors.txt", "r") as file: lines = file.read().splitlines() def merge_vectors(lines): from itertools import groupby import re for _, group in groupby(lines, key=lambda line: re.match(r"vector\((\d+)\)", line).group(1)): yield " ".join(re.search("\"(.+)\"", item).group(1) for item in group) print(list(merge_vectors(lines))) return 0 if __name__ == "__main__": import sys sys.exit(main())
输出：

['This is the first vector that only takes 1 line!', 'This is some of the text for vector 2 but it continues! This is the continuation of the text for vector 2!', 'This is a new vector with only a single line!'] >>>
这假设
vectors.txt
文件中的行已按矢量号分组在一起。例如，它假设您不能拥有以下内容：

vector(1)="Part of one" vector(2)="Part of two" vector(1)= vector(1) & "Also part of one"

编辑-我已经查看了您的
repl.it
中的文本文件。我对正则表达式模式和一般代码做了一些更改——我只做了一些更明确的步骤。模式现在更宽松了，例如，
vector（2）=vector（2）&“
之类的东西将不再抛出异常，但由于双引号之间没有内容，因此将忽略它。不以双引号结尾的行也会被处理。所有的行在处理之前都会被过滤，因此只包括以
vector\u CRS（…）
开头的行，因此您不需要再手动跳过前五行左右

def main(): import re line_pattern = r"vector_CRS\((?P<vector_number>\d+)\)" content_pattern = "\"(?P<content>.*)\"?" def is_vector_line(line): return re.match(line_pattern, line) is not None with open("vectors.txt", "r") as file: lines = list(map(str.strip, filter(is_vector_line, file))) def merge_vectors(lines): from itertools import groupby def key(line): return re.match(line_pattern, line).group("vector_number") def get_content(item): return re.search(content_pattern, item).group("content") for _, group in groupby(lines, key=key): yield " ".join(filter(None, map(get_content, group))) merged = list(merge_vectors(lines)) return 0 if __name__ == "__main__": import sys sys.exit(main())

def main（）：进口稀土线模式=r“向量CRS\（？P\d+）” content\u pattern=“\”（？P.*）\“？” def是矢量线（线）： return re.match（line_pattern，line）不是None 打开（“vectors.txt”、“r”）作为文件：行=列表（映射（str.strip，过滤器（is\u vector\u line，file））） def合并_向量（行）：从itertools导入groupby def键（行）：返回重新匹配（线条图案，线条）组（“向量编号”） def get_内容（项目）：返回重新搜索（内容模式，项目）。组（“内容”）对于_，分组方式为groupby（行，键=键）：产生“”。加入（筛选器（无，映射（获取内容，组）））合并=列表（合并向量（行））返回0 如果名称=“\uuuuu main\uuuuuuuu”：导入系统 sys.exit（main（））
谢谢！这看起来会很好-我今天下午会测试它。它工作得很好！矢量将始终按顺序排列并相应分组。不过，我注意到的一点是，我将使用的txt文件从第6行的向量开始。在此之前是一些不必要的数据（源文件的名称、传出传输的名称等）。然而，向量总是从第6行开始-因此我将尝试在您的解决方案中构建一个变通方法。再次感谢你@基温。很高兴听到这个消息！如果它们总是从第6行开始，您只需从第五个索引（第六行）开始对
行进行切片即可：行=file.read（）.splitlines（）[5: 文本文件中的第7行，即以向量\u CRS（1）=向量\u CRS（1）开头的行“compte pour… ，没有结束双引号。@KevinW。当然可以，如果您有任何其他问题，请告诉我。