Python 根据第一列中的字母数将行与前一行连接起来_Python_Regex_Python 3.x_Csv_Concatenation

Python 根据第一列中的字母数将行与前一行连接起来

python regex python-3.x csv

Python 根据第一列中的字母数将行与前一行连接起来,python,regex,python-3.x,csv,concatenation,Python,Regex,Python 3.x,Csv,Concatenation,新的编码和试图找出如何修复一个破碎的csv文件，使其能够正常工作因此，该文件已从案例管理系统导出，并包含用户名、案例编号、花费的时间、注释和日期字段问题在于，临时注释中有换行符，在导出csv时，工具不包含引号以将其定义为字段中的字符串请参见以下示例： user;case;hours;note;date; tnn;123;4;solved problem;2017-11-27; tnn;124;2;random comment;2017-11-27; tnn;125;3;I am writi

新的编码和试图找出如何修复一个破碎的csv文件，使其能够正常工作

因此，该文件已从案例管理系统导出，并包含用户名、案例编号、花费的时间、注释和日期字段

问题在于，临时注释中有换行符，在导出csv时，工具不包含引号以将其定义为字段中的字符串

请参见以下示例：

user;case;hours;note;date;
tnn;123;4;solved problem;2017-11-27;
tnn;124;2;random comment;2017-11-27;
tnn;125;3;I am writing a comment
that contains new lines
without quotation marks;2017-11-28;
HJL;129;8;trying to concatenate lines to re form the broken csv;2017-11-29;

我想连接第3、4和5行以显示以下内容： tnn；125;3.我正在写一篇评论，其中包含没有引号的新行；2017-11-28;

由于每一行都以用户名开头（总是3个字母），我想我可以迭代这些行，找出哪些行不是以用户名开头的，并将其与前一行连接起来。但它并没有像预期的那样真正起作用

到目前为止，我得到的是：

import re

with open('Rapp.txt', 'r') as f:

 for line in f:
  previous = line #keep current line in variable to join next line
  if not re.match(r'^[A-Za-z]{3}', line): #regex to match 3 letters
   print(previous.join(line))

脚本显示没有输出，只是默默地结束，有什么想法吗？

我想我会采取稍微不同的方式：

import re

all_the_data = ""

with open('Rapp.txt', 'r') as f:
    for line in f:
        if not re.search("\d{4}-\d{1,2}-\d{1,2};\n", line):
            line = re.sub("\n", "", line)
        all_the_data = "".join([all_the_data, line])
print (all_the_data)

有几种方法可以做到这一点，每种方法都有其优点和缺点，但我认为这样做很简单

循环文件，就像你做的那样，如果行没有以日期和结束；取下回车并将其填入所有数据中。这样，您就不必再“查阅”文件了。同样，有很多方法可以做到这一点。如果您更愿意使用以3个字母和a开头的逻辑；回顾过去，这是可行的：

import re

all_the_data = ""

with open('Rapp.txt', 'r') as f:
    all_the_data = ""
    for line in f:
        if not re.search("^[A-Za-z]{3};", line):
            all_the_data = re.sub("\n$", "", all_the_data)
        all_the_data = "".join([all_the_data, line])

    print ("results:")
    print (all_the_data)

差不多就是我们要的。逻辑是如果当前行没有正确启动，则从所有\u数据中取出前一行的回车符

如果您需要使用正则表达式本身的帮助，此站点非常棒：

代码中的正则表达式与txt中的所有行（字符串）匹配（找到与模式的有效匹配）。if条件从不为真，因此不会打印任何内容

with open('./Rapp.txt', 'r') as f:
    join_words = []

    for line in f:
        line = line.strip()
        if len(line) > 3 and ";" in line[0:4] and len(join_words) > 0:
            print(';'.join(join_words)) 
            join_words = []
            join_words.append(line)
        else:
            join_words.append(line)

    print(";".join(join_words))

如果可能的话，我尽量不使用正则表达式来保持它的清晰。但是，正则表达式是一个更好的选择

一个简单的方法是使用一个生成器作为原始文件的过滤器。如果该筛选器的第4列中没有分号（

；

），则该筛选器会将一行连接到上一行。代码可以是：

def preprocess(fd):
    previous = next(fd)
    for line in fd:
        if line[3] == ';':
            yield previous
            previous = line
        else:
            previous = previous.strip() + " " + line
    yield previous  # don't forget last line!

然后，您可以使用：

with open(test.txt) as fd:
    rd = csv.DictReader(preprocess(fd))
    for row in rd:
        ...

这里的诀窍是，csv模块只需要在每次对其应用

next

函数时返回一行的on对象，因此生成器是合适的

但这只是一种解决方法，正确的方法是，前一步直接生成正确的CSV文件。

如果注释包含

，会发生什么？如果可能的话，也许你应该尝试修复CSV导出。我使用了第二种解决方案，因为真实文件包含更多字段，最后一个字段不是日期。该值有时为空，这使得使用第一个解决方案变得困难。