Regex Python读取具有开始和停止条件的文件_Regex_Linux_Python 3.x_Pandas

Regex Python读取具有开始和停止条件的文件

regex linux python-3.x pandas

Regex Python读取具有开始和停止条件的文件,regex,linux,python-3.x,pandas,Regex,Linux,Python 3.x,Pandas,大家好，我有一个下面的文件数据，我希望处理它以获得预期的输出，作为一名python学习者，我很想知道是否有基于启动和停止布尔索引的方法来实现这一点在这里，一个文件行以一个名为SRV:的字符串开始，但在某些情况下，这些行始终在同一行开始和结束，而在某些情况下，这些行被扩展为换行文件文本数据：预期产出：有没有更好的方法来实现这一点，我对熊猫也没意见。使用for group，然后通过加入聚合： df1 = (df['col'].groupby(df['col'].str.startswith(

大家好，我有一个下面的文件数据，我希望处理它以获得预期的输出，作为一名python学习者，我很想知道是否有基于启动和停止布尔索引的方法来实现这一点

在这里，一个文件行以一个名为

SRV:

的字符串开始，但在某些情况下，这些行始终在同一行开始和结束，而在某些情况下，这些行被扩展为换行

文件文本数据：预期产出：有没有更好的方法来实现这一点，我对熊猫也没意见。

使用for group，然后通过加入聚合：

df1 = (df['col'].groupby(df['col'].str.startswith('SRV').cumsum())
                .agg(' '.join)
                .reset_index(drop=True)
                .to_frame(name='new'))
print (df1)
                                                 new
0                             SRV: this is for bryan
1                             SRV: this is for terry
2  SRV: this is for torain sec01: This is reserve...
3                               SRV: this is for Jun

详细信息：

print (df['col'].str.startswith('SRV').cumsum())
0    1
1    2
2    3
3    3
4    3
5    3
6    4
Name: col, dtype: int32

对于

DataFrame

使用：

import pandas as pd

temp=u"""col
SRV: this is for bryan

SRV: this is for terry

SRV: this is for torain
sec01: This is reserved
sec02: This is open for all
sec03: Closed!

SRV: this is for Jun"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep="|")

print (df)
                           col
0       SRV: this is for bryan
1       SRV: this is for terry
2      SRV: this is for torain
3      sec01: This is reserved
4  sec02: This is open for all
5               sec03: Closed!
6         SRV: this is for Jun

纯python解决方案：

out = []
with open("file.csv") as f1:
        last = 0
        for i, line in enumerate(f1.readlines()):
            if line.strip().startswith('SRV'):
                last = i
            out.append([line.strip(), last])

from itertools import groupby
from operator import itemgetter

with open("out_file.csv", "w") as f2:
    groups = groupby(out, key=itemgetter(1))
    for _, g in groups:
        gg = list(g)
        h = ' '.join(list(map(itemgetter(0), gg)))
        f2.write('\n' + h)

您可以尝试使用类似于

df[0].groupby（df[0].str.startswith（'SRV'）.cumsum（））.apply（''.join）

，其中

是列名。（注意：这是使用熊猫数据框）@anky_91，这也很有效。这确实很棒-jezrael+1-jezrael，你能解释一下它是如何记住它必须保存数据直到它看到下一个

srv

？@user294110-在编辑的答案中添加了纯python解决方案。

import pandas as pd

temp=u"""col
SRV: this is for bryan

SRV: this is for terry

SRV: this is for torain
sec01: This is reserved
sec02: This is open for all
sec03: Closed!

SRV: this is for Jun"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep="|")

print (df)
                           col
0       SRV: this is for bryan
1       SRV: this is for terry
2      SRV: this is for torain
3      sec01: This is reserved
4  sec02: This is open for all
5               sec03: Closed!
6         SRV: this is for Jun

out = []
with open("file.csv") as f1:
        last = 0
        for i, line in enumerate(f1.readlines()):
            if line.strip().startswith('SRV'):
                last = i
            out.append([line.strip(), last])

from itertools import groupby
from operator import itemgetter

with open("out_file.csv", "w") as f2:
    groups = groupby(out, key=itemgetter(1))
    for _, g in groups:
        gg = list(g)
        h = ' '.join(list(map(itemgetter(0), gg)))
        f2.write('\n' + h)