Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Regex Python读取具有开始和停止条件的文件_Regex_Linux_Python 3.x_Pandas - Fatal编程技术网

Regex Python读取具有开始和停止条件的文件

Regex Python读取具有开始和停止条件的文件,regex,linux,python-3.x,pandas,Regex,Linux,Python 3.x,Pandas,大家好,我有一个下面的文件数据,我希望处理它以获得预期的输出,作为一名python学习者,我很想知道是否有基于启动和停止布尔索引的方法来实现这一点 在这里,一个文件行以一个名为SRV:的字符串开始,但在某些情况下,这些行始终在同一行开始和结束,而在某些情况下,这些行被扩展为换行 文件文本数据: 预期产出: 有没有更好的方法来实现这一点,我对熊猫也没意见。使用for group,然后通过加入聚合: df1 = (df['col'].groupby(df['col'].str.startswith(

大家好,我有一个下面的文件数据,我希望处理它以获得预期的输出,作为一名python学习者,我很想知道是否有基于启动和停止布尔索引的方法来实现这一点

在这里,一个文件行以一个名为
SRV:
的字符串开始,但在某些情况下,这些行始终在同一行开始和结束,而在某些情况下,这些行被扩展为换行

文件文本数据: 预期产出: 有没有更好的方法来实现这一点,我对熊猫也没意见。

使用for group,然后通过加入聚合:

df1 = (df['col'].groupby(df['col'].str.startswith('SRV').cumsum())
                .agg(' '.join)
                .reset_index(drop=True)
                .to_frame(name='new'))
print (df1)
                                                 new
0                             SRV: this is for bryan
1                             SRV: this is for terry
2  SRV: this is for torain sec01: This is reserve...
3                               SRV: this is for Jun
详细信息

print (df['col'].str.startswith('SRV').cumsum())
0    1
1    2
2    3
3    3
4    3
5    3
6    4
Name: col, dtype: int32
对于
DataFrame
使用:

import pandas as pd

temp=u"""col
SRV: this is for bryan

SRV: this is for terry

SRV: this is for torain
sec01: This is reserved
sec02: This is open for all
sec03: Closed!

SRV: this is for Jun"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep="|")

print (df)
                           col
0       SRV: this is for bryan
1       SRV: this is for terry
2      SRV: this is for torain
3      sec01: This is reserved
4  sec02: This is open for all
5               sec03: Closed!
6         SRV: this is for Jun
纯python解决方案:

out = []
with open("file.csv") as f1:
        last = 0
        for i, line in enumerate(f1.readlines()):
            if line.strip().startswith('SRV'):
                last = i
            out.append([line.strip(), last])

from itertools import groupby
from operator import itemgetter

with open("out_file.csv", "w") as f2:
    groups = groupby(out, key=itemgetter(1))
    for _, g in groups:
        gg = list(g)
        h = ' '.join(list(map(itemgetter(0), gg)))
        f2.write('\n' + h)

您可以尝试使用类似于
df[0].groupby(df[0].str.startswith('SRV').cumsum()).apply(''.join)
,其中
0
是列名。(注意:这是使用熊猫数据框)@anky_91,这也很有效。这确实很棒-jezrael+1-jezrael,你能解释一下它是如何记住它必须保存数据直到它看到下一个
srv
?@user294110-在编辑的答案中添加了纯python解决方案。
import pandas as pd

temp=u"""col
SRV: this is for bryan

SRV: this is for terry

SRV: this is for torain
sec01: This is reserved
sec02: This is open for all
sec03: Closed!

SRV: this is for Jun"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep="|")

print (df)
                           col
0       SRV: this is for bryan
1       SRV: this is for terry
2      SRV: this is for torain
3      sec01: This is reserved
4  sec02: This is open for all
5               sec03: Closed!
6         SRV: this is for Jun
out = []
with open("file.csv") as f1:
        last = 0
        for i, line in enumerate(f1.readlines()):
            if line.strip().startswith('SRV'):
                last = i
            out.append([line.strip(), last])

from itertools import groupby
from operator import itemgetter

with open("out_file.csv", "w") as f2:
    groups = groupby(out, key=itemgetter(1))
    for _, g in groups:
        gg = list(g)
        h = ' '.join(list(map(itemgetter(0), gg)))
        f2.write('\n' + h)