Python：拆分为行并根据搜索删除特定行_Python_Regex_Python 3.x_Linux_Unix

Python：拆分为行并根据搜索删除特定行

python regex python-3.x linux unix

Python：拆分为行并根据搜索删除特定行,python,regex,python-3.x,linux,unix,Python,Regex,Python 3.x,Linux,Unix,我有一个如下所示的csv文件，凭借我对python的一点了解，我正试图将其内容拆分为以“sec”为起始字段的行，并删除具有sip:+99*、sip:+88*、sip:+77*字段的特定行 cat text.csv sec,sip:+1111,2222,3333,4444,5555,sec,6666,sip:+7777,8888,sec,sip:+9999,1000,1100,110,1200,1300,1400 所需的输出是行，其中匹配了字符串“sec”，并删除特定行，其中包含以sip:+99

我有一个如下所示的csv文件，凭借我对python的一点了解，我正试图将其内容拆分为以“sec”为起始字段的行，并删除具有sip:+99*、sip:+88*、sip:+77*字段的特定行

cat text.csv

sec,sip:+1111,2222,3333,4444,5555,sec,6666,sip:+7777,8888,sec,sip:+9999,1000,1100,110,1200,1300,1400

所需的输出是行，其中匹配了字符串“sec”，并删除特定行，其中包含以sip:+99*、sip:+88*和sip:+77*（sip:+99xxxx之后的任何数字）开头的字段的行

拆分后所需的输出：

sec,sip:+1111,2222,3333,4444,5555
sec,6666,sip:+7777,8888
sec,sip:+9999,1000,1100,1100,1200,1300,1400

sec,sip:+1111,2222,3333,4444,5555
sec,6666,sip:+7777,8888
sec,sip:+9999,1000,1100,110,1200,1300,1400

删除字段匹配的行后所需的输出：

sec,sip:+1111,2222,3333,4444,5555

我已经尝试过使用csv、re模块的python代码，但没有成功。我是python编程新手，请帮忙

def aggr(s):
  " Aggregate into substrings "
  lst = s.split(',')
  current = [lst[0]]
  result = []
  for i in lst[1:]:
    if i == 'sec':
      if current:
        result.append(','.join(current))
        current = []
    current.append(i)
  if current:
    result.append(','.join(current))

  return result

# Input String
s = 'sec,sip:+1111,2222,3333,4444,5555,sec,6666,sip:+7777,8888,sec,sip:+9999,1000,1100,110,1200,1300,1400'

# Aggregate substrings (i.e. substrings starts with 'sec,sip')
l = aggr(s)
print('\n'.join(l))

# Filter out undesired substrings
prefixes = ['sip:+99', 'sip:+88', 'sip:+77']
# only check third column for match of prefixes
result = [i for i in l if not any(x in i.split(',')[2] for x in prefixes)]
print()
print('\n'.join(result))

输出

sec,sip:+1111,2222,3333,4444,5555
sec,6666,sip:+7777,8888
sec,sip:+9999,1000,1100,110,1200,1300,1400

sec,sip:+1111,2222,3333,4444,5555
sec,sip:+9999,1000,1100,110,1200,1300,1400

Python:

import re
s = 'sec,sip:+1111,2222,3333,4444,5555,sec,6666,sip:+7777,8888,sec,sip:+9999,1000,1100,110,1200,1300,1400'
pos = [m.start() for m in re.finditer('sec', s)]

i = 0
start_idx = end_idx = None

raw_data = []

while i < len(pos)-1:
    start_idx = pos[i]
    end_idx = pos[i+1]-1
    raw_data.append(s[start_idx:end_idx])
    i = i + 1

start_idx = pos[i]
end_idx = len(s)
raw_data.append(s[start_idx:end_idx])
print('%s' % '\n'.join(map(str, raw_data)))

p = re.compile(r'sip:\+(?!([7]{2,}|[8]{2,}|[9]{2,})).*')
result = [ s for s in raw_data if p.search(s) ]
print('\n%s' % '\n '.join(map(str, result)))

使用正则表达式筛选后的输出：

sec,sip:+1111,2222,3333,4444,5555

嗨，Darryl，非常感谢，它成功了。。。但我有一个小小的疑问，=[i代表l中的i，如果没有任何（i.startswith（'sec，+x）代表前缀中的x）]的写入方式是这样的，如果在“sec”之前找到sip+99*，它就是不想要的行。但我有动态数据，可以产生sec，6666，sip:+77778888之类的，直到打印出来，因为sip:+7777不在“sec”之前。。是否有可能只搜索第三个字段并删除整行。{sec，6666，sip:+7777888}sip+77*在这里的第三个字段中找到，可以删除/避免打印整行。@SSK--相信我制作了您请求的mod。它仍然将行聚合为以sec开头的字符串。但是，现在它只是检查行中是否包含子字符串（即sip:+77），而不是在秒后执行。@ssk--不客气，但也可以检查下面的Neda Peyrone解决方案，作为一个有趣的替代方案。@DarrylG..谢谢，两者都有不同的方法，但给出了我所需的输出。我目前正在自己检查如何在匹配第三列后打印行，例如，上面的示例在任何地方查找+99*并删除那些不需要的行，但最初这样问是我的错误，我的要求是仅在第三列匹配sip:+99*，sip:+88*的地方删除行，sip:+77*这样我每秒只能得到两行输出，sip:+1111222233334444555秒，sip:+99911001100120013001400（另一行在第三个字段中有匹配+77*。@ssk--修改为匹配第三列。正如您在代码中所注意到的，只有一个小的更改——即。检查生成结果的行。添加了检索第3列的i.split（'，'）[2]。