Python正则表达式特定字符串_Python_Regex_Pandas

Python正则表达式特定字符串

python regex pandas

Python正则表达式特定字符串,python,regex,pandas,Python,Regex,Pandas,我想遍历一列记录（字符串目录路径）并取出括号中的记录id。但是，在其他情况下，括号中包含的详细信息不是记录id，需要忽略代码： df1['Doc ID'] = df['Folder Path'].str.extract('.*\((.*)\).*', expand=True) #this does not ignore instances with (2018-03) or (yyyy-mm) 我也尝试过： df1['Doc ID'] = df['Folder Path'].str.extr

我想遍历一列记录（字符串目录路径）并取出括号中的记录id。但是，在其他情况下，括号中包含的详细信息不是记录id，需要忽略

代码：

df1['Doc ID'] = df['Folder Path'].str.extract('.*\((.*)\).*', expand=True) #this does not ignore instances with (2018-03) or (yyyy-mm)

我也尝试过：

df1['Doc ID'] = df['Folder Path'].str.extract('\((?!date_format)([^()]+)\)',expand=True) #this does not ignore (Data Only)

  Folder Path                                          Doc ID
1  /report/support + admin. (256)/ Global (2018-03)    (256) # ignores: (2018-03)
2  /reports/limit/sector(139)/2017                     (139)
3  /reports/sector/region(147,189 and 132)/2018        (147, 189 and 132)
4  /reports/support.(Data Only)/Region (2558)          (2558)  #ignores(Data Only)

这将使用负前瞻过滤掉“仅数据”和日期格式：

(\((?!Data Only)[^\-]+\))

设置：

df = pd.DataFrame(
    {'Path': ['(Data Only) text (1, 2 and 3)',
    '(2013-08) foo (123)',
    '(Data Only) bar (1,2,3,4,5 and 6)']}
)

                                Path
0      (Data Only) text (1, 2 and 3)
1                (2013-08) foo (123)
2  (Data Only) bar (1,2,3,4,5 and 6)

使用

str.extract

：

df.Path.str.extract(r'(\((?!Data Only)[^\-]+\))', expand=True)

                   0
0      (1, 2, and 3)
1              (123)
2  (1,2,3,4,5 and 6)

那么，您期望的输出是什么？记录ID和非记录ID的区别是什么？嗨！我想要的输出将在“Doc ID”列中。记录ID与非记录ID的区别在于：它们不包括“（仅数据）”，也不包括“（yyyy-mm）”或日期格式-我不知道如何将这两个参数组合到一个表达式中，以查找括号内的数据这真的是仅有的两个选项吗？文字（仅数据）和日期格式？或者（仅数据）只是一个表示其他文本的通用值吗？不幸的是，我没有通过正则表达式进行解析的背景（你有什么好地方可以学习的建议吗？）-这个文件中的数据真是一团糟，这是识别记录id的最佳方法（仅数据）实际上是在字符串中列出的-在包含此字符串的大约50%的记录中，有一个关联的记录id，在另一半中，没有-我现在将通过脚本运行它，感谢您的帮助！