无法在python中同时使用多个特殊字符或模式提取字符串
我有一个数据集,我试图从这里显示的较长的混乱版本中提取简单的城镇名称。其中大多数后面是括号“(.*),但有些不遵循此模式,并以“:”结尾(参见第200行)。最后,有些没有括号,但用逗号分隔部分”(参见第240、246行) 理想情况下,我希望看到的是:无法在python中同时使用多个特殊字符或模式提取字符串,python,regex,text-extraction,Python,Regex,Text Extraction,我有一个数据集,我试图从这里显示的较长的混乱版本中提取简单的城镇名称。其中大多数后面是括号“(.*),但有些不遵循此模式,并以“:”结尾(参见第200行)。最后,有些没有括号,但用逗号分隔部分”(参见第240、246行) 理想情况下,我希望看到的是: 'RegionName' 196 Boston 197
'RegionName'
196 Boston
197 Bridgewater
198 Cambridge
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato
242 Marshall
243 Moorhead
244 Morris
245 Northfield
246 North Mankato
247 St. Cloud
248 St. Joseph
249 St. Peter
我目前的代码是:
df['RegionName'] = df['Region'].str.extract('(.*)[:(,]', expand=False)
但这给了我一个奇怪的结果,就是没有正确使用括号:
196 Boston (Boston University, Boston College, Bos...
197 Bridgewater
198 Cambridge (Harvard University, Massachusetts I...
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato (Minnesota State University, Mankato)
242 Marshall
243 Moorhead (Minnesota State University, Moorhead
244 Morris
245 Northfield (Carleton College
246 North Mankato
247 St. Cloud (St. Cloud State University
248 St. Joseph
249 St. Peter
我也尝试过:
df['RegionName'] = df['Region'].str.extract('(.*)[ (.*|:|,]', expand=False)
我不确定如何同时使用所有三种模式提取字符串。也可以使用两行解决方案。
谢谢(如果格式不好,我深表歉意!)使用此正则表达式:
([\w\s.]+)(?<!\s)
([\w\s.]+)(?
如果不关心尾随空格,则可以删除结尾处的负数look behind(?)。使用以下正则表达式:
([\w\s.]+)(?<!\s)
([\w\s.]+)(?
如果不关心尾随空格,则可以删除后面的负数查找(?)。因为只有三个可能的分隔符,所以可以利用chained split(),因为如果找不到分隔符,split将返回未修改的字符串
>>> s = """196 Boston (Boston University, Boston College, Bos...
... 197 Bridgewater (Bridgewater State College)[2]
... 198 Cambridge (Harvard University, Massachusetts I...
... 199 Chestnut Hill (Boston College)
... 200 The Colleges of Worcester Consortium:
... 201 Dudley (Nichols College)
... 240 Faribault, South Central College
... 241 Mankato (Minnesota State University, Mankato),...
... 242 Marshall (Southwest Minnesota State University...
... 243 Moorhead (Minnesota State University, Moorhead...
... 244 Morris (University of Minnesota Morris)[2]
... 245 Northfield (Carleton College, St. Olaf College...
... 246 North Mankato, South Central College
... 247 St. Cloud (St. Cloud State University, The Col...
... 248 St. Joseph (College of Saint Benedict)[2]
... 249 St. Peter (Gustavus Adolphus College)[2]"""
>>> for i in s.split('\n'):
... number, text = i.split('(')[0].split(',')[0].split(':')[0].split(' ',1)
... print('{} {}'.format(number, text.strip()))
...
196 Boston
197 Bridgewater
198 Cambridge
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato
242 Marshall
243 Moorhead
244 Morris
245 Northfield
246 North Mankato
247 St. Cloud
248 St. Joseph
249 St. Peter
您可以使用对字符串执行相同的转换。因为您只有三个可能的分隔符,所以可以利用chained split(),因为如果找不到分隔符,split将返回未修改的字符串
>>> s = """196 Boston (Boston University, Boston College, Bos...
... 197 Bridgewater (Bridgewater State College)[2]
... 198 Cambridge (Harvard University, Massachusetts I...
... 199 Chestnut Hill (Boston College)
... 200 The Colleges of Worcester Consortium:
... 201 Dudley (Nichols College)
... 240 Faribault, South Central College
... 241 Mankato (Minnesota State University, Mankato),...
... 242 Marshall (Southwest Minnesota State University...
... 243 Moorhead (Minnesota State University, Moorhead...
... 244 Morris (University of Minnesota Morris)[2]
... 245 Northfield (Carleton College, St. Olaf College...
... 246 North Mankato, South Central College
... 247 St. Cloud (St. Cloud State University, The Col...
... 248 St. Joseph (College of Saint Benedict)[2]
... 249 St. Peter (Gustavus Adolphus College)[2]"""
>>> for i in s.split('\n'):
... number, text = i.split('(')[0].split(',')[0].split(':')[0].split(' ',1)
... print('{} {}'.format(number, text.strip()))
...
196 Boston
197 Bridgewater
198 Cambridge
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato
242 Marshall
243 Moorhead
244 Morris
245 Northfield
246 North Mankato
247 St. Cloud
248 St. Joseph
249 St. Peter
您可以使用对字符串执行相同的转换。您可以只提取除、、
或以外的任何0个或更多字符。(
位于字符串开头,带有
df['RegionName'] = df['Region'].str.extract(r'^([^:(,]*)\b', expand=False)
如果您使用的是Python2.x,请在模式的开头使用(?u)
,这样单词边界\b
也可以匹配Unicode字符串中的正确位置
详细信息
^
-字符串的开头
([^:(,]*)
-第1组:除([^…]
形成一个否定字符类):
,(
和,
之外的任何字符的连续出现次数为零或更多(*
)
\b
-单词边界
请参见下面的示例和Python 3演示:
>>> from pandas import DataFrame
>>> import pandas as pd
>>> item_list = ['Boston (Boston University, Boston College, Bos...','Bridgewater (Bridgewater State College)[2]','Cambridge (Harvard University, Massachusetts I...','Chestnut Hill (Boston College)','The Colleges of Worcester Consortium:','Dudley (Nichols College)','Faribault, South Central College','Mankato (Minnesota State University, Mankato),...','Marshall (Southwest Minnesota State University...','Moorhead (Minnesota State University, Moorhead...','Morris (University of Minnesota Morris)[2]','Northfield (Carleton College, St. Olaf College...','North Mankato, South Central College','St. Cloud (St. Cloud State University, The Col...','St. Joseph (College of Saint Benedict)[2]','St. Peter (Gustavus Adolphus College)[2]']
>>> df = pd.DataFrame(item_list, columns=['Region'])
>>> df['RegionName'] = df['Region'].str.extract(r'^([^:(,]*)\b', expand=False)
>>> df['RegionName']
RegionName
0 Boston
1 Bridgewater
2 Cambridge
3 Chestnut Hill
4 The Colleges of Worcester Consortium
5 Dudley
6 Faribault
7 Mankato
8 Marshall
9 Moorhead
10 Morris
11 Northfield
12 North Mankato
13 St. Cloud
14 St. Joseph
15 St. Peter
>>>
您可以只提取除:
、、
或以外的任何0个或多个字符(
位于字符串开头,带有
df['RegionName'] = df['Region'].str.extract(r'^([^:(,]*)\b', expand=False)
如果您使用的是Python2.x,请在模式的开头使用(?u)
,这样单词边界\b
也可以匹配Unicode字符串中的正确位置
详细信息
^
-字符串的开头
([^:(,]*)
-第1组:除([^…]
形成一个否定字符类):
,(
和,
之外的任何字符的连续出现次数为零或更多(*
)
\b
-单词边界
请参见下面的示例和Python 3演示:
>>> from pandas import DataFrame
>>> import pandas as pd
>>> item_list = ['Boston (Boston University, Boston College, Bos...','Bridgewater (Bridgewater State College)[2]','Cambridge (Harvard University, Massachusetts I...','Chestnut Hill (Boston College)','The Colleges of Worcester Consortium:','Dudley (Nichols College)','Faribault, South Central College','Mankato (Minnesota State University, Mankato),...','Marshall (Southwest Minnesota State University...','Moorhead (Minnesota State University, Moorhead...','Morris (University of Minnesota Morris)[2]','Northfield (Carleton College, St. Olaf College...','North Mankato, South Central College','St. Cloud (St. Cloud State University, The Col...','St. Joseph (College of Saint Benedict)[2]','St. Peter (Gustavus Adolphus College)[2]']
>>> df = pd.DataFrame(item_list, columns=['Region'])
>>> df['RegionName'] = df['Region'].str.extract(r'^([^:(,]*)\b', expand=False)
>>> df['RegionName']
RegionName
0 Boston
1 Bridgewater
2 Cambridge
3 Chestnut Hill
4 The Colleges of Worcester Consortium
5 Dudley
6 Faribault
7 Mankato
8 Marshall
9 Moorhead
10 Morris
11 Northfield
12 North Mankato
13 St. Cloud
14 St. Joseph
15 St. Peter
>>>
我不认为行号和填充是数据的一部分-它只是数据打印到控制台时的格式。我不认为行号和填充是数据的一部分-它只是数据打印到控制台时的格式。感谢这个答案。我在数据集中没有显示的是有状态列表d例如“Michigan[edit]”,我想删除它。以前,str.extract
将它们制作成NaN,因此我会删除它们。但是您的方法将它们留在数据集中(它们变成了“Michigan[edit”)。您如何对此进行调整?第2部分:我尝试了:df['RegionName']=df['RegionName'].str.replace(r'(^.\[edit$),np.NaN)
将所有内容都转换为NaN。为什么会这样?相反,我将其替换为一个空单元格,然后将空单元格替换为NaN。但似乎效率低下。请尝试并使用您以前的dropna
方法。感谢您的回答。我在数据集中没有显示的是列出了“密歇根州[编辑]”等州,我想删除它。以前,str.extract
将它们制作成NaN的,所以我会删除它们。但是您的方法将它们留在数据集中(它们变成了“Michigan[edit”)。您将如何对此进行调整?第2部分:我尝试了:df['RegionName']=df['RegionName'].str.replace(r'(^.\[edit$),np.NaN)
将所有内容都转换为NaN。为什么会这样?相反,我将其替换为一个空单元格,然后将空单元格替换为NaN。但似乎效率低下。请尝试并使用以前的dropna
方法。