更简单、更安全的字符串操作Python_Python_Regex_String_Python 2.7_Indexing

更简单、更安全的字符串操作Python

python regex string python-2.7 indexing

更简单、更安全的字符串操作Python,python,regex,string,python-2.7,indexing,Python,Regex,String,Python 2.7,Indexing,我用Python做了很多业余的数据清理和清理工作——这比用Excel快得多。但我觉得我做每件事都很辛苦。最大的痛苦是，我不知道如何安全地从列表索引或字符串索引中获取，而不会出现错误，或者在代码中一层接一层地使用无法读取的try/except 下面是一个我刚刚提出的清理城市/州组合的Trulia配置文件URL的示例。有时它们不给出状态，但模式是相当标准化的 checkstr = 'http://www.trulia.com/profile/agent-name-agent-orlando-fl-2

我用Python做了很多业余的数据清理和清理工作——这比用Excel快得多。但我觉得我做每件事都很辛苦。最大的痛苦是，我不知道如何安全地从列表索引或字符串索引中获取，而不会出现错误，或者在代码中一层接一层地使用无法读取的try/except

下面是一个我刚刚提出的清理城市/州组合的Trulia配置文件URL的示例。有时它们不给出状态，但模式是相当标准化的

checkstr = 'http://www.trulia.com/profile/agent-name-agent-orlando-fl-24408364/'

state = ''
citystrs = re.findall('-agent-(.*)-\d', checkstr)[0:1]
print citystrs
for citystr in citystrs:
    if '-' in citystr:
        if len(citystr.split('-')[-1]) == 2:
            state = citystr.split('-')[-1].upper().strip()
            city = string.replace(citystr.upper(), state, '')
            city = string.replace(city, '-', ' ').title().strip()
        else:
            city = string.replace(citystr, '-', ' ').title().strip()
    else:
        city = citystr.title().strip()

print city, state

我不需要多个“答案”，但我使用切片[0:1]和

作为

，因为我不希望在模式不适合findall[0]时错误停止我的代码（执行此操作约200万次）

我能得到一些关于pythonic（和高效的）方法的建议来更简单地做到这一点吗

编辑1：我不是在寻找不合格的字符串。我希望足够安全，让它贯穿一切，并“尽其所能”（即，更符合>更少）

编辑2：示例中遗漏了一个非常明显的细节：包含多个单词的城市有内部破折号（“-”）。例如，为什么不一直使用切片

if '-' in citystr:
    sep_index = citystr.find('-')
    city = citystr[0:sep_index].title()
    state = citystr[sep_index+1:].upper()
else:
    city = citystr.title()

使用timeit（数字=10000）：

为什么不一直用切片呢

if '-' in citystr:
    sep_index = citystr.find('-')
    city = citystr[0:sep_index].title()
    state = citystr[sep_index+1:].upper()
else:
    city = citystr.title()

使用timeit（数字=10000）：

以下是我的做法：

import re

reg = re.compile(r'-agent-(?P<city>[^-]*)(?:-(?P<state>[^-]*))?-\d')    

checkstr = 'http://www.trulia.com/profile/agent-name-agent-orlando-fl-24408364/'

m = reg.search(checkstr)

city = m.group('city').title()
state = m.group('state').upper() if (m.group('state')) else ''

print city, state

重新导入
reg=re.compile（r'-agent-（？P[^-]*）（？：-（？P[^-]*））？-\d'）
checkstr=http://www.trulia.com/profile/agent-name-agent-orlando-fl-24408364/'
m=注册搜索（checkstr）
城市=m.group（'city'）。标题（）
state=m.group（'state'）。如果（m.group（'state'））else''
印刷城市、州

如果需要多次使用该模式，可以使用

re.compile

我使用在第一个破折号之前停止的

[^-]*

（并非破折号零次或更多次），而不是使用非常宽松且生成回溯的

状态和上一个破折号位于可选组中：

（？：-（？p[^-]*）？

。因此，即使字符串没有状态部分，模式也会成功

通过此更改，不再需要

re.findall

，您可以使用返回单个结果的

re.search

。请注意，如果您不确定字符串格式，则始终可以添加测试以检查是否存在匹配项

为了使代码更具可读性，我使用命名捕获

（？p…）

。因此，通过这种方式，您可以轻松检索组的内容：

m.group（'name'）

。但是，如果您想稍微提高速度，可以使用编号组（但这不是很重要）。

以下是我的方法：

import re

reg = re.compile(r'-agent-(?P<city>[^-]*)(?:-(?P<state>[^-]*))?-\d')    

checkstr = 'http://www.trulia.com/profile/agent-name-agent-orlando-fl-24408364/'

m = reg.search(checkstr)

city = m.group('city').title()
state = m.group('state').upper() if (m.group('state')) else ''

print city, state

重新导入
reg=re.compile（r'-agent-（？P[^-]*）（？：-（？P[^-]*））？-\d'）
checkstr=http://www.trulia.com/profile/agent-name-agent-orlando-fl-24408364/'
m=注册搜索（checkstr）
城市=m.group（'city'）。标题（）
state=m.group（'state'）。如果（m.group（'state'））else''
印刷城市、州