Python 如何将此字符串（日志行）拆分为多个不同的字符/模式？_Python_Regex_Python 2.7_Parsing_Logparser

Python 如何将此字符串（日志行）拆分为多个不同的字符/模式？

python regex python-2.7 parsing

Python 如何将此字符串（日志行）拆分为多个不同的字符/模式？,python,regex,python-2.7,parsing,logparser,Python,Regex,Python 2.7,Parsing,Logparser,我有这样一个字符串： 66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 我在分析这个东西时遇到了一个很大的问题。我基本上只需要ip地址、日期、“GET”方法、响应代码（404，在这一行中），其余部分作为一个较长的字符串。结

我有这样一个字符串：

66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

我在分析这个东西时遇到了一个很大的问题。我基本上只需要ip地址、日期、“GET”方法、响应代码（404，在这一行中），其余部分作为一个较长的字符串。结果应为逗号分隔的列表，如：

['66.249.69.97', '24/Sep/2014:22:25:44 +0000', '"GET /071300/242153 HTTP/1.1"','404','"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"']

如果解决方案没有完全返回，那也没关系，但我已经尝试了几个小时来解析这个问题，不知道怎么做

我已经在循环中尝试了

split（）

和

strip（）

，并且准备尝试regex…我必须复习/重新学习。有没有我忽略的更简单的方法

我使用的是python2笔记本，所以没有Python3选项

提前谢谢

编辑： @阿伦现在我有了这个：

    p = re.compile(r'(?P<ip_addr>\d+(\.\d+){3}) - - \[(?P<date_time>.+?)\] (?P<http_method>\".+?\") (?P<return_code>\d+) \d+ "-" (?P<client>\".+?\")')
    def pattern_match(line):
          m = p.search(line)
          return([m.group('ip_addr'), m.group('date_time'), m.group('http_method'), m.group('return_code'), m.group('client')])

rdd.collect（）

是5行文本，当我在其中迭代打印作业时，它会打印所有5行文本。但是，我现在只以这种方式打印了4个……然后它错误地显示为：

AttributeError:'NoneType'对象没有属性“group”

有什么想法吗？

这里有一种方法：

>>> import re
>>> s = '''66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"'''
>>> p = re.compile(r'(?P<ip_addr>\d+(\.\d+){3}) - - \[(?P<date_time>.+?)\] (?P<http_method>\".+?\") (?P<return_code>\d+) \d+ "-" (?P<client>\".+?\")')
>>> m = p.search(s)
>>> [m.group('ip_addr'), m.group('date_time'), m.group('http_method'), m.group('return_code'), m.group('client')]
['66.249.69.97', '24/Sep/2014:22:25:44 +0000', '"GET /071300/242153 HTTP/1.1"', '404', '"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"']

>>重新导入
>>>s=''66.249.69.97---[24/Sep/2014:22:25:44+0000]“GET/071300/242153 HTTP/1.1”404 514“-”Mozilla/5.0（兼容；Googlebot/2.1+http://www.google.com/bot.html)"'''
>>>p=重新编译（r'（？p\d+（\.\d+）{3}）-\[（？p+？）\]（？p\“+？\”）（.p\d+）\d+“-”（？p\“+？”）
>>>m=p.搜索
>>>[m.group（'ip_地址'）、m.group（'date_time'）、m.group（'http_方法'）、m.group（'return_code'）、m.group（'client'）]
['66.249.69.97'、'24/Sep/2014:22:25:44+0000'、'GET/071300/242153 HTTP/1.1'、'404'、'Mozilla/5.0（兼容；Googlebot/2.1+http://www.google.com/bot.html)"']

请注意，正则表达式在这里做了一些假设，可能不适用于所有情况，但它应该可以帮助您开始

（这将非常值得你花时间学习。希望这能让你开始：-）

正则表达式应该是一个不错的选择，但你可以尝试做其他事情作为替代，如以下：

temp1 = '66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"'.split("\"")
temp2 = temp1[0].split("-")
print [temp2[0].strip(), temp2[-1].strip(" []")] + [i.strip() for i in temp1[1:] if i not in "- "]

它并不完美，也许有点难看，但说实话，我喜欢它胜过正则表达式，因为我发现它更容易阅读

这对我来说很好。你会犯什么样的错误？

temp1 = '66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"'.split("\"")
temp2 = temp1[0].split("-")
print [temp2[0].strip(), temp2[-1].strip(" []")] + [i.strip() for i in temp1[1:] if i not in "- "]