Python 使用Regex解析Whois数据-忽略字段重复
我正在尝试解析whois查询的结果。我感兴趣的是检索route、descr和origin字段,如下所示:Python 使用Regex解析Whois数据-忽略字段重复,python,regex,parsing,whois,Python,Regex,Parsing,Whois,我正在尝试解析whois查询的结果。我感兴趣的是检索route、descr和origin字段,如下所示: route: 129.45.67.8/91 descr: FOO-BAR descr: Information 2 origin: AS5462 notify: foo@bar.net mnt-by: AS5462-MNT remarks: For abuse notifica
route: 129.45.67.8/91
descr: FOO-BAR
descr: Information 2
origin: AS5462
notify: foo@bar.net
mnt-by: AS5462-MNT
remarks: For abuse notifications please file an online case @ http://www.foo.com/bar
changed: foo@bar.net 20000101
source: RIPE
remarks: ****************************
remarks: * THIS OBJECT IS MODIFIED
remarks: * Please note that all data that is generally regarded as personal
remarks: * data has been removed from this object.
remarks: * To view the original object, please query the RIPE Database at:
remarks: * http://www.foo.net/bar
remarks: ****************************
route: 123.45.67.8/91
descr: FOO-BAR
origin: AS3269
mnt-by: BAR-BAZ
changed: foo@bar.net 20000101
source: RIPE
remarks: ****************************
remarks: * THIS OBJECT IS MODIFIED
remarks: * Please note that all data that is generally regarded as personal
remarks: * data has been removed from this object.
remarks: * To view the original object, please query the RIPE Database at:
remarks: * http://www.ripe.net/whois
remarks: ****************************
为此,我使用以下代码和正则表达式:
search = "FOO-BAR"
with open(FILE, "r") as f:
content = f.read()
r = re.compile(r'route:\s+(.*)\ndescr:\s+(.*' + search + '.*).*\norigin:\s+(.*)', re.IGNORECASE)
res = r.findall(content)
print res
对于只包含一个descr字段的结果,它确实可以按预期工作,但是它会忽略包含多个descr字段的结果
在这种情况下,我得到以下结果:
[('123.45.67.8/91', 'FOO-BAR', 'AS3269')]
预期结果是有路由字段,如果有多条描述线和原点字段,则为第一个描述字段
[('129.45.67.8/91', 'FOO-BAR', 'AS5462'), ('123.45.67.8/91', 'FOO-BAR', 'AS3269')]
解析包含一行和几行描述的结果的正确正则表达式是什么?我已经非常接近您的要求:
import re
search = "FOO-BAR"
with open('whois', "r") as f:
content = f.read()
r = re.compile( r'' #
'route:\s+(.*)\n' #
'(descr:\s+(?!FOO-BAR).*\n)*' # Capture 0-n lines with descr: field but without FOO-BAR
'descr:\s+(FOO-BAR)\n' # Capture at least one line with descr: and FOO-BAR
'(descr:\s+(?!FOO-BAR).*\n)*' # Capture 0-n lines with descr: field but without FOO-BAR
'origin:\s+(.*)', #
re.IGNORECASE)
#r = re.compile('(route:\n)((descr:)(?!FOO-BAR)(.*)\n)*((descr:)(FOO-BAR)\n)?((descr:)(?!FOO-BAR)(.*)\n)*')
res = r.findall(content)
print res
结果是:
>>> [('129.45.67.8/91', '', 'FOO-BAR', 'descr: Information 2\n', 'AS5462'),
('123.45.67.8/91', '', 'FOO-BAR', '', 'AS3269')]
只要稍微清理一下,就可以得到结果有什么问题吗?对于这项任务来说,正则表达式似乎有点过火了。linestartwith()和一些计数器呢?@IntrepidBrit我的理解是pywhois从给定的域名生成解析的WHOIS数据,在这种情况下,WHOIS数据是通过自由文本搜索生成的。@georgesl我可以看看linestartwith(),我不知道这个问题,但是我也希望有一个使用正则表达式的解决方案。我完全重写了这个问题,希望它现在能成为主题,谢谢