Python web解析器url re.findall（）_Python_Parsing

Python web解析器url re.findall（）

python parsing

Python web解析器url re.findall（）,python,parsing,Python,Parsing,我正在研究web解析器我不应该使用正则表达式来解析以下内容。这些就是结果 ex) location.href = "login/html"; ex) location.href = "featureId/html"; 我想得到所有字符串的结果，但我不能得到它们代码如下： # -*- coding: utf-8 -*- import urllib2 import re url_reg= re.compile('(location\.(href|assign|replace)|window

我正在研究web解析器

我不应该使用正则表达式来解析以下内容。这些就是结果

ex) location.href = "login/html";
ex) location.href = "featureId/html";

我想得到所有字符串的结果，但我不能得到它们

代码如下：

# -*- coding: utf-8 -*-
import urllib2
import re

url_reg= re.compile('(location\.(href|assign|replace)|window\.location)\s*(=|\()+.*(;|$)')
url ='http://zero.webappsecurity.com/'
request = urllib2.Request(url)
res = urllib2.urlopen(request)
html = res.read().decode('utf-8')

print html
print re.findall(url_reg, html)

运行结果的来源如下：

[(u'location.href', u'href', u'=', u';'), (u'location.href', u'href', u'=', u';'), (u'location.href', u'href', u'=', u';'), (u'location.href', u'href', u'=', u';'), (u'location.href', u'href', u'=', u';'), (u'location.href', u'href', u'=', u';')]

最初，我计划得到如下结果

location.href = path + "login" + ".html";
location.href = path + featureId + ".html";
location.href = "/" + "online-banking" + ".html";
location.href = path + featureName +".html";

请给我一些建议。

你没有把你的问题描述清楚。这是你想要的字符串，对吗

window.location.href = path + "login" + ".html";
window.location.href = path + featureId + ".html";
window.location.href = "/" + "online-banking" + ".html";
window.location.href = path + featureName +".html";
window.location.href = link.page;
window.location.href = path + link.page + ".html";

那么你应该使用这样的模式

...
url_reg= re.compile('window\.location\.href = ["/ +\-\.; a-zA-Z]*')
print url_reg.findall(html)
...

您的字符串好像丢失了。请告诉我丢失了哪些字符串