python BeautifulSoup从垃圾页获取变量
各位, 试图从格式不好的页面中获取一些变量python BeautifulSoup从垃圾页获取变量,python,beautifulsoup,Python,Beautifulsoup,各位, 试图从格式不好的页面中获取一些变量 html = response.read() soup = BeautifulSoup(html) links = soup.findAll('a') for link in links: for x in link.attrs: print x 输出: (u'href', u"javascript:Set_Variables('FIRSTNAME,LASTNAME', \r\n\t\t\t\t\t\t\t\t\t\t\t\
html = response.read()
soup = BeautifulSoup(html)
links = soup.findAll('a')
for link in links:
for x in link.attrs:
print x
输出:
(u'href', u"javascript:Set_Variables('FIRSTNAME,LASTNAME', \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'123456789123', \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'FOOOOOOO',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'54',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'2014',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'BAZZZZ',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'BARRRRRRRRRR',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'07/31/2015',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'')")
(u'onmouseover', u"javascript: return window.status=''")
(u'href', u"javascript:Set_Variables('FIRSTNAME,LASTNAME', \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'123456789123', \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'FOOOOOOO',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'54',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'2014',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'BAZZZZ',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'BARRRRRRRRRR',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'07/31/2015',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'')")
(u'onmouseover', u"javascript: return window.status=''")
问题:
我怎样才能从这些乱七八糟的东西中得到FIRSTNAME,LASTNAME
,foooooo
,barrrr
,BAZZZZZ
,123456789123
谢谢 首先,您只需要关注这里的
href
属性
将所有内容放在括号内,按空格分隔,并删除逗号和引号:
args = link['href'].partition('(')[-1].rpartition(')')[0]
args = [v.rstrip(',').strip("'") for v in args.split()]
演示:
>>> href = u"javascript:Set_Variables('FIRSTNAME,LASTNAME', \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'123456789123', \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'FOOOOOOO',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'54',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'2014',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'BAZZZZ',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'BARRRRRRRRRR',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'07/31/2015',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'')"
>>> href.partition('(')[-1].rpartition(')')[0]
u"'FIRSTNAME,LASTNAME', \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'123456789123', \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'FOOOOOOO',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'54',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'2014',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'BAZZZZ',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'BARRRRRRRRRR',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'07/31/2015',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t''"
>>> [v.rstrip(',').strip("'") for v in href.partition('(')[-1].rpartition(')')[0].split()]
[u'FIRSTNAME,LASTNAME', u'123456789123', u'FOOOOOOO', u'54', u'2014', u'BAZZZZ', u'BARRRRRRRRRR', u'07/31/2015', u'']