如何从某个网站获取内容<;表>;使用python?
我有一些如何从某个网站获取内容<;表>;使用python?,python,regex,Python,Regex,我有一些s,如下所示: <tr align=center><td>10876151</td><td><a href=userstatus?user_id=yangfanhit>yangfanhit</a></td><td><a href=problem?id=3155>3155</a></td><td><font color=blue>Acc
s,如下所示:
<tr align=center><td>10876151</td><td><a href=userstatus?user_id=yangfanhit>yangfanhit</a></td><td><a href=problem?id=3155>3155</a></td><td><font color=blue>Accepted</font></td><td>344K</td><td>219MS</td><td>C++</td><td>3940B</td><td>2012-10-02 16:42:45</td></tr>
<tr align=center><td>10876150</td><td><a href=userstatus?user_id=BandBandRock>BandBandRock</a></td><td><a href=problem?id=2503>2503</a></td><td><font color=blue>Accepted</font></td><td>16348K</td><td>2750MS</td><td>G++</td><td>840B</td><td>2012-10-02 16:42:25</td></tr>
现在我使用以下代码来处理它:
response = urllib2.urlopen('http://poj.org/status', timeout=10)
html = response.read()
response.close()
pattern = re.compile(r'<tr align.*</tr>')
match = pattern.findall(html)
pat = re.compile(r'<td>.*?</td>')
p = re.compile(r'<[/]?.*?>')
for item in match:
for i in pat.findall(item):
print p.sub(r'', i)
print '================================================='
response=urllib2.urlopen('http://poj.org/status,超时=10)
html=response.read()
答复:close()
pattern=re.compile(r'您真的不需要直接使用正则表达式来解析html
或者查看关于HTML处理的内容。当您已经拥有HTML/XML解析器时,您为什么要做这些事情,而这些解析器可以轻松完成这项工作
考虑到上面问题中提到的您想要什么,可以用2-3行代码完成
例如:
>>> from bs4 import BeautifulSoup as bs
>>> html = """
<tr align=center><td>10876151</td><td><a href=userstatus?user_id=yangfanhit>yangfanhit</a></td><td><a href=problem?id=3155>3155</a></td><td><font color=blue>Accepted</font></td><td>344K</td><td>219MS</td><td>C++</td><td>3940B</td><td>2012-10-02 16:42:45</td></tr>
<tr align=center><td>10876150</td><td><a href=userstatus?user_id=BandBandRock>BandBandRock</a></td><td><a href=problem?id=2503>2503</a></td><td><font color=blue>Accepted</font></td><td>16348K</td><td>2750MS</td><td>G++</td><td>840B</td><td>2012-10-02 16:42:25</td></tr>
"""
>>>soup = bs(html)
>>>soup.td
>>><td>10876151</td>
>>从bs4导入BeautifulSoup作为bs
>>>html=”“”
10876151已接受344K219 MSC++3940B2012-10-02 16:42:45
10876150接受16348K2750MSG++840B2012-10-02 16:42:25
"""
>>>soup=bs(html)
>>>汤
>>>10876151
您可以使用来解析html。要以csv格式写入表格内容,请执行以下操作:
#!/usr/bin/env python
import csv
import sys
import urllib2
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
soup = BeautifulSoup(urllib2.urlopen('http://poj.org/status'))
writer = csv.writer(sys.stdout)
for tr in soup.find('table', 'a')('tr'):
writer.writerow([td.get_text() for td in tr('td')])
输出
另请看一看。如果您熟悉jQuery,那么很容易理解。下面是一个以字典列表的形式返回表头和数据的示例
import itertools
from pyquery import PyQuery as pq
# parse html
html = pq(url="http://poj.org/status")
# extract header values from table
header = [header.text for header in html(".a").find(".in").find("td")]
# extract data values from table rows in nested list
detail = [[td.text for td in tr] for tr in html(".a").children().not_(".in")]
# merge header and detail to create list of dictionaries
result = [dict(itertools.izip(header, values)) for values in detail]
可能重复的不使用正则表达式解析HTML。Tony the Pony会活活吃掉你。请改用合适的解析器。Python内置lxml。可能重复的
Run ID,User,Problem,Result,Memory,Time,Language,Code Length,Submit Time
10876151,yangfanhit,3155,Accepted,344K,219MS,C++,3940B,2012-10-02 16:42:45
10876150,BandBandRock,2503,Accepted,16348K,2750MS,G++,840B,2012-10-02 16:42:25
import itertools
from pyquery import PyQuery as pq
# parse html
html = pq(url="http://poj.org/status")
# extract header values from table
header = [header.text for header in html(".a").find(".in").find("td")]
# extract data values from table rows in nested list
detail = [[td.text for td in tr] for tr in html(".a").children().not_(".in")]
# merge header and detail to create list of dictionaries
result = [dict(itertools.izip(header, values)) for values in detail]