在Python中加入RE请求
我有一个page.htm文件:在Python中加入RE请求,python,regex,python-3.x,Python,Regex,Python 3.x,我有一个page.htm文件: </td></tr> <tr> <td height="120" class="box_pic"> <a href="view.php?item=1322679" target="_blank"><img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXm
</td></tr>
<tr>
<td height="120" class="box_pic">
<a href="view.php?item=1322679" target="_blank"><img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXmZddfsdsgsdghqJXZkz5vSkYq6xISbd2zaUA%3D%3D" alt="[без описания]" width="140" height="105"></a>
</td>
</tr>
<tr align="center" valign="middle">
<td valign="top">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="box_prc"><span class="nwr"><img src="/map/gender_pair.gif" width="11" height="11" alt="Сова" border=0> <a class="usernick" href="/index.php?action=user&id=79159" target="_blank">ABird</a></span></td>
</td></tr>
<tr>
<td height="120" class="box_pic">
<a href="view.php?item=1546679" target="_blank"><img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXmZddfsdsgsdghqJXZkz5vSkYq6xISbd2zaUA%3D%3D" alt="[без описания]" width="140" height="105"></a>
</td>
</tr>
<tr align="center" valign="middle">
<td valign="top">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="box_prc"><span class="nwr"><img src="/map/gender_pair.gif" width="11" height="11" alt="Сова" border=0> <a class="usernick" href="/index.php?action=user&id=78759" target="_blank">ADog</a></span></td>
</td></tr>
<tr>
<td height="120" class="box_pic">
<a href="view.php?item=5622679" target="_blank"><img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXmZddfsdsgsdghqJXfdgfdgZkz5vSkYq6xISbd2zaUA%3D%3D" alt="[без описания]" width="140" height="105"></a>
</td>
</tr>
<tr align="center" valign="middle">
<td valign="top">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="box_prc"><span class="nwr"><img src="/map/gender_pair.gif" width="11" height="11" alt="Сова" border=0> <a class="usernick" href="/index.php?action=user&id=87159" target="_blank">ACat56</a></span></td>
我有3个RE请求,可以从该页面挖掘元素:
import re
with open('page.htm', 'r') as our_file:
page=our_file.read()
result = re.findall(r'view\.php\?item=(\d+)', page)
result2 = re.findall(r'user&id=(\d+)', page)
result3 = re.findall(r'user&id=.*>(\w+)', page)
print (result, len(result))
print (result2, len(result2))
print (result3, len(result3))
我得到的结果是:
['1322679', '1546679', '5622679'] 3
['79159', '78759', '87159'] 3
['ABird', 'ADog', 'ACat56'] 3
您知道如何将这三个请求合并为一个请求吗?所以
1) file would be analized 1 time instead of 3 times
2) only ONE re.findall() would be used
3) data would be joined in the way I need
a) 1322679 79159 ABird
b) 1546679 78759 ADog
c) 5622679 87159 ACat56
结果请求应该是这样的:
result = re.findall(r'view\.php\?item=(\d+) SOMETHING_HERE user&id=(\d+) SOMETHING_HERE .*>(\w+)', page)
最后,我找到了解决方案: 这就是答案,它满足所有要求:
import re
with open('page.htm', 'r') as our_file:
page=our_file.read()
page = re.sub(r'[\t\r\n\s]','',page)
re.DOTALL
result = re.findall(r'view\.php\?item=(\d+).*?user&id=(\d+).*?>(\w+)', page)
print (result, len(result))
以及:
结果:
[('1322679', '79159', 'ABird'), ('1546679', '78759', 'ADog'), ('5622679', '87159', 'ACat56')] 3
以下是如何在Python 2中使用HTML解析器正确执行此操作:
from urlparse import parse_qs, urlparse
from bs4 import BeautifulSoup
def only(x):
x = list(x)
assert len(x) == 1
return x[0]
def url_params(a):
return parse_qs(urlparse(a['href']).query)
def main():
with open('page.html') as f:
soup = BeautifulSoup(f, 'html.parser')
rows = soup.find_all('tr', recursive=False)
# Data is in alternating rows, so take pairs of rows at a time
for row1, row2 in zip(rows[::2], rows[1::2]):
a = only(row1.select('td.box_pic a'))
item_id = only(url_params(a)['item'])
a = only(row2.select('a.usernick'))
user_id = only(url_params(a)['id'])
nick = a.text
print item_id, user_id, nick
main()
输出:
1322679 79159 ABird
1546679 78759 ADog
5622679 87159 ACat56
现在,这可能不像re方法那样简洁,但是这段代码知道输入的结构,这使它更加健壮。如果输入的结构发生更改,例如URL的格式或HTML的形状,则此代码将继续正常工作,或者将引发错误,告诉您事情不符合预期。re方法可能很容易继续运行,但结果不正确,这不是您想要的情况。如果您想在将来提取更多信息,可以很容易地添加必要的行而不干扰现有代码。可以使用管道(
|
)作为“或”。@WillemVanOnsem不,它不起作用。我从你的提示中明白了:[('1322679','','',('79159',''),('1546679','','',,('78759',''),('5622679','','','',('87159','')6谷歌“正则表达式html”,然后使用BeautifulSoup。@Alexall好的。我用谷歌搜索了一下。他们说这很难,因为,“,和\。但我不想在这里获取URL。我试着得到一些简单的东西。我的单独请求也行。alex想说的是,正则表达式不是解析html的方式,即使你能找到一些东西来工作,它也很难修改。有更好的工具可用,例如beautiful soup,您应该改用它们。从urllib.parse导入parse_qs,urlparse和print()-它适用于python3。但不管怎样,我会努力理解它是如何工作的。谢谢你的关注和时间。
from urlparse import parse_qs, urlparse
from bs4 import BeautifulSoup
def only(x):
x = list(x)
assert len(x) == 1
return x[0]
def url_params(a):
return parse_qs(urlparse(a['href']).query)
def main():
with open('page.html') as f:
soup = BeautifulSoup(f, 'html.parser')
rows = soup.find_all('tr', recursive=False)
# Data is in alternating rows, so take pairs of rows at a time
for row1, row2 in zip(rows[::2], rows[1::2]):
a = only(row1.select('td.box_pic a'))
item_id = only(url_params(a)['item'])
a = only(row2.select('a.usernick'))
user_id = only(url_params(a)['id'])
nick = a.text
print item_id, user_id, nick
main()
1322679 79159 ABird
1546679 78759 ADog
5622679 87159 ACat56