Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/303.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在Python中加入RE请求_Python_Regex_Python 3.x - Fatal编程技术网

在Python中加入RE请求

在Python中加入RE请求,python,regex,python-3.x,Python,Regex,Python 3.x,我有一个page.htm文件: </td></tr> <tr> <td height="120" class="box_pic"> <a href="view.php?item=1322679" target="_blank"><img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXm

我有一个page.htm文件:

</td></tr>

  <tr>
    <td height="120" class="box_pic">
    <a href="view.php?item=1322679" target="_blank"><img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXmZddfsdsgsdghqJXZkz5vSkYq6xISbd2zaUA%3D%3D" alt="[без описания]" width="140" height="105"></a>
    </td>
  </tr>

  <tr align="center" valign="middle"> 
    <td valign="top"> 
    <table width="100%" border="0" cellspacing="0" cellpadding="0">
        <tr> 
          <td class="box_prc"><span class="nwr"><img src="/map/gender_pair.gif" width="11" height="11" alt="Сова" border=0>&nbsp;<a class="usernick" href="/index.php?action=user&id=79159" target="_blank">ABird</a></span></td>

</td></tr>

  <tr>
    <td height="120" class="box_pic">
    <a href="view.php?item=1546679" target="_blank"><img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXmZddfsdsgsdghqJXZkz5vSkYq6xISbd2zaUA%3D%3D" alt="[без описания]" width="140" height="105"></a>
    </td>
  </tr>

  <tr align="center" valign="middle"> 
    <td valign="top"> 
    <table width="100%" border="0" cellspacing="0" cellpadding="0">
        <tr> 
          <td class="box_prc"><span class="nwr"><img src="/map/gender_pair.gif" width="11" height="11" alt="Сова" border=0>&nbsp;<a class="usernick" href="/index.php?action=user&id=78759" target="_blank">ADog</a></span></td>

</td></tr>

  <tr>
    <td height="120" class="box_pic">
    <a href="view.php?item=5622679" target="_blank"><img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXmZddfsdsgsdghqJXfdgfdgZkz5vSkYq6xISbd2zaUA%3D%3D" alt="[без описания]" width="140" height="105"></a>
    </td>
  </tr>

  <tr align="center" valign="middle"> 
    <td valign="top"> 
    <table width="100%" border="0" cellspacing="0" cellpadding="0">
        <tr> 
          <td class="box_prc"><span class="nwr"><img src="/map/gender_pair.gif" width="11" height="11" alt="Сова" border=0>&nbsp;<a class="usernick" href="/index.php?action=user&id=87159" target="_blank">ACat56</a></span></td>
我有3个RE请求,可以从该页面挖掘元素:

import re

with open('page.htm', 'r') as our_file:
    page=our_file.read()
result = re.findall(r'view\.php\?item=(\d+)', page)
result2 = re.findall(r'user&id=(\d+)', page)
result3 = re.findall(r'user&id=.*>(\w+)', page)
print (result, len(result))
print (result2, len(result2))
print (result3, len(result3))
我得到的结果是:

['1322679', '1546679', '5622679'] 3
['79159', '78759', '87159'] 3
['ABird', 'ADog', 'ACat56'] 3
您知道如何将这三个请求合并为一个请求吗?所以

1) file would be analized 1 time instead of 3 times
2) only ONE re.findall() would be used
3) data would be joined in the way I need  

  a) 1322679 79159 ABird
    b) 1546679 78759 ADog
    c) 5622679 87159 ACat56
结果请求应该是这样的:

result = re.findall(r'view\.php\?item=(\d+) SOMETHING_HERE user&id=(\d+) SOMETHING_HERE .*>(\w+)', page)

最后,我找到了解决方案:

这就是答案,它满足所有要求:

import re

with open('page.htm', 'r') as our_file:
    page=our_file.read()

page = re.sub(r'[\t\r\n\s]','',page)

re.DOTALL
result = re.findall(r'view\.php\?item=(\d+).*?user&id=(\d+).*?>(\w+)', page)

print (result, len(result))
以及:

结果:

[('1322679', '79159', 'ABird'), ('1546679', '78759', 'ADog'), ('5622679', '87159', 'ACat56')] 3

以下是如何在Python 2中使用HTML解析器正确执行此操作:

from urlparse import parse_qs, urlparse
from bs4 import BeautifulSoup

def only(x):
    x = list(x)
    assert len(x) == 1
    return x[0]

def url_params(a):
    return parse_qs(urlparse(a['href']).query)

def main():
    with open('page.html') as f:
        soup = BeautifulSoup(f, 'html.parser')
    rows = soup.find_all('tr', recursive=False)

    # Data is in alternating rows, so take pairs of rows at a time
    for row1, row2 in zip(rows[::2], rows[1::2]):
        a = only(row1.select('td.box_pic a'))
        item_id = only(url_params(a)['item'])
        a = only(row2.select('a.usernick'))
        user_id = only(url_params(a)['id'])
        nick = a.text
        print item_id, user_id, nick

main()
输出:

1322679 79159 ABird
1546679 78759 ADog
5622679 87159 ACat56

现在,这可能不像re方法那样简洁,但是这段代码知道输入的结构,这使它更加健壮。如果输入的结构发生更改,例如URL的格式或HTML的形状,则此代码将继续正常工作,或者将引发错误,告诉您事情不符合预期。re方法可能很容易继续运行,但结果不正确,这不是您想要的情况。如果您想在将来提取更多信息,可以很容易地添加必要的行而不干扰现有代码。

可以使用管道(
|
)作为“或”。@WillemVanOnsem不,它不起作用。我从你的提示中明白了:[('1322679','','',('79159',''),('1546679','','',,('78759',''),('5622679','','','',('87159','')6谷歌“正则表达式html”,然后使用BeautifulSoup。@Alexall好的。我用谷歌搜索了一下。他们说这很难,因为,“,和\。但我不想在这里获取URL。我试着得到一些简单的东西。我的单独请求也行。alex想说的是,正则表达式不是解析html的方式,即使你能找到一些东西来工作,它也很难修改。有更好的工具可用,例如beautiful soup,您应该改用它们。从urllib.parse导入parse_qs,urlparse和print()-它适用于python3。但不管怎样,我会努力理解它是如何工作的。谢谢你的关注和时间。
from urlparse import parse_qs, urlparse
from bs4 import BeautifulSoup

def only(x):
    x = list(x)
    assert len(x) == 1
    return x[0]

def url_params(a):
    return parse_qs(urlparse(a['href']).query)

def main():
    with open('page.html') as f:
        soup = BeautifulSoup(f, 'html.parser')
    rows = soup.find_all('tr', recursive=False)

    # Data is in alternating rows, so take pairs of rows at a time
    for row1, row2 in zip(rows[::2], rows[1::2]):
        a = only(row1.select('td.box_pic a'))
        item_id = only(url_params(a)['item'])
        a = only(row2.select('a.usernick'))
        user_id = only(url_params(a)['id'])
        nick = a.text
        print item_id, user_id, nick

main()
1322679 79159 ABird
1546679 78759 ADog
5622679 87159 ACat56