Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/80.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在Python中使用BeautifulSoup搜索html_Python_Html_Python 2.7_Beautifulsoup - Fatal编程技术网

在Python中使用BeautifulSoup搜索html

在Python中使用BeautifulSoup搜索html,python,html,python-2.7,beautifulsoup,Python,Html,Python 2.7,Beautifulsoup,我写了一些代码来搜索html,但结果不是我想要的。 一些html代码我想拉的网页地址 我想得到“sayfa”这个词 示例: 但我不知道怎么做 <table cellpadding="0" cellspacing="0" border="0" width="100%" style="margin-bottom:3px"> <tr valign="bottom"> <td class="smallfont"><a href="http://www

我写了一些代码来搜索html,但结果不是我想要的。 一些html代码我想拉的网页地址 我想得到“sayfa”这个词 示例:

但我不知道怎么做

<table cellpadding="0" cellspacing="0" border="0" width="100%" style="margin-bottom:3px">
<tr valign="bottom">
    <td class="smallfont"><a href="http://www.vbulletin.com.tr/newthread.php?do=newthread&amp;f=16" rel="nofollow"><img src="http://www.vbulletin.com.tr/images/fsimg/butonlar/newthread.gif" alt="Yeni Konu Oluştur" border="0" /></a></td>
    <td align="right"><div class="pagenav" align="right">
<table class="tborder" cellpadding="3" cellspacing="1" border="0">
<tr>
    <td class="vbmenu_control" style="font-weight:normal">Sayfa 1 Toplam 5 Sayfadan</td>


        <td class="alt2"><span class="smallfont" title="Toplam 100 sonuçtan 1 ile 20 arası sonuç gösteriliyor."><strong>1</strong></span></td>
 <td class="alt1"><a class="smallfont" href="http://www.vbulletin.com.tr/vbulletin-temel-bilgiler/sayfa2/" title="Toplam 100 sonuçtan 21 ile 40 arası sonuç gösteriliyor.">2</a></td><td class="alt1"><a class="smallfont" href="http://www.vbulletin.com.tr/vbulletin-temel-bilgiler/sayfa3/" title="Toplam 100 sonuçtan 41 ile 60 arası sonuç gösteriliyor.">3</a></td>
    <td class="alt1"><a rel="next" class="smallfont" href="http://www.vbulletin.com.tr/vbulletin-temel-bilgiler/sayfa2/" title="Sonraki Sayfa - Toplam 100 sonuçtan 21 ile 40 arası sonuç gösteriliyor.">&gt;</a></td>
    <td class="alt1" nowrap="nowrap"><a class="smallfont" href="http://www.vbulletin.com.tr/vbulletin-temel-bilgiler/sayfa5/" title="Sonuncu Sayfa - Toplam 100 sonuçtan 81 ile 100 arası sonuç gösteriliyor.">Son Sayfa <strong>&raquo;</strong></a></td>
    <td class="vbmenu_control" title="forumdisplay.php?f=16&amp;order=desc"><a name="PageNav"></a></td>
</tr>
</table>
</div></td>
</tr>
</table>
试试这个

from BeautifulSoup import BeautifulSoup
import requests
domain = "http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/"
page = requests.get(domain)
result = BeautifulSoup(page.text)
anc = result.findAll("span")
for values in range(len(anc)):
    anchor = anc[values].findAll('a')
    for i in anchor:
        if "javascript" not in i.get('href') and "sayfa" in i.get('href'):
            print i.get('href')
这将为您获取href链接

Output:
http://www.forumsokagi.com/forum.php
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa2/
etc...
在列表comp中:

urls = [span.a["href"] for span in soup.findAll('span') if span.a]
如果在循环中打印span.a,有时会看到
None
,因此在使用
span.a[“href”]
之前需要检查
如果span.a
,否则会出现
TypeError:'NoneType'对象没有属性'\uu getitem'

您可以使用set comp,因为存在重复的URL:

urls = {span.a["href"] for span in soup.findAll('span') if span.a}
然后搜索您需要的任何url:

for url in sorted(urls):
    if "sayfa" in url:
        print url
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa2/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa3/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa4/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa7/

In [26]: import urllib2

In [27]: from bs4 import BeautifulSoup

In [28]: domain="http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/" 
In [29]: page = urllib2.urlopen(domain).read()

In [30]: soup = BeautifulSoup(page)

In [31]: urls = {span.a["href"] for span in soup.findAll('span') if span.a}

In [32]: for url in sorted(urls):
   ....:     if "sayfa" in url:
   ....:             print url
   ....:         
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa2/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa3/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa4/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa7/

假设您想要的URL包含wordsayfa

您也可以使用
lxml
来执行此操作

import urllib2
import lxml.html
domain="http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/"
data=urllib2.urlopen(domain).read()
tree = lxml.html.fromstring(data)
for i in  tree.xpath('//a/@href'):
    if "sayfa" in i:
        print i
输出:

http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa2/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa3/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa4/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa2/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa7/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa2/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa3/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa4/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa2/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa7/

您想从哪里搜索术语“sayfa”?从html源代码还是从url本身?使用SoupTrainer导入请求和美化组。你会得到想要的结果。请尝试示例代码@carms642www.vbulletin.com.tr/vbulletin-temel-bilgiler/What is SoupStrainer?使用SoupStrainer,我们可以有选择地解析@heinst.UserWarning:BeautifulSoup构造函数的“parseOnlyThese”参数已重命名为“parse_only.”已重命名为“%s”。%(旧名称,新名称))请尝试仅解析而不是仅解析。请检查详细信息。您使用的BeautifulSoup版本可能不同@carms642I找不到solution@carms642,这是什么意思?数据不应该重复数据不应该重复,这就是为什么我使用SET必须以这种方式输出的原因我怎么搞不清楚我该怎么做才能获得不同的屏幕输出示例:
import urllib2
import lxml.html
domain="http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/"
data=urllib2.urlopen(domain).read()
tree = lxml.html.fromstring(data)
for i in  tree.xpath('//a/@href'):
    if "sayfa" in i:
        print i
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa2/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa3/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa4/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa2/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa7/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa2/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa3/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa4/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa2/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa7/