python中的web抓取_Python_Web Scraping

python中的web抓取

python web-scraping

python中的web抓取,python,web-scraping,Python,Web Scraping,我想用python从所有~62000个名称中删除。我正试着使用美丽的SOUP4图书馆然而，它只是不起作用以下是我目前的代码： import urllib2, re from bs4 import BeautifulSoup soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').

我想用python从所有~62000个名称中删除。我正试着使用美丽的SOUP4图书馆

然而，它只是不起作用

以下是我目前的代码：

import urllib2, re
   from bs4 import BeautifulSoup

   soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read())

divs = soup.findAll('div', attrs={'class' : 'name_location'})
print divs
[]

我做错了什么？此外，我想以某种方式访问下一页，将下一组名称添加到列表中，但我现在不知道如何操作。谢谢你的帮助

你不工作是什么意思？空列表还是错误

如果您收到一个空列表，这是因为文档中不存在类“name\u location”。还可以在

上查看bs4的文档。在大多数情况下，简单地刮掉一个站点是非常不体贴的。在很短的时间内，您就给站点增加了相当大的负载，降低了合法用户的请求速度。更不用说窃取他们所有的数据了

考虑另一种方法，例如（礼貌地）请求转储数据（如上所述）

或者，如果您确实需要刮：

使用计时器间隔您的请求

巧妙地刮

我快速浏览了一下该页面，发现他们使用AJAX请求签名。为什么不简单地复制他们的AJAX请求呢？它很可能使用某种REST调用。通过这样做，您只需请求所需的数据，就可以减轻他们服务器上的负载。因为数据的格式很好，所以实际处理数据也会更容易

重新编辑，我查看了他们的

robots.txt

文件。它允许

/xml/

请尊重这一点。

您可以尝试以下方法：

import urllib2
from bs4 import BeautifulSoup

html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495')

# uncomment to try with a smaller subset of the signatures
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml')

results = []
while True:
    # Read the web page in XML mode
    soup = BeautifulSoup(html.read(), "xml")

    try:
        for s in soup.find_all("signature"):
            # Scrape the names from the XML
                    firstname = s.find('firstname').contents[0]
            lastname = s.find('lastname').contents[0]
            results.append(str(firstname) + " " + str(lastname))
    except:
        pass

    # Find the next page to scrape
    prev = soup.find("prev_signature")

    # Check if another page of result exists - if not break from loop   
    if prev == None:
        break

    # Get the previous URL
    url = prev.contents[0]

    # Open the next page of results
    html = urllib2.urlopen(url)
    print("Extracting data from {}".format(url))

# Print the results
print("\n")
print("====================")   
print("= Printing Results =")
print("====================\n")
print(results)

请注意，虽然那里有大量数据需要查看，但我不知道这是否违反了网站的服务条款，因此您需要查看。

列表中包含哪些内容？另外，请不要使用变量名

list

，因为它会隐藏同名的python内置代码，而且scrapy会使每个页面的抓取变得微不足道，但需要使用/学习scrapy框架。请注意：1）看起来站点的AUP不允许这样做，2）即使您在下一页、下一页进行了简单的循环，下一页等等，你可能会被阻止，因为你会提出很多请求。。。为什么不给他们发电子邮件，问他们你想要的信息是否可能？里面什么都没有。那我就更新一下。我现在也会尝试给他们发电子邮件，但我仍然想尝试这个问题。它是一个空列表。当我检查Chrome中的元素时，这个类似乎存在，这很奇怪，因为当我查看源代码时它并不存在，现在你提到了它。我可以使用另一种方法，但我不知道如何使用。你能帮我向签名所在的地方提出申请吗？我发了一封电子邮件，但没有用。