使用Python3的网页抓取教程？_Python_Web Scraping_Python 3.2

使用Python3的网页抓取教程？

python web-scraping

使用Python3的网页抓取教程？,python,web-scraping,python-3.2,Python,Web Scraping,Python 3.2,我正在努力学习Python3.x，这样我就可以浏览网站了。人们建议我使用BeautifulSoup4或lxml.html。有没有人能给我指出使用Python3.x进行BeautifulSoup的教程或示例的正确方向感谢您的帮助。实际上，我刚刚用Python编写了一些示例代码。我在Python2.7上编写并测试了，但我使用的两个包（requests和BeautifulSoup）都与Python3完全兼容下面是一些代码，让您开始使用Python进行web抓取： import sys import

我正在努力学习Python3.x，这样我就可以浏览网站了。人们建议我使用BeautifulSoup4或lxml.html。有没有人能给我指出使用Python3.x进行BeautifulSoup的教程或示例的正确方向

感谢您的帮助。

实际上，我刚刚用Python编写了一些示例代码。我在Python2.7上编写并测试了，但我使用的两个包（requests和BeautifulSoup）都与Python3完全兼容

下面是一些代码，让您开始使用Python进行web抓取：

import sys
import requests
from BeautifulSoup import BeautifulSoup


def scrape_google(keyword):

    # dynamically build the URL that we'll be making a request to
    url = "http://www.google.com/search?q={term}".format(
        term=keyword.strip().replace(" ", "+"),
    )

    # spoof some headers so the request appears to be coming from a browser, not a bot
    headers = {
        "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "accept-charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
        "accept-encoding": "gzip,deflate,sdch",
        "accept-language": "en-US,en;q=0.8",
    }

    # make the request to the search url, passing in the the spoofed headers.
    r = requests.get(url, headers=headers)  # assign the response to a variable r

    # check the status code of the response to make sure the request went well
    if r.status_code != 200:
        print("request denied")
        return
    else:
        print("scraping " + url)

    # convert the plaintext HTML markup into a DOM-like structure that we can search
    soup = BeautifulSoup(r.text)

    # each result is an <li> element with class="g" this is our wrapper
    results = soup.findAll("li", "g")

    # iterate over each of the result wrapper elements
    for result in results:

        # the main link is an <h3> element with class="r"
        result_anchor = result.find("h3", "r").find("a")

        # print out each link in the results
        print(result_anchor.contents)


if __name__ == "__main__":

    # you can pass in a keyword to search for when you run the script
    # be default, we'll search for the "web scraping" keyword
    try:
        keyword = sys.argv[1]
    except IndexError:
        keyword = "web scraping"

    scrape_google(keyword)

导入系统导入请求从BeautifulSoup导入BeautifulSoup def scrape_google（关键字）： #动态构建我们将向其发出请求的URL url=”http://www.google.com/search?q={term}.格式( term=关键字.strip（）.replace（“，“+”）， ) #伪造一些标题，使请求看起来是来自浏览器，而不是机器人标题={ “用户代理”：“Mozilla/5.0（Macintosh；英特尔Mac OS X 10_7_5）”， “接受”：“text/html，application/xhtml+xml，application/xml；q=0.9，*/*；q=0.8”， “接受字符集”：“ISO-8859-1，utf-8；q=0.7，*；q=0.3”， “接受编码”：“gzip、deflate、sdch”， “接受语言”：“en-US，en；q=0.8”， } #向搜索url发出请求，传入伪造的标题。 r=requests.get（url，headers=headers）#将响应分配给变量r #检查响应的状态代码以确保请求顺利进行如果r.status_代码！=200: 打印（“请求被拒绝”）返回其他：打印（“刮取”+url） #将纯文本HTML标记转换为我们可以搜索的类似DOM的结构 soup=BeautifulSoup（右文本） #每个结果都是一个

元素，class=“g”这是我们的包装器结果=汤.findAll（“li”，“g”） #迭代每个结果包装器元素对于结果中的结果： #主链接是class=“r”的元素 result\u anchor=result.find（“h3”、“r”）.find（“a”） #打印出结果中的每个链接打印（结果内容）如果名称=“\uuuuu main\uuuuuuuu”： #运行脚本时，可以传入要搜索的关键字 #默认情况下，我们将搜索“web scraping”关键字尝试：关键字=sys.argv[1] 除索引器外：关键字=“网页抓取” 谷歌搜索（关键字）

如果您只是想了解更多关于Python3的一般知识，并且已经熟悉Python2.x，那么从Python2过渡到Python3可能会有所帮助。

如果您想进行web抓取，请使用Python2。是迄今为止Python最好的web抓取框架，并且没有3.x版本。