使用Python3的网页抓取教程?
我正在努力学习Python3.x,这样我就可以浏览网站了。人们建议我使用BeautifulSoup4或lxml.html。有没有人能给我指出使用Python3.x进行BeautifulSoup的教程或示例的正确方向使用Python3的网页抓取教程?,python,web-scraping,python-3.2,Python,Web Scraping,Python 3.2,我正在努力学习Python3.x,这样我就可以浏览网站了。人们建议我使用BeautifulSoup4或lxml.html。有没有人能给我指出使用Python3.x进行BeautifulSoup的教程或示例的正确方向 感谢您的帮助。实际上,我刚刚用Python编写了一些示例代码。我在Python2.7上编写并测试了,但我使用的两个包(requests和BeautifulSoup)都与Python3完全兼容 下面是一些代码,让您开始使用Python进行web抓取: import sys import
感谢您的帮助。实际上,我刚刚用Python编写了一些示例代码。我在Python2.7上编写并测试了,但我使用的两个包(requests和BeautifulSoup)都与Python3完全兼容 下面是一些代码,让您开始使用Python进行web抓取:
import sys
import requests
from BeautifulSoup import BeautifulSoup
def scrape_google(keyword):
# dynamically build the URL that we'll be making a request to
url = "http://www.google.com/search?q={term}".format(
term=keyword.strip().replace(" ", "+"),
)
# spoof some headers so the request appears to be coming from a browser, not a bot
headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"accept-charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
"accept-encoding": "gzip,deflate,sdch",
"accept-language": "en-US,en;q=0.8",
}
# make the request to the search url, passing in the the spoofed headers.
r = requests.get(url, headers=headers) # assign the response to a variable r
# check the status code of the response to make sure the request went well
if r.status_code != 200:
print("request denied")
return
else:
print("scraping " + url)
# convert the plaintext HTML markup into a DOM-like structure that we can search
soup = BeautifulSoup(r.text)
# each result is an <li> element with class="g" this is our wrapper
results = soup.findAll("li", "g")
# iterate over each of the result wrapper elements
for result in results:
# the main link is an <h3> element with class="r"
result_anchor = result.find("h3", "r").find("a")
# print out each link in the results
print(result_anchor.contents)
if __name__ == "__main__":
# you can pass in a keyword to search for when you run the script
# be default, we'll search for the "web scraping" keyword
try:
keyword = sys.argv[1]
except IndexError:
keyword = "web scraping"
scrape_google(keyword)
导入系统
导入请求
从BeautifulSoup导入BeautifulSoup
def scrape_google(关键字):
#动态构建我们将向其发出请求的URL
url=”http://www.google.com/search?q={term}.格式(
term=关键字.strip().replace(“,“+”),
)
#伪造一些标题,使请求看起来是来自浏览器,而不是机器人
标题={
“用户代理”:“Mozilla/5.0(Macintosh;英特尔Mac OS X 10_7_5)”,
“接受”:“text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8”,
“接受字符集”:“ISO-8859-1,utf-8;q=0.7,*;q=0.3”,
“接受编码”:“gzip、deflate、sdch”,
“接受语言”:“en-US,en;q=0.8”,
}
#向搜索url发出请求,传入伪造的标题。
r=requests.get(url,headers=headers)#将响应分配给变量r
#检查响应的状态代码以确保请求顺利进行
如果r.status_代码!=200:
打印(“请求被拒绝”)
返回
其他:
打印(“刮取”+url)
#将纯文本HTML标记转换为我们可以搜索的类似DOM的结构
soup=BeautifulSoup(右文本)
#每个结果都是一个如果您只是想了解更多关于Python3的一般知识,并且已经熟悉Python2.x,那么从Python2过渡到Python3可能会有所帮助。如果您想进行web抓取,请使用Python2。是迄今为止Python最好的web抓取框架,并且没有3.x版本。