使用BeautifulSoup在Python中进行Web抓取_Python_Web Scraping_Beautifulsoup

使用BeautifulSoup在Python中进行Web抓取

python web-scraping

使用BeautifulSoup在Python中进行Web抓取,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我是个新手，我一直在抓取一个包含一些我想提取的引用的网页您是否也可以检查将刮取的数据复制到CSV文件的代码我在“findAll”函数中遇到错误 for row in table.findAll('div', attrs = {'class':'quote'}): AttributeError: 'NoneType' object has no attribute 'findAll 该站点没有任何属性id为：container的div标记。你可以使用该站点的html与您在脚本中定

我是个新手，我一直在抓取一个包含一些我想提取的引用的网页

您是否也可以检查将刮取的数据复制到CSV文件的代码

我在

“findAll”

函数中遇到错误

for row in table.findAll('div', attrs = {'class':'quote'}):    
AttributeError: 'NoneType' object has no attribute 'findAll

该站点没有任何属性id为：container的div标记。你可以使用

该站点的html与您在脚本中定义的html不同。我已经纠正了前三个字段。我想你可以做剩下的。以下内容应该适合您

import csv
import requests
from bs4 import BeautifulSoup

URL = "https://www.passiton.com/inspirational-quotes?page={}"

quotes = []
page = 1

while True:
    r = requests.get(URL.format(page))
    print(r.url)
    soup = BeautifulSoup(r.content, 'html5lib')

    if not soup.select_one("#all_quotes .text-center > a"):break
    for row in soup.select("#all_quotes .text-center"):
        quote = {}
        try:
            quote['quote'] = row.select_one('a img.shadow').get("alt")
        except AttributeError: quote['quote'] = ""
        try:
            quote['url'] = row.select_one('a').get('href')
        except AttributeError: quote['url'] = ""
        try:
            quote['img'] = row.select_one('a img.shadow').get('src')
        except AttributeError: quote['img'] = ""
        quotes.append(quote)

    page+=1

with open('inspirational_quotes.csv', 'w', newline="", encoding="utf-8") as f:
    w = csv.DictWriter(f,['quote','url','img'])
    w.writeheader()
    for quote in quotes:
        w.writerow(quote)

您的错误意味着

soup.find（'div'，attrs={id'：'container'}）

返回了

None

，即未找到具有

container

id的

div

元素。你检查过HTML了吗？您可以在问题中显示要提取的相关部分。我认为问题在于，您使用的URL实际上重定向到了另一个页面：@dspencer是的，先生，我已经检查了HTML，“Container”似乎是正确的，也检查了“row”。你能检查一下页面吗？是的，一旦我们点击一个引号，它就会重定向到不同的年龄。@dspencer table=soup.find（'div'，attrs={'id'：'all_quotes'））这是代码中的变化，完成了这项工作。我得到了一个“csv”文件，文件头为'theme'，'url'，'img'，'lines'，'author'，但与它们相对的数据是空白的。我可能做错了什么？您好-我连接了这个，我得到了以下结果：'。。除此之外，“回溯（最后一次调用）：文件“/home/martin/.atom/python/examples/bs_values_com.py”，第31行，用open（'inspirational_quotes.csv'，'w'，newline=“”，encoding=“utf-8”）作为f:TypeError:File（）最多接受3个参数（给定4个）[以55.861s完成]我只是在想怎么做-我在想办法。先生，我不明白你给出的答案。上述问题在1年前就可以执行了。你能再检查一下吗？你好，我检查了这个，我得到了以下结果：''。。除此之外，“回溯（最后一次调用）：文件“/home/martin/.atom/python/examples/bs_values_com.py”，第31行，用open（'inspirational_quotes.csv'，'w'，newline=“”，encoding=“utf-8”）作为f:TypeError:File（）最多接受3个参数（给定4个）[以55.861s完成]'@zero，我得到了完美的答案。您确定没有创建任何具有该名称的excel文件吗？因为它创建了一个页面，所以页面被刮取。

from requests import get
url='https://quote-garden.herokuapp.com/quotes/random'
res=get(url)
res=res.json()
quote=res["quoteText"]
quoteauthor=res["quoteAuthor"]

import csv
import requests
from bs4 import BeautifulSoup

URL = "https://www.passiton.com/inspirational-quotes?page={}"

quotes = []
page = 1

while True:
    r = requests.get(URL.format(page))
    print(r.url)
    soup = BeautifulSoup(r.content, 'html5lib')

    if not soup.select_one("#all_quotes .text-center > a"):break
    for row in soup.select("#all_quotes .text-center"):
        quote = {}
        try:
            quote['quote'] = row.select_one('a img.shadow').get("alt")
        except AttributeError: quote['quote'] = ""
        try:
            quote['url'] = row.select_one('a').get('href')
        except AttributeError: quote['url'] = ""
        try:
            quote['img'] = row.select_one('a img.shadow').get('src')
        except AttributeError: quote['img'] = ""
        quotes.append(quote)

    page+=1

with open('inspirational_quotes.csv', 'w', newline="", encoding="utf-8") as f:
    w = csv.DictWriter(f,['quote','url','img'])
    w.writeheader()
    for quote in quotes:
        w.writerow(quote)