Python I';我在试图从网站上抓取数据时出错
我写了一个数据抓取代码;它在某些页面上运行良好,但在某些页面上显示: KeyError:'isbn' 你能指导我如何解决这个问题吗 这是我的密码:Python I';我在试图从网站上抓取数据时出错,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我写了一个数据抓取代码;它在某些页面上运行良好,但在某些页面上显示: KeyError:'isbn' 你能指导我如何解决这个问题吗 这是我的密码: import requests import re import json from bs4 import BeautifulSoup import csv import sys import codecs def Soup(content): soup = BeautifulSoup(content, 'html.parser')
import requests
import re
import json
from bs4 import BeautifulSoup
import csv
import sys
import codecs
def Soup(content):
soup = BeautifulSoup(content, 'html.parser')
return soup
def Main(url):
r = requests.get(url)
soup = Soup(r.content)
scripts = soup.findAll("script", type="application/ld+json",
text=re.compile("data"))
prices = [span.text for span in soup.select(
"p.product-field.price span span") if span.text != "USD"]
with open("AudioBook/Fiction & Literature/African American.csv", 'a', encoding="utf-8", newline="") as f:
writer = csv.writer(f)
writer.writerow(["Title", "Writer", "Price", "IMG", "URL", "ISBN"])
for script, price in zip(scripts, prices):
script = json.loads(script.text)
title = script["data"]["name"]
author = script["data"]["author"][0]["name"]
img = f'https:{script["data"]["thumbnailUrl"]}'
isbn = script["data"]["isbn"]
url = script["data"]["url"]
writer.writerow([title, author, price, img, url, isbn])
for x in range(1,10):
url = ("https://www.kobo.com/ww/en/audiobooks/contemporary-1?pageNumber=" + str(x))
print("Scrapin page " + str(x) + ".....")
Main(url)
由于有声读物列表页面上没有ISBN,因此您可以使用默认值准备此案例,例如:
isbn = script["data"].get("isbn", "")
在这种情况下,如果脚本[“数据”]
中不存在“isbn”
键,它将返回空字符串的值
或者,您可以从有声读物特定页面(您的脚本[“data”][“url”]
上面)获取图书ISBN,例如:
这表明
script[“data”]
没有键isbn
。你检查过你的HTML吗?是的,我检查过了,有一个关键的isbn。我的直觉是,你至少有一个案例没有。您应该决定如何处理那些没有ISBN的文件,例如,用空字符串编写CSV。bro它可以工作,但只显示ISBN的空字段,如果您签入html,则有ISBN可用。你能看一下html吗?请sirI检查,ISBN不是数据的一部分:查看来源:。ISBN只有在您导航到网站的图书特定页面时才会出现。那么我们如何提取ISBN?有什么解决方案吗?您需要按照您使用脚本[“data”][“URL”]
提取的URL进行操作,即使用请求获取此页面。获取
,然后从中提取ISBN。bro我的代码对此没有问题一些链接如下:有了此链接,我的代码工作正常,但为什么不适用于某些URL
def Main(url):
r = requests.get(url)
soup = Soup(r.content)
scripts = soup.findAll("script", type="application/ld+json",
text=re.compile("data"))
prices = [span.text for span in soup.select(
"p.product-field.price span span") if span.text != "USD"]
with open("AudioBook/Fiction & Literature/African American.csv", 'a', encoding="utf-8", newline="") as f:
writer = csv.writer(f)
writer.writerow(["Title", "Writer", "Price", "IMG", "URL", "ISBN"])
for script, price in zip(scripts, prices):
script = json.loads(script.text)
title = script["data"]["name"]
author = script["data"]["author"][0]["name"]
img = f'https:{script["data"]["thumbnailUrl"]}'
# NEW CODE
url = script["data"]["url"]
if "isbn" in script["data"]:
# ebook listings
isbn = script["data"]["isbn"]
else:
# audiobook listings
r = requests.get(url)
inner_soup = Soup(r.content)
try:
inner_script = json.loads(
inner_soup.find("script", type="application/ld+json",
text=re.compile("workExample")).text)
isbn = inner_script["workExample"]["isbn"]
except AttributeError:
isbn = ""
# END NEW CODE
writer.writerow([title, author, price, img, url, isbn])