Python 获取属性错误:';非类型';对象没有属性';文本';(刮网)
这是我关于网页抓取的案例研究。 我在最后一个代码中遇到了一个问题“NoneType”对象没有属性“text”,所以我尝试用“getattr”函数来修复它,但它不起作用Python 获取属性错误:';非类型';对象没有属性';文本';(刮网),python,selenium,web-scraping,beautifulsoup,google-colaboratory,Python,Selenium,Web Scraping,Beautifulsoup,Google Colaboratory,这是我关于网页抓取的案例研究。 我在最后一个代码中遇到了一个问题“NoneType”对象没有属性“text”,所以我尝试用“getattr”函数来修复它,但它不起作用 for link in productlinks: source = requests.get(link) soup = BeautifulSoup(source.content, 'lxml') name = getattr(soup.find('h1',class_='item-heading__n
for link in productlinks:
source = requests.get(link)
soup = BeautifulSoup(source.content, 'lxml')
name = getattr(soup.find('h1',class_='item-heading__name'),'text',None)
price = getattr(soup.find('p',class_='item-heading__price'),'text',None)
feature = getattr(soup.find('div',class_='tab-accordion__content active'),'text',None)
sum = {
'name':name,
'price':price,
'feature':feature
}
print(sum)
'''
'''
'''
'''
'''
'''
这是输出。它只显示“非类型”
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
首先,对于正在抓取的页面,始终关闭
JS
。然后,您会意识到标记类会发生变化,而这些正是您想要瞄准的对象
另外,在页面中循环时,不要忘记Python的range()
stop值不包含在内。也就是说,这个范围(1,28)
将在27页停止
下面是我将如何着手的:
import json
import requests
from bs4 import BeautifulSoup
cookies = {
"ServerID": "1033",
"__zlcmid": "10tjXhWpDJVkUQL",
}
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
}
def extract_info(bs: BeautifulSoup, tag: str, attr_value: str) -> list:
return [i.text.strip() for i in bs.find_all(tag, {"itemprop": attr_value})]
all_pages = []
for page in range(1, 29):
print(f"Scraping data from page {page}...")
current_page = f"https://www.birdsnest.com.au/womens/dresses?page={page}"
source = requests.get(current_page, headers=headers, cookies=cookies)
soup = BeautifulSoup(source.content, 'html.parser')
brand = extract_info(soup, tag="strong", attr_value="brand")
name = extract_info(soup, tag="h2", attr_value="name")
price = extract_info(soup, tag="span", attr_value="price")
all_pages.extend(
[
{
"brand": b,
"name": n,
"price": p,
} for b, n, p in zip(brand, name, price)
]
)
print(f"{all_pages}\nFound: {len(all_pages)} dresses.")
with open("all_the_dresses2.json", "w") as jf:
json.dump(all_pages, jf, indent=4)
这将为您提供一个带有所有衣服的JSON
{
"brand": "boho bird",
"name": "Prissy Dress",
"price": "$189.95"
},
{
"brand": "boho bird",
"name": "Dandelion Dress",
"price": "$139.95"
},
{
"brand": "Lula Soul",
"name": "Dandelion Dress",
"price": "$179.95"
},
{
"brand": "Honeysuckle Beach",
"name": "Cotton V-Neck A-Line Splice Dress",
"price": "$149.95"
},
{
"brand": "Honeysuckle Beach",
"name": "Lenny Pinafore",
"price": "$139.95"
},
and so on for the next 28 pages ...
首先,对于正在抓取的页面,始终关闭JS
。然后,您会意识到标记类会发生变化,而这些正是您想要瞄准的对象
另外,在页面中循环时,不要忘记Python的range()
stop值不包含在内。也就是说,这个范围(1,28)
将在27页停止
下面是我将如何着手的:
import json
import requests
from bs4 import BeautifulSoup
cookies = {
"ServerID": "1033",
"__zlcmid": "10tjXhWpDJVkUQL",
}
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
}
def extract_info(bs: BeautifulSoup, tag: str, attr_value: str) -> list:
return [i.text.strip() for i in bs.find_all(tag, {"itemprop": attr_value})]
all_pages = []
for page in range(1, 29):
print(f"Scraping data from page {page}...")
current_page = f"https://www.birdsnest.com.au/womens/dresses?page={page}"
source = requests.get(current_page, headers=headers, cookies=cookies)
soup = BeautifulSoup(source.content, 'html.parser')
brand = extract_info(soup, tag="strong", attr_value="brand")
name = extract_info(soup, tag="h2", attr_value="name")
price = extract_info(soup, tag="span", attr_value="price")
all_pages.extend(
[
{
"brand": b,
"name": n,
"price": p,
} for b, n, p in zip(brand, name, price)
]
)
print(f"{all_pages}\nFound: {len(all_pages)} dresses.")
with open("all_the_dresses2.json", "w") as jf:
json.dump(all_pages, jf, indent=4)
这将为您提供一个带有所有衣服的JSON
{
"brand": "boho bird",
"name": "Prissy Dress",
"price": "$189.95"
},
{
"brand": "boho bird",
"name": "Dandelion Dress",
"price": "$139.95"
},
{
"brand": "Lula Soul",
"name": "Dandelion Dress",
"price": "$179.95"
},
{
"brand": "Honeysuckle Beach",
"name": "Cotton V-Neck A-Line Splice Dress",
"price": "$149.95"
},
{
"brand": "Honeysuckle Beach",
"name": "Lenny Pinafore",
"price": "$139.95"
},
and so on for the next 28 pages ...
这是productlinks
中的第一个链接:此链接被阻止。如果您转到productlinks列表中的一个链接,它们将指向空页面,这是productlinks
中的第一个链接:此链接被阻止。如果您转到productlinks列表中的一个链接,它们将指向空页面,这是我的输出,我使用你所有的代码{从第1页刮取数据…从第2页刮取数据…从第3页刮取数据…从第4页刮取数据…从第5页刮取数据…从第6页刮取数据…从第7页刮取数据…}。。。脚本完成后,输出将在一个JSON
文件中。这一行将输出保存到一个文件open(“all_the_dresses2.json”,“w”)
哦,我知道了,你在谷歌Colab上。我对此一无所知,但您可以添加这一行print(f“{all_pages}\n查找:{len(all_pages)}dresses.”)
来打印结果。OMG!它起作用了。非常感谢你的帮助。我尝试了很多次来修复它。谢谢你如果你觉得我的答案对投票有用并且/或者接受它-这是我的输出,我使用你所有的代码{从第1页抓取数据…从第2页抓取数据…从第3页抓取数据…从第4页抓取数据…从第5页抓取数据…从第6页抓取数据…从第7页抓取数据…}。。。脚本完成后,输出将在一个JSON
文件中。这一行将输出保存到一个文件open(“all_the_dresses2.json”,“w”)
哦,我知道了,你在谷歌Colab上。我对此一无所知,但您可以添加这一行print(f“{all_pages}\n查找:{len(all_pages)}dresses.”)
来打印结果。OMG!它起作用了。非常感谢你的帮助。我尝试了很多次来修复它。如果您觉得我的答案有用,请立即投票并/或接受-
for link in productlinks:
source = requests.get(link)
soup = BeautifulSoup(source.content, 'lxml')
name = getattr(soup.find('h1',class_='item-heading__name'),'text',None)
price = getattr(soup.find('p',class_='item-heading__price'),'text',None)
feature = getattr(soup.find('div',class_='tab-accordion__content active'),'text',None)
sum = {
'name':name,
'price':price,
'feature':feature
}
print(sum)
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
import json
import requests
from bs4 import BeautifulSoup
cookies = {
"ServerID": "1033",
"__zlcmid": "10tjXhWpDJVkUQL",
}
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
}
def extract_info(bs: BeautifulSoup, tag: str, attr_value: str) -> list:
return [i.text.strip() for i in bs.find_all(tag, {"itemprop": attr_value})]
all_pages = []
for page in range(1, 29):
print(f"Scraping data from page {page}...")
current_page = f"https://www.birdsnest.com.au/womens/dresses?page={page}"
source = requests.get(current_page, headers=headers, cookies=cookies)
soup = BeautifulSoup(source.content, 'html.parser')
brand = extract_info(soup, tag="strong", attr_value="brand")
name = extract_info(soup, tag="h2", attr_value="name")
price = extract_info(soup, tag="span", attr_value="price")
all_pages.extend(
[
{
"brand": b,
"name": n,
"price": p,
} for b, n, p in zip(brand, name, price)
]
)
print(f"{all_pages}\nFound: {len(all_pages)} dresses.")
with open("all_the_dresses2.json", "w") as jf:
json.dump(all_pages, jf, indent=4)
{
"brand": "boho bird",
"name": "Prissy Dress",
"price": "$189.95"
},
{
"brand": "boho bird",
"name": "Dandelion Dress",
"price": "$139.95"
},
{
"brand": "Lula Soul",
"name": "Dandelion Dress",
"price": "$179.95"
},
{
"brand": "Honeysuckle Beach",
"name": "Cotton V-Neck A-Line Splice Dress",
"price": "$149.95"
},
{
"brand": "Honeysuckle Beach",
"name": "Lenny Pinafore",
"price": "$139.95"
},
and so on for the next 28 pages ...