Python 获取属性错误：'；非类型'；对象没有属性'；文本'；（刮网）_Python_Selenium_Web Scraping_Beautifulsoup_Google Colaboratory

Python 获取属性错误：'；非类型'；对象没有属性'；文本'；（刮网）

python selenium web-scraping google-colaboratory

Python 获取属性错误：'；非类型'；对象没有属性'；文本'；（刮网）,python,selenium,web-scraping,beautifulsoup,google-colaboratory,Python,Selenium,Web Scraping,Beautifulsoup,Google Colaboratory,这是我关于网页抓取的案例研究。我在最后一个代码中遇到了一个问题“NoneType”对象没有属性“text”，所以我尝试用“getattr”函数来修复它，但它不起作用 for link in productlinks: source = requests.get(link) soup = BeautifulSoup(source.content, 'lxml') name = getattr(soup.find('h1',class_='item-heading__n

这是我关于网页抓取的案例研究。我在最后一个代码中遇到了一个问题“NoneType”对象没有属性“text”，所以我尝试用“getattr”函数来修复它，但它不起作用

 for link in productlinks:
    source = requests.get(link)
    soup = BeautifulSoup(source.content, 'lxml')

    name = getattr(soup.find('h1',class_='item-heading__name'),'text',None)
    price = getattr(soup.find('p',class_='item-heading__price'),'text',None)
    feature = getattr(soup.find('div',class_='tab-accordion__content active'),'text',None)

    sum = {
      'name':name,
      'price':price,
      'feature':feature
          }
    print(sum)

'''

这是输出。它只显示“非类型”

{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}

首先，对于正在抓取的页面，始终关闭

JS

。然后，您会意识到标记类会发生变化，而这些正是您想要瞄准的对象

另外，在页面中循环时，不要忘记Python的

range（）

stop值不包含在内。也就是说，这个

范围（1,28）

将在

27页停止
下面是我将如何着手的：
import json

import requests
from bs4 import BeautifulSoup


cookies = {
    "ServerID": "1033",
    "__zlcmid": "10tjXhWpDJVkUQL",
}

headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
}


def extract_info(bs: BeautifulSoup, tag: str, attr_value: str) -> list:
    return [i.text.strip() for i in bs.find_all(tag, {"itemprop": attr_value})]


all_pages = []
for page in range(1, 29):
    print(f"Scraping data from page {page}...")

    current_page = f"https://www.birdsnest.com.au/womens/dresses?page={page}"
    source = requests.get(current_page, headers=headers, cookies=cookies)
    soup = BeautifulSoup(source.content, 'html.parser')

    brand = extract_info(soup, tag="strong", attr_value="brand")
    name = extract_info(soup, tag="h2", attr_value="name")
    price = extract_info(soup, tag="span", attr_value="price")

    all_pages.extend(
        [
            {
                "brand": b,
                "name": n,
                "price": p,
            } for b, n, p in zip(brand, name, price)
        ]
    )

print(f"{all_pages}\nFound: {len(all_pages)} dresses.")

with open("all_the_dresses2.json", "w") as jf:
    json.dump(all_pages, jf, indent=4)


这将为您提供一个带有所有衣服的JSON

    {
        "brand": "boho bird",
        "name": "Prissy Dress",
        "price": "$189.95"
    },
    {
        "brand": "boho bird",
        "name": "Dandelion Dress",
        "price": "$139.95"
    },
    {
        "brand": "Lula Soul",
        "name": "Dandelion Dress",
        "price": "$179.95"
    },
    {
        "brand": "Honeysuckle Beach",
        "name": "Cotton V-Neck A-Line Splice Dress",
        "price": "$149.95"
    },
    {
        "brand": "Honeysuckle Beach",
        "name": "Lenny Pinafore",
        "price": "$139.95"
    },
and so on for the next 28 pages ...

首先，对于正在抓取的页面，始终关闭JS
。然后，您会意识到标记类会发生变化，而这些正是您想要瞄准的对象
另外，在页面中循环时，不要忘记Python的range（）
stop值不包含在内。也就是说，这个范围（1,28）
将在27页停止
下面是我将如何着手的：
import json

import requests
from bs4 import BeautifulSoup


cookies = {
    "ServerID": "1033",
    "__zlcmid": "10tjXhWpDJVkUQL",
}

headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
}


def extract_info(bs: BeautifulSoup, tag: str, attr_value: str) -> list:
    return [i.text.strip() for i in bs.find_all(tag, {"itemprop": attr_value})]


all_pages = []
for page in range(1, 29):
    print(f"Scraping data from page {page}...")

    current_page = f"https://www.birdsnest.com.au/womens/dresses?page={page}"
    source = requests.get(current_page, headers=headers, cookies=cookies)
    soup = BeautifulSoup(source.content, 'html.parser')

    brand = extract_info(soup, tag="strong", attr_value="brand")
    name = extract_info(soup, tag="h2", attr_value="name")
    price = extract_info(soup, tag="span", attr_value="price")

    all_pages.extend(
        [
            {
                "brand": b,
                "name": n,
                "price": p,
            } for b, n, p in zip(brand, name, price)
        ]
    )

print(f"{all_pages}\nFound: {len(all_pages)} dresses.")

with open("all_the_dresses2.json", "w") as jf:
    json.dump(all_pages, jf, indent=4)


这将为您提供一个带有所有衣服的JSON

    {
        "brand": "boho bird",
        "name": "Prissy Dress",
        "price": "$189.95"
    },
    {
        "brand": "boho bird",
        "name": "Dandelion Dress",
        "price": "$139.95"
    },
    {
        "brand": "Lula Soul",
        "name": "Dandelion Dress",
        "price": "$179.95"
    },
    {
        "brand": "Honeysuckle Beach",
        "name": "Cotton V-Neck A-Line Splice Dress",
        "price": "$149.95"
    },
    {
        "brand": "Honeysuckle Beach",
        "name": "Lenny Pinafore",
        "price": "$139.95"
    },
and so on for the next 28 pages ...

这是productlinks
中的第一个链接：此链接被阻止。如果您转到productlinks列表中的一个链接，它们将指向空页面，这是productlinks
中的第一个链接：此链接被阻止。如果您转到productlinks列表中的一个链接，它们将指向空页面，这是我的输出，我使用你所有的代码{从第1页刮取数据…从第2页刮取数据…从第3页刮取数据…从第4页刮取数据…从第5页刮取数据…从第6页刮取数据…从第7页刮取数据…}。。。脚本完成后，输出将在一个JSON
文件中。这一行将输出保存到一个文件open（“all_the_dresses2.json”，“w”）
哦，我知道了，你在谷歌Colab上。我对此一无所知，但您可以添加这一行print（f“{all_pages}\n查找：{len（all_pages）}dresses.”）
来打印结果。OMG！它起作用了。非常感谢你的帮助。我尝试了很多次来修复它。谢谢你如果你觉得我的答案对投票有用并且/或者接受它-这是我的输出，我使用你所有的代码{从第1页抓取数据…从第2页抓取数据…从第3页抓取数据…从第4页抓取数据…从第5页抓取数据…从第6页抓取数据…从第7页抓取数据…}。。。脚本完成后，输出将在一个JSON
文件中。这一行将输出保存到一个文件open（“all_the_dresses2.json”，“w”）
哦，我知道了，你在谷歌Colab上。我对此一无所知，但您可以添加这一行print（f“{all_pages}\n查找：{len（all_pages）}dresses.”）来打印结果。OMG！它起作用了。非常感谢你的帮助。我尝试了很多次来修复它。如果您觉得我的答案有用，请立即投票并/或接受-
 for link in productlinks:
    source = requests.get(link)
    soup = BeautifulSoup(source.content, 'lxml')

    name = getattr(soup.find('h1',class_='item-heading__name'),'text',None)
    price = getattr(soup.find('p',class_='item-heading__price'),'text',None)
    feature = getattr(soup.find('div',class_='tab-accordion__content active'),'text',None)

    sum = {
      'name':name,
      'price':price,
      'feature':feature
          }
    print(sum)

{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}

import json

import requests
from bs4 import BeautifulSoup


cookies = {
    "ServerID": "1033",
    "__zlcmid": "10tjXhWpDJVkUQL",
}

headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
}


def extract_info(bs: BeautifulSoup, tag: str, attr_value: str) -> list:
    return [i.text.strip() for i in bs.find_all(tag, {"itemprop": attr_value})]


all_pages = []
for page in range(1, 29):
    print(f"Scraping data from page {page}...")

    current_page = f"https://www.birdsnest.com.au/womens/dresses?page={page}"
    source = requests.get(current_page, headers=headers, cookies=cookies)
    soup = BeautifulSoup(source.content, 'html.parser')

    brand = extract_info(soup, tag="strong", attr_value="brand")
    name = extract_info(soup, tag="h2", attr_value="name")
    price = extract_info(soup, tag="span", attr_value="price")

    all_pages.extend(
        [
            {
                "brand": b,
                "name": n,
                "price": p,
            } for b, n, p in zip(brand, name, price)
        ]
    )

print(f"{all_pages}\nFound: {len(all_pages)} dresses.")

with open("all_the_dresses2.json", "w") as jf:
    json.dump(all_pages, jf, indent=4)


    {
        "brand": "boho bird",
        "name": "Prissy Dress",
        "price": "$189.95"
    },
    {
        "brand": "boho bird",
        "name": "Dandelion Dress",
        "price": "$139.95"
    },
    {
        "brand": "Lula Soul",
        "name": "Dandelion Dress",
        "price": "$179.95"
    },
    {
        "brand": "Honeysuckle Beach",
        "name": "Cotton V-Neck A-Line Splice Dress",
        "price": "$149.95"
    },
    {
        "brand": "Honeysuckle Beach",
        "name": "Lenny Pinafore",
        "price": "$139.95"
    },
and so on for the next 28 pages ...