Python 使用Beauty Soup从Kickstarter中删除项目URL_Python_Python 3.x_Web Scraping_Beautifulsoup

Python 使用Beauty Soup从Kickstarter中删除项目URL

python python-3.x web-scraping

Python 使用Beauty Soup从Kickstarter中删除项目URL,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,我正在尝试使用BeautifulSoup从网站上刮取项目的URL。我正在使用以下代码： import requests from bs4 import BeautifulSoup url = 'https://www.kickstarter.com/discover/advanced?category_id=28&staff_picks=1&sort=newest&seed=2639586&page=1' page = requests.get(url) sou

我正在尝试使用BeautifulSoup从网站上刮取项目的URL。我正在使用以下代码：

import requests
from bs4 import BeautifulSoup

url = 'https://www.kickstarter.com/discover/advanced?category_id=28&staff_picks=1&sort=newest&seed=2639586&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

project_name_list = soup.find(class_='grid-row flex flex-wrap')

project_name_list_items = project_name_list.find_all('a')
print(project_name_list_items)

for project_name in project_name_list_items:
    links = project_name.get('href')
    print(links)

但这是我得到的输出：

[<a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>]
None
None
None
None
None
None

但仍然没有结果。另外，我正在抓取的这个页面在页面的末尾有一个Load more部分。我如何获得该部分的URL？

非常感谢您的帮助。

数据不是嵌入html本身，而是作为JSON嵌入到名为data project的html属性中。一种解决方案是使用find_alldiv并只签出那些具有该属性的

此外，虽然url在JSON中存在，但在另一个名为data-ref的html属性中存在一个名为ref的查询参数

import requests
from bs4 import BeautifulSoup
import json

url = 'https://www.kickstarter.com/discover/advanced?category_id=28&staff_picks=1&sort=newest&seed=2639586&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

data = [
    (json.loads(i["data-project"]), i["data-ref"])
    for i in soup.find_all("div")
    if i.get("data-project")
]

for i in data:
    print(f'{i[0]["urls"]["web"]["project"]}?ref={i[1]}')

然后，您可以通过增加页面查询参数来迭代pages Load more（页面加载更多）按钮。数据不是嵌入到html本身中，而是作为JSON嵌入到称为data project的html属性中。一种解决方案是使用find_alldiv并只签出那些具有该属性的

此外，虽然url在JSON中存在，但在另一个名为data-ref的html属性中存在一个名为ref的查询参数

import requests
from bs4 import BeautifulSoup
import json

url = 'https://www.kickstarter.com/discover/advanced?category_id=28&staff_picks=1&sort=newest&seed=2639586&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

data = [
    (json.loads(i["data-project"]), i["data-ref"])
    for i in soup.find_all("div")
    if i.get("data-project")
]

for i in data:
    print(f'{i[0]["urls"]["web"]["project"]}?ref={i[1]}')

然后，您可以通过增加页面查询参数来迭代“页面加载更多”按钮

，但如何才能获得项目正文中的文本？@MaryamRahmaniMoghaddam是这样的吗？我想这可能与这个问题有关。我在这里提出了一个新问题。如果可以的话，请帮助我。谢谢但是我怎样才能在项目主体中得到文本呢？@MaryamRahmaniMoghaddam是这样吗？我想这可能与这个问题有关。我在这里提出了一个新问题。如果可以的话，请帮助我。谢谢