漂亮的汤和Python抓取-无法引用正确的元素_Python_Web Scraping_Beautifulsoup

漂亮的汤和Python抓取-无法引用正确的元素

python web-scraping

漂亮的汤和Python抓取-无法引用正确的元素,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我从craigstlist中获取了一个预先制作的脚本，用于删除我正在寻找的值，并希望在此基础上进行扩展，从其他论坛（例如Pinkbike）中删除数据工作正常的原始脚本： from bs4 import BeautifulSoup from datetime import datetime import requests import time #from config import * Free_CL_URL = "https://philadelphia.craigslist.o

我从craigstlist中获取了一个预先制作的脚本，用于删除我正在寻找的值，并希望在此基础上进行扩展，从其他论坛（例如Pinkbike）中删除数据

工作正常的原始脚本：

from bs4 import BeautifulSoup
from datetime import datetime
import requests
import time
#from config import *

Free_CL_URL = "https://philadelphia.craigslist.org/d/bicycles/search/bia"


def crawlFree(pageval):
    # crawls the free items section and parses HTML
    if pageval == 0:
        r = requests.get(Free_CL_URL).text
        soup = BeautifulSoup(r, 'html.parser')
    else:
        r = requests.get(Free_CL_URL + "?s=" + str(pageval)).text
        time.sleep(1)
        soup = BeautifulSoup(r, 'html.parser')
    return soup


def searchItems(input):
    # in each page crawled from crawlFree , extract the titles, lower the character case and compare against search strings to append a result list
    itemlist = []
    for i in input:
        TitleSplit = str(i.contents[0]).split()
        TitleSplit = str([TitleSplit.lower() for TitleSplit in TitleSplit])

        if "cyclocross" in TitleSplit:
            print(str("\n" + i.contents[0]))
            itemlist.append(i.contents[0])
            print((i.attrs['href']))
            itemlist.append(i.attrs['href'])
        elif "58" in TitleSplit:
            print(str("\n" + i.contents[0]))
            itemlist.append(i.contents[0])
            print((i.attrs['href']))
            itemlist.append(i.attrs['href'])
        elif "cx" in TitleSplit:
            print(str("\n" + i.contents[0]))
            itemlist.append(i.contents[0])
            print((i.attrs['href']))
            itemlist.append(i.attrs['href'])

    return itemlist

pageval = 0
totalist = []

while True:
    time.sleep(0.2)
    soup = crawlFree(pageval)
    # crawl page until you hit a page with the following text, signifing the end of the catagory
    if "search and you will find" and "the harvest moon wanes" in soup.text:
        print("\nEnd of Script")
        break
    else:
        print("\nSearching page " + str((int(pageval / 120))))
        links = soup.find_all('a', class_="result-title hdrlnk")
        itemlist = searchItems(links)
        totalist.append(itemlist)

        pageval += 120

now = datetime.now()
current_time = now.strftime("%H:%M:%S")

# message compliation and delivery
message = "Subject:CL Free Bot Report - " + str(len(totalist)) + "\n\n"

for i in totalist:
    for i in i:
        message += str("\n" + str(i) + "\n")


print(message)

我现在遇到的问题是：

print("\nSearching page " + str((int(pageval / 120))))
        links = soup.find_all('a', class_="result-title hdrlnk")
        itemlist = searchItems(links)
        totalist.append(itemlist)

链接变量an拉着a href似乎不能很好地转换为pinkbike

但当我尝试提取此值时：

        print("\nSearching page " + str((int(pageval / 120))))
        links = soup.find_all('a', class_="href")
        itemlist = searchItems(links)
        totalist.append(itemlist)

我似乎没有放弃这个价值，但似乎无法理解为什么。尝试了几种不同的格式化方法，但失败了

我尝试的原因的完整代码如下：

from bs4 import BeautifulSoup
from datetime import datetime
import requests
import time
#from config import *

URL = "https://www.pinkbike.com/buysell/list/?category=77"


def crawlFree(pageval):
    # crawls our results from the url above
    if pageval == 0:
        r = requests.get(URL).text
        soup = BeautifulSoup(r, 'html.parser')
    else:
        r = requests.get(URL + "?s=" + str(pageval)).text
        time.sleep(1)
        soup = BeautifulSoup(r, 'html.parser')
    return soup


def searchItems(input):
    # in each page crawled from crawlFree , extract the titles, lower the character case and compare against search strings to append a result list
    itemlist = []
    for i in input:
        TitleSplit = str(i.contents[0]).split()
        TitleSplit = str([TitleSplit.lower() for TitleSplit in TitleSplit])

        if "cyclocross" in TitleSplit:
            print(str("\n" + i.contents[0]))
            itemlist.append(i.contents[0])
            print((i.attrs['href']))
            itemlist.append(i.attrs['href'])
        elif "58" in TitleSplit:
            print(str("\n" + i.contents[0]))
            itemlist.append(i.contents[0])
            print((i.attrs['href']))
            itemlist.append(i.attrs['href'])
        elif "large" in TitleSplit:
            print(str("\n" + i.contents[0]))
            itemlist.append(i.contents[0])
            print((i.attrs['href']))
            itemlist.append(i.attrs['href'])

    return itemlist

pageval = 0
totalist = []

while True:
    time.sleep(0.2)
    soup = crawlFree(pageval)
    # crawl page until you hit a page with the following text, signifing the end of the catagory
    if "search and you will find" and "the harvest moon wanes" in soup.text:
        print("\nEnd of Script")
        break
    else:
        print("\nSearching page " + str((int(pageval / 120))))
        links = soup.find_all('a', class_="href")
        itemlist = searchItems(links)
        totalist.append(itemlist)

        pageval += 120

now = datetime.now()
current_time = now.strftime("%H:%M:%S")

# message compliation and delivery
message = "Subject:CL Free Bot Report - " + str(len(totalist)) + "\n\n"

for i in totalist:
    for i in i:
        message += str("\n" + str(i) + "\n")


print(message)

由于声誉的原因，我无法直接发表评论

以下内容适用于您的else子句

links = [div.find_all('a')[1] for div in soup.find_all('div', class_='bsitem')]

但是… 你必须多做一点，因为在刮削中没有一个适合所有人的解决方案。所以你还必须处理跳转到下一页的问题

每页只有20辆而不是120辆自行车，参数不是

而是

page

希望这个提示能帮助您，让我们知道。

很好-我知道您在那里做了什么。我知道问题出在哪里了——它是一个div类，而不是a href。现在我需要开始看看你的建议——我们如何翻开新的一页！非常感谢。我能把它整理好```