漂亮的汤和Python抓取-无法引用正确的元素
我从craigstlist中获取了一个预先制作的脚本,用于删除我正在寻找的值,并希望在此基础上进行扩展,从其他论坛(例如Pinkbike)中删除数据 工作正常的原始脚本:漂亮的汤和Python抓取-无法引用正确的元素,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我从craigstlist中获取了一个预先制作的脚本,用于删除我正在寻找的值,并希望在此基础上进行扩展,从其他论坛(例如Pinkbike)中删除数据 工作正常的原始脚本: from bs4 import BeautifulSoup from datetime import datetime import requests import time #from config import * Free_CL_URL = "https://philadelphia.craigslist.o
from bs4 import BeautifulSoup
from datetime import datetime
import requests
import time
#from config import *
Free_CL_URL = "https://philadelphia.craigslist.org/d/bicycles/search/bia"
def crawlFree(pageval):
# crawls the free items section and parses HTML
if pageval == 0:
r = requests.get(Free_CL_URL).text
soup = BeautifulSoup(r, 'html.parser')
else:
r = requests.get(Free_CL_URL + "?s=" + str(pageval)).text
time.sleep(1)
soup = BeautifulSoup(r, 'html.parser')
return soup
def searchItems(input):
# in each page crawled from crawlFree , extract the titles, lower the character case and compare against search strings to append a result list
itemlist = []
for i in input:
TitleSplit = str(i.contents[0]).split()
TitleSplit = str([TitleSplit.lower() for TitleSplit in TitleSplit])
if "cyclocross" in TitleSplit:
print(str("\n" + i.contents[0]))
itemlist.append(i.contents[0])
print((i.attrs['href']))
itemlist.append(i.attrs['href'])
elif "58" in TitleSplit:
print(str("\n" + i.contents[0]))
itemlist.append(i.contents[0])
print((i.attrs['href']))
itemlist.append(i.attrs['href'])
elif "cx" in TitleSplit:
print(str("\n" + i.contents[0]))
itemlist.append(i.contents[0])
print((i.attrs['href']))
itemlist.append(i.attrs['href'])
return itemlist
pageval = 0
totalist = []
while True:
time.sleep(0.2)
soup = crawlFree(pageval)
# crawl page until you hit a page with the following text, signifing the end of the catagory
if "search and you will find" and "the harvest moon wanes" in soup.text:
print("\nEnd of Script")
break
else:
print("\nSearching page " + str((int(pageval / 120))))
links = soup.find_all('a', class_="result-title hdrlnk")
itemlist = searchItems(links)
totalist.append(itemlist)
pageval += 120
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
# message compliation and delivery
message = "Subject:CL Free Bot Report - " + str(len(totalist)) + "\n\n"
for i in totalist:
for i in i:
message += str("\n" + str(i) + "\n")
print(message)
我现在遇到的问题是:
print("\nSearching page " + str((int(pageval / 120))))
links = soup.find_all('a', class_="result-title hdrlnk")
itemlist = searchItems(links)
totalist.append(itemlist)
链接变量an拉着a href似乎不能很好地转换为pinkbike
但当我尝试提取此值时:
print("\nSearching page " + str((int(pageval / 120))))
links = soup.find_all('a', class_="href")
itemlist = searchItems(links)
totalist.append(itemlist)
我似乎没有放弃这个价值,但似乎无法理解为什么。
尝试了几种不同的格式化方法,但失败了
我尝试的原因的完整代码如下:
from bs4 import BeautifulSoup
from datetime import datetime
import requests
import time
#from config import *
URL = "https://www.pinkbike.com/buysell/list/?category=77"
def crawlFree(pageval):
# crawls our results from the url above
if pageval == 0:
r = requests.get(URL).text
soup = BeautifulSoup(r, 'html.parser')
else:
r = requests.get(URL + "?s=" + str(pageval)).text
time.sleep(1)
soup = BeautifulSoup(r, 'html.parser')
return soup
def searchItems(input):
# in each page crawled from crawlFree , extract the titles, lower the character case and compare against search strings to append a result list
itemlist = []
for i in input:
TitleSplit = str(i.contents[0]).split()
TitleSplit = str([TitleSplit.lower() for TitleSplit in TitleSplit])
if "cyclocross" in TitleSplit:
print(str("\n" + i.contents[0]))
itemlist.append(i.contents[0])
print((i.attrs['href']))
itemlist.append(i.attrs['href'])
elif "58" in TitleSplit:
print(str("\n" + i.contents[0]))
itemlist.append(i.contents[0])
print((i.attrs['href']))
itemlist.append(i.attrs['href'])
elif "large" in TitleSplit:
print(str("\n" + i.contents[0]))
itemlist.append(i.contents[0])
print((i.attrs['href']))
itemlist.append(i.attrs['href'])
return itemlist
pageval = 0
totalist = []
while True:
time.sleep(0.2)
soup = crawlFree(pageval)
# crawl page until you hit a page with the following text, signifing the end of the catagory
if "search and you will find" and "the harvest moon wanes" in soup.text:
print("\nEnd of Script")
break
else:
print("\nSearching page " + str((int(pageval / 120))))
links = soup.find_all('a', class_="href")
itemlist = searchItems(links)
totalist.append(itemlist)
pageval += 120
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
# message compliation and delivery
message = "Subject:CL Free Bot Report - " + str(len(totalist)) + "\n\n"
for i in totalist:
for i in i:
message += str("\n" + str(i) + "\n")
print(message)
由于声誉的原因,我无法直接发表评论 以下内容适用于您的else子句
links = [div.find_all('a')[1] for div in soup.find_all('div', class_='bsitem')]
但是…
你必须多做一点,因为在刮削中没有一个适合所有人的解决方案。所以你还必须处理跳转到下一页的问题
每页只有20辆而不是120辆自行车,参数不是s
而是page
希望这个提示能帮助您,让我们知道。很好-我知道您在那里做了什么。我知道问题出在哪里了——它是一个div类,而不是a href。现在我需要开始看看你的建议——我们如何翻开新的一页!非常感谢。我能把它整理好```