Web scraping 如何使用python beautifulsoup和请求在多个链接上迭代并逐个刮取每个链接,并将输出保存在csv中
我有这个代码,但不知道如何读取CSV或列表中的链接。我想读取链接并从每个链接中删除详细信息,然后将每个链接对应的列中的数据保存到输出CSV中 下面是我为获取特定数据而构建的代码Web scraping 如何使用python beautifulsoup和请求在多个链接上迭代并逐个刮取每个链接,并将输出保存在csv中,web-scraping,beautifulsoup,python-requests,Web Scraping,Beautifulsoup,Python Requests,我有这个代码,但不知道如何读取CSV或列表中的链接。我想读取链接并从每个链接中删除详细信息,然后将每个链接对应的列中的数据保存到输出CSV中 下面是我为获取特定数据而构建的代码 from bs4 import BeautifulSoup import requests url = "http://www.ebay.com/itm/282231178856" r = requests.get(url) x = BeautifulSoup(r.content, "html.parser") #
from bs4 import BeautifulSoup
import requests
url = "http://www.ebay.com/itm/282231178856"
r = requests.get(url)
x = BeautifulSoup(r.content, "html.parser")
# print(x.prettify().encode('utf-8'))
# time to find some tags!!
# y = x.find_all("tag")
z = x.find_all("h1", {"itemprop": "name"})
# print z
# for loop done to extracting the title.
for item in z:
try:
print item.text.replace('Details about ', '')
except:
pass
# category extraction done
m = x.find_all("span", {"itemprop": "name"})
# print m
for item in m:
try:
print item.text
except:
pass
# item condition extraction done
n = x.find_all("div", {"itemprop": "itemCondition"})
# print n
for item in n:
try:
print item.text
except:
pass
# sold number extraction done
k = x.find_all("span", {"class": "vi-qtyS vi-bboxrev-dsplblk vi-qty-vert-algn vi-qty-pur-lnk"})
# print k
for item in k:
try:
print item.text
except:
pass
# Watchers extraction done
u = x.find_all("span", {"class": "vi-buybox-watchcount"})
# print u
for item in u:
try:
print item.text
except:
pass
# returns details extraction done
t = x.find_all("span", {"id": "vi-ret-accrd-txt"})
# print t
for item in t:
try:
print item.text
except:
pass
#per hour day view done
a = x.find_all("div", {"class": "vi-notify-new-bg-dBtm"})
# print a
for item in a:
try:
print item.text
except:
pass
#trending at price
b = x.find_all("span", {"class": "mp-prc-red"})
#print b
for item in b:
try:
print item.text
except:
pass
你的问题有点模糊 你在谈论哪些链接?一个易趣页面上有100个。你想搜集哪些信息?同样还有一吨 但无论如何,我会继续:
# First, create a list of urls you want to iterate on
urls = []
soup = (re.text, "html.parser")
# Assuming your links of interests are values of "href" attributes within <a> tags
a_tags = soup.find_all("a")
for tag in a_tags:
urls.append(tag["href"])
# Second, start to iterate while storing the info
info_1, info_2 = [], []
for link in urls:
# Do stuff here, maybe its time to define your existing loops as functions?
info_a, info_b = YourFunctionReturningValues(soup)
info_1.append(info_a)
info_2.append(info_b)
希望这会有帮助
当然,不要犹豫提供更多信息,以便获得更多详细信息
在具有BeautifulGroup的属性上:
关于csv模块:感谢您的帮助,不共享链接是一个很小的错误。我指的是物品页面本身的主URL,它遵循以下结构-->ebay.com/itm/。我有上千个项目id,因此我可以使用上面的示例代码将每个项目id放入一个列表,然后使用其他URL连接教程将每个项目id连接起来,并循环遍历它们以获取我需要的数据,然后将数据保存到我要使用代码本身创建的新CSV中,我将再次编写一个示例代码,并将其发布在这里供您进行评论,以便了解更多信息。嘿,也许您现在应该结束这个问题。我认为你的结构是对的,所以开始把问题分解成不同的部分。也就是说,(1)定义一个函数来检测页面的html元素,这些元素是您使用beautiful soup感兴趣的,(2)如何存储收集的信息,(3)如何写入csv文件:)尝试一下(很好),稍后可能会返回更精确的问题!相信我,你已经明白它是如何工作的了
# Don't forget to import the csv module
with open(r"path_to_file.csv", "wb") as my_file:
csv_writer = csv.writer(final_csv, delimiter = ",")
csv_writer.writerows(zip(urls, info_1, info_2, info_3))