Python 构建scrapy以加载更多内容并在页面中刮取产品的URL
大家好,我已经用python构建了一个脚本,使用selenium滚动无限并单击load more按钮,显然它只提供了一半的产品,而且也很耗时。现在我想用scrapy编写一个脚本,以获取csv文件中的所有产品链接获取所有链接我编写的脚本是:Python 构建scrapy以加载更多内容并在页面中刮取产品的URL,python,selenium,web-scraping,scrapy,Python,Selenium,Web Scraping,Scrapy,大家好,我已经用python构建了一个脚本,使用selenium滚动无限并单击load more按钮,显然它只提供了一半的产品,而且也很耗时。现在我想用scrapy编写一个脚本,以获取csv文件中的所有产品链接获取所有链接我编写的脚本是: from selenium import webdriver import time from selenium.common.exceptions import WebDriverException from selenium.common.exceptio
from selenium import webdriver
import time
from selenium.common.exceptions import WebDriverException
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoSuchWindowException
path_to_chromedriver = 'C:/Users/Admin/AppData/Local/Programs/Python/Python37-32/chromedriver.exe'
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications": 2}
chrome_options.add_experimental_option("prefs", prefs)
chrome_options.add_argument("start-maximized")
browser = webdriver.Chrome(options=chrome_options, executable_path=path_to_chromedriver)
with open('E:/grainger2.txt','r', encoding='utf-8-sig') as f:
content = f.readlines()
content = [x.strip() for x in content]
with open('E:/grainger11.csv', 'a', encoding="utf-8") as f:
headers = ("link,sublink")
f.write(headers)
f.write("\n")
for dotnum in content:
browser.get(dotnum)
SCROLL_PAUSE_TIME = 1
# Get scroll height
last_height = browser.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = browser.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
while True:
try:
try:
loadMoreButton = browser.find_element_by_css_selector(".btn.list-view__load-more.list-view__load-more--js")
loadMoreButton.click()
time.sleep(2)
except NoSuchWindowException:
pass
except Exception as e:
break
try:
try:
for links in browser.find_elements_by_css_selector(".list-view__product.list-view__product--js"):
aa = links.get_attribute("data-url-ie8")
print(aa)
ana = "loadlink"
f.write(ana+","+dotnum+","+aa+"\n")
except NoSuchWindowException:
pass
except NoSuchElementException:
pass
相同的示例链接为:
使用上面的脚本,我只得到200个产品链接,但该链接包含9748个产品。如果有人能帮助我,我想提取所有链接。这将是一个很大的帮助。我认为你把这件事复杂化了,超出了你的需要 我建议您使用scrapy standalone(您不需要selenium),然后使用页面上隐藏的页面链接来遍历所有页面。看看来源
<section class="searchControls paginator-control">
<a
href="/category/drill-bushings/machine-tool-accessories/machining/ecatalog/N-hg1?searchRedirect=products&requestedPage=2"
class="btn list-view__load-more list-view__load-more--js"
data-current-page="1"
data-product-offset="32"
data-total-products="9749"
data-page-url="/category/drill-bushings/machine-tool-accessories/machining/ecatalog/N-hg1?searchRedirect=products"
id="list-view__load-more--js">
View More
</a>
</section>
我建议您重新编写此代码,并使用此分页块来实现相同的结果,从而得到更合理的解决方案
要查看一个基本示例,请参见此我无法实现我在scrapy中是新手,您能否帮助我构建脚本,因为只有当我们滚动到该部分时,产品才可见
# go through the pagination links to access infinite scroll
next_page = response.css('div.paginator-control a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse_item)