Web scraping 刮削；“老年人”；带有刮擦、规则和链接提取器的页面_Web Scraping_Scrapy_Rules

Web scraping 刮削；“老年人”；带有刮擦、规则和链接提取器的页面

web-scraping scrapy

Web scraping 刮削；“老年人”；带有刮擦、规则和链接提取器的页面,web-scraping,scrapy,rules,Web Scraping,Scrapy,Rules,我一直在和scrapy一起做一个项目。在这个可爱的社区的帮助下，我成功地浏览了这个网站的第一页：。我也在尝试从“旧”页面中获取信息。我研究过“爬行蜘蛛”、规则和链接提取器，并相信我有正确的代码。我希望爬行器在后续页面上执行相同的循环。不幸的是，当我运行它的时候，它只是吐出了第一页，并没有继续到“旧”页我不太确定我需要改变什么，我真的很感谢你的帮助。有些帖子可以追溯到2004年2月。。。我对数据挖掘还不熟悉，不确定能把每一篇文章都删掉是否是一个现实的目标。如果是的话，我想去。请帮忙，谢谢。谢谢

我一直在和scrapy一起做一个项目。在这个可爱的社区的帮助下，我成功地浏览了这个网站的第一页：。我也在尝试从“旧”页面中获取信息。我研究过“爬行蜘蛛”、规则和链接提取器，并相信我有正确的代码。我希望爬行器在后续页面上执行相同的循环。不幸的是，当我运行它的时候，它只是吐出了第一页，并没有继续到“旧”页

我不太确定我需要改变什么，我真的很感谢你的帮助。有些帖子可以追溯到2004年2月。。。我对数据挖掘还不熟悉，不确定能把每一篇文章都删掉是否是一个现实的目标。如果是的话，我想去。请帮忙，谢谢。谢谢

import scrapy
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor



class Roto_News_Spider2(crawlspider):
    name = "RotoPlayerNews"

    start_urls = [
        'http://www.rotoworld.com/playernews/nfl/football/',
    ]

    Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//input[@id="cp1_ctl00_btnNavigate1"]',)), callback="parse_page", follow= True),)


    def parse(self, response):
        for item in response.xpath("//div[@class='pb']"):
            player = item.xpath(".//div[@class='player']/a/text()").extract_first()
            position= item.xpath(".//div[@class='player']/text()").extract()[0].replace("-","").strip()
            team = item.xpath(".//div[@class='player']/a/text()").extract()[1].strip()
            report = item.xpath(".//div[@class='report']/p/text()").extract_first()
            date = item.xpath(".//div[@class='date']/text()").extract_first() + " 2018"
            impact = item.xpath(".//div[@class='impact']/text()").extract_first().strip()
            source = item.xpath(".//div[@class='source']/a/text()").extract_first()
            yield {"Player": player,"Position": position, "Team": team,"Report":report,"Impact":impact,"Date":date,"Source":source}

我的建议：硒

如果您想自动更改页面，可以使用。

Selenium

使您能够与页面交互单击按钮、写入输入等。您需要更改代码以废弃

数据，然后单击较旧的按钮。然后，它将更改页面并继续进行刮削
Selenium
是一个非常有用的工具。我现在正在一个私人项目中使用它。你可以看看它是如何工作的。对于您试图删除的页面，您不能只将链接更改为删除的，因此，您需要使用Selenium
在页面之间进行更改
希望有帮助。
在目前的情况下不需要使用硒。在抓取之前，您需要在浏览器中打开url，然后按F12键检查代码并在网络选项卡中查看数据包。按“下一步”或“旧”键时，您可以在“网络”选项卡中看到新的TCP数据包集。它为你提供你所需要的一切。当您了解它是如何工作的，您就可以编写工作蜘蛛了
import scrapy
from scrapy import FormRequest
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor



class Roto_News_Spider2(CrawlSpider):
    name = "RotoPlayerNews"

    start_urls = [
        'http://www.<DOMAIN>/playernews/nfl/football/',
    ]

    Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//input[@id="cp1_ctl00_btnNavigate1"]',)), callback="parse", follow= True),)


    def parse(self, response):
        for item in response.xpath("//div[@class='pb']"):
            player = item.xpath(".//div[@class='player']/a/text()").extract_first()
            position= item.xpath(".//div[@class='player']/text()").extract()[0].replace("-","").strip()
            team = item.xpath(".//div[@class='player']/a/text()").extract()[1].strip()
            report = item.xpath(".//div[@class='report']/p/text()").extract_first()
            date = item.xpath(".//div[@class='date']/text()").extract_first() + " 2018"
            impact = item.xpath(".//div[@class='impact']/text()").extract_first().strip()
            source = item.xpath(".//div[@class='source']/a/text()").extract_first()
            yield {"Player": player,"Position": position, "Team": team,"Report":report,"Impact":impact,"Date":date,"Source":source}

        older = response.css('input#cp1_ctl00_btnNavigate1')
        if not older:
            return

        inputs = response.css('div.aspNetHidden input')
        inputs.extend(response.css('div.RW_pn input'))

        formdata = {}
        for input in inputs:
            name = input.css('::attr(name)').extract_first()
            value = input.css('::attr(value)').extract_first()
            formdata[name] = value or ''

        formdata['ctl00$cp1$ctl00$btnNavigate1.x'] = '42'
        formdata['ctl00$cp1$ctl00$btnNavigate1.y'] = '17'
        del formdata['ctl00$cp1$ctl00$btnFilterResults']
        del formdata['ctl00$cp1$ctl00$btnNavigate1']

        action_url = 'http://www.<DOMAIN>/playernews/nfl/football-player-news?ls=roto%3anfl%3agnav&rw=1'

        yield FormRequest(
            action_url,
            formdata=formdata,
            callback=self.parse
        )

import scrapy
从scrapy导入表单请求
从scrapy.contrib.spider导入爬行蜘蛛，规则
从scrapy.contrib.LinkExtractor导入LinkExtractor
职业旋转蜘蛛2（爬行蜘蛛）：
name=“RotoPlayerNews”
起始URL=[
'http://www./playernews/nfl/football/',
]
规则=（规则（LinkExtractor（allow=（），restrict_xpath=（'//input[@id=“cp1\u ctl00\u btnNavigate1]”），callback=“parse”，follow=True），）
def解析（自我，响应）：
对于response.xpath（“//div[@class='pb']”）中的项：
player=item.xpath（“.//div[@class='player']/a/text（）”）.extract_first（）
position=item.xpath（“.//div[@class='player']/text（）”）.extract（）[0]。replace（“-”，“”）。strip（）
team=item.xpath（“.//div[@class='player']/a/text（）”）.extract（）[1].strip（）
report=item.xpath（“.//div[@class='report']/p/text（）”）.extract_first（）
date=item.xpath（“.//div[@class='date']/text（）”）。extract_first（）+“2018”
impact=item.xpath（“.//div[@class='impact']/text（）”）.extract_first（）.strip（）
source=item.xpath（“.//div[@class='source']/a/text（）”）。extract_first（）
产生{“玩家”：玩家，“位置”：位置，“团队”：团队，“报告”：报告，“影响”：影响，“日期”：日期，“来源”：来源}
older=response.css（'input#cp1_ctl00_btnNavigate1'）
如果不老：
返回
inputs=response.css（'div.aspNetHidden input'）
inputs.extend（response.css（'div.RW_pn input'））
formdata={}
对于输入中的输入：
name=input.css（“：：attr（name）”）.extract_first（）
value=input.css（'：：attr（value）'）。首先提取
formdata[名称]=值或“”
formdata['ctl00$cp1$ctl00$btnNavigate1.x']='42'
formdata['ctl00$cp1$ctl00$btnNavigate1.y']=“17”
del formdata['ctl00$cp1$ctl00$btnFilterResults']
del formdata['ctl00$cp1$ctl00$btnNavigate1']
行动http://www./playernews/nfl/football-player-news?ls=roto%3anfl%3agnav&rw=1'
屈服请求(
行动(网址)，
formdata=formdata，
callback=self.parse
)

请小心，您需要替换我的代码中的所有内容以更正其中一个。
如果您的目的是获取跨多个页面的数据，则无需使用scrapy。如果您仍然希望有任何与scrapy相关的解决方案，那么我建议您选择splash来处理分页
我将执行以下操作以获取项目（假设您已经在计算机中安装了selenium）：
嗯，谢谢你的快速回复。我从BeautifulSoup开始，然后当我知道我无法使用它访问不同的链接时，我选择了Selenium。有人建议我去看看scrapy，因为它“能做硒能做的”等等。哈哈。所以你是说用scrapy没有办法刮旧的页面？可以，但不总是这样。我试着用Scrapy来做，但有时候，Selenium
效果更好，因为它可以等待标签可见、可点击和很多东西。你可以在你的蜘蛛中使用Selenium，你只需要稍加修改。如果你看看我的代码，你会看到的。好吧，酷。我去看看。如果我有问题，我可以在这里问你吗？对于舒尔。编辑你的帖子，包括你的“新”问题。Jordan Freundlich，我没有在第5页之后测试它是如何工作的。我不知道当'ctl00$cp1$ctl00$hidPageLastLine'等于零时它将如何工作。嘿，谢谢你的帮助！介意解释一下formdata部分吗？是的，这让我很困惑哈哈。有人能帮我弄清楚吗？Formdata是dict，在POST请求期间要发送字段。若你们在浏览器中按F12键探索网站是如何工作的，然后转到网络选项卡，你们就会明白。我按原样运行了代码。起始URL出现错误，因此我取出了“”。然后代码运行良好，但只输出第一页的信息。我已经看过了
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://www.rotoworld.com/playernews/nfl/football/")
wait = WebDriverWait(driver, 10)

while True:
    for item in wait.until(EC.presence_of_all_elements_located((By.XPATH,"//div[@class='pb']"))):
        player = item.find_element_by_xpath(".//div[@class='player']/a").text
        player = player.encode() #it should handle the encoding issue; I'm not totally sure, though
        print(player)

    try:
        idate = wait.until(EC.presence_of_element_located((By.XPATH, "//div[@class='date']"))).text
        if "Jun 9" in idate: #put here any date you wanna go back to (last limit: where the scraper will stop)
            break
        wait.until(EC.presence_of_element_located((By.XPATH, "//input[@id='cp1_ctl00_btnNavigate1']"))).click()
        wait.until(EC.staleness_of(item))
    except:break

driver.quit()