Web scraping SGMLLinkedExtractor()不会提取所有URL

Web scraping SGMLLinkedExtractor()不会提取所有URL,web-scraping,scrapy,Web Scraping,Scrapy,何乐而不为 我有一个奇怪的问题 看起来Scrapy并没有从一个页面中提取所有现有的URL。即,它查找/提取在此类标记上找到的URL: 有人有自己的解决方案吗 提前谢谢你 我使用firebug查看了您共享的链接,打开了“网络”部分,并意识到这就是您想要的链接: $ scrapy shell "https://www.knaw.nl/en/members/members/@@faceted_query?b_start[]=0&version=cb403bd0d9fed8ab5ee81b142

何乐而不为

我有一个奇怪的问题

看起来Scrapy并没有从一个页面中提取所有现有的URL。即,它查找/提取在此类标记上找到的URL:

有人有自己的解决方案吗


提前谢谢你

我使用firebug查看了您共享的链接,打开了“网络”部分,并意识到这就是您想要的链接:

$ scrapy shell "https://www.knaw.nl/en/members/members/@@faceted_query?b_start[]=0&version=cb403bd0d9fed8ab5ee81b142c8d1f9a"
...
>>> sel.xpath('//a/@href').extract()
[u'https://www.knaw.nl/en/members/members/8199', 
 u'https://www.knaw.nl/en/members/members/8199', 
 u'https://www.knaw.nl/en/members/members/3786', 
 u'https://www.knaw.nl/en/members/members/3786',
 ...]
这可用于类似蜘蛛网的情况:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector

class MySpider(BaseSpider):
    ...
    start_urls = ['https://www.knaw.nl/en/members/members/@@faceted_query?b_start[]=0&version=cb403bd0d9fed8ab5ee81b142c8d1f9a']

    def parse(self, response):
        sel = Selector(response)
        for link in sel.xpath('').extract():
            yield Request(url=link, callback=self.some_function_to _extract_page)
通过此url,您还可以将爬行蜘蛛与SGMLLinkedExtractor一起使用:

>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> SgmlLinkExtractor().extract_links(response)
[Link(url='https://www.knaw.nl/en/members/members/8199', .....]

作为旁注,请注意链接出现两次相同,您可以尝试提取它们一次,让scrapy为您过滤它们,这是自动完成的,因此在默认情况下,scrapy不会将它们刮两次。

没有示例页面,很难进行调查。您是否已使用
scrapy shell
并调用
sgmlLinkedExtractor().extract_links(response)
?您可以尝试其他链接提取器,如
LxmlParserLinkExtractor
来自scrapy.contrib.LinkedExtractors.lxmlhtml导入LxmlParserLinkExtractor
),这是开始url:。正如你想象的那样,我想提取指向成员的URL,以及指向下一页的URL。我尝试了你上面写的两个建议,但都没有成功。在我看来,成员是通过一些Javascript加载的(我可以看到一个XHR请求
https://www.knaw.nl/en/members/members/@@分面查询?b_开始%5B%5D=0…
). 您需要模仿我所认为的(或者页面中是否有针对非启用Javascript的访问者的链接?)。但我真的不知道怎么解决,也不明白你的答案。我是网络开发的初学者。所以如果您能给我更多的线索,我将不胜感激:)
$ scrapy shell "https://www.knaw.nl/en/members/members/@@faceted_query?b_start[]=0&version=cb403bd0d9fed8ab5ee81b142c8d1f9a"
...
>>> sel.xpath('//a/@href').extract()
[u'https://www.knaw.nl/en/members/members/8199', 
 u'https://www.knaw.nl/en/members/members/8199', 
 u'https://www.knaw.nl/en/members/members/3786', 
 u'https://www.knaw.nl/en/members/members/3786',
 ...]
from scrapy.spider import BaseSpider
from scrapy.selector import Selector

class MySpider(BaseSpider):
    ...
    start_urls = ['https://www.knaw.nl/en/members/members/@@faceted_query?b_start[]=0&version=cb403bd0d9fed8ab5ee81b142c8d1f9a']

    def parse(self, response):
        sel = Selector(response)
        for link in sel.xpath('').extract():
            yield Request(url=link, callback=self.some_function_to _extract_page)
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> SgmlLinkExtractor().extract_links(response)
[Link(url='https://www.knaw.nl/en/members/members/8199', .....]