Python 如何在scrapy中跟踪多个链接？_Python_Tree_Scrapy

Python 如何在scrapy中跟踪多个链接？

python tree scrapy

Python 如何在scrapy中跟踪多个链接？,python,tree,scrapy,Python,Tree,Scrapy,[scrapy和python] 我想跟踪并提取位于xpath（//div[@class=“work\u area\u content”]/a'）的所有链接，并使用相同的xpath遍历所有链接，直到每个链接的最深层。我试过使用下面的代码：但是，它只经过主层，不跟随每个链接我觉得这与列表中不包含值的链接变量有关。但不知道为什么列表是空的 class DatabloggerSpider(CrawlSpider): # The name of the spider name = "jo

[scrapy和python]

我想

跟踪并提取位于xpath（//div[@class=“work\u area\u content”]/a'）
的所有链接，并使用相同的xpath遍历所有链接，直到每个链接的最深层。我试过使用下面的代码：但是，它只经过主层，不跟随每个链接
我觉得这与列表中不包含值的链接
变量有关。但不知道为什么列表是空的
class DatabloggerSpider(CrawlSpider):
    # The name of the spider
    name = "jobs"

    # The domains that are allowed (links to other domains are skipped)
    allowed_domains = ['1.1.1.1']

    # The URLs to start with
    start_urls = ['1.1.1.1/TestSuites']


    # Method for parsing items
    def parse(self, response):
        # The list of items that are found on the particular page
        items = []
        # Only extract canonicalized and unique links (with respect to the current page)
        test_str = response.text
        # Removes string between two placeholders with regex
        regex = r"(Back to)(.|\n)*?<br><br>"
        regex_response = re.sub(regex, "", test_str)
        regex_response2 = HtmlResponse(regex_response) ##TODO: fix here!

        #print(regex_response2)
        links = LinkExtractor(canonicalize=True, unique=True, restrict_xpaths = ('//div[@class="work_area_content"]/a')).extract_links(regex_response2)
        print(type(links))
        # #Now go through all the found links
        print(links)
        for link in links:
            item = DatabloggerScraperItem()
            item['url_from'] = response.url
            item['url_to'] = link.url
            items.append(item)
            print(items)
        yield scrapy.Request(links, callback=self.parse, dont_filter=True)

        #Return all the found items
        return items

类数据库记录器爬行器（爬行爬行器）：
#蜘蛛的名字
name=“作业”
#允许的域（跳过指向其他域的链接）
允许的_域=['1.1.1.1']
#要开始的URL
start_url=['1.1.1.1/TestSuites']
#用于分析项的方法
def解析（自我，响应）：
#在特定页面上找到的项目列表
项目=[]
#仅提取规范化和唯一链接（关于当前页面）
test_str=response.text
#使用正则表达式删除两个占位符之间的字符串
regex=r“（返回到）（.|\n）*？

”
regex_response=re.sub（regex，“，test_str）
regex_response2=HtmlResponse（regex_response）###TODO:在这里修复！
#打印（正则表达式2）
links=LinkExtractor（canonicalize=True，unique=True，restrict\u xpath=（'//div[@class=“work\u area\u content”]/a'）。提取链接（regex\u response2）
打印（类型（链接））
##现在检查所有找到的链接
打印（链接）
对于链接中的链接：
item=DatabloggerScraperItem（）
项['url\u from']=response.url
项目['url\u to']=link.url
items.append（项目）
打印（项目）
生成scrapy.Request（links，callback=self.parse，dont\u filter=True）
#返回所有找到的项目
退货项目
我认为您应该使用带有follow=True
参数集的
比如：
links = SgmlLinkExtractor(follow=True, restrict_xpaths = ('//div[@class="work_area_content"]/a')).extract_links(regex_response2))

由于您使用的是爬行爬行器，所以应该定义规则，查看完整的示例