Python Scrapy可实时打印结果，而不是等待爬网完成_Python_Scrapy

Python Scrapy可实时打印结果，而不是等待爬网完成

python scrapy

Python Scrapy可实时打印结果，而不是等待爬网完成,python,scrapy,Python,Scrapy,scrapy是否可以实时打印结果？我计划对大型网站进行爬网，担心如果我的vpn连接中断，爬网工作将白费，因为它不会打印任何结果我目前正在使用VPN和旋转用户代理，我知道使用旋转代理代替VPN是理想的，但这将用于未来的脚本升级 import scrapy import re from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor results = open('re

scrapy是否可以实时打印结果？我计划对大型网站进行爬网，担心如果我的vpn连接中断，爬网工作将白费，因为它不会打印任何结果

我目前正在使用VPN和旋转用户代理，我知道使用旋转代理代替VPN是理想的，但这将用于未来的脚本升级

import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

results = open('results.csv','w')

class TestSpider(CrawlSpider):
    name = "test"
    with open("domains.txt", "r") as d:
        allowed_domains = [url.strip() for url in d.readlines()]

    with open("urls.txt", "r") as f:
        start_urls = [url.strip() for url in f.readlines()]
        f.close()

    rules = (Rule(LinkExtractor(allow=('/'), deny=('9','10')), follow=True, callback='parse_item'),)

    def parse_item(self, response):
        for pattern in ['Albert Einstein', 'Bob Marley']:
            result = re.findall(pattern, response.text) 
            print(response.url,">",pattern,'>',len(result), file = results)

非常感谢

更新

原田的脚本工作得非常完美，除了保存文件外，没有任何更改。我所需要做的就是对当前文件做一些修改，如下所示，以便一切正常

蜘蛛定义的项目

import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ..items import TestItem

class TestSpider(CrawlSpider):
    name = "test"
    with open("domains.txt", "r") as d:
        allowed_domains = [url.strip() for url in d.readlines()]

    with open("urls.txt", "r") as f:
        start_urls = [url.strip() for url in f.readlines()]
        f.close()

    rules = (Rule(LinkExtractor(allow=('/'), deny=('9','10')), follow=True, callback='parse_item'),)

    def parse_item(self, response):

        items = TestItem()

        for pattern in ['Albert Einstein', 'Bob Marley']:
            result = re.findall(pattern, response.text) 
            url = response.url
            count = len(result)

            items['url'] = url
            items['pattern'] = pattern
            items['count'] = count

            yield(items)

items.py-将项目添加为字段

import scrapy

    class TestItem(scrapy.Item):
        url = scrapy.Field()
        pattern = scrapy.Field()
        count = scrapy.Field()

settings.py-未注释的项

ITEM_PIPELINES = {
   'test.pipelines.TestPipeline': 300,
}

您可以将脚本添加到管道中，以便将您当时拥有的数据保存到文件中。将一个计数器作为变量添加到管道中，当管道达到某个阈值（比方说，每产生1000个项目）时，它应该写入一个文件。代码看起来像这样。我尽力使它尽可能一般化

class MyPipeline:
    def __init__(self):
        # variable that keeps track of the total number of items yielded
        self.total_count = 0
        self.data = []

    def process_item(self, item, spider):
        self.data.append(item)
        self.total_count += 1
        if self.total_count % 1000 == 0:
            # write to your file of choice....
            # I'm not sure how your data is stored throughout the crawling process
            # If it's a variable of the pipeline like self.data,
            # then just write that to the file
            with open("test.txt", "w") as myfile:
                myfile.write(f'{self.data}')

        return item

class MyPipeline:
    def __init__(self):
        # variable that keeps track of the total number of items yielded
        self.total_count = 0
        self.data = []

    def process_item(self, item, spider):
        self.data.append(item)
        self.total_count += 1
        if self.total_count % 1000 == 0:
            # write to your file of choice....
            # I'm not sure how your data is stored throughout the crawling process
            # If it's a variable of the pipeline like self.data,
            # then just write that to the file
            with open("test.txt", "w") as myfile:
                myfile.write(f'{self.data}')

        return item

谢谢我一周前刚开始使用Scrapy/Python，所以我只是想弄清楚如何以及如何在进程上添加内容。@AJ2没问题，我编辑了我的答案，以使我的示例更加清晰。从理论上讲，你可以（并且被高度鼓励）使用饲料出口。查看此处了解更多信息：这非常有效！非常感谢。我只是需要对当前的文件做一些调整，它工作得非常完美。谢谢！我一周前刚开始使用Scrapy/Python，所以我只是想弄清楚如何以及如何在进程上添加内容。@AJ2没问题，我编辑了我的答案，以使我的示例更加清晰。从理论上讲，你可以（并且被高度鼓励）使用饲料出口。查看此处了解更多信息：这非常有效！非常感谢。我只需要对当前文件进行一些调整，它就可以完美地工作。