Python Scrapy可实时打印结果,而不是等待爬网完成
scrapy是否可以实时打印结果?我计划对大型网站进行爬网,担心如果我的vpn连接中断,爬网工作将白费,因为它不会打印任何结果 我目前正在使用VPN和旋转用户代理,我知道使用旋转代理代替VPN是理想的,但这将用于未来的脚本升级Python Scrapy可实时打印结果,而不是等待爬网完成,python,scrapy,Python,Scrapy,scrapy是否可以实时打印结果?我计划对大型网站进行爬网,担心如果我的vpn连接中断,爬网工作将白费,因为它不会打印任何结果 我目前正在使用VPN和旋转用户代理,我知道使用旋转代理代替VPN是理想的,但这将用于未来的脚本升级 import scrapy import re from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor results = open('re
import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
results = open('results.csv','w')
class TestSpider(CrawlSpider):
name = "test"
with open("domains.txt", "r") as d:
allowed_domains = [url.strip() for url in d.readlines()]
with open("urls.txt", "r") as f:
start_urls = [url.strip() for url in f.readlines()]
f.close()
rules = (Rule(LinkExtractor(allow=('/'), deny=('9','10')), follow=True, callback='parse_item'),)
def parse_item(self, response):
for pattern in ['Albert Einstein', 'Bob Marley']:
result = re.findall(pattern, response.text)
print(response.url,">",pattern,'>',len(result), file = results)
非常感谢
更新
原田的脚本工作得非常完美,除了保存文件外,没有任何更改。我所需要做的就是对当前文件做一些修改,如下所示,以便一切正常
蜘蛛定义的项目
import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ..items import TestItem
class TestSpider(CrawlSpider):
name = "test"
with open("domains.txt", "r") as d:
allowed_domains = [url.strip() for url in d.readlines()]
with open("urls.txt", "r") as f:
start_urls = [url.strip() for url in f.readlines()]
f.close()
rules = (Rule(LinkExtractor(allow=('/'), deny=('9','10')), follow=True, callback='parse_item'),)
def parse_item(self, response):
items = TestItem()
for pattern in ['Albert Einstein', 'Bob Marley']:
result = re.findall(pattern, response.text)
url = response.url
count = len(result)
items['url'] = url
items['pattern'] = pattern
items['count'] = count
yield(items)
items.py-将项目添加为字段
import scrapy
class TestItem(scrapy.Item):
url = scrapy.Field()
pattern = scrapy.Field()
count = scrapy.Field()
settings.py-未注释的项
ITEM_PIPELINES = {
'test.pipelines.TestPipeline': 300,
}
您可以将脚本添加到管道中,以便将您当时拥有的数据保存到文件中。将一个计数器作为变量添加到管道中,当管道达到某个阈值(比方说,每产生1000个项目)时,它应该写入一个文件。代码看起来像这样。我尽力使它尽可能一般化
class MyPipeline:
def __init__(self):
# variable that keeps track of the total number of items yielded
self.total_count = 0
self.data = []
def process_item(self, item, spider):
self.data.append(item)
self.total_count += 1
if self.total_count % 1000 == 0:
# write to your file of choice....
# I'm not sure how your data is stored throughout the crawling process
# If it's a variable of the pipeline like self.data,
# then just write that to the file
with open("test.txt", "w") as myfile:
myfile.write(f'{self.data}')
return item
您可以将脚本添加到管道中,以便将您当时拥有的数据保存到文件中。将一个计数器作为变量添加到管道中,当管道达到某个阈值(比方说,每产生1000个项目)时,它应该写入一个文件。代码看起来像这样。我尽力使它尽可能一般化
class MyPipeline:
def __init__(self):
# variable that keeps track of the total number of items yielded
self.total_count = 0
self.data = []
def process_item(self, item, spider):
self.data.append(item)
self.total_count += 1
if self.total_count % 1000 == 0:
# write to your file of choice....
# I'm not sure how your data is stored throughout the crawling process
# If it's a variable of the pipeline like self.data,
# then just write that to the file
with open("test.txt", "w") as myfile:
myfile.write(f'{self.data}')
return item
谢谢我一周前刚开始使用Scrapy/Python,所以我只是想弄清楚如何以及如何在进程上添加内容。@AJ2没问题,我编辑了我的答案,以使我的示例更加清晰。从理论上讲,你可以(并且被高度鼓励)使用饲料出口。查看此处了解更多信息:这非常有效!非常感谢。我只是需要对当前的文件做一些调整,它工作得非常完美。谢谢!我一周前刚开始使用Scrapy/Python,所以我只是想弄清楚如何以及如何在进程上添加内容。@AJ2没问题,我编辑了我的答案,以使我的示例更加清晰。从理论上讲,你可以(并且被高度鼓励)使用饲料出口。查看此处了解更多信息:这非常有效!非常感谢。我只需要对当前文件进行一些调整,它就可以完美地工作。