Python 在包含URL的多个文本文件上迭代scrapy爬网_Python_Scrapy_Web Crawler

Python 在包含URL的多个文本文件上迭代scrapy爬网

python scrapy web-crawler

Python 在包含URL的多个文本文件上迭代scrapy爬网,python,scrapy,web-crawler,Python,Scrapy,Web Crawler,我一直在寻找这个问题的答案，但在这里没有找到（如果我忽略了，请原谅）我有20个文本文件，每个文件包含数千个URL（称这些文本文件为test1.txt-test20.txt）。我的目标是在这20个文本文件中的每个文件中存储的URL上循环我的抓取，并将数据存储在20个csv文件中。有什么方便的方法吗？我已经在下面粘贴了我的爬行器，它成功地从第一个文件抓取URL并保存数据 import scrapy from scrapy.spider import Spider from scrapy.selec

我一直在寻找这个问题的答案，但在这里没有找到（如果我忽略了，请原谅）

我有20个文本文件，每个文件包含数千个URL（称这些文本文件为test1.txt-test20.txt）。我的目标是在这20个文本文件中的每个文件中存储的URL上循环我的抓取，并将数据存储在20个csv文件中。有什么方便的方法吗？我已经在下面粘贴了我的爬行器，它成功地从第一个文件抓取URL并保存数据

import scrapy
from scrapy.spider import Spider
from scrapy.selector import Selector
from proquest.items import ProquestItem
import HTMLParser
import xml.sax.saxutils as saxutils

class ProquestSpider(Spider):
    name = 'proquest'
    f=open("/Users/danny/tutorial/test1.txt")
    start_urls=[url.strip() for url in f.readlines()]
    def parse(self, response):
        hxs = Selector(response)
        items = []
        item = ProquestItem()
        item['date'] = hxs.xpath('./NumericPubDate/text()').extract()
        item['rectype'] = hxs.xpath('./RecordTitle/text()').extract()
        item['pubtitle'] = hxs.xpath('./PubTitle/text()').extract()
        item['fulltext'] = hxs.xpath('./FullText/text()').extract()
        items.append(item)
        with open('/Users/danny/tutorial/log.txt', 'a') as f:
##            f.write('{0}, {1}, {2}\n'.format(item['date'], item['rectype'], item['pubtitle']))
            f.write('{0}, {1}, {2}, {3}\n'.format(item['date'], item['rectype'], item['pubtitle'], item['fulltext']))
        return items
    f.close()

您可以使用

\uuuu init\uuuu

函数从文件中读取值来启动spider

比如：

def __init__(self, *args, **kwargs):
    super(ProquestSpider, self).__init__(*args, **kwargs) 
    self.start_urls = []
    for i in range(21):  # will go 1 to 20
        with open('file{}.txt'.format(i), 'w') as url_file:  #open a file
            self.start_urls.extend([url.strip() for url in url_file.read().splitlines()])

这将使用.txt文件中的值填充

start\u URL