Python 刮削进料输出包含预期输出几次,而不是一次
我编写了一个spider,其唯一目的是从底部的寻呼机中提取一个数字,即最大页数(例如,下面示例中的数字255) 我根据这些页面的URL匹配的正则表达式,使用LinkExtractor实现了这一点。卡盘如下所示:Python 刮削进料输出包含预期输出几次,而不是一次,python,scrapy,Python,Scrapy,我编写了一个spider,其唯一目的是从底部的寻呼机中提取一个数字,即最大页数(例如,下面示例中的数字255) 我根据这些页面的URL匹配的正则表达式,使用LinkExtractor实现了这一点。卡盘如下所示: import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from scrapy.crawler import CrawlerPr
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
from Funda.items import MaxPageItem
class FundaMaxPagesSpider(CrawlSpider):
name = "Funda_max_pages"
allowed_domains = ["funda.nl"]
start_urls = ["http://www.funda.nl/koop/amsterdam/"]
le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0]) # Link to a page containing thumbnails of several houses, such as http://www.funda.nl/koop/amsterdam/p10/
rules = (
Rule(le_maxpage, callback='get_max_page_number'),
)
def get_max_page_number(self, response):
links = self.le_maxpage.extract_links(response)
max_page_number = 0 # Initialize the maximum page number
page_numbers=[]
for link in links:
if link.url.count('/') == 6 and link.url.endswith('/'): # Select only pages with a link depth of 3
page_number = int(link.url.split("/")[-2].strip('p')) # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/'
page_numbers.append(page_number)
# if page_number > max_page_number:
# max_page_number = page_number # Update the maximum page number if the current value is larger than its previous value
max_page_number = max(page_numbers)
print("The maximum page number is %s" % max_page_number)
yield {'max_page_number': max_page_number}
如果我通过在命令行中输入scrapy crawl Funda_max_pages-o Funda_max_pages.json
在提要输出中运行此命令,则生成的json文件如下所示:
[
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257}
]
我觉得奇怪的是,dict被输出7次而不是一次。毕竟,yield
语句在for
循环之外。有人能解释这种行为吗
get\u max\u page\u number
李>
get\u max\u page\u number
返回一个字典get\u max\u page\u number
李>
get\u max\u page\u number
返回一个字典作为一种变通方法,我将输出写入一个文本文件,以代替JSON提要输出:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
class FundaMaxPagesSpider(CrawlSpider):
name = "Funda_max_pages"
allowed_domains = ["funda.nl"]
start_urls = ["http://www.funda.nl/koop/amsterdam/"]
le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0]) # Link to a page containing thumbnails of several houses, such as http://www.funda.nl/koop/amsterdam/p10/
rules = (
Rule(le_maxpage, callback='get_max_page_number'),
)
def get_max_page_number(self, response):
links = self.le_maxpage.extract_links(response)
max_page_number = 0 # Initialize the maximum page number
for link in links:
if link.url.count('/') == 6 and link.url.endswith('/'): # Select only pages with a link depth of 3
print("The link is %s" % link.url)
page_number = int(link.url.split("/")[-2].strip('p')) # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/'
if page_number > max_page_number:
max_page_number = page_number # Update the maximum page number if the current value is larger than its previous value
print("The maximum page number is %s" % max_page_number)
place_name = link.url.split("/")[-3] # For example, "amsterdam" in 'http://www.funda.nl/koop/amsterdam/p10/'
print("The place name is %s" % place_name)
filename = str(place_name)+"_max_pages.txt" # File name with as prefix the place name
with open(filename,'wb') as f:
f.write('max_page_number = %s' % max_page_number) # Write the maximum page number to a text file
yield {'max_page_number': max_page_number}
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(FundaMaxPagesSpider)
process.start() # the script will block here until the crawling is finished
我还改编了spider,将其作为脚本运行。该脚本将生成一个文本文件
amsterdam_max_pages.txt
,其中只有一行max_pages\u编号:257
作为一种解决方法,我将输出写入一个文本文件,以代替JSON提要输出:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
class FundaMaxPagesSpider(CrawlSpider):
name = "Funda_max_pages"
allowed_domains = ["funda.nl"]
start_urls = ["http://www.funda.nl/koop/amsterdam/"]
le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0]) # Link to a page containing thumbnails of several houses, such as http://www.funda.nl/koop/amsterdam/p10/
rules = (
Rule(le_maxpage, callback='get_max_page_number'),
)
def get_max_page_number(self, response):
links = self.le_maxpage.extract_links(response)
max_page_number = 0 # Initialize the maximum page number
for link in links:
if link.url.count('/') == 6 and link.url.endswith('/'): # Select only pages with a link depth of 3
print("The link is %s" % link.url)
page_number = int(link.url.split("/")[-2].strip('p')) # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/'
if page_number > max_page_number:
max_page_number = page_number # Update the maximum page number if the current value is larger than its previous value
print("The maximum page number is %s" % max_page_number)
place_name = link.url.split("/")[-3] # For example, "amsterdam" in 'http://www.funda.nl/koop/amsterdam/p10/'
print("The place name is %s" % place_name)
filename = str(place_name)+"_max_pages.txt" # File name with as prefix the place name
with open(filename,'wb') as f:
f.write('max_page_number = %s' % max_page_number) # Write the maximum page number to a text file
yield {'max_page_number': max_page_number}
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(FundaMaxPagesSpider)
process.start() # the script will block here until the crawling is finished
我还改编了spider,将其作为脚本运行。脚本将生成一个文本文件
amsterdam_max_pages.txt
,其中一行max_pages_number:257
您仍在爬网7个URL,但您正在用max_pages_number:257
覆盖同一文件7次…您仍在爬网7个URL,但是您正在用max\u page\u number:257覆盖同一个文件7次。。。