Python 刮擦的网页爬行变差了
我对scrapy是个新手,试图通过浏览yellowpages.com网站来了解它 我的目标是编写一个python代码,输入yellowpages.com主页的搜索字段(业务和位置),然后刮取后续URL 我的代码如下所示:Python 刮擦的网页爬行变差了,python,web-scraping,scrapy,scrapy-spider,Python,Web Scraping,Scrapy,Scrapy Spider,我对scrapy是个新手,试图通过浏览yellowpages.com网站来了解它 我的目标是编写一个python代码,输入yellowpages.com主页的搜索字段(业务和位置),然后刮取后续URL 我的代码如下所示: import scrapy from scrapy.spiders import Spider from scrapy.selector import Selector from spider.items import Website class YellowPages(Sp
import scrapy
from scrapy.spiders import Spider
from scrapy.selector import Selector
from spider.items import Website
class YellowPages(Spider):
name = "yellow"
allowed_domains = ["yellowpages.com"]
start_urls = [
"http://www.yellowpages.com/"
]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formxpath="//form[@id='search-form']",
formdata={
"query":"business",
"location" : "78735" },
callback=self.after_results
)
def after_results(self, response):
self.logger.info("info msg")
我想在“78735”位置搜索“业务”。但是,这些不是传递给网站的值。我的日志如下所示:
import scrapy
from scrapy.spiders import Spider
from scrapy.selector import Selector
from spider.items import Website
class YellowPages(Spider):
name = "yellow"
allowed_domains = ["yellowpages.com"]
start_urls = [
"http://www.yellowpages.com/"
]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formxpath="//form[@id='search-form']",
formdata={
"query":"business",
"location" : "78735" },
callback=self.after_results
)
def after_results(self, response):
self.logger.info("info msg")
2016-01-28 23:55:36[scrapy]调试:爬网(200)(参考:无)
2016-01-28 23:55:36[scrapy]调试:爬网(200)(参考:http://www.yellowpages.com/)
在第二个url中,以某种方式插入了术语Los+Angeles。当我尝试手动输入搜索字段并提交时,url应该是这样的:
http://www.yellowpages.com/search?search_terms=business&geo_location_terms=78735
有人能告诉我出了什么问题以及如何解决吗
非常感谢
仅供参考,以下是yellowpages.com主页的HTML源代码部分
您想查找什么?- 按企业名称或关键字搜索
在哪里?
设置搜索词
和地理位置词
表单参数:
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formxpath="//form[@id='search-form']",
formdata={
"search_terms": "business",
"geo_location_terms" : "78735"},
callback=self.after_results
)
使用以下卡盘进行测试:
import scrapy
from scrapy.spiders import Spider
class YellowPages(Spider):
name = "yellow"
allowed_domains = ["yellowpages.com"]
start_urls = [
"http://www.yellowpages.com/"
]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formxpath="//form[@id='search-form']",
formdata={
"search_terms":"business",
"geo_location_terms" : "78735"},
callback=self.after_results
)
def after_results(self, response):
for result in response.css("div.result a[itemprop=name]::text").extract():
print(result)
打印“德克萨斯州奥斯汀”的企业列表:
设置
search\u terms
和geo\u location\u terms
表单参数:
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formxpath="//form[@id='search-form']",
formdata={
"search_terms": "business",
"geo_location_terms" : "78735"},
callback=self.after_results
)
使用以下卡盘进行测试:
import scrapy
from scrapy.spiders import Spider
class YellowPages(Spider):
name = "yellow"
allowed_domains = ["yellowpages.com"]
start_urls = [
"http://www.yellowpages.com/"
]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formxpath="//form[@id='search-form']",
formdata={
"search_terms":"business",
"geo_location_terms" : "78735"},
callback=self.after_results
)
def after_results(self, response):
for result in response.css("div.result a[itemprop=name]::text").extract():
print(result)
打印“德克萨斯州奥斯汀”的企业列表: