Python 2.7 如何在使用scrapy从网站上抓取数据时获得干净的结果
我是python新手,我正在尝试从黄页中获取数据。我能擦掉它,但结果却很糟糕 这就是我得到的结果:Python 2.7 如何在使用scrapy从网站上抓取数据时获得干净的结果,python-2.7,web-scraping,scrapy,scrape,scraper,Python 2.7,Web Scraping,Scrapy,Scrape,Scraper,我是python新手,我正在尝试从黄页中获取数据。我能擦掉它,但结果却很糟糕 这就是我得到的结果: 2013-03-24 20:26:47+0800 [scrapy] INFO: Scrapy 0.14.4 started (bot: eyp) 2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, Memor
2013-03-24 20:26:47+0800 [scrapy] INFO: Scrapy 0.14.4 started (bot: eyp)
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware,DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled item pipelines:
2013-03-24 20:26:47+0800 [eyp] INFO: Spider opened
2013-03-24 20:26:47+0800 [eyp] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
我怎样才能得到一个干净的结果?我只想得到姓名,地址,电话号码和链接而已
顺便说一下,我用来做这件事的代码是
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from eyp.items import EypItem
class EypSpider(BaseSpider):
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//ol[@class="result"]/li')
items = []
for title in titles:
item = EypItem()
item['title'] = title.select(".//p/text()").extract()
item['link'] = title.select(".//a/@href").extract()
items.append(item)
return items
您的代码有点混乱,但我会尽力帮助您:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
import string
class EypItem(Item):
name = Field()
address = Field()
phone = Field()
class eypSpider(BaseSpider):
name = "eyp.ph"
allowed_domains = ["eyp.ph"]
start_urls = ["http://www.eyp.ph/home-real-estate/search/real-estate/davao/cat/real-estate-brokers"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("//li/div[@class='details']")
items = []
for site in sites:
itemE = EypItem()
itemE["name"] = site.select("normalize-space(p[1]/text())").extract()
itemE["address"] = site.select("normalize-space(p[2]/text())").extract()
itemE["phone"] = site.select("normalize-space(p[3]/text())").extract()
items.append(itemE)
return items
您缺少类EypItem
的定义。我建议了一个。通过运行命令行将上述内容保存为test.py
:
$ scrapy runspider test.py -o items.json -t json
将为您提供一个名为items.JSON的JSON输出文件。输出的示例如下所示
[{"phone": ["Phone: +63(907)6390603"], "name": ["(CARLOS A. VARGAS)"], "address": ["Mezzanine Wee Eng Apartment, Guerrero Street, Davao City, Davao Del Sur"]},
{"phone": ["Phone: +63(921)9566577"], "name": ["(ROGELIO G. CARBIERO)"], "address": ["Sto. Nino Heights, Pantinople Village, Davao City, Davao Del Sur"]},
{"phone": ["Phone: +63(917)3137855"], "name": ["(FLORIZEL C. CHAVEZ)"], "address": ["12 Tulip Street, El Rio Vista Village P4a, Davao City, Davao Del Sur"]},
..........
似乎在项['title']
中,您正在选择所选中的每个
元素。你是否应该更精确地选择你想要的内容?如果您想刮取名称
,电话号码
,地址
,链接
,您的物品是否真的应该只有标题
和链接
??你不应该更准确地选择你想要的链接吗?不是每个链接都像您在
中所做的那样?在你寻求帮助之前,你应该先研究一下基本手册,你不觉得吗?我在这里给了你三个问题,我明白了。请仔细阅读。她的定义在另一个文件中<代码>从eyp.items导入EypItem
,但有好的地方。这里的normalizespace
交易是什么?normalizespace
是使用xpath调用删除空白。如其他地方所述,项目加载器
可能是一种更合适的方法。