Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/maven/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python scrapy xpath选择器重复数据_Python_Xpath_Web Scraping_Scrapy_Scrape - Fatal编程技术网

Python scrapy xpath选择器重复数据

Python scrapy xpath选择器重复数据,python,xpath,web-scraping,scrapy,scrape,Python,Xpath,Web Scraping,Scrapy,Scrape,我试图从每个列表中提取企业名称和地址,并将其导出到-csv,但我在输出csv时遇到问题。我认为bizs=hxs.select//div[@class='listing_content']可能是造成问题的原因 yp_spider.py from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from yp.items import Biz class MySpider(BaseSpide

我试图从每个列表中提取企业名称和地址,并将其导出到-csv,但我在输出csv时遇到问题。我认为bizs=hxs.select//div[@class='listing_content']可能是造成问题的原因

yp_spider.py

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from yp.items import Biz

class MySpider(BaseSpider):
    name = "ypages"
    allowed_domains = ["yellowpages.com"]
    start_urls = ["http://www.yellowpages.com/sanfrancisco/restaraunts"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        bizs = hxs.select("//div[@class='listing_content']")
        items = []

        for biz in bizs:
            item = Biz()
            item['name'] = biz.select("//h3/a/text()").extract()
            item['address'] = biz.select("//span[@class='street-address']/text()").extract()
            print item
            items.append(item)
items.py

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html

from scrapy.item import Item, Field

class Biz(Item):
    name = Field()
    address = Field()

    def __str__(self):
        return "Website: name=%s address=%s" %  (self.get('name'), self.get('address'))

“scrapy crawl ypages-o list.csv-t csv”的输出是一个很长的企业名称列表,然后是位置列表,它会多次重复相同的数据。

您应该添加一个。选择相对xpath,这里是来自scrapy文档

首先,您可能会尝试使用以下方法,这是错误的,因为它实际上从文档中提取所有元素,而不仅仅是元素内部的元素:

>>> for p in divs.select('//p') # this is wrong - gets all <p> from the whole document
>>>     print p.extract()
这是正确的方法请注意前缀为的点。//p XPath:

>>> for p in divs.select('.//p') # extracts all <p> inside
>>>     print p.extract()

这很有效。谢谢是否有任何关于从已刮取的数据中删除\n字符的帮助?python str提供了方法replace,它可用于删除“\n”