使用python&；发痒的_Python_Web Scraping_Scrapy

使用python&；发痒的

python web-scraping scrapy

使用python&；发痒的,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我是Scrapy（&Python！）的新手，我正试图删除Cricinfo网站上的评论。以下是一个网页示例：我感兴趣的是删去上面的数字（例如0.1）和它旁边的文字使用Firebug，我可以看到“0.1”的xpath是： /html/body/div[2]/div[3]/div[4]/div[5]/div/div/div[3]/table/tbody/tr/td[2]/div/table/tbody/tr[2]/td[1]/p 旁边的文字是： /html/body/div[2]/div[3]

我是Scrapy（&Python！）的新手，我正试图删除Cricinfo网站上的评论。以下是一个网页示例：

我感兴趣的是删去上面的数字（例如0.1）和它旁边的文字

使用Firebug，我可以看到“0.1”的xpath是： /html/body/div[2]/div[3]/div[4]/div[5]/div/div/div[3]/table/tbody/tr/td[2]/div/table/tbody/tr[2]/td[1]/p

旁边的文字是： /html/body/div[2]/div[3]/div[4]/div[5]/div/div/div[3]/table/tbody/tr/td[2]/div/table/tbody/tr[2]/td[2]/p

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from crictest.items import CrictestItem

class MySpider(BaseSpider):
    name = "cricinfo"
    allowed_domains = ["espncricinfo.com/"]
    start_urls = ["http://www.espncricinfo.com/champions-league-twenty20-2014/engine/match/763595.html?innings=1;view=commentary/"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//html/body/div[2]/div[3]/div[4]/div[5]/div/div[3]/table/tbody/tr/td[2]/div/table/tbody/tr')
        items =[]
        for row in rows:
            item = CrictestItem()
            item['overnum'] = row.select('td[1]/p/text()').extract()
            item['overnumtext'] = row.select('td[2]/p/text()').extract()
            items.append(item)
        return items

我正在尝试遍历行（/tr），然后返回td[1]/p/text，然后返回td[2]/p/text My items.py看起来像：

import scrapy


class CrictestItem(scrapy.Item):
    overnum = scrapy.Field()
    overnumtext = scrapy.Field()

使用

scrapy crawl-cricinfo-o items.csv-t csv

它只会给我一个items.csv文件，其中没有任何数据

我哪里做错了？任何帮助都将不胜感激

您使用的xpath不正确，而且非常脆弱

据我所知，你需要粗体数字和旁边的文字。我将使用

battingComms

类依赖

td

元素：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from crictest.items import CrictestItem


class MySpider(BaseSpider):
    name = "cricinfo"
    allowed_domains = ["espncricinfo.com/"]
    start_urls = ["http://www.espncricinfo.com/champions-league-twenty20-2014/engine/match/763595.html?innings=1;view=commentary/"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//td[@class="battingComms" and b]')
        for row in rows:
            item = CrictestItem()
            item['overnum'] = row.select('b/text()').extract()[0]
            item['overnumtext'] = row.select('b/following-sibling::text()').extract()[0]
            yield item

控制台上的输出：

{'overnum': u'0.4',
 'overnumtext': u" bingo! that's a good ol slog from van Wyk right across the line of a good length ball that nips back in. No bat involved, but loads of timber. Lovely bowling from Paris and he knows it "}
{'overnum': u'1.3',
 'overnumtext': u' and dies by his reputation. Behrendorff is assisted by some swing away, Delport flings his bat at with all his might and only ends up with an outside edge that is pouched behind the wicket. Brilliant catch from Whiteman as he leaps to his left and stretches as high as he could '}
...

您可以从下面的示例中获得准确的结果

使用python下一个同级来获得适当的结果

Html代码是：

<div id="provider-region-addresses">
<h3>Contact details</h3>
<h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>More information</dt>
            <dd>North Shore Hospital</dd><dt>Physical address</dt>
                <dd>124 Shakespeare Rd, Takapuna, Auckland 0620</dd><dt>Postal address</dt>
                <dd>Private Bag 93503, Takapuna, Auckland 0740</dd><dt>Postcode</dt>
                <dd>0740</dd><dt>District/town</dt>

                <dd>
                North Shore, Takapuna</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 486 8996</dd><dt>Fax</dt>
                <dd>(09) 486 8342</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    <h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>Physical address</dt>
                <dd>Helensville</dd><dt>Postal address</dt>
                <dd>PO Box 13, Helensville 0840</dd><dt>Postcode</dt>
                <dd>0840</dd><dt>District/town</dt>

                <dd>
                Rodney, Helensville</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 420 9450</dd><dt>Fax</dt>
                <dd>(09) 420 7050</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    <h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>Physical address</dt>
                <dd>Warkworth</dd><dt>Postal address</dt>
                <dd>PO Box 505, Warkworth 0941</dd><dt>Postcode</dt>
                <dd>0941</dd><dt>District/town</dt>

                <dd>
                Rodney, Warkworth</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 422 2700</dd><dt>Fax</dt>
                <dd>(09) 422 2709</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    <h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>More information</dt>
            <dd>Waitakere Hospital</dd><dt>Physical address</dt>
                <dd>55-75 Lincoln Rd, Henderson, Auckland 0610</dd><dt>Postal address</dt>
                <dd>Private Bag 93115, Henderson, Auckland 0650</dd><dt>Postcode</dt>
                <dd>0650</dd><dt>District/town</dt>

                <dd>
                Waitakere, Henderson</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 839 0000</dd><dt>Fax</dt>
                <dd>(09) 837 6634</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    <h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>More information</dt>
            <dd>Hibiscus Coast Community Health Centre</dd><dt>Physical address</dt>
                <dd>136 Whangaparaoa Rd, Red Beach 0932</dd><dt>Postcode</dt>
                <dd>0932</dd><dt>District/town</dt>

                <dd>
                Rodney, Red Beach</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 427 0300</dd><dt>Fax</dt>
                <dd>(09) 427 0391</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    </div>

看起来更像是这样，但并不是每个数字都有。只有11张唱片？还有，我该如何了解battingComms课程？Thanks@Del我怎么知道你到底想从这个页面上得到什么？如果我一开始不清楚，我很抱歉。我想要一个有两列的csv文件。一列显示超过的数字：0.1，0.3。。。19.5, 19.6. 另一列显示网页上该数字旁边的文本。

def parse(self, response):
        hxs = HtmlXPathSelector(response)

        practice = hxs.select('//h1/text()').extract()
        items1 = []

        results = hxs.select('//*[@id="content"]/div[@class="content"]/div/dl')
        for result in results:
            item = WebhealthItem1()
            #item['url'] = result.select('//dl/a/@href').extract()
            item['practice'] = practice
            item['hours'] = map(unicode.strip,
                result.select('dt[contains(.," Contact hours")]/following-sibling::dd[1]/text()').extract())
            item['more_hours'] = map(unicode.strip,
                result.select('dt[contains(., "More information")]/following-sibling::dd[1]/text()').extract())
            item['physical_address'] = map(unicode.strip,
                result.select('dt[contains(., "Physical address")]/following-sibling::dd[1]/text()').extract())
            item['postal_address'] = map(unicode.strip,
                result.select('dt[contains(., "Postal address")]/following-sibling::dd[1]/text()').extract())
            item['postcode'] = map(unicode.strip,
                result.select('dt[contains(., "Postcode")]/following-sibling::dd[1]/text()').extract())
            item['district_town'] = map(unicode.strip,
                result.select('dt[contains(., "District/town")]/following-sibling::dd[1]/text()').extract())
            item['region'] = map(unicode.strip,
                result.select('dt[contains(., "Region")]/following-sibling::dd[1]/text()').extract())
            item['phone'] = map(unicode.strip,
                result.select('dt[contains(., "Phone")]/following-sibling::dd[1]/text()').extract())
            item['website'] = map(unicode.strip,
                result.select('dt[contains(., "Website")]/following-sibling::dd[1]/a/@href').extract())
            item['email'] = map(unicode.strip,
                result.select('dt[contains(., "Email")]/following-sibling::dd[1]/a/text()').extract())
            items1.append(item)
        return items1