Python Scrapy Spider无法使用xpath提取网页内容_Python_Xpath_Web Crawler_Scrapy

Python Scrapy Spider无法使用xpath提取网页内容

python xpath web-crawler scrapy

Python Scrapy Spider无法使用xpath提取网页内容,python,xpath,web-crawler,scrapy,Python,Xpath,Web Crawler,Scrapy,我有scrapy spider，我正在使用xpath选择器提取页面内容，请检查我哪里出错了 from scrapy.contrib.loader import ItemLoader from scrapy.contrib.spiders import CrawlSpider,Rule from scrapy.selector import HtmlXPathSelector from medicalproject.items import MedicalprojectItem from scra

我有scrapy spider，我正在使用xpath选择器提取页面内容，请检查我哪里出错了

from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import HtmlXPathSelector
from medicalproject.items import MedicalprojectItem
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector 
from scrapy import Request  


class MySpider(CrawlSpider):
      name = "medical"
      allowed_domains = ["yananow.org"]
      start_urls = ["http://yananow.org/query_stories.php"]

rules = (
    Rule(SgmlLinkExtractor(allow=[r'display_story.php\?\id\=\d+']),callback='parse_page',follow=True),       
    )

def parse_items(self, response):
    hxs = HtmlXPathSelector(response)
    titles = hxs.xpath('/html/body/div/table/tbody/tr[2]/td/table/tbody/tr/td')
    items = []
    for title in titles:
        item = MedicalprojectItem()
        item["patient_name"] = title.xpath("/html/body/div/table/tbody/tr[2]/td/table/tbody/tr/td/img[1]/text()").extract()
        item["stories"] = title.xpath("/html/body/div/table/tbody/tr[2]/td/table/tbody/tr/td/div/font/p/text()").extract()
        items.append(item)
    return(items)

你的代码有很多问题，所以这里有一个不同的方法

我选择了使用

爬行爬行器来更好地控制刮削过程。特别是从查询页面抓取名称
，从细节页面抓取故事
我试图简化XPath
语句，方法是不深入（嵌套）表结构，而是寻找内容模式。所以如果你想摘录一个故事。。。一定有一个故事的链接
下面是经过测试的代码（带注释）：
你的XPath离我们太远了。您不会从display_story.php
中获取患者姓名，您需要从query_stories.php
索引上的表中提取患者姓名。请记住，您可以在Chrome控制台中使用$x（此处为xpath）
测试xpath。我对显示在display_story.php中的患者的特定故事感兴趣，我想逐个提取单个患者的整个故事one@Ash例如，从本页显示_story.php？id=1023，我需要一个名字和完整的故事我已经对该页面运行了XPath，/html/body/div/table/tbody/tr[2]/td/table/tbody/tr/td/img[1]/text（）
返回一个空数组，因为它不存在。我能想到的最好的方法是使用'/html/body/div/table/tbody/tr[2]/td/table/tbody/tr/td/div/font/p[last（）]/text（）'
然后使用name.split（“”，1）[0]删除名称后面的额外字符，但这只适用于某些页面。你真的需要从索引页上的表中刮取名称。我用chrome复制了这个xpath，我不明白为什么我不能获取数据，好吧，我可以从索引中获取名称，但我还需要获取整个故事。谢谢你，完美的解决方案！
# -*- coding: utf-8 -*-
import scrapy

class MyItem(scrapy.Item):
    name = scrapy.Field()
    story = scrapy.Field()

class MySpider(scrapy.Spider):

    name = 'medical'
    allowed_domains = ['yananow.org']
    start_urls = ['http://yananow.org/query_stories.php']

    def parse(self, response):

        rows = response.xpath('//a[contains(@href,"display_story")]')

        #loop over all links to stories
        for row in rows:
            myItem = MyItem() # Create a new item
            myItem['name'] = row.xpath('./text()').extract() # assign name from link
            story_url = response.urljoin(row.xpath('./@href').extract()[0]) # extract url from link
            request = scrapy.Request(url = story_url, callback = self.parse_detail) # create request for detail page with story
            request.meta['myItem'] = myItem # pass the item with the request
            yield request

    def parse_detail(self, response):
        myItem = response.meta['myItem'] # extract the item (with the name) from the response
        text_raw = response.xpath('//font[@size=3]//text()').extract() # extract the story (text)
        myItem['story'] = ' '.join(map(unicode.strip, text_raw)) # clean up the text and assign to item
        yield myItem # return the item