Python 刮擦输出问题_Python_Web Scraping_Scrapy

Python 刮擦输出问题

python web-scraping scrapy

Python 刮擦输出问题,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我在显示我想要的项目时遇到问题。我的代码如下： from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.http import request from scrapy.selector import HtmlXPathSelector from texashealth.items import

我在显示我想要的项目时遇到问题。我的代码如下：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import request
from scrapy.selector import HtmlXPathSelector
from texashealth.items import TexashealthItem

class texashealthspider(CrawlSpider):

    name="texashealth"
    allowed_domains=['jobs.texashealth.org']
    start_urls=['http://jobs.texashealth.org/search/?&q=&title=Filter%3A%20title&facility=Filter%3A%20facility&location=Filter%3A%20city&date=Filter%3A%20date']

    rules=(
    Rule(SgmlLinkExtractor(allow=("search/",)), callback="parse_health", follow=True),
    #Rule(SgmlLinkExtractor(allow=("startrow=\d",)),callback="parse_health",follow=True),
    )

    def parse_health(self, response):
        hxs=HtmlXPathSelector(response)
    titles=hxs.select('//tbody/tr/td')
    items = []

    for titles in titles:
        item=TexashealthItem()
        item['title']=titles.select('span[@class="jobTitle"]/a/text()').extract()
        item['link']=titles.select('span[@class="jobTitle"]/a/@href').extract()
        item['shifttype']=titles.select('span[@class="jobShiftType"]/text()').extract()
        item['location']=titles.select('span[@class="jobLocation"]/text()').extract()
        items.append(item)
    print items
    return items

正在显示的输出以json格式显示如下：

[
    TexashealthItem(location=[], link=[u'/job/Fort-Worth-ULTRASONOGRAPHER-II-Job-TX-76101/31553900/'], shifttype=[], title=[u'ULTRASONOGRAPHER II Job']), 
    TexashealthItem(location=[], link=[], shifttype=[u'Texas Health Fort Worth'], title=[]), 
    TexashealthItem(location=[u'Fort Worth, TX, US'], link=[], shifttype=[], title=[]), 
    TexashealthItem(location=[], link=[], shifttype=[], title=[]), 
    TexashealthItem(location=[], link=[u'/job/Kaufman-RN-Acute-ICU-Full-Time-Kaufman-Job-TX-75142/35466900/'], shifttype=[], title=[u'RN--Telemetry--Full Time--Kaufman Job']), 
    TexashealthItem(location=[], link=[], shifttype=[u'Texas Health Kaufman'], title=[]), 
    TexashealthItem(location=[u'Kaufman, TX, US'], link=[], shifttype=[], title=[]), 
    TexashealthItem(location=[], link=[], shifttype=[], title=[]), 
    TexashealthItem(location=[], link=[u'/job/Fort-Worth-NURSE-PRACTITIONER-Occ-Med-Full-Time-Alliance-Job-TX-76101/35465400/'], shifttype=[], title=[u'NURSE PRACTITIONER-Occ Med-Full Time-Alliance Job']), 
    TexashealthItem(location=[], link=[], shifttype=[u'Texas Health Alliance'], title=[]), 
    TexashealthItem(location=[u'Fort Worth, TX, US'], link=[], shifttype=[], title=[]), 
    TexashealthItem(location=[], link=[], shifttype=[], title=[])
]

如上所示，项目的参数以不同的间隔显示，也就是说，它在一行中显示标题和链接，在其他单独的行中显示其余的输出

我可以得到一个解决方案，这样我就可以在一次拍摄中显示所有参数吗

感谢您的帮助

您应该循环表行-tr元素，而不是表单元格-td元素

我建议您使用hxs。选择“//table[@id=searchresults]/tbody/tr”，然后使用//span。。。在每个循环迭代中

titles=hxs.select('//table[@id="searchresults"]/tbody/tr')
items = []
for titles in titles:
    item['title']=titles.select('.//span[@class="jobTitle"]/a/text()').extract()
    item['link']=titles.select('.//span[@class="jobTitle"]/a/@href').extract()
    item['shifttype']=titles.select('.//span[@class="jobShiftType"]/text()').extract()
    item['location']=titles.select('.//span[@class="jobLocation"]/text()').extract()
    items.append(item)
return items

输出示例来自“打印项目”行。你应该评论一下。当通过Scrapy crawl调用texashealth-t JSON-o output.jsoni时，Scrapy JSON序列化程序应该执行您想要的操作。我已经尝试过这样做了。但我还是得到了同样的结果。有什么想法吗？您认为它与字段有关吗？我只是说您的示例输出不是JSON。我真的不明白你所说的项目以不同的间隔显示的意思。你的预期产出是多少？例如这个项目？哦，好的。因此，当您查看输出时，可以看到该位置为空。滚动时，您还会发现shift type字段为空。只填写链接和标题。但是，当您不断滚动时，下一个条目包含位置，并且已填充，shifttype也是如此。它们将显示为单独的条目。我想知道这可能是什么原因。您还将找到一行，其中所有参数都为空。所以基本上，在一行中显示所有参数的那一行，是以不同的间隔分别显示参数。啊，我明白了，你应该循环表行-trs，而不是表单元格tds。我将使用hxs。选择“//table[@id=searchresults]/tbody/tr”，然后使用//span。。。在每个循环迭代中