Python 草率的项目，草率的时间表_Python_Xpath_Web Scraping_Scrapy

Python 草率的项目，草率的时间表

python xpath web-scraping scrapy

Python 草率的项目，草率的时间表,python,xpath,web-scraping,scrapy,Python,Xpath,Web Scraping,Scrapy,所以我想在这一页上把日程安排得很紧凑 …使用此代码 from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector class SchemaSpider(BaseSpider): name = "schema" allowed_domains = ["http://stats.swehockey.se/"] start_urls = [ "http://

所以我想在这一页上把日程安排得很紧凑

…使用此代码

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class SchemaSpider(BaseSpider):
    name = "schema"
    allowed_domains = ["http://stats.swehockey.se/"]
    start_urls = [
        "http://stats.swehockey.se/ScheduleAndResults/Schedule/3940"
    ]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    rows = hxs.select('//table[@class="tblContent"]/tbody/tr')

    for row in rows:
        date = row.select('/td[1]/div/span/text()').extract()
        teams = row.select('/td[2]/text()').extract()

        print date, teams

但我不能让它工作。我做错了什么？我已经试着弄清楚我自己几个小时了，但我不知道为什么我的XPath不能正常工作。

两个问题：

```
tbody
```
是现代浏览器添加的标记。Scrapy在html中根本看不到它
数据和团队的xpath不正确：应该使用相对xpath（
```
/
```
），td索引也错误，应该是2和3，而不是1和2

以下是完整的代码和一些功能（工作）：

希望有帮助。

两个问题：

```
tbody
```
是现代浏览器添加的标记。Scrapy在html中根本看不到它
数据和团队的xpath不正确：应该使用相对xpath（
```
/
```
），td索引也错误，应该是2和3，而不是1和2

以下是完整的代码和一些功能（工作）：

希望能有所帮助。

非常感谢！我对Python和Scrapy都是新手，所以我想我还有一些事情要解决。接下来要做的是将日期和团队划分成谷歌日历格式，并对其进行过滤，以便只添加AIK或Djurgårdens的主场比赛（如果有的话）。你介意帮我一下吗？这样我以后可以举个例子看看。不客气。当然，考虑一个单独的问题，但请确保你是具体的，明白。我提出了一个新问题（）。如果你有时间，我很乐意接受你的帮助。干杯关于

标签：非常感谢！我对Python和Scrapy都是新手，所以我想我还有一些事情要解决。接下来要做的是将日期和团队划分成谷歌日历格式，并对其进行过滤，以便只添加AIK或Djurgårdens的主场比赛（如果有的话）。你介意帮我一下吗？这样我以后可以举个例子看看。不客气。当然，考虑一个单独的问题，但请确保你是具体的，明白。我提出了一个新问题（）。如果你有时间，我很乐意接受你的帮助。干杯关于

标签：

from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class SchemaItem(Item):
    date = Field()
    teams = Field()


class SchemaSpider(BaseSpider):
    name = "schema"
    allowed_domains = ["http://stats.swehockey.se/"]
    start_urls = [
        "http://stats.swehockey.se/ScheduleAndResults/Schedule/3940"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//table[@class="tblContent"]/tr')

        for row in rows:
            item = SchemaItem()
            item['date'] = row.select('.//td[2]/div/span/text()').extract()
            item['teams'] = row.select('.//td[3]/text()').extract()

            yield item