Python 在Scrapy中连接Xpath嵌套文本_Python_Html_Xpath_Web Scraping_Scrapy

Python 在Scrapy中连接Xpath嵌套文本

python html xpath web-scraping scrapy

Python 在Scrapy中连接Xpath嵌套文本,python,html,xpath,web-scraping,scrapy,Python,Html,Xpath,Web Scraping,Scrapy,我一直在尝试将一些嵌套文本与Scrapy中的xpath连接在一起。我认为它使用XPath1.0？我看过很多其他的帖子，但似乎没有什么能达到我想要的以下是html的特定部分（实际页面）：我试着使用来自但这只能让我回来 [u'<td colspan="5" style="border-bottom: #BCD9E3 3px solid">Finn and Princess Bubblegum must protect the <a href="/wiki/ Candy_Ki

我一直在尝试将一些嵌套文本与Scrapy中的xpath连接在一起。我认为它使用XPath1.0？我看过很多其他的帖子，但似乎没有什么能达到我想要的

以下是html的特定部分（实际页面）：

我试着使用来自

但这只能让我回来

[u'<td colspan="5" style="border-bottom: #BCD9E3 3px solid">Finn and Princess Bubblegum must protect the <a href="/wiki/
Candy_Kingdom" title="Candy Kingdom">Candy Kingdom</a> from a horde of candy zombies they accidentally created.\n</td>',

[u'Finn and Princess Bubblegum must protect the ', u'Candy Kingdom', u' from a horde of candy zombies they accidentally
created.\n', u'Finn must travel to ', u'Lumpy Space', u' to find a cure that will save Jake, who was accidentally bitten, (more stuff here)]

这让我回来了

[u'<td colspan="5" style="border-bottom: #BCD9E3 3px solid">Finn and Princess Bubblegum must protect the <a href="/wiki/
Candy_Kingdom" title="Candy Kingdom">Candy Kingdom</a> from a horde of candy zombies they accidentally created.\n</td>',

[u'Finn and Princess Bubblegum must protect the ', u'Candy Kingdom', u' from a horde of candy zombies they accidentally
created.\n', u'Finn must travel to ', u'Lumpy Space', u' to find a cure that will save Jake, who was accidentally bitten, (more stuff here)]

有人知道关于连接的xpath技巧吗

谢谢

编辑：蜘蛛代码

class AT_Episode_Detail_Spider_2(Spider):

    name = "ep_detail_2"
    allowed_domains = ["adventuretime.wikia.com"]
    start_urls = [
        "http://adventuretime.wikia.com/wiki/List_of_episodes"
    ]

    def parse(self, response):
        sel = Selector(response)

        description = sel.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan='5']//text()").extract()
        print description

通过

join（）

手动连接：

description = " ".join(sel.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan='5']//text()").extract())

或者将处理器与应用程序结合使用

下面是一个示例代码，用于获取事件描述列表：

def parse(self, response):
    description = [" ".join(row.xpath(".//text()[not(ancestor::sup)]").extract())
                   for row in response.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan]")]
    print description

join（）

不是我想要的。我应该更具体一点。注意，在我想要返回的数据中，不仅仅有一个字符串。我只想将文本与其他a标记组合，而不是将所有文本和a标记组合在一起。我会很快更新我的html…@pyramidface您可以而且可能仍然应该使用

join（）

解决它。除了您可能需要遍历这些行以生成描述列表之外。你也可以发布完整的蜘蛛代码，你有这样我可以了解更多的上下文？谢谢@好的，我已经更新了答案，包括获取描述列表的代码。这是你问的问题吗？谢谢。啊，太棒了！谢谢但这给我提出了一些问题。。。您能告诉我您如何/为什么使用

response

而不是

sel

？我试着在

//text（）

中打印出一行而不使用

。。这真是太疯狂了哈哈。

做什么了？还有。。。我注意到sup（与标记级别相同）中的文本（如[226]）也被合并到最终输出中。我怎么能用你写的代码忽略sup呢？我已经找到了一个解决方案，但不是通过xpath。也许只是做一些python解析来删除任何带有方括号的内容。

description = " ".join(sel.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan='5']//text()").extract())

def parse(self, response):
    description = [" ".join(row.xpath(".//text()[not(ancestor::sup)]").extract())
                   for row in response.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan]")]
    print description