如何在python中仅提取scrapy选择器中的文本_Python_Scrapy

如何在python中仅提取scrapy选择器中的文本

python scrapy

如何在python中仅提取scrapy选择器中的文本,python,scrapy,Python,Scrapy,我有这个密码 site = hxs.select("//h1[@class='state']") log.msg(str(site[0].extract()),level=log.ERROR) 输出是 [scrapy] ERROR: <h1 class="state"><strong> 1</strong> <span> job containing <strong>php

我有这个密码

   site = hxs.select("//h1[@class='state']")
   log.msg(str(site[0].extract()),level=log.ERROR)

输出是

 [scrapy] ERROR: <h1 class="state"><strong>
            1</strong>
            <span> job containing <strong>php</strong> in <strong>region</strong> paying  <strong>$30-40k per year</strong></span>
                </h1>

是否可以只获取没有任何html标记的文本

我没有运行scrapy实例，因此无法测试它；但您可以尝试在搜索表达式中使用文本

例如：

site = hxs.select("//h1[@class='state']/text()")

从

中获取，您可以使用它剥离html标记，下面是一个示例：

from BeautifulSoup import BeautifulSoup
''.join(BeautifulSoup(str(site[0].extract())).findAll(text=True))

然后，您可以去掉所有额外的空白、新行等

如果不想使用其他模块，可以尝试使用简单的正则表达式：

# replace html tags with ' '
text = re.sub(r'<[^>]*?>', ' ', str(site[0].extract()))

在上面的xpath中，您选择的是具有class属性state的h1标记

这就是为什么它会选择h1元素中的所有内容

如果你只想选择h1标签的文本，你所要做的就是

//h1[@class='state']/text()

如果要选择h1标记及其子标记的文本，必须使用

//h1[@class='state']//text()

因此，对于特定标记文本，区别是/text；对于特定标记及其子标记的文本，区别是//text

下面提到的代码适用于您

site = ''.join(hxs.select("//h1[@class='state']/text()").extract()).strip()

您可以使用html2text

您可以使用BeautifulSoup get_文本功能

对/text和//textxpath差异的出色解释永远不能低估

site = ''.join(hxs.select("//h1[@class='state']/text()").extract()).strip()

import html2text
converter = html2text.HTML2Text()
print converter.handle("<div>Please!!!<span>remove me</span></div>")

from bs4 import BeautifulSoup

text = '''
<td><a href="http://www.fakewebsite.com">Please can you strip me?</a>
<br/><a href="http://www.fakewebsite.com">I am waiting....</a>
</td>
'''
soup = BeautifulSoup(text)

print(soup.get_text())