Python 我如何从一个有Scrapy的网站上获取所有的纯文本？_Python_Html_Xpath_Web Scraping_Scrapy

Python 我如何从一个有Scrapy的网站上获取所有的纯文本？

python html xpath web-scraping scrapy

Python 我如何从一个有Scrapy的网站上获取所有的纯文本？,python,html,xpath,web-scraping,scrapy,Python,Html,Xpath,Web Scraping,Scrapy,我想有一个网站的所有文本可见，在HTML呈现后。我正在用Python和Scrapy框架工作。使用xpath（'//body//text（）'）我可以得到它，但是使用HTML标记，我只需要文本。有什么解决办法吗你试过了吗 xpath('//body//text()').re('(\w+)') 或最简单的选择是//body//text（）和找到的所有内容： ''.join(sel.select("//body//text()").extract()).strip() 其中，sel是一个实例

我想有一个网站的所有文本可见，在HTML呈现后。我正在用Python和Scrapy框架工作。使用

xpath（'//body//text（）'）

我可以得到它，但是使用HTML标记，我只需要文本。有什么解决办法吗

你试过了吗

xpath('//body//text()').re('(\w+)')

或

最简单的选择是

//body//text（）

和找到的所有内容：

''.join(sel.select("//body//text()").extract()).strip()

其中，

sel

是一个实例

另一个选项是使用的

clean\u html（）

：

另一个选项是使用的

get_text（）

：

get\u text（）

如果只需要文档或标记的文本部分，则可以使用

get\u text（）

方法。它返回文档中的所有文本或位于标记下方，作为单个Unicode字符串

另一个选项是使用的

text\u content（）

：

.text\u content（）

返回元素的文本内容，包括其子级的文本内容，不带标记

xpath（“//body//text（）”）

并不总是将铲斗推入上次使用的标记（在您的案例body中）中的节点。如果键入

xpath（“//body/node（）/text（）”）.extract（）

，您将看到html正文中的节点。您可以尝试

xpath（'//body/genderant:：text（）'）

这实际上工作得很好，但仍然返回一些html标记和其他标记。我已经删除了我的问题。。我使用了下面的代码html=sel.select（“//body//text（）”）tree=lxml.html.fromstring（html）item['description']=tree.text\u content（）.strip（），但我得到的是is\u full\u html=\u looks\u like\u full\u html\u unicode（html）异常。类型错误：预期的字符串或缓冲区..错误。出错的地方就在更新时，

nltk

不推荐他们的

clean_html

方法，而是建议：

NotImplementedError:要删除html标记，请使用BeautifulSoup的get_text（）函数

''.join(sel.select("//body//text()").extract()).strip()

>>> import nltk
>>> html = """
... <div class="post-text" itemprop="description">
... 
...         <p>I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
... With <code>xpath('//body//text()')</code> I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !</p>
... 
...     </div>"""
>>> nltk.clean_html(html)
"I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.\nWith xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !"

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.get_text().strip()
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !

>>> import lxml.html
>>> tree = lxml.html.fromstring(html)
>>> print tree.text_content().strip()
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !