从HTML到文本的NLP预处理

从HTML到文本的NLP预处理,html,text,beautifulsoup,nlp,nltk,Html,Text,Beautifulsoup,Nlp,Nltk,我看到NLTK建议使用BeautifulSoupget_text()将HTML预处理为文本,以便进行后续NLP分析。但它似乎并没有起到很好的作用。在下面的示例中,xyz和abc是连接的,但它们不应该是连接的。在NLP应用程序中将HTML转换为文本时,有没有更好的预处理实用程序 $ cat main.py #!/usr/bin/env python # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1: html_doc = "

我看到NLTK建议使用
BeautifulSoup
get_text()
将HTML预处理为文本,以便进行后续NLP分析。但它似乎并没有起到很好的作用。在下面的示例中,
xyz
abc
是连接的,但它们不应该是连接的。在NLP应用程序中将HTML转换为文本时,有没有更好的预处理实用程序

$ cat main.py
#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1:

html_doc = "<h2>xyz</h2><p>abc</p>"

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print soup.get_text()
$ ./main.py 
xyzabc
$cat main.py
#!/usr/bin/env python
#vim:设置noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1:
html_doc=“xyzabc

” 从bs4导入BeautifulSoup soup=BeautifulSoup(html_doc,'html.parser') 打印soup.get_text() $/main.py 拉丁字母
我建议您使用该工具。以下是命令行中的测试运行:

$ html2text --ignore-links https://content.cultureandempire.com/chapter1.html 

  * Culture & Empire
  *   * __Introduction
  * __**1.** Preface 
  * __**2.** Chapter 1 - Magic Machines 
  * __**3.** Chapter 2 - Spheres of Light 
  * __**4.** Chapter 3 - Faceless Societies 
  * __**5.** Chapter 4 - Freedom in Chains 
  * __**6.** Chapter 5 - Eyes of the Spider 
  * __**7.** Chapter 6 - Wealth of Nations 
  * __**8.** Chapter 7 - March of the Kaiju 
  * __**9.** Chapter 8 - The Reveal 
  * __**10.** Postface 
  * __**11.** Appendix 1 
  *   * Published with GitBook 

#  __Culture & Empire

# Chapter 1. Magic Machines

> Far away, in a different place, a civilization called Culture had taken
seed, and was growing. It owned little except a magic spell called Knowledge.

In this chapter, I'll examine how the Internet is changing our society. It's
happening quickly. The most significant changes have occurred during just the
last 10 years or so. More and more of our knowledge about the world and other
people is transmitted and stored digitally. What we know and who we know are
moving out of our minds and into databases. These changes scare many people,
whereas in fact they contain the potential to free us, empowering us to
improve society in ways that were never before possible.

## From Bricks to Bits

否则,您可以使用或

看到我的答案了吗?如果它有效/回答问题或至少投票,你能将其标记为正确吗?