Python 美化组未正确解析html_Python_Html_Beautifulsoup

Python 美化组未正确解析html

python html

Python 美化组未正确解析html,python,html,beautifulsoup,Python,Html,Beautifulsoup,因此，我有以下代码： #!/usr/bin/env python # -*- coding: utf-8 -*- from bs4 import BeautifulSoup html = '</p></td></tr><tr><td colspan="3"> Data I want </td></tr><tr> <td colspan="3"> Data I want &

因此，我有以下代码：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup

html = '</p></td></tr><tr><td colspan="3">   Data I want  </td></tr><tr>  <td colspan="3">   Data I want  </td> </tr> <tr><td colspan="3">   Data I want  </td> </tr></table>'
soup = BeautifulSoup(html, "lxml")

print soup.getText()

#/usr/bin/env python
#-*-编码：utf-8-*-
从bs4导入BeautifulSoup
html='Data我想要数据我想要数据
soup=BeautifulSoup（html，“lxml”）
打印soup.getText（）

但是输出是空的，但是对于其他html示例，它工作得很好。 html是这样的，因为它是从表中提取的

html = '<p>Content</p></td></table>'

html='Content'

举个例子，这很好用。有什么帮助吗

编辑：我知道HTML无效，但第二个HTML示例也无效，但仍然有效。

您的HTML不是有效的HTML。你为什么不把它改成下面的

html = '<table><tr><td colspan="3">   Data I want  </td></tr><tr>  <td colspan="3">   Data I want  </td> </tr> <tr><td colspan="3">   Data I want  </td> </tr></table>'

html='Data I want Data I want Data I want'

但在你发布样本之前，可能缺少一些东西。HTML代码从何而来？

如果一致性问题缺少开头标记，您可以使用正则表达式找到它应该是什么样子，如下所示

from bs4 import BeautifulSoup
import re

html = '</p></td></tr><tr><td colspan="3">   Data I want  </td></tr><tr>  <td colspan="3">   Data I want  </td> </tr> <tr><td colspan="3">   Data I want  </td> </tr></table>'
pat = re.compile('</[a-z]*>')
L = list(re.findall(pat, html))
if L[0] != L[len(L)-1]:
    html = L[len(L)-1].replace('/','') + html

soup = BeautifulSoup(html, "lxml")
print soup.getText()

这是因为

lxml

在解析无效的

HTML

时遇到问题

使用

html.parser

而不是

lxml

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup

html = '</p></td></tr><tr><td colspan="3">   Data I want  </td></tr><tr>  <td colspan="3">   Data I want  </td> </tr> <tr><td colspan="3">   Data I want  </td> </tr></table>'
soup = BeautifulSoup(html, 'html.parser')

print soup.getText()

我知道它不是有效的HTML，因为我从一个大的源文件中提取它。但是我给出的第二个例子也是无效的HTML，但是解析得很好，并且输出了内容。我猜beautifulsoup只处理一些无效的HTML，但不是全部。似乎在字符串的末尾有非感官标记不是问题，但如果从它们开始，则是问题。我不知道为什么从一个大文件中提取它可以解释或原谅HTML的无效性。看来你应该把它提取得更好。你需要帮忙吗？

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup

html = '</p></td></tr><tr><td colspan="3">   Data I want  </td></tr><tr>  <td colspan="3">   Data I want  </td> </tr> <tr><td colspan="3">   Data I want  </td> </tr></table>'
soup = BeautifulSoup(html, 'html.parser')

print soup.getText()

 Data I want      Data I want       Data I want