python从html标记中提取数据_Python_Html_Python 3.x

python从html标记中提取数据

python html python-3.x

python从html标记中提取数据,python,html,python-3.x,Python,Html,Python 3.x,我想在Python中提取html标记中的（段落） <p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business y

我想在Python中提取html标记中的（段落）

 &lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;&quot;&gt;

 Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.

 &lt;/span&gt;&lt;/p&gt;

我的代码是

 from HTMLParser import HTMLParser
 from bs4 import BeautifulSoup

x = """&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&  quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;&quot;&gt; Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. &lt;/span&gt;&lt;/p&gt;"""

p1 = HTMLParser()
p1.unescape(x)
bdy_soup = BeautifulSoup(p1.unescape(x)).get_text(separator=";")
print(bdy_soup)

此代码未返回任何内容，请帮助我执行此操作，如有任何帮助，将不胜感激

使用

html.unescape

将html字符转换为ascii

使用

bs4.BeautifulSoup（html\u content.text

提取内容

>>x=“”p style=“文本对齐：对齐；“span style=“font size:小；字体系列：拉托字体、arial字体、helvetica字体、无衬线字体；“无论您拥有哪种类型的小型企业，使用传统的销售和营销策略都可能代价高昂。/span/p”“”
>>>导入html
>>>xx=html.unescape（x）
“\n\n无论您拥有哪种类型的小企业，使用传统的销售和营销策略都会很昂贵。\n\n”
>>>进口bs4
>>>bs4.BeautifulSoup（xx，“html”）.text
“不管你拥有哪种类型的小企业，使用传统的销售和营销策略都可能代价高昂。”

您可以使用正则表达式在两个HTML标记之间提取数据

r'<title[^>]*>([^<]+)</title>'

r']*>（[^您可以这样做。请先安装HTMLParser
和beautifulsoup4

from HTMLParser import HTMLParser
p = "&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span 
 style=&quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;&quot;&gt; Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. &lt;/span&gt;&lt;/p&gt;"
from bs4 import BeautifulSoup
p1 = HTMLParser()
p1.unescape(p)
bdy_soup = BeautifulSoup(p1.unescape(p)).get_text(separator="\n")
print bdy_soup

您是从html页面还是文本文件读取它？@prakash palnati---从Sql读取table@s.s您可以使用BeautifulSoup
提取extact数据。首先执行导入html>>html.unescape（x）.
@manoj jadhav你能解释一下吗code@s.s检查我的帖子。这不起作用。你能帮我解释一下代码吗？我修改了它。@s.si修改了我的问题。请咨询我的帮助。我发布了我的答案，文本为p={“p………gt；/p”}显示error@s.s确切的输入是什么？你能把完整的代码片段输入吗？确切地说，这个===p style=“text align:justify；”span style=“font size:small；font family:lato、arial、helvetica、sans serif；”无论您拥有哪种类型的小企业，使用传统的销售和营销策略都会很昂贵。/span/p您的代码正在运行，但没有返回任何输出。我必须打印什么？请添加print bdy_soup
。bdy_soup
from HTMLParser import HTMLParser
p = "&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span 
 style=&quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;&quot;&gt; Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. &lt;/span&gt;&lt;/p&gt;"
from bs4 import BeautifulSoup
p1 = HTMLParser()
p1.unescape(p)
bdy_soup = BeautifulSoup(p1.unescape(p)).get_text(separator="\n")
print bdy_soup

The code worked by installing lxml parser.. thankyou everyone for your help

 import html
 import bs4
 import html.parser
 import lxml
 from bs4 import BeautifulSoup

 x = """&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&  quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;&quot;&gt; Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. &lt;/span&gt;&lt;/p&gt;"""

 p1 = html.unescape(x) 
 bdy_soup = bs4.BeautifulSoup(p1, "lxml").get_text(separator="/n")
 print(bdy_soup)