python从html标记中提取数据
我想在Python中提取html标记中的(段落)python从html标记中提取数据,python,html,python-3.x,Python,Html,Python 3.x,我想在Python中提取html标记中的(段落) <p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business y
<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, helvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>
我的代码是
from HTMLParser import HTMLParser
from bs4 import BeautifulSoup
x = """<p style="text-align: justify;"><span style=& quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. </span></p>"""
p1 = HTMLParser()
p1.unescape(x)
bdy_soup = BeautifulSoup(p1.unescape(x)).get_text(separator=";")
print(bdy_soup)
此代码未返回任何内容,请帮助我执行此操作,如有任何帮助,将不胜感激
html.unescape
将html字符转换为asciibs4.BeautifulSoup(html\u content.text
提取内容>>x=“”p style=“文本对齐:对齐;“span style=“font size:小;字体系列:拉托字体、arial字体、helvetica字体、无衬线字体;“无论您拥有哪种类型的小型企业,使用传统的销售和营销策略都可能代价高昂。/span/p”“”
>>>导入html
>>>xx=html.unescape(x)
“\n\n无论您拥有哪种类型的小企业,使用传统的销售和营销策略都会很昂贵。\n\n
”
>>>进口bs4
>>>bs4.BeautifulSoup(xx,“html”).text
“不管你拥有哪种类型的小企业,使用传统的销售和营销策略都可能代价高昂。”
您可以使用正则表达式在两个HTML标记之间提取数据
r'<title[^>]*>([^<]+)</title>'
r']*>([^您可以这样做。请先安装HTMLParser
和beautifulsoup4
from HTMLParser import HTMLParser
p = "<p style="text-align: justify;"><span
style="font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. </span></p>"
from bs4 import BeautifulSoup
p1 = HTMLParser()
p1.unescape(p)
bdy_soup = BeautifulSoup(p1.unescape(p)).get_text(separator="\n")
print bdy_soup
您是从html页面还是文本文件读取它?@prakash palnati---从Sql读取table@s.s您可以使用BeautifulSoup
提取extact数据。首先执行导入html>>html.unescape(x).
@manoj jadhav你能解释一下吗code@s.s检查我的帖子。这不起作用。你能帮我解释一下代码吗?我修改了它。@s.si修改了我的问题。请咨询我的帮助。我发布了我的答案,文本为p={“p………gt;/p”}显示error@s.s确切的输入是什么?你能把完整的代码片段输入吗?确切地说,这个===p style=“text align:justify;”span style=“font size:small;font family:lato、arial、helvetica、sans serif;”无论您拥有哪种类型的小企业,使用传统的销售和营销策略都会很昂贵。/span/p您的代码正在运行,但没有返回任何输出。我必须打印什么?请添加print bdy_soup
。bdy_soup
from HTMLParser import HTMLParser
p = "<p style="text-align: justify;"><span
style="font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. </span></p>"
from bs4 import BeautifulSoup
p1 = HTMLParser()
p1.unescape(p)
bdy_soup = BeautifulSoup(p1.unescape(p)).get_text(separator="\n")
print bdy_soup
The code worked by installing lxml parser.. thankyou everyone for your help
import html
import bs4
import html.parser
import lxml
from bs4 import BeautifulSoup
x = """<p style="text-align: justify;"><span style=& quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. </span></p>"""
p1 = html.unescape(x)
bdy_soup = bs4.BeautifulSoup(p1, "lxml").get_text(separator="/n")
print(bdy_soup)