Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/89.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
python从html标记中提取数据_Python_Html_Python 3.x - Fatal编程技术网

python从html标记中提取数据

python从html标记中提取数据,python,html,python-3.x,Python,Html,Python 3.x,我想在Python中提取html标记中的(段落) <p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business y

我想在Python中提取html标记中的(段落)

 <p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, helvetica, sans-serif;">

 Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.

 </span></p>
我的代码是

 from HTMLParser import HTMLParser
 from bs4 import BeautifulSoup

x = """<p style="text-align: justify;"><span style=&  quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. </span></p>"""

p1 = HTMLParser()
p1.unescape(x)
bdy_soup = BeautifulSoup(p1.unescape(x)).get_text(separator=";")
print(bdy_soup)
此代码未返回任何内容,请帮助我执行此操作,如有任何帮助,将不胜感激

  • 使用
    html.unescape
    将html字符转换为ascii
  • 使用
    bs4.BeautifulSoup(html\u content.text
    提取内容

  • >>x=“”p style=“文本对齐:对齐;“span style=“font size:小;字体系列:拉托字体、arial字体、helvetica字体、无衬线字体;“无论您拥有哪种类型的小型企业,使用传统的销售和营销策略都可能代价高昂。/span/p”“”
    >>>导入html
    >>>xx=html.unescape(x)
    “

    \n\n无论您拥有哪种类型的小企业,使用传统的销售和营销策略都会很昂贵。\n\n

    ” >>>进口bs4 >>>bs4.BeautifulSoup(xx,“html”).text “不管你拥有哪种类型的小企业,使用传统的销售和营销策略都可能代价高昂。”
    您可以使用正则表达式在两个HTML标记之间提取数据

    r'<title[^>]*>([^<]+)</title>'
    

    r']*>([^您可以这样做。请先安装
    HTMLParser
    beautifulsoup4

    from HTMLParser import HTMLParser
    p = "&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span 
     style=&quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;&quot;&gt; Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. &lt;/span&gt;&lt;/p&gt;"
    from bs4 import BeautifulSoup
    p1 = HTMLParser()
    p1.unescape(p)
    bdy_soup = BeautifulSoup(p1.unescape(p)).get_text(separator="\n")
    print bdy_soup
    

    您是从html页面还是文本文件读取它?@prakash palnati---从Sql读取table@s.s您可以使用
    BeautifulSoup
    提取extact数据。首先执行
    导入html>>html.unescape(x).
    @manoj jadhav你能解释一下吗code@s.s检查我的帖子。这不起作用。你能帮我解释一下代码吗?我修改了它。@s.si修改了我的问题。请咨询我的帮助。我发布了我的答案,文本为p={“p………gt;/p”}显示error@s.s确切的输入是什么?你能把完整的代码片段输入吗?确切地说,这个===p style=“text align:justify;”span style=“font size:small;font family:lato、arial、helvetica、sans serif;”无论您拥有哪种类型的小企业,使用传统的销售和营销策略都会很昂贵。/span/p您的代码正在运行,但没有返回任何输出。我必须打印什么?请添加
    print bdy_soup
    bdy_soup
    from HTMLParser import HTMLParser
    p = "&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span 
     style=&quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;&quot;&gt; Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. &lt;/span&gt;&lt;/p&gt;"
    from bs4 import BeautifulSoup
    p1 = HTMLParser()
    p1.unescape(p)
    bdy_soup = BeautifulSoup(p1.unescape(p)).get_text(separator="\n")
    print bdy_soup
    
    The code worked by installing lxml parser.. thankyou everyone for your help
    
     import html
     import bs4
     import html.parser
     import lxml
     from bs4 import BeautifulSoup
    
     x = """&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&  quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;&quot;&gt; Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. &lt;/span&gt;&lt;/p&gt;"""
    
     p1 = html.unescape(x) 
     bdy_soup = bs4.BeautifulSoup(p1, "lxml").get_text(separator="/n")
     print(bdy_soup)