Python,从字符串中删除所有html标记

Python,从字符串中删除所有html标记,python,html,string,parsing,beautifulsoup,Python,Html,String,Parsing,Beautifulsoup,我正在尝试使用beautifulsoup和以下代码从网站访问文章内容: site= 'www.example.com' page = urllib2.urlopen(req) soup = BeautifulSoup(page) content = soup.find_all('p') content=str(content) 内容对象包含页面中“p”标记内的所有主文本,但输出中仍存在其他标记,如下图所示。我想删除包含在匹配的标记对中的所有字符以及标记本身。这样就只剩下文本了 我试过下面的方法

我正在尝试使用beautifulsoup和以下代码从网站访问文章内容:

site= 'www.example.com'
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
content = soup.find_all('p')
content=str(content)
内容对象包含页面中“p”标记内的所有主文本,但输出中仍存在其他标记,如下图所示。我想删除包含在匹配的<>标记对中的所有字符以及标记本身。这样就只剩下文本了

我试过下面的方法,但似乎不起作用

' '.join(item for item in content.split() if not (item.startswith('<') and item.endswith('>')))
“”.join(content.split()中的项对应项,如果不是(item.startswith(“”)))
移除sting中的子字符串的最佳方法是什么?以某种模式开始和结束,如<>

您需要使用:

你可以用

以下示例来自:

>>标记=''
>>>汤=美汤(标记)
>>>soup.get_text()
u'\n链接到example.com\n'
使用正则表达式:

re.sub('<[^<]+?>', '', text)
使用NLTK:

import nltk   
from urllib import urlopen
url = "https://stackoverflow.com/questions/tagged/python"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

Pyparsing通过定义一个匹配所有打开和关闭HTML标记的模式,然后使用该模式作为抑制器来转换输入,从而使编写HTML剥离器变得简单。这仍然会留下
&xxx要转换的HTML实体-可以使用
xml.sax.saxutils.unescape
进行转换:

source = """
<p><strong>Editors' Pick: Originally published March 22.<br /> <br /> Apple</strong> <span class=" TICKERFLAT">(<a href="/quote/AAPL.html">AAPL</a> - <a href="http://secure2.thestreet.com/cap/prm.do?OID=028198&amp;ticker=AAPL">Get Report</a><a class=" arrow" href="/quote/AAPL.html"><span class=" tickerChange" id="story_AAPL"></span></a>)</span> is waking up the echoes with the reintroduction of a&nbsp;4-inch iPhone, a model&nbsp;its creators hope will lead the company to victory not just in emerging markets, but at home as well.</p> 
<p>&quot;There's significant pent-up demand within Apple's base of iPhone owners who want a smaller iPhone with up-to-date specs and newer features,&quot; Jackdaw Research Chief Analyst Jan Dawson said in e-mailed comments.</p> 
<p>The new model, dubbed the iPhone SE, &quot;should unleash a decent upgrade cycle over the coming months,&quot; Dawson said.&nbsp;Prior to the iPhone 6 and 6 Plus, introduced in 2014, Apple's iPhones were small, at 3.5 inches and 4 inches tall, respectively, compared with models by Samsung and others that approached 6 inches.</p>
<div class=" butonTextPromoAd">
 <div class=" ym" id="ym_44444440"></div>"""

from pyparsing import anyOpenTag, anyCloseTag
from xml.sax.saxutils import unescape as unescape
unescape_xml_entities = lambda s: unescape(s, {"&apos;": "'", "&quot;": '"', "&nbsp;":" "})

stripper = (anyOpenTag | anyCloseTag).suppress()

print(unescape_xml_entities(stripper.transformString(source)))

(以后,请不要将示例文本或代码作为不可复制的粘贴图像提供。)

如果您被限制使用任何库,只需使用以下代码即可删除html标记

我只是纠正了你的想法。谢谢你的主意

content="<h4 style='font-size: 11pt; color: rgb(67, 67, 67); font-family: arial, sans-serif;'>Sample text for display.</h4> <p>&nbsp;</p>"


' '.join([word for line in [item.strip() for item in content.replace('<',' <').replace('>','> ').split('>') if not (item.strip().startswith('<') or (item.strip().startswith('&') and item.strip().endswith(';')))] for word in line.split() if not (word.strip().startswith('<') or (word.strip().startswith('&') and word.strip().endswith(';')))])
content=“用于显示的示例文本。


“”.join([item.strip()中的[word for line for content.replace(“”).split(“>”)如果不是(item.strip()).startswith(“简单算法,将在每种语言中工作,而不导入模块和其他库)。 代码是自记录的:

def removetags_fc(data_str):
    appendingmode_bool = True
    output_str = ''
    for char_str in data_str:
        if char_str == '>':
            appendingmode_bool = False
        elif char_str == '<':
            appendingmode_bool = True
            continue
        if appendingmode_bool:
            output_str += char_str
    return output_str
def removetags_fc(数据存储):
附加模式布尔=真
输出_str=''
对于数据结构中的字符结构:
如果char_str=='>':
appendingmode\u bool=错误
elif char_str==''和'
import nltk   
from urllib import urlopen
url = "https://stackoverflow.com/questions/tagged/python"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)
source = """
<p><strong>Editors' Pick: Originally published March 22.<br /> <br /> Apple</strong> <span class=" TICKERFLAT">(<a href="/quote/AAPL.html">AAPL</a> - <a href="http://secure2.thestreet.com/cap/prm.do?OID=028198&amp;ticker=AAPL">Get Report</a><a class=" arrow" href="/quote/AAPL.html"><span class=" tickerChange" id="story_AAPL"></span></a>)</span> is waking up the echoes with the reintroduction of a&nbsp;4-inch iPhone, a model&nbsp;its creators hope will lead the company to victory not just in emerging markets, but at home as well.</p> 
<p>&quot;There's significant pent-up demand within Apple's base of iPhone owners who want a smaller iPhone with up-to-date specs and newer features,&quot; Jackdaw Research Chief Analyst Jan Dawson said in e-mailed comments.</p> 
<p>The new model, dubbed the iPhone SE, &quot;should unleash a decent upgrade cycle over the coming months,&quot; Dawson said.&nbsp;Prior to the iPhone 6 and 6 Plus, introduced in 2014, Apple's iPhones were small, at 3.5 inches and 4 inches tall, respectively, compared with models by Samsung and others that approached 6 inches.</p>
<div class=" butonTextPromoAd">
 <div class=" ym" id="ym_44444440"></div>"""

from pyparsing import anyOpenTag, anyCloseTag
from xml.sax.saxutils import unescape as unescape
unescape_xml_entities = lambda s: unescape(s, {"&apos;": "'", "&quot;": '"', "&nbsp;":" "})

stripper = (anyOpenTag | anyCloseTag).suppress()

print(unescape_xml_entities(stripper.transformString(source)))
Editors' Pick: Originally published March 22.  Apple (AAPL - Get Report) is waking up the echoes with the reintroduction of a 4-inch iPhone, a model its creators hope will lead the company to victory not just in emerging markets, but at home as well. 
"There's significant pent-up demand within Apple's base of iPhone owners who want a smaller iPhone with up-to-date specs and newer features," Jackdaw Research Chief Analyst Jan Dawson said in e-mailed comments. 
The new model, dubbed the iPhone SE, "should unleash a decent upgrade cycle over the coming months," Dawson said. Prior to the iPhone 6 and 6 Plus, introduced in 2014, Apple's iPhones were small, at 3.5 inches and 4 inches tall, respectively, compared with models by Samsung and others that approached 6 inches.
content="<h4 style='font-size: 11pt; color: rgb(67, 67, 67); font-family: arial, sans-serif;'>Sample text for display.</h4> <p>&nbsp;</p>"


' '.join([word for line in [item.strip() for item in content.replace('<',' <').replace('>','> ').split('>') if not (item.strip().startswith('<') or (item.strip().startswith('&') and item.strip().endswith(';')))] for word in line.split() if not (word.strip().startswith('<') or (word.strip().startswith('&') and word.strip().endswith(';')))])
def removetags_fc(data_str):
    appendingmode_bool = True
    output_str = ''
    for char_str in data_str:
        if char_str == '>':
            appendingmode_bool = False
        elif char_str == '<':
            appendingmode_bool = True
            continue
        if appendingmode_bool:
            output_str += char_str
    return output_str