Python，从字符串中删除所有html标记_Python_Html_String_Parsing_Beautifulsoup

Python，从字符串中删除所有html标记

python html string parsing

Python，从字符串中删除所有html标记,python,html,string,parsing,beautifulsoup,Python,Html,String,Parsing,Beautifulsoup,我正在尝试使用beautifulsoup和以下代码从网站访问文章内容： site= 'www.example.com' page = urllib2.urlopen(req) soup = BeautifulSoup(page) content = soup.find_all('p') content=str(content) 内容对象包含页面中“p”标记内的所有主文本，但输出中仍存在其他标记，如下图所示。我想删除包含在匹配的标记对中的所有字符以及标记本身。这样就只剩下文本了我试过下面的方法

我正在尝试使用beautifulsoup和以下代码从网站访问文章内容：

site= 'www.example.com'
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
content = soup.find_all('p')
content=str(content)

内容对象包含页面中“p”标记内的所有主文本，但输出中仍存在其他标记，如下图所示。我想删除包含在匹配的<>标记对中的所有字符以及标记本身。这样就只剩下文本了

我试过下面的方法，但似乎不起作用

' '.join(item for item in content.split() if not (item.startswith('<') and item.endswith('>')))

“”.join（content.split（）中的项对应项，如果不是（item.startswith（“”）））

移除sting中的子字符串的最佳方法是什么？以某种模式开始和结束，如<>

您需要使用：

你可以用

以下示例来自：

>>标记=''
>>>汤=美汤（标记）
>>>soup.get_text（）
u'\n链接到example.com\n'

使用正则表达式：

re.sub('<[^<]+?>', '', text)

使用NLTK：

import nltk   
from urllib import urlopen
url = "https://stackoverflow.com/questions/tagged/python"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

Pyparsing通过定义一个匹配所有打开和关闭HTML标记的模式，然后使用该模式作为抑制器来转换输入，从而使编写HTML剥离器变得简单。这仍然会留下

&xxx要转换的HTML实体-可以使用xml.sax.saxutils.unescape
进行转换：
source = """
<p><strong>Editors' Pick: Originally published March 22.<br /> <br /> Apple</strong> <span class=" TICKERFLAT">(<a href="/quote/AAPL.html">AAPL</a> - <a href="http://secure2.thestreet.com/cap/prm.do?OID=028198&amp;ticker=AAPL">Get Report</a><a class=" arrow" href="/quote/AAPL.html"><span class=" tickerChange" id="story_AAPL"></span></a>)</span> is waking up the echoes with the reintroduction of a&nbsp;4-inch iPhone, a model&nbsp;its creators hope will lead the company to victory not just in emerging markets, but at home as well.</p> 
<p>&quot;There's significant pent-up demand within Apple's base of iPhone owners who want a smaller iPhone with up-to-date specs and newer features,&quot; Jackdaw Research Chief Analyst Jan Dawson said in e-mailed comments.</p> 
<p>The new model, dubbed the iPhone SE, &quot;should unleash a decent upgrade cycle over the coming months,&quot; Dawson said.&nbsp;Prior to the iPhone 6 and 6 Plus, introduced in 2014, Apple's iPhones were small, at 3.5 inches and 4 inches tall, respectively, compared with models by Samsung and others that approached 6 inches.</p>
<div class=" butonTextPromoAd">
 <div class=" ym" id="ym_44444440"></div>"""

from pyparsing import anyOpenTag, anyCloseTag
from xml.sax.saxutils import unescape as unescape
unescape_xml_entities = lambda s: unescape(s, {"&apos;": "'", "&quot;": '"', "&nbsp;":" "})

stripper = (anyOpenTag | anyCloseTag).suppress()

print(unescape_xml_entities(stripper.transformString(source)))

（以后，请不要将示例文本或代码作为不可复制的粘贴图像提供。）
如果您被限制使用任何库，只需使用以下代码即可删除html标记
我只是纠正了你的想法。谢谢你的主意
content="<h4 style='font-size: 11pt; color: rgb(67, 67, 67); font-family: arial, sans-serif;'>Sample text for display.</h4> <p>&nbsp;</p>"


' '.join([word for line in [item.strip() for item in content.replace('<',' <').replace('>','> ').split('>') if not (item.strip().startswith('<') or (item.strip().startswith('&') and item.strip().endswith(';')))] for word in line.split() if not (word.strip().startswith('<') or (word.strip().startswith('&') and word.strip().endswith(';')))])

content=“用于显示的示例文本。”
“”.join（[item.strip（）中的[word for line for content.replace（“”）.split（“>”）如果不是（item.strip（））.startswith（“简单算法，将在每种语言中工作，而不导入模块和其他库）。
代码是自记录的：
def removetags_fc(data_str):
    appendingmode_bool = True
    output_str = ''
    for char_str in data_str:
        if char_str == '>':
            appendingmode_bool = False
        elif char_str == '<':
            appendingmode_bool = True
            continue
        if appendingmode_bool:
            output_str += char_str
    return output_str

def removetags_fc（数据存储）：
附加模式布尔=真
输出_str=''
对于数据结构中的字符结构：
如果char_str=='>'：
appendingmode\u bool=错误
elif char_str==''和'
import nltk   
from urllib import urlopen
url = "https://stackoverflow.com/questions/tagged/python"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

source = """
<p><strong>Editors' Pick: Originally published March 22.<br /> <br /> Apple</strong> <span class=" TICKERFLAT">(<a href="/quote/AAPL.html">AAPL</a> - <a href="http://secure2.thestreet.com/cap/prm.do?OID=028198&amp;ticker=AAPL">Get Report</a><a class=" arrow" href="/quote/AAPL.html"><span class=" tickerChange" id="story_AAPL"></span></a>)</span> is waking up the echoes with the reintroduction of a&nbsp;4-inch iPhone, a model&nbsp;its creators hope will lead the company to victory not just in emerging markets, but at home as well.</p> 
<p>&quot;There's significant pent-up demand within Apple's base of iPhone owners who want a smaller iPhone with up-to-date specs and newer features,&quot; Jackdaw Research Chief Analyst Jan Dawson said in e-mailed comments.</p> 
<p>The new model, dubbed the iPhone SE, &quot;should unleash a decent upgrade cycle over the coming months,&quot; Dawson said.&nbsp;Prior to the iPhone 6 and 6 Plus, introduced in 2014, Apple's iPhones were small, at 3.5 inches and 4 inches tall, respectively, compared with models by Samsung and others that approached 6 inches.</p>
<div class=" butonTextPromoAd">
 <div class=" ym" id="ym_44444440"></div>"""

from pyparsing import anyOpenTag, anyCloseTag
from xml.sax.saxutils import unescape as unescape
unescape_xml_entities = lambda s: unescape(s, {"&apos;": "'", "&quot;": '"', "&nbsp;":" "})

stripper = (anyOpenTag | anyCloseTag).suppress()

print(unescape_xml_entities(stripper.transformString(source)))

Editors' Pick: Originally published March 22.  Apple (AAPL - Get Report) is waking up the echoes with the reintroduction of a 4-inch iPhone, a model its creators hope will lead the company to victory not just in emerging markets, but at home as well. 
"There's significant pent-up demand within Apple's base of iPhone owners who want a smaller iPhone with up-to-date specs and newer features," Jackdaw Research Chief Analyst Jan Dawson said in e-mailed comments. 
The new model, dubbed the iPhone SE, "should unleash a decent upgrade cycle over the coming months," Dawson said. Prior to the iPhone 6 and 6 Plus, introduced in 2014, Apple's iPhones were small, at 3.5 inches and 4 inches tall, respectively, compared with models by Samsung and others that approached 6 inches.

content="<h4 style='font-size: 11pt; color: rgb(67, 67, 67); font-family: arial, sans-serif;'>Sample text for display.</h4> <p>&nbsp;</p>"


' '.join([word for line in [item.strip() for item in content.replace('<',' <').replace('>','> ').split('>') if not (item.strip().startswith('<') or (item.strip().startswith('&') and item.strip().endswith(';')))] for word in line.split() if not (word.strip().startswith('<') or (word.strip().startswith('&') and word.strip().endswith(';')))])

def removetags_fc(data_str):
    appendingmode_bool = True
    output_str = ''
    for char_str in data_str:
        if char_str == '>':
            appendingmode_bool = False
        elif char_str == '<':
            appendingmode_bool = True
            continue
        if appendingmode_bool:
            output_str += char_str
    return output_str