Python 如何使用BeautifulSoup删除嵌套标记中的内容?
如何使用Python 如何使用BeautifulSoup删除嵌套标记中的内容?,python,html,nested,beautifulsoup,Python,Html,Nested,Beautifulsoup,如何使用BeautifulSoup删除嵌套标记中的内容?这些帖子显示了检索嵌套标记中的内容的相反方式:,和 我尝试了.text,但它只删除了标记 >>> from bs4 import BeautifulSoup as bs >>> html = "<foo>Something something <bar> blah blah</bar> something</foo>" >>> bs(htm
BeautifulSoup
删除嵌套标记中的内容?这些帖子显示了检索嵌套标记中的内容的相反方式:,和
我尝试了.text
,但它只删除了标记
>>> from bs4 import BeautifulSoup as bs
>>> html = "<foo>Something something <bar> blah blah</bar> something</foo>"
>>> bs(html).find_all('foo')[0]
<foo>Something something <bar> blah blah</bar> something else</foo>
>>> bs(html).find_all('foo')[0].text
u'Something something blah blah something else'
>>从bs4导入BeautifulSoup作为bs
>>>html=“某物,某物,某物”
>>>bs(html).find_all('foo')[0]
什么什么什么什么别的
>>>bs(html)。查找所有('foo')[0]。文本
你“什么什么什么什么什么别的”
期望输出:
还有别的吗
例如
您可以检查子项上的
bs4.element.NavigableString
:
from bs4 import BeautifulSoup as bs
import bs4
html = "<foo>Something something <bar> blah blah</bar> something <bar2>GONE!</bar2> else</foo>"
def get_only_text(elem):
for item in elem.children:
if isinstance(item,bs4.element.NavigableString):
yield item
print ''.join(get_only_text(bs(html).find_all('foo')[0]))
这是我的简单方法,
soup.body.clear()
或soup.tag.clear()
假设您希望清除
中的内容并添加新的数据帧;稍后,您可以使用此clear方法轻松更新网页html文件中的表格,而不是flask/django:
import pandas as pd
import bs4
我想将120万行.csv转换为数据帧,然后转换为HTML表格,
然后将其添加到我的网页的html语法中。以后我想轻松的
只要切换一个变量,csv就可以随时更新数据
bizcsv = read_csv("business.csv")
dframe = pd.DataFrame(bizcsv)
dfhtml = dframe.to_html #convert DataFrame to table, HTML format
dfhtml_update = dfhtml_html.strip('<table border="1" class="dataframe">, </table>')
"""use dfhtml_update later to update your table without the <table> tags,
the <table> is easy for BS to select & clear!"""
#A small function to unescape (< to <) the tags back into HTML format
def unescape(s):
s = s.replace("<", "<")
s = s.replace(">", ">")
# this has to be last:
s = s.replace("&", "&")
return s
with open("page.html") as page: #return to here when updating
txt = page.read()
soup = bs4.BeautifulSoup(txt, features="lxml")
soup.body.append(dfhtml) #adds table to <body>
with open("page.html", "w") as outf:
outf.write(unescape(str(soup))) #writes to page.html
"""lets say you want to make seamless table updates to your
webpage instead of using flask or django x_x; return to with open function"""
soup.table.clear() #clears everything in <table></table>
soup.table.append(dfhtml_update)
with open("page.html", "w") as outf:
outf.write(unescape(str(soup)))
bizcsv=read\u csv(“business.csv”)
dframe=pd.DataFrame(bizcsv)
dfhtml=dframe.to_html#将数据帧转换为表格,html格式
dfhtml_update=dfhtml_html.strip(',')
“”“稍后使用dfhtml\u update更新不带标记的表,
BS易于选择和清除!“”
#一个用于取消浏览的小函数(为了……在本例中,您希望删除条的内容
)?第二行代码中是否应该有一个“else”?
import pandas as pd
import bs4
bizcsv = read_csv("business.csv")
dframe = pd.DataFrame(bizcsv)
dfhtml = dframe.to_html #convert DataFrame to table, HTML format
dfhtml_update = dfhtml_html.strip('<table border="1" class="dataframe">, </table>')
"""use dfhtml_update later to update your table without the <table> tags,
the <table> is easy for BS to select & clear!"""
#A small function to unescape (< to <) the tags back into HTML format
def unescape(s):
s = s.replace("<", "<")
s = s.replace(">", ">")
# this has to be last:
s = s.replace("&", "&")
return s
with open("page.html") as page: #return to here when updating
txt = page.read()
soup = bs4.BeautifulSoup(txt, features="lxml")
soup.body.append(dfhtml) #adds table to <body>
with open("page.html", "w") as outf:
outf.write(unescape(str(soup))) #writes to page.html
"""lets say you want to make seamless table updates to your
webpage instead of using flask or django x_x; return to with open function"""
soup.table.clear() #clears everything in <table></table>
soup.table.append(dfhtml_update)
with open("page.html", "w") as outf:
outf.write(unescape(str(soup)))