如何在Python中从html中提取文本,同时忽略某个标记
我有一个如下所示的输入文件,我正试图提取它的文本并删除html标记。请注意,我希望换行符中的每个p都是br,但如果它是br,我希望它保持在同一行中,但无论如何都要删除br标记如何在Python中从html中提取文本,同时忽略某个标记,python,python-3.x,regex,Python,Python 3.x,Regex,我有一个如下所示的输入文件,我正试图提取它的文本并删除html标记。请注意,我希望换行符中的每个p都是br,但如果它是br,我希望它保持在同一行中,但无论如何都要删除br标记 <tt xmlns="http://www.w3.org/ns/ttml" xml:lang="en" xmlns:tts="http://www.w3.org/ns/ttml#parameter"><head><styling>
<tt xmlns="http://www.w3.org/ns/ttml" xml:lang="en" xmlns:tts="http://www.w3.org/ns/ttml#parameter"><head><styling><style id="b1"/></sty ling></head><body><div xml:lang="en" style="b1"><p begin="" end="0.143">HISTORY</p><p begin="0.143" end="0.286">HISTORY TV"</p><p begin= "0.286" end="0.714">HISTORY TV" THIS</p><p begin="0.714" end="0.857">HISTORY TV" THIS<br/>WEEKEND</p><p begin="0.857" end="3">HISTORY TV " THIS<br/>WEEKEND ON</p><p begin="3" end="3.333">HISTORY TV" THIS<br/>WEEKEND ON C-SPAN3.</p><p begin="3.333" end="3.667">WEEKEND ON C- SPAN3.<br/>>>></p><p begin="3.667" end="4">WEEKEND ON C-SPAN3.<br/>>>> "THE</p><p begin="4" end="4.5">WEEKEND ON C-SPA N3.<br/>>>> "THE MARCH</p><p begin="4.5" end="5">WEEKEND ON C-SPAN3.<br/>>>> "THE MARCH ON</p><p begin="5" end="5.5">W EEKEND ON C-SPAN3.<br/>>>> "THE MARCH ON WASHINGTON"</p><p begin="5.5" end="5.667">>>> "THE MARCH ON WASHINGTON"<br/>F OR</p><p begin="5.667" end="5.833">>>> "THE MARCH ON WASHINGTON"<br/>FOR JOBS</p><p begin="5.833" end="6">>>> "THE MAR CH ON WASHINGTON"<br/>FOR JOBS AND</p><p begin="6" end="6.2">>>> "THE MARCH ON WASHINGTON"<br/>FOR JOBS AND FREEDOM</p><p begin ="6.2" end="6.4">>>> "THE MARCH ON WASHINGTON"<br/>FOR JOBS AND FREEDOM WAS</p><p begin="6.4" end="7">>>> "THE MARCH O N WASHINGTON"<br/>FOR JOBS AND FREEDOM WAS 49</p><p begin="7" end="8">FOR JOBS AND FREEDOM WAS 49<br/>YEARS</p><p begin="8" end="8.5">FO R JOBS AND FREEDOM WAS 49<br/>YEARS AGO.</p><p begin="8.5" end="8.75">YEARS AGO.<br/>ON</p><p begin="8.75" end="9">YEARS AGO.<br/>ON AUG UST</p><p begin="9" end="13">YEARS AGO.<br/>ON AUGUST 28th,</p><p begin="13" end="13.333">YEARS AGO.<br/>ON AUGUST 28th, 1963.</p><p beg in="13.333" end="13.5">ON AUGUST 28th, 1963.<br/>THE</p><p begin="13.5" end="13.667">ON AUGUST 28th, 1963.<br/>THE MARCH</p><p begin="13 .667" end="13.833">ON AUGUST 28th, 1963.<br/>THE MARCH WAS</p><p begin="13.833" end="14">ON AUGUST 28th, 1963.<br/>THE MARCH WAS ORKGANI ZED</p><p begin="14" end="14.167">ON AUGUST 28th, 1963.<br/>THE MARCH WAS ORKGANIZED TO</p><p begin="14.167" end="14.667">ON AUGUST 28th , 1963.<br/>THE MARCH WAS ORKGANIZED TO PUSH</p><p begin="14.667" end="14.833">THE MARCH WAS ORKGANIZED TO PUSH<br/>FOR</p>
我如何完成这项任务
我用了这个密码
import re
import os
def remove_html_tags(data):
p = re.compile(r'<.*?>')
return p.sub(' ', str(data)).strip()
directory = './reprocess'
for filename in os.listdir(directory):
if filename.endswith(".dfxp"):
print("Processing: {}".format(filename))
with open("./reprocess/"+filename, "r") as inputFile:
data = inputFile.read().splitlines()
new_data = ""
for line in data:
new_data = new_data + remove_html_tags(line) + "\n"
with open("./rmout/"+filename, "w") as text_file:
text_file.write(new_data)
消化
该代码使用bs4(BeautifulSoup4),由3个主要步骤组成:
import bs4
import re
from pprint import pprint
# raw data
html = "(as provided)"
# 1. cleansing
# (1) remove known unwanted patterns
html = html.replace(" ", "")
html = html.replace(">>>", "")
# remove <br> tags (can also remove after the soup is built)
html = re.sub(r"<br\s*/?>", " ", html) # careful! error-prone!
# (2) regularize multiple spaces
html = re.sub(r"\s{2,}", " ", html)
# 2. construct soup (DOM)
soup = bs4.BeautifulSoup(html, 'html.parser')
# 3. extract text in target elements
ls_lines = []
for el in soup.find_all("p"):
ls_lines.append(el.get_text().strip())
# check
for line in ls_lines:
print(line)
工具书类
- 替换
:李> - (卡格尔教程)
import ttconv.imsc.reader
import ttconv.isd
import xml.etree.ElementTree as et
tt_doc = """<?xml version="1.0" encoding="UTF-8"?>
<tt xml:lang="fr" xmlns="http://www.w3.org/ns/ttml">
<body>
<div>
<p begin="1s" end="2s">Hello</p>
<p begin="3s" end="4s">Bonjour</p>
</div>
</body>
</tt>"""
m = ttconv.imsc.reader.to_model(et.ElementTree(et.fromstring(tt_doc)))
st = ttconv.isd.ISD.significant_times(m)
for t in st:
isd = ttconv.isd.ISD.from_model(m, t)
# walk through all Text elements in `isd` to extract text
导入ttconv.imsc.reader
导入ttconv.isd
将xml.etree.ElementTree作为et导入
tt_doc=”“”
你好
您好
"""
m=ttconv.imsc.reader.to_模型(et.ElementTree(et.fromstring(tt_doc)))
st=ttconv.isd.isd.signific_倍(m)
对于st中的t:
isd=ttconv.isd.isd.来自于_模型(m,t)
#遍历“isd”中的所有文本元素以提取文本
ttconv还支持从TTML/IMSC到SRT的转换,SRT是一种简单的基于文本的格式
tt.py convert -i <input .ttml file> -o <output .srt file> --otype SRT --itype TTML
tt.py convert-i-o--otype SRT--itype TTML
不要使用正则表达式解析HTML文件。使用BeautifulSoup
。标签在bs4中可能是邪恶的,如中的注释所示。这种find\u all-replace\u with
方法不知何故未能为我删除
标记(可能是bs4 API更改了?)。我想知道到2020年是否有一个干净可靠的解决方案(BeautifulSoup v4.9.1)。
HISTORY
HISTORY TV"
HISTORY TV" THIS
HISTORY TV" THIS WEEKEND
HISTORY TV" THIS WEEKEND ON
HISTORY TV" THIS WEEKEND ON C-SPAN3.
WEEKEND ON C-SPAN3.
WEEKEND ON C-SPAN3. "THE
WEEKEND ON C-SPAN3. "THE MARCH
WEEKEND ON C-SPAN3. "THE MARCH ON
WEEKEND ON C-SPAN3. "THE MARCH ON WASHINGTON"
"THE MARCH ON WASHINGTON" FOR
"THE MARCH ON WASHINGTON" FOR JOBS
"THE MARCH ON WASHINGTON" FOR JOBS AND
"THE MARCH ON WASHINGTON" FOR JOBS AND FREEDOM
"THE MARCH ON WASHINGTON" FOR JOBS AND FREEDOM WAS
"THE MARCH ON WASHINGTON" FOR JOBS AND FREEDOM WAS 49
FOR JOBS AND FREEDOM WAS 49 YEARS
FOR JOBS AND FREEDOM WAS 49 YEARS AGO.
YEARS AGO. ON
YEARS AGO. ON AUGUST
YEARS AGO. ON AUGUST 28th,
YEARS AGO. ON AUGUST 28th, 1963.
ON AUGUST 28th, 1963. THE
ON AUGUST 28th, 1963. THE MARCH
ON AUGUST 28th, 1963. THE MARCH WAS
ON AUGUST 28th, 1963. THE MARCH WAS ORKGANIZED
ON AUGUST 28th, 1963. THE MARCH WAS ORKGANIZED TO
ON AUGUST 28th, 1963. THE MARCH WAS ORKGANIZED TO PUSH
THE MARCH WAS ORKGANIZED TO PUSH FOR
import ttconv.imsc.reader
import ttconv.isd
import xml.etree.ElementTree as et
tt_doc = """<?xml version="1.0" encoding="UTF-8"?>
<tt xml:lang="fr" xmlns="http://www.w3.org/ns/ttml">
<body>
<div>
<p begin="1s" end="2s">Hello</p>
<p begin="3s" end="4s">Bonjour</p>
</div>
</body>
</tt>"""
m = ttconv.imsc.reader.to_model(et.ElementTree(et.fromstring(tt_doc)))
st = ttconv.isd.ISD.significant_times(m)
for t in st:
isd = ttconv.isd.ISD.from_model(m, t)
# walk through all Text elements in `isd` to extract text
tt.py convert -i <input .ttml file> -o <output .srt file> --otype SRT --itype TTML