Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/297.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在Python中从html中提取文本,同时忽略某个标记_Python_Python 3.x_Regex - Fatal编程技术网

如何在Python中从html中提取文本,同时忽略某个标记

如何在Python中从html中提取文本,同时忽略某个标记,python,python-3.x,regex,Python,Python 3.x,Regex,我有一个如下所示的输入文件,我正试图提取它的文本并删除html标记。请注意,我希望换行符中的每个p都是br,但如果它是br,我希望它保持在同一行中,但无论如何都要删除br标记 <tt xmlns="http://www.w3.org/ns/ttml" xml:lang="en" xmlns:tts="http://www.w3.org/ns/ttml#parameter"><head><styling>

我有一个如下所示的输入文件,我正试图提取它的文本并删除html标记。请注意,我希望换行符中的每个p都是br,但如果它是br,我希望它保持在同一行中,但无论如何都要删除br标记

<tt xmlns="http://www.w3.org/ns/ttml" xml:lang="en" xmlns:tts="http://www.w3.org/ns/ttml#parameter"><head><styling><style id="b1"/></sty    ling></head><body><div xml:lang="en" style="b1"><p begin="" end="0.143">HISTORY</p><p begin="0.143" end="0.286">HISTORY TV"</p><p begin=    "0.286" end="0.714">HISTORY TV" THIS</p><p begin="0.714" end="0.857">HISTORY TV" THIS<br/>WEEKEND</p><p begin="0.857" end="3">HISTORY TV    " THIS<br/>WEEKEND ON</p><p begin="3" end="3.333">HISTORY TV" THIS<br/>WEEKEND ON C-SPAN3.</p><p begin="3.333" end="3.667">WEEKEND ON C-    SPAN3.<br/>&gt;&gt;&gt;</p><p begin="3.667" end="4">WEEKEND ON C-SPAN3.<br/>&gt;&gt;&gt; "THE</p><p begin="4" end="4.5">WEEKEND ON C-SPA    N3.<br/>&gt;&gt;&gt; "THE MARCH</p><p begin="4.5" end="5">WEEKEND ON C-SPAN3.<br/>&gt;&gt;&gt; "THE MARCH ON</p><p begin="5" end="5.5">W    EEKEND ON C-SPAN3.<br/>&gt;&gt;&gt; "THE MARCH ON WASHINGTON"</p><p begin="5.5" end="5.667">&gt;&gt;&gt; "THE MARCH ON WASHINGTON"<br/>F    OR</p><p begin="5.667" end="5.833">&gt;&gt;&gt; "THE MARCH ON WASHINGTON"<br/>FOR JOBS</p><p begin="5.833" end="6">&gt;&gt;&gt; "THE MAR    CH ON WASHINGTON"<br/>FOR JOBS AND</p><p begin="6" end="6.2">&gt;&gt;&gt; "THE MARCH ON WASHINGTON"<br/>FOR JOBS AND FREEDOM</p><p begin    ="6.2" end="6.4">&gt;&gt;&gt; "THE MARCH ON WASHINGTON"<br/>FOR JOBS AND FREEDOM WAS</p><p begin="6.4" end="7">&gt;&gt;&gt; "THE MARCH O    N WASHINGTON"<br/>FOR JOBS AND FREEDOM WAS 49</p><p begin="7" end="8">FOR JOBS AND FREEDOM WAS 49<br/>YEARS</p><p begin="8" end="8.5">FO    R JOBS AND FREEDOM WAS 49<br/>YEARS AGO.</p><p begin="8.5" end="8.75">YEARS AGO.<br/>ON</p><p begin="8.75" end="9">YEARS AGO.<br/>ON AUG    UST</p><p begin="9" end="13">YEARS AGO.<br/>ON AUGUST 28th,</p><p begin="13" end="13.333">YEARS AGO.<br/>ON AUGUST 28th, 1963.</p><p beg    in="13.333" end="13.5">ON AUGUST 28th, 1963.<br/>THE</p><p begin="13.5" end="13.667">ON AUGUST 28th, 1963.<br/>THE MARCH</p><p begin="13    .667" end="13.833">ON AUGUST 28th, 1963.<br/>THE MARCH WAS</p><p begin="13.833" end="14">ON AUGUST 28th, 1963.<br/>THE MARCH WAS ORKGANI    ZED</p><p begin="14" end="14.167">ON AUGUST 28th, 1963.<br/>THE MARCH WAS ORKGANIZED TO</p><p begin="14.167" end="14.667">ON AUGUST 28th    , 1963.<br/>THE MARCH WAS ORKGANIZED TO PUSH</p><p begin="14.667" end="14.833">THE MARCH WAS ORKGANIZED TO PUSH<br/>FOR</p>
我如何完成这项任务

我用了这个密码

import re
import os

def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub(' ', str(data)).strip()

directory = './reprocess'
for filename in os.listdir(directory):
    if filename.endswith(".dfxp"):
        print("Processing: {}".format(filename))
        with open("./reprocess/"+filename, "r") as inputFile:
            data = inputFile.read().splitlines()
            new_data = ""
            for line in data:
                new_data = new_data + remove_html_tags(line) + "\n"
        with open("./rmout/"+filename, "w") as text_file:
            text_file.write(new_data)
消化 该代码使用bs4(BeautifulSoup4),由3个主要步骤组成:

  • 预处理数据清理:有时清理原始文本中的一些数据比清理处理中的数据更方便。如果是这样的话,请毫不犹豫地去做
  • 构造soup(DOM)
  • 对提取的文本进行元素提取和后处理
  • 代码 免责声明:进行全面测试,并始终预期例外情况。问题解决者无法预见样本数据中未出现的问题

    import bs4
    import re
    from pprint import pprint
    
    # raw data
    html = "(as provided)"
    
    # 1. cleansing
    
    # (1) remove known unwanted patterns
    html = html.replace("    ", "")
    html = html.replace("&gt;&gt;&gt;", "")
    # remove <br> tags (can also remove after the soup is built)
    html = re.sub(r"<br\s*/?>", " ", html)  # careful! error-prone!
    
    # (2) regularize multiple spaces
    html = re.sub(r"\s{2,}", " ", html)
    
    # 2. construct soup (DOM)
    soup = bs4.BeautifulSoup(html, 'html.parser')
    
    # 3. extract text in target elements    
    ls_lines = []
    for el in soup.find_all("p"):
        ls_lines.append(el.get_text().strip())
    
    # check
    for line in ls_lines:
        print(line)
    
    工具书类
    • 替换

    • (卡格尔教程)

    输入是否始终为TTML?如果是这样,可以将TTML/IMSC文档拆分为一系列中间同步文档(ISD),每个文档对应于TTML/IMSC文档内容为静态的一段时间。文本可以很容易地从每个ISD中提取

    import ttconv.imsc.reader
    import ttconv.isd
    import xml.etree.ElementTree as et
    
    tt_doc = """<?xml version="1.0" encoding="UTF-8"?>
      <tt xml:lang="fr" xmlns="http://www.w3.org/ns/ttml">
      <body>
        <div>
          <p begin="1s" end="2s">Hello</p>
          <p begin="3s" end="4s">Bonjour</p>
        </div>
      </body>
      </tt>"""
    
    m = ttconv.imsc.reader.to_model(et.ElementTree(et.fromstring(tt_doc)))
    
    st = ttconv.isd.ISD.significant_times(m)
    
    for t in st:
      isd = ttconv.isd.ISD.from_model(m, t)
      
      # walk through all Text elements in `isd` to extract text
    
    导入ttconv.imsc.reader
    导入ttconv.isd
    将xml.etree.ElementTree作为et导入
    tt_doc=”“”
    你好

    您好

    """ m=ttconv.imsc.reader.to_模型(et.ElementTree(et.fromstring(tt_doc))) st=ttconv.isd.isd.signific_倍(m) 对于st中的t: isd=ttconv.isd.isd.来自于_模型(m,t) #遍历“isd”中的所有文本元素以提取文本
    ttconv还支持从TTML/IMSC到SRT的转换,SRT是一种简单的基于文本的格式

    tt.py convert -i <input .ttml file> -o <output .srt file> --otype SRT --itype TTML
    
    tt.py convert-i-o--otype SRT--itype TTML
    
    不要使用正则表达式解析HTML文件。使用
    BeautifulSoup
    。标签在bs4中可能是邪恶的,如中的注释所示。这种
    find\u all-replace\u with
    方法不知何故未能为我删除

    标记(可能是bs4 API更改了?)。我想知道到2020年是否有一个干净可靠的解决方案(BeautifulSoup v4.9.1)。
    HISTORY
    HISTORY TV"
    HISTORY TV" THIS
    HISTORY TV" THIS WEEKEND
    HISTORY TV" THIS WEEKEND ON
    HISTORY TV" THIS WEEKEND ON C-SPAN3.
    WEEKEND ON C-SPAN3.
    WEEKEND ON C-SPAN3. "THE
    WEEKEND ON C-SPAN3. "THE MARCH
    WEEKEND ON C-SPAN3. "THE MARCH ON
    WEEKEND ON C-SPAN3. "THE MARCH ON WASHINGTON"
    "THE MARCH ON WASHINGTON" FOR
    "THE MARCH ON WASHINGTON" FOR JOBS
    "THE MARCH ON WASHINGTON" FOR JOBS AND
    "THE MARCH ON WASHINGTON" FOR JOBS AND FREEDOM
    "THE MARCH ON WASHINGTON" FOR JOBS AND FREEDOM WAS
    "THE MARCH ON WASHINGTON" FOR JOBS AND FREEDOM WAS 49
    FOR JOBS AND FREEDOM WAS 49 YEARS
    FOR JOBS AND FREEDOM WAS 49 YEARS AGO.
    YEARS AGO. ON
    YEARS AGO. ON AUGUST
    YEARS AGO. ON AUGUST 28th,
    YEARS AGO. ON AUGUST 28th, 1963.
    ON AUGUST 28th, 1963. THE
    ON AUGUST 28th, 1963. THE MARCH
    ON AUGUST 28th, 1963. THE MARCH WAS
    ON AUGUST 28th, 1963. THE MARCH WAS ORKGANIZED
    ON AUGUST 28th, 1963. THE MARCH WAS ORKGANIZED TO
    ON AUGUST 28th, 1963. THE MARCH WAS ORKGANIZED TO PUSH
    THE MARCH WAS ORKGANIZED TO PUSH FOR
    
    import ttconv.imsc.reader
    import ttconv.isd
    import xml.etree.ElementTree as et
    
    tt_doc = """<?xml version="1.0" encoding="UTF-8"?>
      <tt xml:lang="fr" xmlns="http://www.w3.org/ns/ttml">
      <body>
        <div>
          <p begin="1s" end="2s">Hello</p>
          <p begin="3s" end="4s">Bonjour</p>
        </div>
      </body>
      </tt>"""
    
    m = ttconv.imsc.reader.to_model(et.ElementTree(et.fromstring(tt_doc)))
    
    st = ttconv.isd.ISD.significant_times(m)
    
    for t in st:
      isd = ttconv.isd.ISD.from_model(m, t)
      
      # walk through all Text elements in `isd` to extract text
    
    tt.py convert -i <input .ttml file> -o <output .srt file> --otype SRT --itype TTML