Python 在HTML字符串中每N个字符添加一个中断符 问题背景:

Python 在HTML字符串中每N个字符添加一个中断符 问题背景:,python,html,string,beautifulsoup,Python,Html,String,Beautifulsoup,我需要对HTML字符串进行“文本包装”,以便元素仅应用于HTML字符串中的文本 我可以将样式应用于文本字符串(如果只需要一种样式) 但是,将样式附加到此字符串会进一步混淆实际文本与样式标记(显然) 例子: 示例测试数据: 字符串输入: <html><body><p>Author Correction: Hybrid organic-inorganic polariton laser<span style="color:red">.. A corre

我需要对HTML字符串进行“文本包装”,以便

元素仅应用于HTML字符串中的文本

我可以将样式应用于文本字符串(如果只需要一种样式)

但是,将样式附加到此字符串会进一步混淆实际文本与样式标记(显然)

例子: 示例测试数据: 字符串输入:

<html><body><p>Author Correction: Hybrid organic-inorganic polariton laser<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span>not<span style="color:red"> been fixed in the paper.</span>  =========  Publisher Correction: Predictors of chronic kidney disease in type 1 diabetes: a longitudinal study from the AMD Annals initiative<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span><span style="color:red"> been fixed in the paper.</span></p></body></html>
作者更正:混合有机-无机极化子激光器。。这篇文章的更正已经发布,并从本文的HTML和PDF版本链接。该错误尚未在纸张中修复出版商更正:1型糖尿病慢性肾病的预测因子:AMD年鉴倡议的纵向研究。。这篇文章的更正已经发布,并从本文的HTML和PDF版本链接。这一错误已在论文中得到纠正

i、 e

作者更正:混合有机-无机极化子激光器。。这篇文章的更正已经发布,并从本文的HTML和PDF版本链接。该错误已在论文中得到纠正。==========出版商更正:1型糖尿病慢性肾病的预测因素:AMD年鉴倡议的一项纵向研究。。这篇文章的更正已经发布,并从本文的HTML和PDF版本链接。错误已在论文中修复。

(粗体为红色)

期望输出:

<html><body><p>Author Correction: Hybrid organic-inorganic <br>polariton laser<span style="color:red">.. A correction to this article has <br>been published and is linked from the HTML and PDF <br>versions of this paper. The error has </span>not<span style="color:red"> been <br>fixed in the paper.</span>  =========  Publisher <br>Correction: Predictors of chronic kidney disease <br>in type 1 diabetes: a longitudinal study from the <br>AMD Annals initiative<span style="color:red">.. A correction to this <br>article has been published and is linked from the <br>HTML and PDF versions of this paper. The error has </span><span style="color:red"> <br>been fixed in the paper.</span></p></body></html>
作者更正:混合有机-无机
极化子激光器。。这篇文章的更正已经发表,并链接到本文的HTML和PDF版本。该错误尚未在论文中修复。=========出版商
校正:1型糖尿病慢性肾病的预测因素:AMD年鉴倡议的纵向研究。。这篇
文章的更正已经发布,链接自本文的
HTML和PDF版本。这一错误已在论文中得到纠正

i、 e

作者更正:混合有机-无机极化剂

激光。。对这篇文章的更正已经

已发布,并从HTML和PDF链接

本文的版本。错误已

固定在报纸上。==========出版商

校正:慢性肾病的预测因子

1型糖尿病:一项来自

AMD年鉴倡议。。对此的更正

文章已发表,并从

本文的HTML和PDF版本。错误已发生


<强>文中,

< P>近似蛮力的方式,如果我们认为没有标签封装(否则,请告诉),可以是:

def put_tags_every_N(_input, _tag, _n):
    _k = 0
    _in_tag = False
    _len = len(_input)
    _i = 0
    while  _i < _len:
        _c = _input[_i]
        if _c == '<':
            _in_tag = True
        elif _c == '>':
            _in_tag = False
        if not _in_tag and _c != '>':
            _k += 1
            if _k % _n == 0:
                _input = _input[:_i+1]+_tag+_input[_i+1:]
                _len += len(_tag)
        _i += 1
    return _input

def put_tags_every_N_nowordcut(_input, _tag, _n):
    _k = 0
    _in_tag = False
    _position_past = False
    _len = len(_input)
    _i = 0
    while  _i < _len:
        _c = _input[_i]
        if _c == '<':
            _in_tag = True
        elif _c == '>':
            _in_tag = False
        if not _in_tag and _c != '>':
            _k += 1
            if _k % _n == 0:
                _position_past = True
            if _position_past and (_input[_i+1] in ('<', '.') or _input[_i] == ' '):
                _input = _input[:_i+1]+_tag+_input[_i+1:]
                _len += len(_tag)
                _position_past = False
                _k = 0
        _i += 1
    return _input

_tmp_input = '<html><body><p>Author Correction: Hybrid organic-inorganic polariton laser<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span>not<span style="color:red"> been fixed in the paper.</span>  =========  Publisher Correction: Predictors of chronic kidney disease in type 1 diabetes: a longitudinal study from the AMD Annals initiative<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span><span style="color:red"> been fixed in the paper.</span></p></body></html>'
# print(put_tags_every_N(_input=_tmp_input, _tag='<br>', _n=50))
# print(put_tags_every_N(_input=_tmp_input, _tag='<br>', _n=5))
print(put_tags_every_N_nowordcut(_input=_tmp_input, _tag='<br>', _n=50))
# print(put_tags_every_N_nowordcut(_input=_tmp_input, _tag='<br>', _n=5))
def put_tags_every_N(_input,_tag,_N):
_k=0
_in_tag=False
_len=len(_输入)
_i=0
而
_c=_输入[_i]
如果_c=='':
_in_tag=False
如果不是_in_标签和_c!='>':
_k+=1
如果_k%_n==0:
_输入=_输入[:_i+1]+_标记+_输入[_i+1:]
_len+=len(_标记)
_i+=1
返回输入
def put_tags_every_N_now或cut(_input,_tag,_N):
_k=0
_in_tag=False
_位置\过去=错误
_len=len(_输入)
_i=0
而
_c=_输入[_i]
如果_c=='':
_in_tag=False
如果不是_in_标签和_c!='>':
_k+=1
如果_k%_n==0:
_位置_过去=真

如果α位置过去和(α输入[i i+1)在(′p>近似蛮力的方式下),如果我们认为没有标签封装(否则,请告诉),则可以是:

def put_tags_every_N(_input, _tag, _n):
    _k = 0
    _in_tag = False
    _len = len(_input)
    _i = 0
    while  _i < _len:
        _c = _input[_i]
        if _c == '<':
            _in_tag = True
        elif _c == '>':
            _in_tag = False
        if not _in_tag and _c != '>':
            _k += 1
            if _k % _n == 0:
                _input = _input[:_i+1]+_tag+_input[_i+1:]
                _len += len(_tag)
        _i += 1
    return _input

def put_tags_every_N_nowordcut(_input, _tag, _n):
    _k = 0
    _in_tag = False
    _position_past = False
    _len = len(_input)
    _i = 0
    while  _i < _len:
        _c = _input[_i]
        if _c == '<':
            _in_tag = True
        elif _c == '>':
            _in_tag = False
        if not _in_tag and _c != '>':
            _k += 1
            if _k % _n == 0:
                _position_past = True
            if _position_past and (_input[_i+1] in ('<', '.') or _input[_i] == ' '):
                _input = _input[:_i+1]+_tag+_input[_i+1:]
                _len += len(_tag)
                _position_past = False
                _k = 0
        _i += 1
    return _input

_tmp_input = '<html><body><p>Author Correction: Hybrid organic-inorganic polariton laser<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span>not<span style="color:red"> been fixed in the paper.</span>  =========  Publisher Correction: Predictors of chronic kidney disease in type 1 diabetes: a longitudinal study from the AMD Annals initiative<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span><span style="color:red"> been fixed in the paper.</span></p></body></html>'
# print(put_tags_every_N(_input=_tmp_input, _tag='<br>', _n=50))
# print(put_tags_every_N(_input=_tmp_input, _tag='<br>', _n=5))
print(put_tags_every_N_nowordcut(_input=_tmp_input, _tag='<br>', _n=50))
# print(put_tags_every_N_nowordcut(_input=_tmp_input, _tag='<br>', _n=5))
def put_tags_every_N(_input,_tag,_N):
_k=0
_in_tag=False
_len=len(_输入)
_i=0
而
_c=_输入[_i]
如果_c=='':
_in_tag=False
如果不是_in_标签和_c!='>':
_k+=1
如果_k%_n==0:
_输入=_输入[:_i+1]+_标记+_输入[_i+1:]
_len+=len(_标记)
_i+=1
返回输入
def put_tags_every_N_now或cut(_input,_tag,_N):
_k=0
_in_tag=False
_位置\过去=错误
_len=len(_输入)
_i=0
而
_c=_输入[_i]
如果_c=='':
_in_tag=False
如果不是_in_标签和_c!='>':
_k+=1
如果_k%_n==0:
_位置_过去=真

如果_位置_过去和(_输入[_i+1]in(当文本包装时,有很多事情需要考虑)-字体大小、非比例字符、字符间空隙,这些都可以用不同的标记标签在整个文本中发生变化。知道所有这一切的是将文本呈现到屏幕上的浏览器。它由CSS控制。为什么不添加样式而不是手动添加。
?这适用于不具备*任何css功能的应用程序。特别是用于在图形包中显示小的文本片段。只有非常小的html标记子集可以使用它,但是

标记可以工作。*我知道这在技术上可能不正确,因为使用了内联html样式g,但是我不太了解处理输入的包的内部结构。在我意识到它们不是由我正在使用的包处理之前,我对这些字符串做了很好的样式化,所以现在我必须恢复到这些<代码> <代码>和<代码> BR> < /代码>标签。当文本包装时,有很多事情要考虑。g-字体大小、非比例字符、字符间距等
def put_tags_every_N(_input, _tag, _n):
    _k = 0
    _in_tag = False
    _len = len(_input)
    _i = 0
    while  _i < _len:
        _c = _input[_i]
        if _c == '<':
            _in_tag = True
        elif _c == '>':
            _in_tag = False
        if not _in_tag and _c != '>':
            _k += 1
            if _k % _n == 0:
                _input = _input[:_i+1]+_tag+_input[_i+1:]
                _len += len(_tag)
        _i += 1
    return _input

def put_tags_every_N_nowordcut(_input, _tag, _n):
    _k = 0
    _in_tag = False
    _position_past = False
    _len = len(_input)
    _i = 0
    while  _i < _len:
        _c = _input[_i]
        if _c == '<':
            _in_tag = True
        elif _c == '>':
            _in_tag = False
        if not _in_tag and _c != '>':
            _k += 1
            if _k % _n == 0:
                _position_past = True
            if _position_past and (_input[_i+1] in ('<', '.') or _input[_i] == ' '):
                _input = _input[:_i+1]+_tag+_input[_i+1:]
                _len += len(_tag)
                _position_past = False
                _k = 0
        _i += 1
    return _input

_tmp_input = '<html><body><p>Author Correction: Hybrid organic-inorganic polariton laser<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span>not<span style="color:red"> been fixed in the paper.</span>  =========  Publisher Correction: Predictors of chronic kidney disease in type 1 diabetes: a longitudinal study from the AMD Annals initiative<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span><span style="color:red"> been fixed in the paper.</span></p></body></html>'
# print(put_tags_every_N(_input=_tmp_input, _tag='<br>', _n=50))
# print(put_tags_every_N(_input=_tmp_input, _tag='<br>', _n=5))
print(put_tags_every_N_nowordcut(_input=_tmp_input, _tag='<br>', _n=50))
# print(put_tags_every_N_nowordcut(_input=_tmp_input, _tag='<br>', _n=5))