Python 在HTML字符串中每N个字符添加一个中断符问题背景:_Python_Html_String_Beautifulsoup

Python 在HTML字符串中每N个字符添加一个中断符问题背景:

python html string

Python 在HTML字符串中每N个字符添加一个中断符问题背景:,python,html,string,beautifulsoup,Python,Html,String,Beautifulsoup,我需要对HTML字符串进行“文本包装”，以便元素仅应用于HTML字符串中的文本我可以将样式应用于文本字符串（如果只需要一种样式）但是，将样式附加到此字符串会进一步混淆实际文本与样式标记（显然）例子：示例测试数据：字符串输入： <html><body><p>Author Correction: Hybrid organic-inorganic polariton laser<span style="color:red">.. A corre

我需要对HTML字符串进行“文本包装”，以便

元素仅应用于HTML字符串中的文本

我可以将样式应用于文本字符串（如果只需要一种样式）

但是，将样式附加到此字符串会进一步混淆实际文本与样式标记（显然）

例子：示例测试数据：字符串输入：

<html><body><p>Author Correction: Hybrid organic-inorganic polariton laser<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span>not<span style="color:red"> been fixed in the paper.</span>  =========  Publisher Correction: Predictors of chronic kidney disease in type 1 diabetes: a longitudinal study from the AMD Annals initiative<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span><span style="color:red"> been fixed in the paper.</span></p></body></html>

作者更正：混合有机-无机极化子激光器。。这篇文章的更正已经发布，并从本文的HTML和PDF版本链接。该错误尚未在纸张中修复出版商更正：1型糖尿病慢性肾病的预测因子：AMD年鉴倡议的纵向研究。。这篇文章的更正已经发布，并从本文的HTML和PDF版本链接。这一错误已在论文中得到纠正

i、 e

作者更正：混合有机-无机极化子激光器。。这篇文章的更正已经发布，并从本文的HTML和PDF版本链接。该错误已在论文中得到纠正。==========出版商更正：1型糖尿病慢性肾病的预测因素：AMD年鉴倡议的一项纵向研究。。这篇文章的更正已经发布，并从本文的HTML和PDF版本链接。错误已在论文中修复。

（粗体为红色）

期望输出：

<html><body><p>Author Correction: Hybrid organic-inorganic <br>polariton laser<span style="color:red">.. A correction to this article has <br>been published and is linked from the HTML and PDF <br>versions of this paper. The error has </span>not<span style="color:red"> been <br>fixed in the paper.</span>  =========  Publisher <br>Correction: Predictors of chronic kidney disease <br>in type 1 diabetes: a longitudinal study from the <br>AMD Annals initiative<span style="color:red">.. A correction to this <br>article has been published and is linked from the <br>HTML and PDF versions of this paper. The error has </span><span style="color:red"> <br>been fixed in the paper.</span></p></body></html>

作者更正：混合有机-无机
极化子激光器。。这篇文章的更正已经发表，并链接到本文的HTML和PDF版本。该错误尚未在论文中修复。=========出版商
校正：1型糖尿病慢性肾病的预测因素：AMD年鉴倡议的纵向研究。。这篇
文章的更正已经发布，链接自本文的
HTML和PDF版本。这一错误已在论文中得到纠正

i、 e

作者更正：混合有机-无机极化剂

激光。。对这篇文章的更正已经

已发布，并从HTML和PDF链接
本文的版本。错误已未已

固定在报纸上。==========出版商

校正：慢性肾病的预测因子

1型糖尿病：一项来自

AMD年鉴倡议。。对此的更正

文章已发表，并从

本文的HTML和PDF版本。错误已发生

<强>文中，

< P>近似蛮力的方式，如果我们认为没有标签封装（否则，请告诉），可以是：

def put_tags_every_N(_input, _tag, _n):
    _k = 0
    _in_tag = False
    _len = len(_input)
    _i = 0
    while  _i < _len:
        _c = _input[_i]
        if _c == '<':
            _in_tag = True
        elif _c == '>':
            _in_tag = False
        if not _in_tag and _c != '>':
            _k += 1
            if _k % _n == 0:
                _input = _input[:_i+1]+_tag+_input[_i+1:]
                _len += len(_tag)
        _i += 1
    return _input

def put_tags_every_N_nowordcut(_input, _tag, _n):
    _k = 0
    _in_tag = False
    _position_past = False
    _len = len(_input)
    _i = 0
    while  _i < _len:
        _c = _input[_i]
        if _c == '<':
            _in_tag = True
        elif _c == '>':
            _in_tag = False
        if not _in_tag and _c != '>':
            _k += 1
            if _k % _n == 0:
                _position_past = True
            if _position_past and (_input[_i+1] in ('<', '.') or _input[_i] == ' '):
                _input = _input[:_i+1]+_tag+_input[_i+1:]
                _len += len(_tag)
                _position_past = False
                _k = 0
        _i += 1
    return _input

_tmp_input = '<html><body><p>Author Correction: Hybrid organic-inorganic polariton laser<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span>not<span style="color:red"> been fixed in the paper.</span>  =========  Publisher Correction: Predictors of chronic kidney disease in type 1 diabetes: a longitudinal study from the AMD Annals initiative<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span><span style="color:red"> been fixed in the paper.</span></p></body></html>'
# print(put_tags_every_N(_input=_tmp_input, _tag='<br>', _n=50))
# print(put_tags_every_N(_input=_tmp_input, _tag='<br>', _n=5))
print(put_tags_every_N_nowordcut(_input=_tmp_input, _tag='<br>', _n=50))
# print(put_tags_every_N_nowordcut(_input=_tmp_input, _tag='<br>', _n=5))

def put_tags_every_N（_input，_tag，_N）：
_k=0
_in_tag=False
_len=len（_输入）
_i=0
而
_c=_输入[_i]
如果_c==''：
_in_tag=False
如果不是_in_标签和_c！='>'：
_k+=1
如果_k%_n==0：
_输入=_输入[：_i+1]+_标记+_输入[_i+1:]
_len+=len（_标记）
_i+=1
返回输入
def put_tags_every_N_now或cut（_input，_tag，_N）：
_k=0
_in_tag=False
_位置\过去=错误
_len=len（_输入）
_i=0
而
_c=_输入[_i]
如果_c==''：
_in_tag=False
如果不是_in_标签和_c！='>'：
_k+=1
如果_k%_n==0：
_位置_过去=真
如果α位置过去和（α输入[i i+1）在（′p＞近似蛮力的方式下），如果我们认为没有标签封装（否则，请告诉），则可以是：
def put_tags_every_N(_input, _tag, _n):
    _k = 0
    _in_tag = False
    _len = len(_input)
    _i = 0
    while  _i < _len:
        _c = _input[_i]
        if _c == '<':
            _in_tag = True
        elif _c == '>':
            _in_tag = False
        if not _in_tag and _c != '>':
            _k += 1
            if _k % _n == 0:
                _input = _input[:_i+1]+_tag+_input[_i+1:]
                _len += len(_tag)
        _i += 1
    return _input

def put_tags_every_N_nowordcut(_input, _tag, _n):
    _k = 0
    _in_tag = False
    _position_past = False
    _len = len(_input)
    _i = 0
    while  _i < _len:
        _c = _input[_i]
        if _c == '<':
            _in_tag = True
        elif _c == '>':
            _in_tag = False
        if not _in_tag and _c != '>':
            _k += 1
            if _k % _n == 0:
                _position_past = True
            if _position_past and (_input[_i+1] in ('<', '.') or _input[_i] == ' '):
                _input = _input[:_i+1]+_tag+_input[_i+1:]
                _len += len(_tag)
                _position_past = False
                _k = 0
        _i += 1
    return _input

_tmp_input = '<html><body><p>Author Correction: Hybrid organic-inorganic polariton laser<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span>not<span style="color:red"> been fixed in the paper.</span>  =========  Publisher Correction: Predictors of chronic kidney disease in type 1 diabetes: a longitudinal study from the AMD Annals initiative<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span><span style="color:red"> been fixed in the paper.</span></p></body></html>'
# print(put_tags_every_N(_input=_tmp_input, _tag='<br>', _n=50))
# print(put_tags_every_N(_input=_tmp_input, _tag='<br>', _n=5))
print(put_tags_every_N_nowordcut(_input=_tmp_input, _tag='<br>', _n=50))
# print(put_tags_every_N_nowordcut(_input=_tmp_input, _tag='<br>', _n=5))

def put_tags_every_N（_input，_tag，_N）：
_k=0
_in_tag=False
_len=len（_输入）
_i=0
而
_c=_输入[_i]
如果_c==''：
_in_tag=False
如果不是_in_标签和_c！='>'：
_k+=1
如果_k%_n==0：
_输入=_输入[：_i+1]+_标记+_输入[_i+1:]
_len+=len（_标记）
_i+=1
返回输入
def put_tags_every_N_now或cut（_input，_tag，_N）：
_k=0
_in_tag=False
_位置\过去=错误
_len=len（_输入）
_i=0
而
_c=_输入[_i]
如果_c==''：
_in_tag=False
如果不是_in_标签和_c！='>'：
_k+=1
如果_k%_n==0：
_位置_过去=真
如果_位置_过去和（_输入[_i+1]in（当文本包装时，有很多事情需要考虑）-字体大小、非比例字符、字符间空隙，这些都可以用不同的标记标签在整个文本中发生变化。知道所有这一切的是将文本呈现到屏幕上的浏览器。它由CSS控制。为什么不添加样式而不是手动添加。
？这适用于不具备*任何css功能的应用程序。特别是用于在图形包中显示小的文本片段。只有非常小的html标记子集可以使用它，但是
和
标记可以工作。*我知道这在技术上可能不正确，因为使用了内联html样式g，但是我不太了解处理输入的包的内部结构。在我意识到它们不是由我正在使用的包处理之前，我对这些字符串做了很好的样式化，所以现在我必须恢复到这些<代码> <代码>和<代码> BR> < /代码>标签。当文本包装时，有很多事情要考虑。g-字体大小、非比例字符、字符间距等
def put_tags_every_N(_input, _tag, _n):
    _k = 0
    _in_tag = False
    _len = len(_input)
    _i = 0
    while  _i < _len:
        _c = _input[_i]
        if _c == '<':
            _in_tag = True
        elif _c == '>':
            _in_tag = False
        if not _in_tag and _c != '>':
            _k += 1
            if _k % _n == 0:
                _input = _input[:_i+1]+_tag+_input[_i+1:]
                _len += len(_tag)
        _i += 1
    return _input

def put_tags_every_N_nowordcut(_input, _tag, _n):
    _k = 0
    _in_tag = False
    _position_past = False
    _len = len(_input)
    _i = 0
    while  _i < _len:
        _c = _input[_i]
        if _c == '<':
            _in_tag = True
        elif _c == '>':
            _in_tag = False
        if not _in_tag and _c != '>':
            _k += 1
            if _k % _n == 0:
                _position_past = True
            if _position_past and (_input[_i+1] in ('<', '.') or _input[_i] == ' '):
                _input = _input[:_i+1]+_tag+_input[_i+1:]
                _len += len(_tag)
                _position_past = False
                _k = 0
        _i += 1
    return _input

_tmp_input = '<html><body><p>Author Correction: Hybrid organic-inorganic polariton laser<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span>not<span style="color:red"> been fixed in the paper.</span>  =========  Publisher Correction: Predictors of chronic kidney disease in type 1 diabetes: a longitudinal study from the AMD Annals initiative<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span><span style="color:red"> been fixed in the paper.</span></p></body></html>'
# print(put_tags_every_N(_input=_tmp_input, _tag='<br>', _n=50))
# print(put_tags_every_N(_input=_tmp_input, _tag='<br>', _n=5))
print(put_tags_every_N_nowordcut(_input=_tmp_input, _tag='<br>', _n=50))
# print(put_tags_every_N_nowordcut(_input=_tmp_input, _tag='<br>', _n=5))