Python:美化组修改文本_Python_Beautifulsoup

Python:美化组修改文本

python

Python:美化组修改文本,python,beautifulsoup,Python,Beautifulsoup,我需要对大量XHTML文件进行后期处理，但我没有生成这些文件，因此无法修复生成它的代码。我不能使用正则表达式来破坏整个文件，只是高度选择性的部分，因为有些链接和id的数字我无法全局更改我简化了这个例子很多，因为原始文件有RTL文本。我只对修改可见文本中的数字感兴趣，而不是修改标记。似乎有三种不同的情况来自bk1.xhtml的代码片段：案例1：带有链接的交叉引用，带有嵌入bookref文本的数字xt 案例2：无链接的交叉引用-文本中有数字，没有嵌入bookref文本一些带有这些数字

我需要对大量XHTML文件进行后期处理，但我没有生成这些文件，因此无法修复生成它的代码。我不能使用正则表达式来破坏整个文件，只是高度选择性的部分，因为有些链接和id的数字我无法全局更改

我简化了这个例子很多，因为原始文件有RTL文本。我只对修改可见文本中的数字感兴趣，而不是修改标记。似乎有三种不同的情况

来自bk1.xhtml的代码片段：

案例1：带有链接的交叉引用，带有嵌入bookref文本的数字xt

案例2：无链接的交叉引用-文本中有数字，没有嵌入bookref文本


一些带有这些数字的文本：26:118

案例3：脚注没有链接，但在ft文本中有数字


一些带有以下数字的文本：22

我试图找出如何识别可见用户部分中的文本字符串，以便只修改相关数字：

案例1：我只需要捕捉

将“some text 26:118”子字符串分配给变量，并针对该变量运行正则表达式；然后将该子字符串替换回其所在的文件中

案例2：我只需要捕获

一些文本26:118

，只更改“some text 26:118”子字符串中的数字，并对该变量运行正则表达式；然后将该子字符串替换回其所在的文件中

案例3：我需要只捕获

一些文本22

，只更改“some text 22”子字符串中的数字，并对该变量运行正则表达式；然后将该子字符串替换回其所在的文件中

我在很多文件中有数千个这样的东西要做。我知道如何遍历这些文件

处理完一个文件中的所有模式后，我需要写出更改后的树

我只需要对它进行后期处理来修复文本

我一直在谷歌上搜索、阅读和观看很多教程，我感到困惑

感谢您的帮助。

似乎您需要该方法，您必须首先找到所有要匹配的文本：

from bs4 import BeautifulSoup

cases = '''
<aside epub:type='footnote' id="FN96"><p class="x"><a class="notebackref" href="#bk1_21_9"><span class="notemark">*</span>text</a>
<span class="xt"> <a class='bookref' href='bk50.xhtml#bk50_118_26'>some text with these digits: 26:118</a></span></p></aside>

<aside epub:type='footnote' id="FN100"><p class="x"><a class="notebackref" href="#bk1_21_42"><span class="notemark">*</span>text</a>
<span class="xt">some text with these digits: 26:118</span></p></aside>

<aside epub:type='footnote' id="FN107"><p class="f"><a class="notebackref" href="#bk1_22_44"><span class="notemark">§</span>text</a>
<span class="ft">some text with these digits: 22</span></p></aside>
'''

soup = BeautifulSoup(cases, 'lxml')

case1 = soup.findAll('a',{'class':'bookref'})
case2 = soup.findAll('span',{'class':'xt'})
case3 = soup.findAll('span',{'class':'ft'})

for match in case1 + case2 + case3:
    text = match.string
    print(text)
    if text:
        newText = text.replace('some text', 'modified!') # this line is your regex things
        text.replaceWith(newText)

如果我们再叫它，现在：

modified! with these digits: 26:118
None
modified! with these digits: 26:118
modified! with these digits: 22

这是否解决了“在处理了一个文件中的所有模式后，我需要写出更改的树”的需要？LarsH我错过了这一需要，但我认为只需将

文本

写入一个文件即可轻松完成。

<aside epub:type='footnote' id="FN107"><p class="f"><a class="notebackref" href="#bk1_22_44"><span class="notemark">§</span>text</a>
<span class="ft">some text with these digits: 22</span></p></aside>

from bs4 import BeautifulSoup

cases = '''
<aside epub:type='footnote' id="FN96"><p class="x"><a class="notebackref" href="#bk1_21_9"><span class="notemark">*</span>text</a>
<span class="xt"> <a class='bookref' href='bk50.xhtml#bk50_118_26'>some text with these digits: 26:118</a></span></p></aside>

<aside epub:type='footnote' id="FN100"><p class="x"><a class="notebackref" href="#bk1_21_42"><span class="notemark">*</span>text</a>
<span class="xt">some text with these digits: 26:118</span></p></aside>

<aside epub:type='footnote' id="FN107"><p class="f"><a class="notebackref" href="#bk1_22_44"><span class="notemark">§</span>text</a>
<span class="ft">some text with these digits: 22</span></p></aside>
'''

soup = BeautifulSoup(cases, 'lxml')

case1 = soup.findAll('a',{'class':'bookref'})
case2 = soup.findAll('span',{'class':'xt'})
case3 = soup.findAll('span',{'class':'ft'})

for match in case1 + case2 + case3:
    text = match.string
    print(text)
    if text:
        newText = text.replace('some text', 'modified!') # this line is your regex things
        text.replaceWith(newText)

some text with these digits: 26:118
None
some text with these digits: 26:118
some text with these digits: 22

modified! with these digits: 26:118
None
modified! with these digits: 26:118
modified! with these digits: 22