Regex 多行上的正则表达式匹配_Regex_Text Processing_Calibre

Regex 多行上的正则表达式匹配

regex

Regex 多行上的正则表达式匹配,regex,text-processing,calibre,Regex,Text Processing,Calibre,我目前正在尝试对pdf进行一些基本清理，以便将其转换为ePub，以便在我的电子阅读器上使用。我所做的只是删除页码（简单）和脚注（到目前为止还很难懂）。基本上，我想要一个表达式，它在每个脚注的开头找到标记模式（，后跟一个换行符、一个数字、一个字母或引号），选择模式及其后的所有内容，直到它到达下一页开头的标记。以下是一些示例文本： The phantoms, for so they then seemed, were flitting on the other side of <br>

我目前正在尝试对pdf进行一些基本清理，以便将其转换为ePub，以便在我的电子阅读器上使用。我所做的只是删除页码（简单）和脚注（到目前为止还很难懂）。基本上，我想要一个表达式，它在每个脚注的开头找到标记模式（

，后跟一个换行符、一个数字、一个字母或引号），选择模式及其后的所有内容，直到它到达下一页开头的

标记。以下是一些示例文本：

The phantoms, for so they then seemed, were flitting on the other side of <br>
the deck, and, with a noiseless celerity, were casting loose the tackles and bands <br>
of the boat which swung there. This boat had always been deemed one of the spare boats <br>
technically called the captain’s, on account of its hanging from the starboard quarter.<br>
The figure that now stood by its bows was tall and swart, with one white tooth <br>
evilly protruding from its steel-like lips. <br>
 <br>
1 "Hardly" had they pulled out from under the ship’s lee, when a <br>
fourth keel, coming from the windward side, pulled round under the stern, <br>
and showed the five strangers <br>
127 <br>
<br>
<hr/>

幻影，因为它们当时看起来是在

甲板上，他们以无声的速度松开了铲子和带子

在那里荡来荡去的那艘船。这条船一直被认为是备用船之一

技术上称为船长的，因为它悬挂在右舷舷侧。

现在站在船头旁的人影又高又黑，长着一颗白牙

邪恶地从它钢铁般的嘴唇上突出



1“刚”从船的背风面下驶出时，a

第四根龙骨，来自迎风面，在船尾下方拉回，

并给五个陌生人看

127

由于所有脚注都是以这种方式格式化的，因此我希望选择以

（注意空格）开头并以

标记结尾的每组行。这是我第一次真正尝试使用正则表达式，所以我尝试了一些解决方案：

\s
\n\d+\s[a-zA-Z”.*

：这会正确地选择

和脚注的第一行，但会在中断处停止。

\s
\n\d+\s[a-zA-Z”.*\n.*\n.*\n.\n.*

选择正确的行数，但这显然只适用于恰好有三行文本的脚注

\s
\n\d+\s[a-zA-Z”]（.*\n）*）

从第一个脚注的正确位置开始，然后选择文档的其余部分。我对这个表达式的解释是“从

开始，数字后跟空格，后跟字母或引号，然后选择所有内容，包括换行符，直到到达

”

\s
\n\d+\s[a-zA-Z”]（（？：...\r？\n？*）\n

与（2）的想法相同，但结果相同，尽管我对regex不够熟悉，无法完全理解这一点

基本上，我的问题是，我的表达式要么排除换行符（并忽略结尾模式），要么包含所有换行符并返回整个文本（显然仍然忽略结尾模式）

如何让它只返回模式之间的文本，包括换行符？

您的尝试非常接近。在第一次尝试中，您可能需要设置允许

匹配换行符的标志。通常不会。在第二次尝试中，您需要在任何匹配

的模式上设置非贪婪？
模式*

。否则，

将尝试匹配文本的其余部分

可能是这样的。

/^
\n\d+\s[a-zA-Z”“]（.*？\n）*？/

但是无论如何，这是最好用Perl来完成的。Perl是所有高级正则表达式的来源

使用严格；
使用诊断；
我们的$text=谢谢！我将非贪婪修饰符添加到第二个版本（\s
\n\d+\s[a-zA-Z]（.*？\n）*？））中，效果非常好。
Removed text:
[ <br>
1 "Hardly" had they pulled out from under the ship’s lee, when a <br>
fourth keel, coming from the windward side, pulled round under the stern, <br>
and showed the five strangers <br>
127 <br>
<br>
<hr/>]

New text:
[The figure that now stood by its bows was tall and swart, with one white tooth <br>
evilly protruding from its steel-like lips. <br>
<!-- Removed -->
More text.
]