Java 正则表达式从下面的页面中提取段落_Java_Regex_Itext

Java 正则表达式从下面的页面中提取段落

java regex itext

Java 正则表达式从下面的页面中提取段落,java,regex,itext,Java,Regex,Itext,我使用iText从pdf中提取了此文本，并将其放入字符串变量中： (1) A a, — al'-fah; of Hebrew origin; the first letter of the alphabet; figurative only (from its use as a numeral) the first: — Alpha. Often used (usually ajn an, before a vowel) also in composition (as a contraction

我使用iText从pdf中提取了此文本，并将其放入字符串变量中：

(1) A a, — al'-fah; of Hebrew origin; the first letter of the alphabet;
figurative only (from its use as a numeral) the first: — Alpha.
Often used (usually ajn an, before a vowel) also in composition
(as a contraction from (427) (a]neu,)) in the sense of privation;
so in many words beginning with this letter; occasionally in the
sense of union (as a contraction of (260) (a[ma)).
(2) ÆAarw>n, — ah-ar-ohn'; of Hebrew origin [Hebrew {175}
('Aharown)]; Aaron, the brother of Moses: — Aaron.
(3) ÆAbaddw>n, — ab-ad-dohn'; of Hebrew origin [Hebrew {11}
('abaddown)]; a destroying angel: — Abaddon.
(4) ajbarh>v, — ab-ar-ace'; from (1) (a) (as a negative particle) and (922)
(ba>rov); weightless, i.e. (figurative) not burdensome: — from
being burdensome.
(5) ÆAbba~, — ab-bah'; of Chaldee origin [Hebrew {2} ('ab (Chaldee))];
father (as a vocative): — Abba.
(6) &Abel, — ab'-el; of Hebrew origin [Hebrew {1893} (Hebel)]; Abel,
the son of Adam: — Abel.
(7) ÆAbia>, — ab-ee-ah'; of Hebrew origin [Hebrew {29} ('Abiyah)];
Abijah, the name of two Israelites: — Abia.
(8) ÆAbia>qar, — ab-ee-ath'-ar; of Hebrew origin [Hebrew {54}
('Ebyathar)]; Abiathar, an Israelite: — Abiathar.
(9) ÆAbilhnh>, — ab-ee-lay-nay'; of foreign origin [compare Hebrew {58}
('abel)]; Abilene, a region of Syria: — Abilene.
(10) ÆAbiou>d, — ab-ee-ood'; of Hebrew origin [Hebrew {31}
('Abiyhuwd)]; Abihud, an Israelite: — Abiud.

字符串中的段落以

（[0-9]）

开头，如

（9）

或

（5）

中所述，我想使用

pagestring.split（“regex”）

提取以该字符序列开头的每个段落。有什么帮助吗？

这样可以避免在文本中嵌入的“（999）”上拆分。它基于这样一种假设，即表示段落开头的括号数字前面有一行结尾。还请注意，示例文本在第一个括号数字之前没有文本，因此生成了一个空的“段落”——因此是if语句

  String text = ...;
  String[] paras = text.split( "(?<=(^|\\n))\\(\\d+\\)" );
  for( String para: paras ){
      if( para.length() > 0 ){
          System.out.println( "Para: " + para );
      }
  }

字符串文本=。。。；
String[]paras=text.split（（？您可以使用split方法使用以下正则表达式[\n |.]\\（[0-9]{1,2}\\）”

，它将从文本中提取所有段落（包括0到99的数字）：

<代码> [\n] ./C>：只考虑<强>新段落，忽略PrimaReX文本中的<代码>（n）< /代码> .<

\\（[0-9]{1,2}\\）

：匹配（）内一个或两个数字的任何组
下面是一个包含所有段落的数组

有关正则表达式使用的更多信息，请参阅。
太棒了！有没有教程或指南可以推荐，因为正则表达式真的让我很困惑？我很久以前就学会了正则表达式，所以我不能真正推荐教程。但它提供了一种有趣的学习方式。
String[] parts=st.split("[\n|.]\\([0-9]{1,2}\\)");