Java正则表达式赢得'；不匹配_Java_Regex

Java正则表达式赢得'；不匹配

java regex

Java正则表达式赢得'；不匹配,java,regex,Java,Regex,我正在尝试编写一个程序，它将返回\begin{thermory}和\end{thermory}之间以及\begin{proof}和\end{proof}之间的所有文本使用正则表达式似乎很自然，但因为有很多潜在的元字符，所以需要对它们进行转义以下是我编写的代码： import java.util.ArrayList; import java.util.regex.Matcher; import java.util.regex.Pattern; public class LatexTheore

我正在尝试编写一个程序，它将返回

\begin{thermory}

和

\end{thermory}

之间以及

\begin{proof}

和

\end{proof}

之间的所有文本

使用正则表达式似乎很自然，但因为有很多潜在的元字符，所以需要对它们进行转义

以下是我编写的代码：

import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LatexTheoremProofExtractor {

    // This is the LaTeX source that will be processed
    private String source = null;

    // These are the list of theorems and proofs that are extracted, respectively 
    private ArrayList<String> theorems = null;
    private ArrayList<String> proofs = null;

    // These are the patterns to match theorems and proofs, respectively 
    private static final Pattern THEOREM_REGEX = Pattern.compile("\\begin\\{theorem\\}(.+?)\\end\\{theorem\\}");
    private static final Pattern PROOF_REGEX = Pattern.compile("\\begin\\{proof\\}(.+?)\\end\\{proof\\}");

    LatexTheoremProofExtractor(String source) {
        this.source = source;
    }

    public void parse() {
        extractEntity("theorem");
        extractEntity("proof");
    }

    private void extractTheorems() {
        if(theorems != null) {
            return;
        }

        theorems = new ArrayList<String>();

        final Matcher matcher = THEOREM_REGEX.matcher(source);
        while (matcher.find()) {
            theorems.add(new String(matcher.group(1)));
        }   
    }

    private void extractProofs() {
        if(proofs != null) {
            return;
        }

        proofs = new ArrayList<String>();

        final Matcher matcher = PROOF_REGEX.matcher(source);
        while (matcher.find()) {
            proofs.add(new String(matcher.group(1)));
        }       
    }

    private void extractEntity(final String entity) {   
        if(entity.equals("theorem")) {
            extractTheorems();
        } else if(entity.equals("proof")) {
            extractProofs();
        } else {
            // TODO: Throw an exception or something
        }       
    }

    public ArrayList<String> getTheorems() {
        return theorems;
    }

}

import java.util.ArrayList；
导入java.util.regex.Matcher；
导入java.util.regex.Pattern；
公共类Latextheoremproof提取器{
//这是将要处理的乳胶源
私有字符串源=null；
//这些是分别提取的定理和证明的列表
私有数组列表定理=null；
私有ArrayList证明=null；
//这些是分别匹配定理和证明的模式
私有静态最终模式定理\u REGEX=Pattern.compile（\\begin\\{thermore\\}（+？）\\end\\{thermore\\}）；
私有静态最终模式证明\u REGEX=Pattern.compile（\\begin\\{PROOF\\}（+？）\\end\\{PROOF\}）；
Latextheoremproof提取器（字符串源）{
this.source=源；
}
公共空解析（）{
提取实体（“定理”）；
实体（“证明”）；
}
私有无效性定理（）{
if（定理！=null）{
返回；
}
定理=新的ArrayList（）；
最终匹配器匹配器=定理\正则表达式匹配器（来源）；
while（matcher.find（））{
add（新字符串（matcher.group（1））；
}   
}
私人证据{
如果（证明！=null）{
返回；
}
证明=新的ArrayList（）；
最终匹配器匹配器=证明规则匹配器（源）；
while（matcher.find（））{
添加（新字符串（matcher.group（1））；
}       
}
私有实体（最终字符串实体）{
if（实体等式（“定理”））{
提取定理（）；
}如果（实体等于（“证明”））{
提取证明（）；
}否则{
//TODO:抛出异常或其他什么
}       
}
公共数组列表getTheorems（）{
返回定理；
}
}

下面是我失败的测试

@Test 
public void testTheoremExtractor() {
    String source = "\\begin\\{theorem\\} Hello, World! \\end\\{theorem\\}";
    LatexTheoremProofExtractor extractor = new LatexTheoremProofExtractor(source);
    extractor.parse();
    ArrayList<String> theorems = extractor.getTheorems();
    assertEquals(theorems.get(0).trim(), "Hello, World!");
}

@测试
public void testTheoremExtractor（）{
String source=“\\begin\\{Thermory\\}你好，世界！\\end\\{Thermory\\}”；
Latextheoremprootextractor extractor=新的Latextheoremprootextractor（源）；
提取器.parse（）；
ArrayList定理=提取器.getTheorems（）；
assertEquals（定理.get（0.trim（），“你好，世界！”）；
}

显然，我的测试表明，我希望在这个测试中只有一个匹配项，应该是“你好，世界！”（后期修剪）

当前

是一个空的非null
数组。因此，我的Matcher
s与模式不匹配。有人能帮我理解为什么吗
谢谢，
erip
您的第一个正则表达式需要：
Pattern THEOREM_REGEX = Pattern.compile("\\\\begin\\\\\\{theorem\\\\\\}(.+?)\\\\end\\\\\\{theorem\\\\\\}");

当您试图匹配正则表达式中需要\\\\\的反斜杠时。
以下是您需要对代码进行的更新-提取器方法中的2个正则表达式应更改为
private static final Pattern THEOREM_REGEX = Pattern.compile(Pattern.quote("\\begin\\{theorem\\}") + "(.+?)" + Pattern.quote("\\end\\{theorem\\}"));
private static final Pattern PROOF_REGEX = Pattern.compile(Pattern.quote("\\begin\\{proof\\}") + "(.+?)" + Pattern.quote("\\end\\{proof\\}"));

结果将是“你好，世界！”
您拥有的字符串实际上是\begin\{定理\}你好，世界\结束\{定理\}
。Java字符串中的文字反斜杠加倍，当您需要将Java中的文字反斜杠与正则表达式匹配时，需要使用\\\\
。为了避免使用，模式。quote
可以告诉正则表达式将其内部的所有子模式视为文本
有关模式的更多详细信息，请参见：
为指定的String
返回文本模式String
。

此方法生成一个字符串
，可用于创建一个模式
，该模式将匹配字符串s
，就像它是一个文字模式一样
输入序列中的元字符或转义序列没有特殊意义
您的测试代码中似乎存在其他答案无法解决的错误。您可以按如下方式创建测试字符串：
String source = "\\begin\\{theorem\\} Hello, World! \\end\\{theorem\\}";

…但在文本中，您说源字符串应该是：
\begin{theorem} Hello, World! \end{theorem}

如果这是真的，字符串文字应该是：
"\\begin{theorem} Hello, World! \\end{theorem}"

要创建正则表达式，请使用：
Pattern.quote("\\begin{theorem}") + "(.*?)" + Pattern.quote("\\end{theorem}")

…或手动将其转义：
"\\\\begin\\{theorem\\}(.*?)\\\end\\{theorem\\}"

使用\\\\\
匹配一个\
@Stribizev，我做了更改，但它给了我相同的结果-大小为0。如果文本如您所说的\begin{定理}
恒定，等等。。我认为您不需要正则表达式，为什么不根据这些分隔符分割数据呢？或者使用indexof（）
@LiranBo相当肯定String#split
在引擎盖下使用Pattern
/Matcher
。看看你提取器中的2个更新正则表达式。我添加了更多有趣的链接和解释。实际上，您可以使用\\Q
和\\E
来代替，但重要的是将其应用于这两个regexp。