在Java正则表达式的标记之间提取标记_Java_Regex_Html Parsing

在Java正则表达式的标记之间提取标记

java regex

在Java正则表达式的标记之间提取标记,java,regex,html-parsing,Java,Regex,Html Parsing,我想提取和 String patternHtml=“（*？）（*？）（*？（*？）”； Pattern rHtml=Pattern.compile（Pattern，Pattern.DOTALL | Pattern.Pattern不区分大小写）； Matcher mHtml=rHtml.Matcher（html）；我不知道为什么，但这会提取带有和的所有标记请：我需要使用正则表达式，请不要提供解析器库之类的替代方案…如果您只想（我引用）“提取标记”，我将其解释为html文本的正文语句中的打开节

我想提取

和

String patternHtml=“（*？）（*？）（*？（*？）”；
Pattern rHtml=Pattern.compile（Pattern，Pattern.DOTALL | Pattern.Pattern不区分大小写）；
Matcher mHtml=rHtml.Matcher（html）；

我不知道为什么，但这会提取带有

和

的所有标记

请：我需要使用正则表达式，请不要提供解析器库之类的替代方案…

如果您只想（我引用）“提取标记”，我将其解释为html文本的正文语句中的打开节点，您可以使用下面的解决方案

请注意，这是野蛮的。您不应该用正则表达式“解析”html（我知道您知道，但其他读者可能不知道）

//简单html文件，具有头/体和换行符
字符串html=“\r\n\r\nfo\r\n\r\n”+
“\r\nBlah\r\nMeh\r\n\r\n”；
//该模式仅检查打开的节点
Pattern tagsWithinBody=Pattern.compile（“”）；
//matcher应用于“”打开和关闭节点之间的任何文本
Matcher Matcher=tagsWithinBody.Matcher（html.substring（html.indexOf（“”+1，html.indexOf（“”））；
//在匹配器上迭代，只要它找到文本
while（matcher.find（））{
System.out.println（matcher.group（））；
}

输出：

<h1>
<h3>

您是否尝试过从正则表达式的开头和结尾删除

（*？）

？另外，不要忘记在matcher对象上使用

find（）

方法。顺便说一句，我希望你不会在一些真正的应用程序中使用此代码，而只是学习正则表达式。你是否意识到你的问题本质上与“我不知道为什么，但螺丝刀打不到钉子……请：我需要使用螺丝刀，请不要提供类似锤子的替代方案……”相同？答案非常简单：使用正确的工具完成工作@我知道最好使用解析器库。但是，如果你看到我的帖子，这不是我的问题。@Orçunyumarcı“但这不是我的问题”——不要说得太细，这个问题很愚蠢。它基本上是这样写的：“我不愿意接受帮助。”这不是一个问题，这是一个声明——我们不喜欢它。这个问题似乎离题了，因为它是关于OP的，已经表明他们对帮助不感兴趣。

// simple html file, has head/body and line breaks
String html = "<html>\r\n<head>\r\n<title>Foo</title>\r\n</head>\r\n" +
        "<body>\r\n<h1>Blah</h1>\r\n<h3>Meh</h3>\r\n</body>\r\n</html>";
// the pattern only checks for opening nodes
Pattern tagsWithinBody = Pattern.compile("<\\p{Alnum}+>");
// matcher is applied to whatever text is in between the "<body>" open and close nodes
Matcher matcher = tagsWithinBody.matcher(html.substring(html.indexOf("<body>") + 1, html.indexOf("</body>")));
// iterates over matcher as long as it finds text
while (matcher.find()) {
    System.out.println(matcher.group());
}

<h1>
<h3>