Java在标记和属性之间提取文本_Java_Regex_Design Patterns_Matcher

Java在标记和属性之间提取文本

java regex design-patterns

Java在标记和属性之间提取文本,java,regex,design-patterns,matcher,Java,Regex,Design Patterns,Matcher,我正在尝试提取特定标记和属性之间的文本。现在，我尝试提取标签。我正在读取一个“.gexf”文件，其中包含XML数据。然后我将此数据保存为字符串。然后，我试图提取“节点”标记之间的文本。以下是我目前的代码： import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.util.regex.Matcher; import java.util.regex.Patter

我正在尝试提取特定标记和属性之间的文本。现在，我尝试提取标签。我正在读取一个“.gexf”文件，其中包含XML数据。然后我将此数据保存为字符串。然后，我试图提取“节点”标记之间的文本。以下是我目前的代码：

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    private static String filePath = "src/babel.gexf";

    public String readFile(String filePath) throws IOException {
        BufferedReader br = new BufferedReader(new FileReader(filePath));
        try {
            StringBuilder sb = new StringBuilder();
            String line = br.readLine();
            while (line != null) {
                sb.append(line);
                sb.append("\n");
                line = br.readLine();
            }
            return sb.toString();
        } finally {
            br.close();
        }
    }

    public void getNodesContent(String content) throws IOException {
        final Pattern pattern = Pattern.compile("<nodes>(\\w+)</nodes>", Pattern.MULTILINE);
        final Matcher matcher = pattern.matcher(content);
        while (matcher.find()) {
            System.out.println(matcher.group(1));
        }
    }

    public static void main(String [] args) throws IOException {
        Main m = new Main();
        String result = m.readFile(filePath);
        m.getNodesContent(result);
    }
}

导入java.io.BufferedReader；
导入java.io.FileReader；
导入java.io.IOException；
导入java.util.regex.Matcher；
导入java.util.regex.Pattern；
公共班机{
私有静态字符串filePath=“src/babel.gexf”；
公共字符串读取文件（字符串文件路径）引发IOException{
BufferedReader br=新的BufferedReader（新文件读取器（文件路径））；
试一试{
StringBuilder sb=新的StringBuilder（）；
String line=br.readLine（）；
while（行！=null）{
某人附加（行）；
某人附加（“\n”）；
line=br.readLine（）；
}
使某人返回字符串（）；
}最后{
br.close（）；
}
}
public void getNodeContent（字符串内容）引发IOException{
最终模式=Pattern.compile（“\\w+”，Pattern.MULTILINE）；
最终匹配器匹配器=pattern.Matcher（内容）；
while（matcher.find（））{
系统输出println（匹配器组（1））；
}
}
公共静态void main（字符串[]args）引发IOException{
Main m=新的Main（）；
字符串结果=m.readFile（文件路径）；
m、 GetNodeContent（结果）；
}
}

在上面的代码中，我没有得到任何结果。当我用“我的字符串”这样的示例字符串进行尝试时，我得到了结果。gexf文件的链接（因为太长，我不得不上传）：

没有文件样本，我只能提出这么多建议。相反，我可以告诉你的是，你可以使用标记搜索循环获得文本的子字符串。以下是一个例子：

String s = "<a>test</a><b>list</b><a>class</a>";
int start = 0,  end = 0;
for(int i = 0; i < s.toCharArray().length-1; i++){
    if(s.toCharArray()[i] == '<' && s.toCharArray()[i+1] == 'a' &&     s.toCharArray()[i+2] == '>'){
        start = i+3;
        for(int j = start+3; j < s.toCharArray().length-1; j++){
            if(s.toCharArray()[j] == '<' && s.toCharArray()[j+1] == '/' && s.toCharArray()[j+2] == 'a' && s.toCharArray()[j+3] == '>'){
                end = j;
                System.out.println(s.substring(start, end));
                break;
            }
        }
    }
}

String s=“testlistclass”；
int start=0，end=0；
for（int i=0；i


上面的代码将在字符串s中搜索标记，然后从找到标记的位置开始搜索，并继续搜索，直到找到结束标记为止。然后它使用这两个位置创建字符串的子字符串，该字符串是两个标记之间的文本。您可以根据需要堆叠任意数量的这些标记搜索。以下是2个标记搜索的示例：
String s = "<a>test</a><b>list</b><a>class</a>";
int start = 0,  end = 0;
for(int i = 0; i < s.toCharArray().length-1; i++){
    if((s.toCharArray()[i] == '<' && s.toCharArray()[i+1] == 'a' && s.toCharArray()[i+2] == '>') ||
            (s.toCharArray()[i] == '<' && s.toCharArray()[i+1] == 'b' && s.toCharArray()[i+2] == '>')){
        start = i+3;
        for(int j = start+3; j < s.toCharArray().length-1; j++){
            if((s.toCharArray()[j] == '<' && s.toCharArray()[j+1] == '/' && s.toCharArray()[j+2] == 'a' && s.toCharArray()[j+3] == '>') || 
                    (s.toCharArray()[j] == '<' && s.toCharArray()[j+1] == '/' && s.toCharArray()[j+2] == 'b' && s.toCharArray()[j+3] == '>')){
                end = j;
                System.out.println(s.substring(start, end));
                break;
            }
        }
    }
}

String s=“testlistclass”；
int start=0，end=0；
for（int i=0；i

唯一的区别是我在if语句中添加了子句，以获得b标记之间的文本。这个系统非常通用，我想你会为它的大量使用提供资金。
我不认为将整个文件内容放在一个字符串中是一个好主意，但我认为这取决于文件中的内容量。如果有很多内容，我会读一些不同的内容。如果能看到一个虚构的文件包含内容的示例，那就太好了
我想你可以试试这个小方法。它的核心是利用正则表达式（RegEx）和模式/匹配器从标记之间检索所需的子字符串
使用以下方法阅读文档非常重要：
/**
 * This method will retrieve a string contained between string tags. You
 * specify what the starting and ending tags are within the startTag and
 * endTag parameters. It is you who determines what the start and end tags
 * are to be which can be any strings.<br><br>
 *
 * @param inputString (String) Any string to process.<br>
 *
 * @param startTag (String) The Start Tag String or String. Data content retrieved
 * will be directly after this tag.<br><br>
 *
 * The supplied Start Tag criteria can contain a single special wildcard tag
 * (~*~) providing you also place something like the closing chevron (>)
 * for an HTML tag after the wildcard tag, for example:<pre>
 *
 * If we have a string which looks like this:
 *      {@code
 *      "<p style=\"padding-left:40px;\">Hello</p>"
 *      }
 *      (Note: to pass double quote marks in a string they must be excaped)
 *
 * and we want to use this method to extract the word "Hello" from between the
 * two HTML tags then your Start Tag can be supplied as "&lt;p~*~&gt;" and of course
 * your End Tag can be "&lt;/p&gt;". The "&lt;p~*~&gt;" would be the same as supplying
 * "&lt;p style=\"padding-left:40px;\"&gt;". Anything between the characters &lt;p and
 * the supplied close chevron (&gt;) is taken into consideration. This allows for
 * contents extraction regardless of what HTML attributes are attached to the
 * tag. The use of a wildcard tag (~*~) is also allowed in a supplied End
 * Tag.</pre><br>
 *
 * The wildcard is used as a special tag so that strings that actually
 * contain asterisks (*) can be processed as regular asterisks.<br>
 *
 * @param endTag (String) The End Tag or String. Data content retrieval will
 * end just before this Tag is reached.<br>
 *
 * The supplied End Tag criteria can contain a single special wildcard tag
 * (~*~) providing you also place something like the closing chevron (&gt;)
 * for an HTML tag after the wildcard tag, for example:<pre>
 *
 * If we have a string which looks like this:
 *      {@code
 *      "<p style=\"padding-left:40px;\">Hello</p>"
 *      }
 *      (Note: to pass double quote marks in a string they must be excaped)
 *
 * and we want to use this method to extract the word "Hello" from between the
 * two HTML tags then your Start Tag can be supplied as "&lt;p style=\"padding-left:40px;\"&gt;"
 * and your End Tag can be "&lt;/~*~&gt;". The "&lt;/~*~&gt;" would be the same as supplying
 * "&lt;/p&gt;". Anything between the characters &lt;/ and the supplied close chevron (&gt;)
 * is taken into consideration. This allows for contents extraction regardless of what the
 * HTML tag might be. The use of a wildcard tag (~*~) is also allowed in a supplied Start Tag.</pre><br>
 *
 * The wildcard is used as a special tag so that strings that actually
 * contain asterisks (*) can be processed as regular asterisks.<br>
 *
 * @param trimFoundData (Optional - Boolean - Default is true) By default
 * all retrieved data is trimmed of leading and trailing white-spaces. If
 * you do not want this then supply false to this optional parameter.
 *
 * @return (1D String Array) If there is more than one pair of Start and End
 * Tags contained within the supplied input String then each set is placed
 * into the Array separately.<br>
 *
 * @throws IllegalArgumentException if any supplied method String argument
 * is Null ("").
 */
public static String[] getBetweenTags(String inputString, String startTag,
        String endTag, boolean... trimFoundData) {
    if (inputString == null || inputString.equals("") || startTag == null ||
            startTag.equals("") || endTag == null || endTag.equals("")) {
        throw new IllegalArgumentException("\ngetBetweenTags() Method Error! - "
                + "A supplied method argument contains Null (\"\")!\n"
                + "Supplied Method Arguments:\n"
                + "==========================\n"
                + "inputString = \"" + inputString + "\"\n"
                + "startTag = \"" + startTag + "\"\n"
                + "endTag = \"" + endTag + "\"\n");
    }

    List<String> list = new ArrayList<>();
    boolean trimFound = true;
    if (trimFoundData.length > 0) {
        trimFound = trimFoundData[0];
    }

    Matcher matcher;
    if (startTag.contains("~*~") || endTag.contains("~*~")) {
        startTag = startTag.replace("~*~", ".*?");
        endTag = endTag.replace("~*~", ".*?");
        Pattern pattern = Pattern.compile("(?iu)" + startTag + "(.*?)" + endTag);
        matcher = pattern.matcher(inputString);
    } else {
        String regexString = Pattern.quote(startTag) + "(?s)(.*?)" + Pattern.quote(endTag);
        Pattern pattern = Pattern.compile("(?iu)" + regexString);
        matcher = pattern.matcher(inputString);
    }

    while (matcher.find()) {
        String match = matcher.group(1);
        if (trimFound) {
            match = match.trim();
        }
        list.add(match);
    }
    return list.toArray(new String[list.size()]);
}

/**
*此方法将检索字符串标记之间包含的字符串。你
*指定起始标记和结束标记在startTag和
*endTag参数。由您决定开始和结束标记的内容
*可以是任何字符串。


*
*@param inputString（String）任何要处理的字符串。

*
*@param startTag（String）起始标记字符串或字符串。检索到的数据内容
*将直接位于此标记之后。


*
*提供的开始标记条件可以包含一个特殊的通配符标记
*（~*~）如果您还放置了类似于闭合V形（>）的东西
*对于通配符标记后的HTML标记，例如：
*
*如果我们有一个如下所示的字符串：
*{@code
*“Hello”
*      }
*（注意：要在字符串中传递双引号，必须将其删除）
*
*我们想用这个方法从
*两个HTML标记，然后您的开始标记可以作为“p~*~”提供，当然还有
*您的结束标记可以是“/p”。“p~*~”将与供应相同
*“p样式=\”左填充：40px\"". 字符p和之间的任何内容
*提供的闭合V形（）已被考虑在内。这就允许
*内容提取，而不考虑将哪些HTML属性附加到
*标签。在提供的端中也允许使用通配符标记（~*~）
*标签。

*
*通配符用作特殊标记，因此
*包含星号（*）可以作为常规星号处理。

*
*@param endTag（String）结束标记或字符串。数据内容检索将
*在到达此标记之前结束。

*
*提供的结束标记条件可以包含一个特殊的通配符标记
*（~*~）如果您还放置了类似于结束V形（）
*对于通配符标记后的HTML标记，例如：
*
*如果我们有一个如下所示的字符串：
*{@code
*“