Java 从文本中提取信息_Java_Regex_Nlp

Java 从文本中提取信息

java regex nlp

Java 从文本中提取信息,java,regex,nlp,Java,Regex,Nlp,我有以下案文： Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen bo

我有以下案文：

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.              

Name                                 Group                       12345678        
ALEX A ALEX                                                                   
ID#                                  PUBLIC NETWORK                  
XYZ123456789                                                                  


Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.

我想提取文本中ID#关键字下的ID值

问题是，在不同的文本文件<代码> ID <代码>可以位于不同的位置，例如在另一个文本的中间，像这样：

Lorem Ipsum is simply dummy text of                                          ID#             the printing and typesetting industry. Lorem Ipsum has been the industry's          
standard dummy text ever since the 1500s, when an unknown printer took a     XYZ123456789    galley of type and scrambled it to make a type specimen book.

此外，在

ID#

和值之间可以有额外的行：

Lorem Ipsum is simply dummy text of                                          ID#             the printing and typesetting industry. Lorem Ipsum has been the industry's      
printing and typesetting industry. Lorem Ipsum has been the                                  printing and typesetting industry. Lorem Ipsum has been the 
standard dummy text ever since the 1500s, when an unknown printer took a     XYZ123456789    galley of type and scrambled it to make a type specimen book.

请说明如何提取所提到的

ID

值？这里是否有任何标准技术可以用于提取这些信息？例如正则表达式或正则表达式顶部的某种方法。有可能在这里应用NLP吗？

下面是我头脑中的一个建议。一般的想法是将源文本转换成一个行数组（或列表），然后遍历它们，直到找到“ID#”标记。一旦知道ID#在那一行中的位置，就可以遍历其余的行，在该位置找到一些文本。此示例应与您给出的示例一起使用，尽管任何不同的操作都可能导致返回错误的值

String s = null; //your source text
String idValue = null; //what we'll assign the ID value to

//format the string into lines
String[] lines = s.split("\\r?\\n"); //this handles both Windows and Unix-style line termination

//go through the lines looking for the ID# token and storing it's horizontal position in in the line
for (int i=0; i<lines.length; i++) {
    String line = lines[i];
    int startIndex = line.indexOf("ID#");

    //if we found the ID token, then go through the remaining lines starting from the next one down
    if (startIndex > -1) {
        for (int j=i+1; j<lines.length; j++) {
            line = lines[j];

            //check if this line is long enough
            if (line.length() > startIndex) {

                //remove everything prior to the index where the ID# token was
                line = line.substring(startIndex);

                //if the line starts with a space then it's not an ID
                if (!line.startsWith(" ")) {

                    //look for the first whitespace after the ID value we've found
                    int endIndex = line.indexOf(" ");

                    //if there's no end index, then the ID is at the end of the line
                    if (endIndex == -1) {
                        idValue = line;
                    } else {
                        //if there is an end index, then remove everything to just leave the ID value
                        idValue = line.substring(0, endIndex);
                    }

                    break;
                }
            }
        }

        break;
    }

}

字符串s=null//你的原文
字符串idValue=null//我们将为其分配ID值
//将字符串格式化为行
字符串[]行=s.split（\\r？\\n”）//这将处理Windows和Unix样式的行终止
//在各行中查找ID#标记并将其水平位置存储在该行中
对于（int i=0；i-1）{
对于（int j=i+1；j startIndex）{
//删除ID标记所在索引之前的所有内容
line=line.子字符串（startIndex）；
//如果该行以空格开头，则它不是ID
如果（！line.startsWith（“”）{
//查找找到的ID值后的第一个空格
int endIndex=line.indexOf（“”）；
//如果没有结束索引，则ID位于行的末尾
如果（endIndex=-1）{
idValue=直线；
}否则{
//如果有结束索引，则删除所有内容，只保留ID值
idValue=line.substring（0，endIndex）；
}
打破
}
}
}
打破
}
}
下面是我脑海中的一个建议。一般的想法是将源文本转换成一个行数组（或列表），然后遍历它们，直到找到“ID#”标记。一旦知道ID#在那一行中的位置，就可以遍历其余的行，在该位置找到一些文本。此示例应与您给出的示例一起使用，尽管任何不同的操作都可能导致返回错误的值
String s = null; //your source text
String idValue = null; //what we'll assign the ID value to

//format the string into lines
String[] lines = s.split("\\r?\\n"); //this handles both Windows and Unix-style line termination

//go through the lines looking for the ID# token and storing it's horizontal position in in the line
for (int i=0; i<lines.length; i++) {
    String line = lines[i];
    int startIndex = line.indexOf("ID#");

    //if we found the ID token, then go through the remaining lines starting from the next one down
    if (startIndex > -1) {
        for (int j=i+1; j<lines.length; j++) {
            line = lines[j];

            //check if this line is long enough
            if (line.length() > startIndex) {

                //remove everything prior to the index where the ID# token was
                line = line.substring(startIndex);

                //if the line starts with a space then it's not an ID
                if (!line.startsWith(" ")) {

                    //look for the first whitespace after the ID value we've found
                    int endIndex = line.indexOf(" ");

                    //if there's no end index, then the ID is at the end of the line
                    if (endIndex == -1) {
                        idValue = line;
                    } else {
                        //if there is an end index, then remove everything to just leave the ID value
                        idValue = line.substring(0, endIndex);
                    }

                    break;
                }
            }
        }

        break;
    }

}

字符串s=null//你的原文
字符串idValue=null//我们将为其分配ID值
//将字符串格式化为行
字符串[]行=s.split（\\r？\\n”）//这将处理Windows和Unix样式的行终止
//在各行中查找ID#标记并将其水平位置存储在该行中
对于（int i=0；i-1）{
对于（int j=i+1；j startIndex）{
//删除ID标记所在索引之前的所有内容
line=line.子字符串（startIndex）；
//如果该行以空格开头，则它不是ID
如果（！line.startsWith（“”）{
//查找找到的ID值后的第一个空格
int endIndex=line.indexOf（“”）；
//如果没有结束索引，则ID位于行的末尾
如果（endIndex=-1）{
idValue=直线；
}否则{
//如果有结束索引，则删除所有内容，只保留ID值
idValue=line.substring（0，endIndex）；
}
打破
}
}
}
打破
}
}
似乎ID的值没有明确的格式，因此单行正则表达式无法帮助您，因为这里几乎没有正则表达式
必须使用两个正则表达式才能获得预期的输出。第一个是：
(?m)^(.*)ID#.*([\s\S]*)

它试图分别在行中查找ID
。它捕获两个字符串块。第一个区块是从该行开始到ID
的所有内容，然后是ID
所在行之后出现的所有内容
然后计算第一个捕获组的长度。它为我们提供了列号，我们应该在下一行中开始搜索ID：
m.group(1).length();

然后，我们构建第二个使用此长度的正则表达式：
(?m)^.{X}(?<!\S)\h{0,3}(\S+)

似乎ID的值没有明确的格式，所以一个单行正则表达式也帮不上忙，因为这里几乎没有正则表达式
必须使用两个正则表达式才能获得预期的输出。第一个是：
(?m)^(.*)ID#.*([\s\S]*)

它试图分别在行中查找ID
。它捕获两个字符串块。第一个区块是从该行开始到ID
的所有内容，然后是ID
所在行之后出现的所有内容
然后计算第一个捕获组的长度。它为我们提供了列号，我们应该在下一行中开始搜索ID：
m.group(1).length();

然后，我们构建第二个使用此长度的正则表达式：
(?m)^.{X}(?<!\S)\h{0,3}(\S+)

始终是ID
中I
所在列下ID值的第一个字母？有时它可以用空格向右移动（相对于ID#）ID
的格式是什么？也许这更容易理解。@Jan在大多数情况下，我认为它只能有数字和字母，但不幸的是，文本中有许多其他类似的结构，类似于？这假定ID至少为3个字符