Java 正则表达式中的StackOverflowerr
我正在使用正则表达式从几个文档中提取一些字符串。我被这个“StackOverflowerError”卡住了,它是为一个特定的正则表达式而来的。没有使用这个正则表达式,程序执行起来很顺利 我的代码:Java 正则表达式中的StackOverflowerr,java,regex,Java,Regex,我正在使用正则表达式从几个文档中提取一些字符串。我被这个“StackOverflowerError”卡住了,它是为一个特定的正则表达式而来的。没有使用这个正则表达式,程序执行起来很顺利 我的代码: package com.gauge.ie.Annotator; import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.FileWriter; import java
package com.gauge.ie.Annotator;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.UUID;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.print.attribute.Size2DSyntax;
import org.apache.commons.io.FilenameUtils;
import org.apache.uima.util.FileUtils;
public class RecursiveFileDisplay
{
static List<String> misclist=new ArrayList<String>();
static List<String> list=new ArrayList<String>();
static LinkedHashMap<String,String> authormap=new LinkedHashMap<>();
static List<String> random=new ArrayList<String>();
static List<String> benchlist=new ArrayList<String>();
static LinkedHashMap<String,String> benchmap=new LinkedHashMap<>();
static List<String> misc1list=new ArrayList<String>();
String csvfile="/home/gauge/Documents/Docs/madras.csv";
FileWriter fw;
public RecursiveFileDisplay()throws IOException
{
fw=new FileWriter("/home/gauge/Documents/Docs/supremecourt.csv",true);
// TODO Auto-generated constructor stub
}
public static void main(String[] args) throws Exception
{
RecursiveFileDisplay rsd=new RecursiveFileDisplay();
File currentDir = new File("/home/gauge/Documents/Docs/SampleData/SupremeCourt");
rsd.readFilesFromDirectory(currentDir);
System.out.println(benchlist.size());
System.out.println(list.size());
System.out.println(random.size());
rsd.writeCSV();
}
public void writeCSV()throws IOException
{
for(String str:list)
{
fw.append(str);
fw.append("\n");
fw.flush();
}
System.out.println("Csv file is done!");
}
public void readFilesFromDirectory(File dir)
{
try
{
int i=0;
Pattern p1=Pattern.compile("(Author):(.*)");
Pattern p=Pattern.compile("(Bench:)(.*)");
Pattern p2=Pattern.compile("JUDGMENT(.*?)J[.]");
Pattern p3=Pattern.compile("(([H|h]on)|(HON)).*((ble)|BLE)(.*)");
//Pattern p4=Pattern.compile(",\\s*([^,]+),[^,]*\\b(J|JJ)\\.");//\s\w*(?=\w*[,]\sJ[.]*\b)
Pattern p5=Pattern.compile("\\s\\w*(?=\\w*[,]\\sJ[.]*\\b)");
Pattern p4=Pattern.compile("\\w*(?=\\w*[,]*\\s*((JJ)|(L.J)|(C.J)|(J))[.]\\s\\b)");
Pattern p6=Pattern.compile("(BENCH:)((.|\\n)*)(BENCH)((.|\\n)*)(CITATION)");
File[] listfiles=dir.listFiles();
for(File file:listfiles)
{
if(file.isFile())
{
String str="";
String line="";
BufferedReader br=new BufferedReader(new FileReader(file));
while((line=br.readLine())!=null)
{
str+=line+"\n";
}
Matcher match=p.matcher(str);
Matcher match1=p1.matcher(str);
Matcher match2=p2.matcher(str);
Matcher match3=p3.matcher(str);
Matcher match4=p4.matcher(str);
Matcher match5=p5.matcher(str);
Matcher match6=p6.matcher(str);
if(match.find())
{
if(match1.find())
{
list.add(file.toString()+"\t"+match.group(2)+"\t"+match1.group(2)); //filename, judgename ,authorname
System.out.println(match1.group(2));
}
else
{
list.add(file.toString()+"\t"+match.group(2)+"\t"+" ");
System.out.println(match.group(2));
}
}
else if(match1.find())
{
list.add(file.toString()+"\t"+" "+"\t"+match1.group(2));
}
else if(match2.find())
{
list.add(file.toString()+"\t"+match2.group()+"\t"+" ");
}
else if(match3.find())
{
list.add(file.toString()+"\t"+match3.group()+"\t"+" ");
}
else if(match4.find())
{
//do nothing
}
else if(match5.find())
{
list.add(file.toString()+"\t"+match5.group()+"\t"+" ");
System.out.println(file.toString());
}
else if(match6.find())
{
System.out.println("lalalalal");
}
else
{
misclist.add(file.toString()); //list of documents which have no Judgenames
String name = UUID.randomUUID().toString();
PrintWriter pw=new PrintWriter("/home/gauge/Documents/Docs/Misc"+"/"+name);
pw.write(str);
pw.flush();
}
}
else if(file.isDirectory())
{
readFilesFromDirectory(file.getAbsoluteFile());
System.out.println("recursion");
}
}
}
catch(StackOverflowError soe)
{
soe.printStackTrace();
System.err.print(soe);
}
catch (Exception e)
{
e.printStackTrace();
System.err.print(e);
}
}
}
问题来自
(.|\\n)*
部分p6
:
Pattern p6=Pattern.compile("(BENCH:)((.|\\n)*)(BENCH)((.|\\n)*)(CITATION)");
(.|\\n)*
在Oracle/OpenJDK JRE上编译成以下结构,其实现使用递归(注意GroupTail
返回到Loop
)来匹配非确定性模式的重复(在实现中始终认为交替是非确定性的)
在长字符串上,堆栈将耗尽,因此您将得到stackoverflowerrror
如果希望无例外地匹配任何字符,则应单独使用
并结合使用flag
- 您可以将标志传递给方法以打开整个表达式的标志:
Pattern p6 = Pattern.compile("(BENCH:)(.*)(BENCH)(.*)(CITATION)", Pattern.DOTALL);
- 或者正如Jonny 5在评论中所建议的,您也可以使用内联标志
:(?s)
- 或者,您也可以打开子模式的标志
:(?s:.*)
顺便问一下,您确定要在
p3
中匹配|onrable
Pattern p3 = Pattern.compile("(([H|h]on)|(HON)).*((ble)|BLE)(.*)");
如果不需要,请从字符类中删除|
:
Pattern p3 = Pattern.compile("(([Hh]on)|(HON)).*((ble)|BLE)(.*)");
我也看到了过多的捕获群。请检查它们是否确实必要。在(.|\n)*
中的分支正在添加到堆栈中,并捕获每个字符。要匹配的字符串足够长,堆栈溢出
一个选项是将其更改为*
,然后使用选项DOTALL
。另一种方法是仔细研究您试图捕获的内容、内容、原因,然后使用不同的正则表达式达到相同的效果,或者构建您自己的简单状态机来扫描字符流
看起来您要么在一个目录中递归地进行grep(可能使用grep),要么试图解析出一些东西(可能构建一个解析器,如使用)
使用StringBuilder
连接大量字符串。更好的方法是使用它来避免将每一行作为自己的字符串进行麻烦和垃圾收集
删除实际上不需要用于分支或捕获目的的任何分组。任何不需要捕获的分支组都应该以?:
开头。您可能希望对其余部分使用命名组。首先在工具中测试正则表达式,甚至可以为您测试一个片段
考虑|
在方括号字符类中的含义,以及类似地使括号类仅包含一个字符的含义,如
或,
Pattern.compile
是一个静态方法,每个模式只需使用一次,而不是每次递归,比如说,从方法中取出这些行,然后用[I'veedited p6]将它们放入类中:
private static Pattern p = Pattern.compile("Bench:(.*)");
private static Pattern p1 = Pattern.compile("Author:(.*)");
private static Pattern p2 = Pattern.compile("JUDGMENT(.*?)J\\.");
private static Pattern p3 = Pattern.compile("[Hh](on|ON).*(ble|BLE)(.*)");
private static Pattern p4 = Pattern.compile(",\\s*([^,]+),[^,]*\\b(J|JJ)\\.");
private static Pattern p5 = Pattern.compile("\\s\\w*(?=\\w*,\\sJ\\.*\\b)"); //? [.]* ?
private static Pattern p4 =
Pattern.compile("\\w*(?=\\w*,*\\s*(JJ|L.J|C.J|J).\\s\\b)"); //? [,]* ?
private static Pattern p6 =
Pattern.compile("BENCH:.*?BENCH.*?CITATION", Pattern.DOTALL);
如果你第一次养成了写作的习惯,你会更清楚地看到像“path\dir”+“\”+name
和“\t”+”
这样的东西:“path\dir”+“+”+name
和“\t”+”
,然后适当地组合到“path\dir”+name
和“\t”
最后,我将确定匹配器的范围并更改支撑格式,这可能只是我:
package com.you.take.me.to.funky;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.PrintWriter;
import java.nio.charset.StandardCharsets;
import java.nio.path.Files;
import java.nio.path.Path;
import java.nio.path.Paths;
import java.util.ArrayList;
import java.util.List;
import java.util.UUID;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class town {
private static List<String> list = new ArrayList<>();
private static List<String> misclist = new ArrayList<>();
private static Pattern p0 = Pattern.compile("(Bench:)(.*)");
private static Pattern p1 = Pattern.compile("(Author):(.*)");
private static Pattern p2 = Pattern.compile("JUDGMENT(.*?)J[.]");
private static Pattern p3 = Pattern.compile("(([H|h]on)|(HON)).*((ble)|BLE)(.*)");
private static Pattern p4 = Pattern.compile(",\\s*([^,]+),[^,]*\\b(J|JJ)\\.");//CAKE
private static Pattern p5 = Pattern.compile("\\s\\w*(?=\\w*[,]\\sJ[.]*\\b)");
private static Pattern p4 =
Pattern.compile("\\w*(?=\\w*[,]*\\s*((JJ)|(L.J)|(C.J)|(J))[.]\\s\\b)");
private static Pattern p6 =
Pattern.compile("BENCH:.*?BENCH.*?CITATION", Pattern.DOTALL);
public static void main(String[] args) {
Path path = Paths.get(args[0]);
String str;
try {
str = new String(Files.readAllBytes(path), StandardCharsets.UTF_8);
} catch (IOException e) {
e.printStackTrace();
}
Matcher match = p0.matcher(str);
if (match.find()) {
Matcher match1 = p1.matcher(str);
if (match1.find()) {
// pathname judgename authorname
list.add(path.toString() +
"\t" + match.group(2) +
"\t" + match1.group(2));
System.out.println(match1.group(2));
} else {
list.add(path.toString() + "\t" + match.group(2) + "\t ");
System.out.println(match.group(2));
}
} else {
match = p1.matcher(str);
if (match.find()) {
list.add(path.toString() + "\t \t" + match.group(2));
} else {
match = p2.matcher(str);
if (match.find()) {
list.add(path.toString() + "\t" + match.group() + "\t ");
} else {
match = p3.matcher(str);
if (match.find()) {
list.add(path.toString() + "\t" + match.group() + "\t ");
} else {
match = p4.matcher(str);
if (match.find()) {
//do nothing
} else {
match = p5.matcher(str);
if (match.find()) {
list.add(path.toString() + "\t" + match.group() + "\t ");
System.out.println(path.toString());
} else {
match = p6.matcher(str);
if (match.find()) {
System.out.println("DEBUG MARKER");
} else {
// list of documents which have no Judgenames
misclist.add(path.toString());
String name = UUID.randomUUID().toString();
try {
PrintWriter pw = new PrintWriter("/h/g/d/d/m/" + name);
pw.write(str);
pw.flush();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
}
}
}
}
}
}
}
}
package com.you.take.me.to.funky;
导入java.io.FileNotFoundException;
导入java.io.IOException;
导入java.io.PrintWriter;
导入java.nio.charset.StandardCharset;
导入java.nio.path.Files;
导入java.nio.path.path;
导入java.nio.path.path;
导入java.util.ArrayList;
导入java.util.List;
导入java.util.UUID;
导入java.util.regex.Matcher;
导入java.util.regex.Pattern;
公营城镇{
私有静态列表=新的ArrayList();
私有静态列表misclist=newarraylist();
私有静态模式p0=Pattern.compile((Bench:)(.*);
私有静态模式p1=Pattern.compile(((作者):(.*);
私有静态模式p2=Pattern.compile(“判断(.*J[.]);
私有静态模式p3=Pattern.compile(([H|H]on)|(HON)).*((ble)| ble)(*);
私有静态模式p4=Pattern.compile(“,\\s*([^,]+),[^,]*\\b(J|JJ)\\”;//蛋糕
私有静态模式p5=Pattern.compile(\\s\\w*(?=\\w*[,]\\sJ[.]*\\b)”;
私有静态模式p4=
模式。编译(\\w*(?=\\w*[,]*\\s*((JJ)|(L.J)|(C.J)|(J))[.]\\s\\b)”;
私有静态模式p6=
compile(“BENCH:*?BENCH.*?引文”,Pattern.DOTALL);
公共静态void main(字符串[]args){
Path Path=Path.get(args[0]);
字符串str;
试一试{
str=新字符串(Files.readAllBytes(path),StandardCharsets.UTF_8);
}捕获(IOE异常){
e、 printStackTrace();
}
Matcher-match=p0.Matcher(str);
if(match.find()){
匹配器匹配1=p1.匹配器(str);
if(match1.find()){
//路径名judgename authorname
list.add(path.toString()+
“\t”+匹配。组(2)+
“\t”+match1.group(2));
System.out.println(match1.group(2));
}否则{
list.add(path.toString()+“\t”+match.group(2)+“\t”);
系统输出println(匹配组(2));
}
}否则{
匹配=p1.匹配器(str);
if(match.find()){
添加(path.toString()+“\t\t”+match.group(2));
}否则{
match=p2.匹配器(str);
if(match.find()){
list.add(path.toString()+“\t”+match.group()+“\t”);
}否则{
匹配=p3.匹配器(str);
if(match.find()){
list.add(path.toString()+“\t”+match.group()+“\t”);
}否则{
match=p4.匹配器(str);
if(match.find()){
//无所事事
}否则{
match=p5.matcher(str);
if(match.find()){
列表。添加(path.toStri)
Pattern p3 = Pattern.compile("(([H|h]on)|(HON)).*((ble)|BLE)(.*)");
Pattern p3 = Pattern.compile("(([Hh]on)|(HON)).*((ble)|BLE)(.*)");
private static Pattern p = Pattern.compile("Bench:(.*)");
private static Pattern p1 = Pattern.compile("Author:(.*)");
private static Pattern p2 = Pattern.compile("JUDGMENT(.*?)J\\.");
private static Pattern p3 = Pattern.compile("[Hh](on|ON).*(ble|BLE)(.*)");
private static Pattern p4 = Pattern.compile(",\\s*([^,]+),[^,]*\\b(J|JJ)\\.");
private static Pattern p5 = Pattern.compile("\\s\\w*(?=\\w*,\\sJ\\.*\\b)"); //? [.]* ?
private static Pattern p4 =
Pattern.compile("\\w*(?=\\w*,*\\s*(JJ|L.J|C.J|J).\\s\\b)"); //? [,]* ?
private static Pattern p6 =
Pattern.compile("BENCH:.*?BENCH.*?CITATION", Pattern.DOTALL);
package com.you.take.me.to.funky;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.PrintWriter;
import java.nio.charset.StandardCharsets;
import java.nio.path.Files;
import java.nio.path.Path;
import java.nio.path.Paths;
import java.util.ArrayList;
import java.util.List;
import java.util.UUID;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class town {
private static List<String> list = new ArrayList<>();
private static List<String> misclist = new ArrayList<>();
private static Pattern p0 = Pattern.compile("(Bench:)(.*)");
private static Pattern p1 = Pattern.compile("(Author):(.*)");
private static Pattern p2 = Pattern.compile("JUDGMENT(.*?)J[.]");
private static Pattern p3 = Pattern.compile("(([H|h]on)|(HON)).*((ble)|BLE)(.*)");
private static Pattern p4 = Pattern.compile(",\\s*([^,]+),[^,]*\\b(J|JJ)\\.");//CAKE
private static Pattern p5 = Pattern.compile("\\s\\w*(?=\\w*[,]\\sJ[.]*\\b)");
private static Pattern p4 =
Pattern.compile("\\w*(?=\\w*[,]*\\s*((JJ)|(L.J)|(C.J)|(J))[.]\\s\\b)");
private static Pattern p6 =
Pattern.compile("BENCH:.*?BENCH.*?CITATION", Pattern.DOTALL);
public static void main(String[] args) {
Path path = Paths.get(args[0]);
String str;
try {
str = new String(Files.readAllBytes(path), StandardCharsets.UTF_8);
} catch (IOException e) {
e.printStackTrace();
}
Matcher match = p0.matcher(str);
if (match.find()) {
Matcher match1 = p1.matcher(str);
if (match1.find()) {
// pathname judgename authorname
list.add(path.toString() +
"\t" + match.group(2) +
"\t" + match1.group(2));
System.out.println(match1.group(2));
} else {
list.add(path.toString() + "\t" + match.group(2) + "\t ");
System.out.println(match.group(2));
}
} else {
match = p1.matcher(str);
if (match.find()) {
list.add(path.toString() + "\t \t" + match.group(2));
} else {
match = p2.matcher(str);
if (match.find()) {
list.add(path.toString() + "\t" + match.group() + "\t ");
} else {
match = p3.matcher(str);
if (match.find()) {
list.add(path.toString() + "\t" + match.group() + "\t ");
} else {
match = p4.matcher(str);
if (match.find()) {
//do nothing
} else {
match = p5.matcher(str);
if (match.find()) {
list.add(path.toString() + "\t" + match.group() + "\t ");
System.out.println(path.toString());
} else {
match = p6.matcher(str);
if (match.find()) {
System.out.println("DEBUG MARKER");
} else {
// list of documents which have no Judgenames
misclist.add(path.toString());
String name = UUID.randomUUID().toString();
try {
PrintWriter pw = new PrintWriter("/h/g/d/d/m/" + name);
pw.write(str);
pw.flush();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
}
}
}
}
}
}
}
}