如何使用Java正则表达式丢弃CSV文件的某些部分_Java_Regex

如何使用Java正则表达式丢弃CSV文件的某些部分

java regex

如何使用Java正则表达式丢弃CSV文件的某些部分,java,regex,Java,Regex,我有一个CSV文件，我必须用Java解析它 2012-11-01 00, 1106, 2194.1971066908 2012-11-01 01, 760, 1271.8460526316 . . . 2012-11-30 21, 1353, 1464.0014781966 2012-11-30 22, 1810, 1338.8331491713 2012-11-30 23, 1537, 1222.7826935589 720 rows sele

我有一个CSV文件，我必须用Java解析它

2012-11-01 00,  1106,   2194.1971066908
2012-11-01 01,  760,    1271.8460526316
.
.
.
2012-11-30 21,  1353,   1464.0014781966
2012-11-30 22,  1810,   1338.8331491713
2012-11-30 23,  1537,   1222.7826935589
        
720 rows selected.      
        
Elapsed: 00:37:00.23

这是我创建的Java代码，用于隔离每一列并将其存储在列表中

public void extractFile(String fileName){
        try{
            BufferedReader bf = new BufferedReader(new FileReader(fileName));
            try {
                String readBuff = bf.readLine();
                
                while (readBuff!=null){
                    
                    Pattern checkData = Pattern.compile("[a-zA-Z]");
                    Matcher match = checkData.matcher(readBuff);
                    
                    if (match.find()){
                        readBuff = null;
                    }
                    
                    else if (!match.find()){
                        
                        String[] splitReadBuffByComma = new String[3];
                        splitReadBuffByComma = readBuff.split(",");
                        
                            for (int x=0; x<splitReadBuffByComma.length; x++){
                                
                                if (x==0){
                                    dHourList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==1){
                                    throughputList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==2){
                                    avgRespTimeList.add(splitReadBuffByComma[x]);
                                }
                            }
                    }
                    
                    readBuff = bf.readLine();
                }
            }
            finally{
                bf.close();
            }
        }
        catch(FileNotFoundException e){
            System.out.println("File not found dude: "+ e);
        }
        catch(IOException e){
            System.out.println("Error Exception dude: "+e);
        }
    }

但它仍然显示720行被选中，另外一行不应该在那里

更新2 工作代码：

while (readBuff!=null){
                    
                    
                    Pattern checkData = Pattern.compile("^(19|20)\\d\\d([-/.])(0[1-9]|1[012])\\2(0[1-9]|[12][0-9]|3[01])\\b.+$");
                    
                    Matcher match = checkData.matcher(readBuff);
                    
                    if (!match.find()){
                        readBuff = null;
                    }
                    
                    else{
                        
                        String[] splitReadBuffByComma = new String[3];
                        splitReadBuffByComma = readBuff.split(",");
                        
                            for (int x=0; x<splitReadBuffByComma.length; x++){
                                
                                if (x==0){
                                    dHourList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==1){
                                    throughputList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==2){
                                    avgRespTimeList.add(splitReadBuffByComma[x]);
                                }
                            }
                    }
                    
                    readBuff = bf.readLine();
                }

非常感谢

首先，在

checkData

正则表达式的开头插入

是有意义的。然后表达式将只在行的开始处查找，而不是在整个字符串中查找，这会使它更快

您可以让您的正则表达式以更像日期格式的表达式（如4个数字和一个破折号）开始，就像在最后一行中一样，行数后面永远不会有破折号

也许是这样的：

Pattern checkData=Pattern.compile（“^\\d\\d\\d-”）；

如果您确信不会得到意外的数据，这就足够了-如果您想确保程序在csv数据格式不正确的情况下也能正常工作，只需扩展正则表达式以捕获整行数据，并使用

匹配（）

。

试试这个[您的代码，但稍微修改一下]：

public void extractFile(String fileName){
        try{
            BufferedReader bf = new BufferedReader(new FileReader(fileName));
            try {
                String readBuff = bf.readLine();

                while (readBuff!=null){

                    Pattern checkData = Pattern.compile("^(19|20)\\d\\d([-/.])(0[1-9]|1[012])\\2(0[1-9]|[12][0-9]|3[01])\\b.+$");
                    Matcher match = checkData.matcher(readBuff);

                    if (!match.find()){
                        readBuff = null;
                    }

                    else if (match.find()){

                        String[] splitReadBuffByComma = new String[3];
                        splitReadBuffByComma = readBuff.split(",");

                            for (int x=0; x<splitReadBuffByComma.length; x++){

                                if (x==0){
                                    dHourList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==1){
                                    throughputList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==2){
                                    avgRespTimeList.add(splitReadBuffByComma[x]);
                                }
                            }
                    }

                    readBuff = bf.readLine();
                }
            }
            finally{
                bf.close();
            }
        }
        catch(FileNotFoundException e){
            System.out.println("File not found dude: "+ e);
        }
        catch(IOException e){
            System.out.println("Error Exception dude: "+e);
        }
    }

更新

据我所知，您的输入字符串包含许多以日期开头的行，但其中不包含逗号。对于此更改，将以前的模式更改为以下模式：

^(19|20)\d\d([-/.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])\s+\d+,[^,]+,[^,]+$

或

转义
你不必用正则表达式。（如果它显示为您的示例）
你可以查一下

如果行包含逗号“，
”或
如果拆分的数组的长度为3或
在while条件下更改一位，如果行以“所选”
结尾，则跳出
请参阅。我们不允许使用第三方APIshi。我尝试了您的建议，并显示了dHourList内容。它仍然显示“720行已选择并经过：…”。我还编辑了你建议检查新更新的正则表达式。嗨，我能让它工作。您的解决方案确实有效，但它只需要在某些地方进行一些更改。见更新2。非常感谢。您好，谢谢您的回答，我尝试过在while循环的状态转换中放置很多if条件。但我无法让它工作。我希望regex能解决我的问题。
public void extractFile(String fileName){
        try{
            BufferedReader bf = new BufferedReader(new FileReader(fileName));
            try {
                String readBuff = bf.readLine();

                while (readBuff!=null){

                    Pattern checkData = Pattern.compile("^(19|20)\\d\\d([-/.])(0[1-9]|1[012])\\2(0[1-9]|[12][0-9]|3[01])\\b.+$");
                    Matcher match = checkData.matcher(readBuff);

                    if (!match.find()){
                        readBuff = null;
                    }

                    else if (match.find()){

                        String[] splitReadBuffByComma = new String[3];
                        splitReadBuffByComma = readBuff.split(",");

                            for (int x=0; x<splitReadBuffByComma.length; x++){

                                if (x==0){
                                    dHourList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==1){
                                    throughputList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==2){
                                    avgRespTimeList.add(splitReadBuffByComma[x]);
                                }
                            }
                    }

                    readBuff = bf.readLine();
                }
            }
            finally{
                bf.close();
            }
        }
        catch(FileNotFoundException e){
            System.out.println("File not found dude: "+ e);
        }
        catch(IOException e){
            System.out.println("Error Exception dude: "+e);
        }
    }

# ^(19|20)\d\d([-/.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])\b.+$
# 
# Options: ^ and $ match at line breaks
# 
# Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
# Match the regular expression below and capture its match into backreference number 1 «(19|20)»
#    Match either the regular expression below (attempting the next alternative only if this one fails) «19»
#       Match the characters “19” literally «19»
#    Or match regular expression number 2 below (the entire group fails if this one fails to match) «20»
#       Match the characters “20” literally «20»
# Match a single digit 0..9 «\d»
# Match a single digit 0..9 «\d»
# Match the regular expression below and capture its match into backreference number 2 «([-/.])»
#    Match a single character present in the list “-/.” «[-/.]»
# Match the regular expression below and capture its match into backreference number 3 «(0[1-9]|1[012])»
#    Match either the regular expression below (attempting the next alternative only if this one fails) «0[1-9]»
#       Match the character “0” literally «0»
#       Match a single character in the range between “1” and “9” «[1-9]»
#    Or match regular expression number 2 below (the entire group fails if this one fails to match) «1[012]»
#       Match the character “1” literally «1»
#       Match a single character present in the list “012” «[012]»
# Match the same text as most recently matched by capturing group number 2 «\2»
# Match the regular expression below and capture its match into backreference number 4 «(0[1-9]|[12][0-9]|3[01])»
#    Match either the regular expression below (attempting the next alternative only if this one fails) «0[1-9]»
#       Match the character “0” literally «0»
#       Match a single character in the range between “1” and “9” «[1-9]»
#    Or match regular expression number 2 below (attempting the next alternative only if this one fails) «[12][0-9]»
#       Match a single character present in the list “12” «[12]»
#       Match a single character in the range between “0” and “9” «[0-9]»
#    Or match regular expression number 3 below (the entire group fails if this one fails to match) «3[01]»
#       Match the character “3” literally «3»
#       Match a single character present in the list “01” «[01]»
# Assert position at a word boundary «\b»
# Match any single character that is not a line break character «.+»
#    Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Assert position at the end of a line (at the end of the string or before a line break character) «$»

^(19|20)\d\d([-/.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])\s+\d+,[^,]+,[^,]+$

^(19|20)\\d\\d([-/.])(0[1-9]|1[012])\\2(0[1-9]|[12][0-9]|3[01])\\s+\\d+,[^,]+,[^,]+$