Java “处理”&引用&引用-&引用;唯一性

Java “处理”&引用&引用-&引用;唯一性,java,csv,parsing,univocity,Java,Csv,Parsing,Univocity,你知道我怎么才能弄到合适的线路吗?有些线路被粘住了,我不知道该怎么做或者为什么 col. 0: Date col. 1: Col2 col. 2: Col3 col. 3: Col4 col. 4: Col5 col. 5: Col6 col. 6: Col7 col. 7: Col7 col. 8: Col8 col. 0: 2017-05-23 col. 1: String col. 2: lo rem ipsum col. 3: dol

你知道我怎么才能弄到合适的线路吗?有些线路被粘住了,我不知道该怎么做或者为什么

  col. 0: Date
  col. 1: Col2
  col. 2: Col3
  col. 3: Col4
  col. 4: Col5
  col. 5: Col6
  col. 6: Col7
  col. 7: Col7
  col. 8: Col8

  col. 0: 2017-05-23
  col. 1: String
  col. 2: lo rem ipsum
  col. 3: dolor sit amet
  col. 4: mcdonalds.com/online.html
  col. 5: null
  col. 6: "","-""-""2017-05-23"
  col. 7: String
  col. 8: lo rem ipsum
  col. 9: dolor sit amet
  col. 10: burgerking.com
  col. 11: https://burgerking.com/
  col. 12: 20
  col. 13: 2
  col. 14: fake

  col. 0: 2017-05-23
  col. 1: String
  col. 2: lo rem ipsum
  col. 3: dolor sit amet
  col. 4: wendys.com
  col. 5: null
  col. 6: "","-""-""2017-05-23"
  col. 7: String
  col. 8: lo rem ipsum
  col. 9: dolor sit amet
  col. 10: buggagump.com
  col. 11: null
  col. 12: "","-""-""2017-05-23"
  col. 13: String
  col. 14: cheese
  col. 15: ad eum
  col. 16: mcdonalds.com/online.html
  col. 17: null
  col. 18: "","-""-""2017-05-23"
  col. 19: String
  col. 20: burger
  col. 21: ludus dissentiet
  col. 22: www.mcdonalds.com
  col. 23: https://www.mcdonalds.com/
  col. 24: 25
  col. 25: 3
  col. 26: fake

  col. 0: 2017-05-23
  col. 1: String
  col. 2: wine
  col. 3: id erat utamur
  col. 4: bubbagump.com
  col. 5: https://buggagump.com/
  col. 6: 25
  col. 7: 3
  col. 8: fake
  done
示例CSV(复制/粘贴时\r\n可能已损坏)。可从以下网址获得:

建筑设置:

  CsvParserSettings settings = new CsvParserSettings();

  settings.setDelimiterDetectionEnabled(true);
  settings.setQuoteDetectionEnabled(true);

  settings.setLineSeparatorDetectionEnabled(false); // all the same using `true`
  settings.getFormat().setLineSeparator("\r\n");

  CsvParser parser = new CsvParser(settings);

  List<String[]> rows;

  rows = parser.parseAll(getReader("testFiles/" + "malformed csv small.csv"));

  for (String[] row : rows)
  {
    System.out.println("");
    int i = 0;

    for (String element : row)
    {
      System.out.println("col. " + i++ + ": " + element);
    }
  }

  System.out.println("done");
CsvParserSettings设置=新的CsvParserSettings();
settings.setDelimiterDetectionEnabled(true);
settings.setQuoteDetectionEnabled(true);
settings.setLineSeparatorDetectionEnabled(false);//同样使用'true'`
settings.getFormat().setLineSeparator(“\r\n”);
CsvParser parser=新的CsvParser(设置);
列出行;
rows=parser.parseAll(getReader(“testFiles/”+“格式错误的csv small.csv”);
for(字符串[]行:行)
{
System.out.println(“”);
int i=0;
for(字符串元素:行)
{
System.out.println(“列“+i+++”:“+元素);
}
}
系统输出打印项次(“完成”);

在您测试自动检测过程时,我建议您使用以下格式打印检测到的格式:

CsvFormat format = parser.getDetectedFormat();
System.out.println(format);
这将打印出:

CsvFormat:
    Comment character=#
    Field delimiter=,
    Line separator (normalized)=\n
    Line separator sequence=\r\n
    Quote character="
    Quote escape character=-
    Quote escape escape character=null
如您所见,解析器没有正确检测引号转义。虽然格式检测过程通常非常好,但不能保证它总是正确的,特别是对于小的测试样本。在您的示例中,我看不出为什么它会选择
-
作为转义字符,所以我打开这个来调查,看看是什么让它检测到了那个转义字符

如果您知道任何输入文件都不会将
-
作为引号转义,那么您现在可以做的是检测格式,测试它从输入中拾取的内容,然后解析内容,如下所示:

public List<String[]> parse(File input, CsvFormat format) {
    CsvParserSettings settings = new CsvParserSettings();
    if (format == null) { //no format specified? Let's detect what we are dealing with
        settings.detectFormatAutomatically();

        CsvParser parser = new CsvParser(settings);
        parser.beginParsing(input); //just call begin parsing to kick of the auto-detection process
        format = parser.getDetectedFormat(); //capture the format
        parser.stopParsing(); //stop the parser - no need to read anything yet.

        System.out.println(format);

        if (format.getQuoteEscape() == '-') { //got something weird detected? Let's amend it.
            format.setQuoteEscape('"');
        }

        return parse(input, format); //now parse with the intended format
    } else {
        settings.setFormat(format); //this parses with the format adjusted earlier.
        CsvParser parser = new CsvParser(settings);
        return parser.parseAll(input);
    }

}

您将正确提取数据。希望这有帮助

我认为这与换行无关:请检查您的报价设置:请参阅。似乎
被解释为带引号的文本。似乎您的解析器真的不喜欢
@pvg,这与自动检测过程有关。请参阅下面我的答案。@TmTron,这与自动检测过程有关。请看下面我的答案。我专注于没有正确分开的行,我错过了引用转义发生的事情。现在一切都好了。非常感谢。
public List<String[]> parse(File input, CsvFormat format) {
    CsvParserSettings settings = new CsvParserSettings();
    if (format == null) { //no format specified? Let's detect what we are dealing with
        settings.detectFormatAutomatically();

        CsvParser parser = new CsvParser(settings);
        parser.beginParsing(input); //just call begin parsing to kick of the auto-detection process
        format = parser.getDetectedFormat(); //capture the format
        parser.stopParsing(); //stop the parser - no need to read anything yet.

        System.out.println(format);

        if (format.getQuoteEscape() == '-') { //got something weird detected? Let's amend it.
            format.setQuoteEscape('"');
        }

        return parse(input, format); //now parse with the intended format
    } else {
        settings.setFormat(format); //this parses with the format adjusted earlier.
        CsvParser parser = new CsvParser(settings);
        return parser.parseAll(input);
    }

}
List<String[]> rows = parse(new File("/Users/jbax/Downloads/malformed csv r n small.csv"), null);