在Scala中读取具有多行字符串的CSV文件_Scala_Csv_Line Breaks

在Scala中读取具有多行字符串的CSV文件

scala csv

在Scala中读取具有多行字符串的CSV文件,scala,csv,line-breaks,Scala,Csv,Line Breaks,我有一个csv文件，我想逐行读取它。问题是某些单元格值在包含换行符的引号中以下是一个CSV示例： Product,Description,Price Product A,This is Product A,20 Product B,"This is much better than Product A",200 标准的getLines（）函数不能处理这个问题 Source.fromFile(inputFile).getLines() // will split at every line

我有一个csv文件，我想逐行读取它。问题是某些单元格值在包含换行符的引号中

以下是一个CSV示例：

Product,Description,Price
Product A,This is Product A,20
Product B,"This is much better
than Product A",200

标准的getLines（）函数不能处理这个问题

Source.fromFile(inputFile).getLines()  // will split at every line break, regardless if quoted or not

getLines得到如下结果：

Array("Product", "Description", "Price")
Array("Product A", "this is Product A", "20")
Array("Product A", "\"This is much better")
Array("than Product A\"", "20")

但它应该是这样的：

Array("Product", "Description", "Price")
Array("Product A", "this is Product A", "20")
Array("Product A", "\"This is much better\nthan Product A\"", "20")

我试着完全阅读文件，split使用了与本文类似的正则表达式

正则表达式工作正常，但我遇到了堆栈溢出异常，因为文件太大，无法处理内存不足的文件。我用一个小版本的文件试过了，效果很好

正如文章中所述，foldLeft（）可以帮助处理更大的文件。但我不确定它应该如何工作，当迭代字符串的每个字符时，一次传递所有字符

当前迭代的字符

这条线是你正在建造的

以及已创建行的列表

也许可以编写自己的getLines尾部递归版本，但我不确定是否有更实用的解决方案来代替逐字符处理

你有没有看到其他功能风格的解决方案

坦克和问候，

Felix

您可以使用第三方库来执行此操作，如opencsv

maven回购->

代码示例->

最简单的答案是找到一个外部库来完成它

如果它不是您的解决方案，foldLeft解决方案在我看来是最好的功能风格！以下是一个简单的版本：

  val lines = Source.fromFile(inputFile).getLines()

  lines.foldLeft[(Seq[String], String)](Nil, "") {
    case ((accumulatedLines, accumulatedString), newLine) => {
      val isInAnOpenString = accumulatedString.nonEmpty
      val lineHasOddQuotes =  newLine.count(_ == '"') % 2 == 1
      (isInAnOpenString, lineHasOddQuotes) match {
        case (true, true) => (accumulatedLines :+ (accumulatedString + newLine)) -> ""
        case (true, false) => accumulatedLines -> (accumulatedString + newLine)
        case (false, true) => accumulatedLines -> newLine
        case (false, false) => (accumulatedLines :+ newLine) -> ""
      }
    }
  }._1

请注意，此版本不会处理太多特殊情况，例如在一行中包含多行的多个值，但它应该为您提供一个良好的开端

其主要思想是将你需要保存在内存中的几乎所有东西折叠起来，并从中逐步改变你的状态

正如您所看到的，在foldLeft中，您可以根据需要拥有尽可能多的逻辑。在本例中，我添加了额外的布尔值和一个嵌套匹配案例，以便于阅读

所以我的建议是：foldLeft，不要惊慌

我想知道新的（Scala 2.13）

unfold（）

是否可以在这里很好地使用

                        // "file" has been opened
val lines = Iterator.unfold(file.getLines()){ itr =>
              Option.when(itr.hasNext) {
                val sb = new StringBuilder(itr.next)
                while (itr.hasNext && sb.count(_ == '"') % 2 > 0)
                  sb.append("\\n" + itr.next)
                (sb.toString, itr)
              }
            }

现在，您可以根据需要迭代内容

lines.foreach(println)
//Product,Description,Price
//Product A,This is Product A,20
//Product B,"This is much better\nthan Product A",200
//Product C,a "third rate" product,5

注意，这非常简单，因为它只计算所有引号，寻找一个偶数。它不会将转义引号（

\“

）识别为不同的，但使用正则表达式只计算非转义引号应该不会太困难

因为我们使用的是迭代器，所以它应该是内存高效的，可以处理任何大小的文件，只要没有错误的单引号触发文件的其余部分作为一行文本读入。

谢谢@C4stor看起来很棒-我真的很喜欢你的建议。我没有完全使用“->”操作符（afaik仅用于地图）.但是在你的例子中看到了foldLeft的强大功能，我对它做了一些修改，让它一个字符一个字符地读，在一行中也有多个带引号的换行符。->用于形成元组。因此a->b相当于（a，b）。这有助于避免到处堆积括号

lines.foreach(println)
//Product,Description,Price
//Product A,This is Product A,20
//Product B,"This is much better\nthan Product A",200
//Product C,a "third rate" product,5