Scala 在Spark中读取CSV，最后一列为值数组（值在括号内，用逗号分隔）_Scala_Apache Spark_Apache Spark Sql_Spark Csv

Scala 在Spark中读取CSV，最后一列为值数组（值在括号内，用逗号分隔）

scala apache-spark

Scala 在Spark中读取CSV，最后一列为值数组（值在括号内，用逗号分隔）,scala,apache-spark,apache-spark-sql,spark-csv,Scala,Apache Spark,Apache Spark Sql,Spark Csv,我有一个CSV文件，其中最后一列在括号内，值用逗号分隔。值的数量在最后一列中是可变的。当我将它们读取为具有以下列名的Dataframe时，我在线程“main”java.lang.IllegalArgumentException中得到了异常：要求失败：列数不匹配。我的CSV文件如下所示 a1,b1,true,2017-05-16T07:00:41.0000000,2.5,(c1,d1,e1) a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2,

我有一个CSV文件，其中最后一列在括号内，值用逗号分隔。值的数量在最后一列中是可变的。当我将它们读取为具有以下列名的Dataframe时，我在线程“main”java.lang.IllegalArgumentException中得到了异常：要求失败：列数不匹配。我的CSV文件如下所示

a1,b1,true,2017-05-16T07:00:41.0000000,2.5,(c1,d1,e1)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2,f2,g2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2,k2,f2)

我最终想要的是这样的东西：

root
 |-- MId: string (nullable = true)
 |-- PId: string (nullable = true)
 |-- IsTeacher: boolean(nullable = true)
 |-- STime: datetype(nullable = true)
 |-- TotalMinutes: double(nullable = true)
 |-- SomeArrayHeader: array<string>(nullable = true)

我想在不给出列名的情况下阅读它们，然后将第5列之后的列转换为数组类型。但是我对括号有问题。在阅读和告知括号内的字段实际上是一个数组类型的字段时，有没有办法做到这一点。

好的。该解决方案仅适用于您的案例。下面的那个对我有用

  val df = spark.read.option("quote", "(").csv("in/staff.csv").toDF(
    "MId",
    "PId",
    "IsTeacher",
    "STime",
    "TotalMinutes",
    "arr")
  df.show()
  val df2 = df.withColumn("arr",split(regexp_replace('arr,"[)]",""),","))
  df2.printSchema()
  df2.show()

输出：

+---+---+---------+--------------------+------------+---------------+
|MId|PId|IsTeacher|               STime|TotalMinutes|            arr|
+---+---+---------+--------------------+------------+---------------+
| a1| b1|     true|2017-05-16T07:00:...|         2.5|      c1,d1,e1)|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|c2,d2,e2,f2,g2)|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|            c2)|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|         c2,d2)|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|      c2,d2,e2)|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|c2,d2,e2,k2,f2)|
+---+---+---------+--------------------+------------+---------------+

root
 |-- MId: string (nullable = true)
 |-- PId: string (nullable = true)
 |-- IsTeacher: string (nullable = true)
 |-- STime: string (nullable = true)
 |-- TotalMinutes: string (nullable = true)
 |-- arr: array (nullable = true)
 |    |-- element: string (containsNull = true)

+---+---+---------+--------------------+------------+--------------------+
|MId|PId|IsTeacher|               STime|TotalMinutes|                 arr|
+---+---+---------+--------------------+------------+--------------------+
| a1| b1|     true|2017-05-16T07:00:...|         2.5|        [c1, d1, e1]|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|[c2, d2, e2, f2, g2]|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|                [c2]|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|            [c2, d2]|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|        [c2, d2, e2]|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|[c2, d2, e2, k2, f2]|
+---+---+---------+--------------------+------------+--------------------+

嗯。该解决方案仅适用于您的案例。下面的那个对我有用

  val df = spark.read.option("quote", "(").csv("in/staff.csv").toDF(
    "MId",
    "PId",
    "IsTeacher",
    "STime",
    "TotalMinutes",
    "arr")
  df.show()
  val df2 = df.withColumn("arr",split(regexp_replace('arr,"[)]",""),","))
  df2.printSchema()
  df2.show()

输出：

+---+---+---------+--------------------+------------+---------------+
|MId|PId|IsTeacher|               STime|TotalMinutes|            arr|
+---+---+---------+--------------------+------------+---------------+
| a1| b1|     true|2017-05-16T07:00:...|         2.5|      c1,d1,e1)|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|c2,d2,e2,f2,g2)|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|            c2)|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|         c2,d2)|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|      c2,d2,e2)|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|c2,d2,e2,k2,f2)|
+---+---+---------+--------------------+------------+---------------+

root
 |-- MId: string (nullable = true)
 |-- PId: string (nullable = true)
 |-- IsTeacher: string (nullable = true)
 |-- STime: string (nullable = true)
 |-- TotalMinutes: string (nullable = true)
 |-- arr: array (nullable = true)
 |    |-- element: string (containsNull = true)

+---+---+---------+--------------------+------------+--------------------+
|MId|PId|IsTeacher|               STime|TotalMinutes|                 arr|
+---+---+---------+--------------------+------------+--------------------+
| a1| b1|     true|2017-05-16T07:00:...|         2.5|        [c1, d1, e1]|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|[c2, d2, e2, f2, g2]|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|                [c2]|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|            [c2, d2]|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|        [c2, d2, e2]|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|[c2, d2, e2, k2, f2]|
+---+---+---------+--------------------+------------+--------------------+

这不是正确的csv内容，最后一列必须用双引号括起来。。无法使用csv读取arraytype。。我会检查是否有任何解决办法是的，我知道。但我的文件看起来只有这个。如果在引号中，csv读取不会有问题，但它不在引号中。这不是正确的csv内容，最后一列必须用双引号括起来。。无法使用csv读取arraytype。。我会检查是否有任何解决办法是的，我知道。但我的文件看起来只有这个。如果在引号中，csv读取不会有问题，但它不在引号中。谢谢！引号在那里起什么作用？我有GB格式的大文件，这有性能问题吗？“引号”是csv解析器中的一个功能-当出现引号字符时，解析器将把内容一直读到下一个引号字符，并读入一列。因此，我们可以在数据中使用分隔符，只要它们被引号字符包围。默认的引号字符是“-double quotes.CSV解析器不允许引号功能使用多个字符。在我们的例子中，我们有（和），所以我给了“（“在读取和在df2中，我删除了”）”。幸运的是，这列是最后一列，所以它一直读取到行的末尾。所以这里的引号将处理默认引号（“）作为引文之一？它也会将“（”作为引号的一部分，是吗？在我们的例子中，我们用“（”覆盖，您只能有一个引号字符..谢谢！引号在那里做什么？我有GB大小的文件，这有性能问题吗？“引号”是csv解析器中的一项功能-当该引号字符出现时，解析器将把下一个引号字符之前的内容读入一列。因此，我们可以在数据中使用分隔符，只要它们被引号字符包围。默认引号字符为“-double quotes.CSV解析器不允许引号功能使用多个字符。在我们的例子中，我们有（和），因此我给出了“（“在读取和在df2中，我删除了”）”。幸运的是，此列位于最后一列，因此它一直读取到行的末尾。因此，此处的引号将处理默认引号（“）作为引号之一？它也将“（”作为引号之一，是吗？在我们的例子中，我们已经用“（”覆盖，您只能有一个引号字符。。