Scala Spark：将字符串的列转换为数组_Scala_Apache Spark_Pyspark

Scala Spark：将字符串的列转换为数组

scala apache-spark pyspark

Scala Spark：将字符串的列转换为数组,scala,apache-spark,pyspark,Scala,Apache Spark,Pyspark,如何将已作为字符串读取的列转换为数组列？ i、 e.从下面的模式转换 scala> test.printSchema root |-- a: long (nullable = true) |-- b: string (nullable = true) +---+---+ | a| b| +---+---+ | 1|2,3| +---+---+ | 2|4,5| +---+---+ 致：如果可能，请共享scala和python实现。请注意，在读取文件本身时，如何处理它？

如何将已作为字符串读取的列转换为数组列？ i、 e.从下面的模式转换

scala> test.printSchema
root
 |-- a: long (nullable = true)
 |-- b: string (nullable = true)

+---+---+
|  a|  b|
+---+---+
|  1|2,3|
+---+---+
|  2|4,5|
+---+---+

致：

如果可能，请共享scala和python实现。请注意，在读取文件本身时，如何处理它？我有大约450列的数据，其中很少有我想用这种格式指定的。目前，我在pyspark阅读如下内容：

df = spark.read.format('com.databricks.spark.csv').options(
    header='true', inferschema='true', delimiter='|').load(input_file)

谢谢。

有多种方法

最好的方法是使用

split

函数并转换为

array

希望这有帮助

使用a将为您提供确切的所需模式。像这样：

val toArray = udf((b: String) => b.split(",").map(_.toLong))

val test1 = test.withColumn("b", toArray(col("b")))

它将为您提供如下模式：

scala> test1.printSchema
root
 |-- a: long (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: long (containsNull = true)

+---+-----+
|  a|  b  |
+---+-----+
|  1|[2,3]|
+---+-----+
|  2|[4,5]|
+---+-----+

就在文件读取本身上应用模式而言，我认为这是一项艰巨的任务。因此，现在您可以在创建

test

的

DataFrameReader

之后应用转换

我希望这有帮助

在python（pyspark）中，它应该是：

from pyspark.sql.types import *
from pyspark.sql.functions import col, split
test = test.withColumn(
        "b",
        split(col("b"), ",\s*").cast("array<int>").alias("ev")
 )

从pyspark.sql.types导入*
从pyspark.sql.functions导入列，拆分
test=test.withColumn(
“b”，
拆分（列（“b”），“，\s*”）。转换（“数组”）。别名（“ev”）
)

有没有人能举一个与此相反的例子，将字符串数组转换为制表符分隔的列？如果上面的2,3是一个元组（2,3），然后需要创建一个数组@thebluephantom@koiralow你说的元组是什么意思？你能分享这个模式吗？您可以使用

array（）

函数从列创建列表。

val toArray = udf((b: String) => b.split(",").map(_.toLong))

val test1 = test.withColumn("b", toArray(col("b")))

scala> test1.printSchema
root
 |-- a: long (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: long (containsNull = true)

+---+-----+
|  a|  b  |
+---+-----+
|  1|[2,3]|
+---+-----+
|  2|[4,5]|
+---+-----+

from pyspark.sql.types import *
from pyspark.sql.functions import col, split
test = test.withColumn(
        "b",
        split(col("b"), ",\s*").cast("array<int>").alias("ev")
 )