Apache spark Spark根据名称将值提取到多列_Apache Spark_Pyspark_Apache Spark Sql

Apache spark Spark根据名称将值提取到多列

apache-spark pyspark

Apache spark Spark根据名称将值提取到多列,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我有一个字符串列，需要根据与之关联的名称将其值提取到多个列中 otherPartofString State DALLocate_SFO-4/3/9 sub Area=<8> ID 8 Name 7 感谢您的帮助。尝试以下方法： import org.apache.spark.sql.functions.udf def myFunc: String => Array[String] = s => Array(/* TODO parse the string as yo

我有一个字符串列，需要根据与之关联的名称将其值提取到多个列中

otherPartofString State DALLocate_SFO-4/3/9 sub Area=<8> ID 8 Name 7

感谢您的帮助。

尝试以下方法：

import org.apache.spark.sql.functions.udf
def myFunc: String => Array[String] = s => Array(/* TODO parse the string as you wish */)
val myUDF = udf(myFunc)

df.withColumn("parsedInput", myUDF(df("input")))
  .select(
    $"parsedInput"(0).as("State"),
    $"parsedInput"(1).as("Area"),
    $"parsedInput"(2).as("Sub Area"),
    $"parsedInput"(3).as("ID"),
    $"parsedInput"(4).as("Name"))

其中“input”是您的原始输入（例如，“otherPartofString State dallositeu SFO-4/3/9子区域=ID 8 Name 7”）

确保您的UDF返回一个有效数组（项目数和订单）

如果模式总是固定的，您可以使用regexp\u extract：

from pyspark.sql.functions import regexp_extract

df = spark.createDataFrame([{"raw": "otherPartofString State DALLocate_SFO-4/3/9 sub Area=<8> ID 8 Name 7 "}], 'raw string') 

(df
 .select(regexp_extract('raw', 'State ([^_]*)', 1).alias('State'), 
         regexp_extract('raw', 'State ([a-zA-Z]*)_([^ ]*)', 2).alias('Area'), 
         regexp_extract('raw', 'Area=<(.*)>', 1).alias('Sub Area'), 
         regexp_extract('raw', 'ID ([^ ]*)', 1).alias('ID'),
         regexp_extract('raw', 'Name ([^ ]*)', 1).alias('Name')).show())

从pyspark.sql.functions导入regexp\u extract
df=spark.createDataFrame（[{“raw”：“otherPartofString State dallosite_SFO-4/3/9 sub-Area=ID8 Name 7”}]，“raw string”）
（df）
.select（regexp\u extract（'raw'，'State（[^\u]*），1）。别名（'State'），
regexp_extract（'raw'，'State（[a-zA-Z]*））u（[^]*），2）.别名（'Area'），
regexp_extract（'raw'，'Area='，1）.alias（'Sub-Area'），
regexp_extract（'raw'，'ID（[^]*），1）.alias（'ID'），
regexp_extract（'raw'，'Name（[^]*），1）.alias（'Name'））.show（））

regexp\u extract

接受3个参数，第一个参数是要匹配的列。第二个是模式，第三个是要提取的组

参考：

谢谢您的回复。我看到这部分regexp_extract（'raw'，'State（[a-zA-Z]*））u（[^]*），2）.别名（'Area'）失败。我的数据有时会是“DALLocate_SFO-4/3/9”或“DALLocate_SFO-4/3/9_DAX-10/3/3”。。在这种情况下，我需要将'SFO-4/3/9_DAX-10/3/3'作为其值。任何关于如何处理这个问题的猜测（'State（[^\u]*）\ u（.*）sub'，2）都没有什么帮助。我有一个字符串，它的值类似于'sfofosite ID Expose Name 10 ID 3 Area 10'。当我使用正则表达式时，它考虑的是ID=Expose。但如果我用字符串和空格提取，它应该是ID=3。你能帮忙吗@马特

from pyspark.sql.functions import regexp_extract

df = spark.createDataFrame([{"raw": "otherPartofString State DALLocate_SFO-4/3/9 sub Area=<8> ID 8 Name 7 "}], 'raw string') 

(df
 .select(regexp_extract('raw', 'State ([^_]*)', 1).alias('State'), 
         regexp_extract('raw', 'State ([a-zA-Z]*)_([^ ]*)', 2).alias('Area'), 
         regexp_extract('raw', 'Area=<(.*)>', 1).alias('Sub Area'), 
         regexp_extract('raw', 'ID ([^ ]*)', 1).alias('ID'),
         regexp_extract('raw', 'Name ([^ ]*)', 1).alias('Name')).show())