String pySpark字符串提取
我在spark df中有一列String pySpark字符串提取,string,apache-spark,pyspark,extract,String,Apache Spark,Pyspark,Extract,我在spark df中有一列目标。这些值如下所示: ab=px_d_1200;ab=9;ab=t_d_o_1000;artid=delish.recipe.46338;artid=delish_recipe_46338;avb=85;cat=recipes;role=3;sect=cooking ab=px_d_1200;ab=8;ab=t_d_o_1000;apn=640x480_370;artid=delish.recipe.25860457;artid=delish_recipe_2586
目标
。这些值如下所示:
ab=px_d_1200;ab=9;ab=t_d_o_1000;artid=delish.recipe.46338;artid=delish_recipe_46338;avb=85;cat=recipes;role=3;sect=cooking
ab=px_d_1200;ab=8;ab=t_d_o_1000;apn=640x480_370;artid=delish.recipe.25860457;artid=delish_recipe_25860457;avb=90;cat=recipes;clc=chicken-breast-recipes;clc=insanely-easy-chicken-dinners;clc=weeknight-dinners;embedid=a6311e94-3b66-4712-8fca-eaa423e4e69a;gs_cat=response_check;gs_cat=gl_english;role=3;sect=cooking;sub=recipe-ideas;tool=recipe;urlhash=5425cac3a9c2959917d0634f5bd6d842
我需要提取role=X。此外,等号后面的值需要保存在另一列中。
所需输出为:
role
3
3
这可能是一个有效的解决方案 在此处创建数据框
df = spark.createDataFrame([(1,"ab=px_d_1200;ab=9;ab=t_d_o_1000;artid=delish.recipe.46338;artid=delish_recipe_46338;avb=85;cat=recipes;role=3;sect=cooking")],[ "col1","col2"])
df.show(truncate=False)
+----+--------------------------------------------------------------------------------------------------------------------------+
|col1|col2 |
+----+--------------------------------------------------------------------------------------------------------------------------+
|1 |ab=px_d_1200;ab=9;ab=t_d_o_1000;artid=delish.recipe.46338;artid=delish_recipe_46338;avb=85;cat=recipes;role=3;sect=cooking|
+----+--------------------------------------------------------------------------------------------------------------------------+
df_new = df.filter(F.col("col2").contains("role"))
df_new = df_new.withColumn("split_col", F.explode(F.split(F.col("col2"), ";")))
df_new = df_new.filter(F.col("split_col").contains("role"))
df_new = df_new.withColumn("final_col", (F.split(F.col("split_col"), "=")))
df_new = df_new.withColumn("role", F.element_at(F.col('final_col'), -1).alias('role'))
df_new.show()
+----+--------------------+---------+---------+----+
|col1| col2|split_col|final_col|role|
+----+--------------------+---------+---------+----+
| 1|ab=px_d_1200;ab=9...| role=3|[role, 3]| 3|
+----+--------------------+---------+---------+----+
完美的这正是我需要的。非常感谢。很好,它帮助了你……)如果你也能投票,我将不胜感激