Apache spark 基于特定字符的索引值的Pyspark dataframe列子字符串

Apache spark 基于特定字符的索引值的Pyspark dataframe列子字符串,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,专家们,我有一个简单的要求,但无法找到实现目标的功能 我使用的是pyspark(spark 1.6和Python2.7),有一个简单的pyspark dataframe列,其中包含某些值,如- 1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC 1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234 1849adb0-89o8iulk89o89-898

专家们,我有一个简单的要求,但无法找到实现目标的功能

我使用的是pyspark(spark 1.6和Python2.7),有一个简单的pyspark dataframe列,其中包含某些值,如-

1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC
1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234
1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678
这些值的共同点是,只有一个“下划线”,其后是某些字符(可以是任意数量的字符)。这些是我在输出中感兴趣的字符。我想使用一个substring或regex函数,它将在列值中找到“下划线”的位置,并选择“from underline position+1”,直到列值结束。 因此,输出将看起来像一个数据帧,其值如下所示-

ABC
1234
12345678
我尝试使用子字符串,但可以找到任何可以“索引”“下划线”的内容


谢谢

您可以使用
regexp\u extract
\u

示例:

from pyspark.sql.functions import *

df=spark.sql("""select stack(3,"1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC","1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234","1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678") as (txt)""")

df.withColumn("extract",regexp_extract(col("txt"),"_(.*)",1)).show(10,False)
+-------------------------------------------------------------------+--------+
|txt                                                                |extract |
+-------------------------------------------------------------------+--------+
|1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC     |ABC     |
|1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234    |1234    |
|1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678|12345678|
+-------------------------------------------------------------------+--------+
结果:

from pyspark.sql.functions import *

df=spark.sql("""select stack(3,"1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC","1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234","1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678") as (txt)""")

df.withColumn("extract",regexp_extract(col("txt"),"_(.*)",1)).show(10,False)
+-------------------------------------------------------------------+--------+
|txt                                                                |extract |
+-------------------------------------------------------------------+--------+
|1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC     |ABC     |
|1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234    |1234    |
|1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678|12345678|
+-------------------------------------------------------------------+--------+

您可以使用
regexp\u extract
\u

示例:

from pyspark.sql.functions import *

df=spark.sql("""select stack(3,"1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC","1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234","1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678") as (txt)""")

df.withColumn("extract",regexp_extract(col("txt"),"_(.*)",1)).show(10,False)
+-------------------------------------------------------------------+--------+
|txt                                                                |extract |
+-------------------------------------------------------------------+--------+
|1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC     |ABC     |
|1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234    |1234    |
|1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678|12345678|
+-------------------------------------------------------------------+--------+
结果:

from pyspark.sql.functions import *

df=spark.sql("""select stack(3,"1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC","1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234","1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678") as (txt)""")

df.withColumn("extract",regexp_extract(col("txt"),"_(.*)",1)).show(10,False)
+-------------------------------------------------------------------+--------+
|txt                                                                |extract |
+-------------------------------------------------------------------+--------+
|1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC     |ABC     |
|1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234    |1234    |
|1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678|12345678|
+-------------------------------------------------------------------+--------+

无需使用任何regexp

请按如下所示尝试。 基本上,字符上拆分,并通过getItem()获取第二个项目

结果

+-------------------------------------------------------------------+--------+
|input_v                                                            |get_val |
+-------------------------------------------------------------------+--------+
|1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC     |ABC     |
|1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234    |1234    |
|1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678|12345678|
+-------------------------------------------------------------------+--------+```



无需使用任何regexp

请按如下所示尝试。 基本上,字符上拆分,并通过getItem()获取第二个项目

结果

+-------------------------------------------------------------------+--------+
|input_v                                                            |get_val |
+-------------------------------------------------------------------+--------+
|1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC     |ABC     |
|1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234    |1234    |
|1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678|12345678|
+-------------------------------------------------------------------+--------+```



那么你想要从第一个下划线到行尾的字符串?那么你想要从第一个下划线到行尾的字符串?