Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/ms-access/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Pyspark:通过解析另一列中的字符串来创建列_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Dataframes - Fatal编程技术网

Apache spark Pyspark:通过解析另一列中的字符串来创建列

Apache spark Pyspark:通过解析另一列中的字符串来创建列,apache-spark,pyspark,apache-spark-sql,pyspark-dataframes,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Dataframes,我想加入两个数据帧 一个数据帧是这样的,其中syscode\u ntwrk被破折号分割 spark.createDataFrame( [ (1, '1234 - ESPN'), (2, '1234 - ESPN'), (3, '963 - CNN'), (4, '963 - CNN'), ], ['id', 'col1'] ) 另一种是这种格式,syscode\u ntwrk连接在一起 spark.c

我想加入两个数据帧

一个数据帧是这样的,其中
syscode\u ntwrk
被破折号分割

spark.createDataFrame(
    [
        (1, '1234 - ESPN'), 
        (2, '1234 - ESPN'),
        (3, '963 - CNN'), 
        (4, '963 - CNN'),
    ],
    ['id', 'col1'] 
)
另一种是这种格式,
syscode\u ntwrk
连接在一起

spark.createDataFrame(
    [
        (100, '1234ESPN'), 
        (297, '1234ESPN'),
        (3989, '963CNN'), 
        (478, '963CNN'),
    ],
    ['counts', 'col1'] 
)

在第二个数据帧中是否有方法创建一个新列,以匹配
syscode\u ntwrk
的第一个数据帧
Syscode
将始终是一组数字,
ntwrk
将始终是一组字母,因此是否有一个正则表达式在两者之间添加一个空格破折号?

您可以使用
regexp\u extract
提取组,并使用
concat\u ws
将组转换为所需的组

import pyspark.sql.functions as F

df = spark.createDataFrame(
    [
        (100, '1234ESPN'), 
        (297, '1234ESPN'),
        (3989, '963CNN'), 
        (478, '963CNN'),
    ],
    ['counts', 'col1'] 
)

df.select(
    F.concat_ws(
        ' - ',
        F.regexp_extract('col1', '(\d+)([a-zA-Z]+)', 1),
        F.regexp_extract('col1', '(\d+)([a-zA-Z]+)', 2)
    ).alias('parsed')
).show()

+-----------+
|     parsed|
+-----------+
|1234 - ESPN|
|1234 - ESPN|
|  963 - CNN|
|  963 - CNN|
+-----------+