Python 3.x 处理分隔列和非分隔列的组合，以获取相应值的新行_Python 3.x_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Dataframes

Python 3.x 处理分隔列和非分隔列的组合，以获取相应值的新行

python-3.x apache-spark pyspark

Python 3.x 处理分隔列和非分隔列的组合，以获取相应值的新行,python-3.x,apache-spark,pyspark,apache-spark-sql,pyspark-dataframes,Python 3.x,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Dataframes,对于创建PySpark数据帧的特定场景，其中两列以管道分隔，一列不是，而是两列的集合（管道分隔）：由于我的产品和数量列是以管道分隔的，因此我将每个产品及其数量分开，并将其分解，以获得如下所示的单个数量： product quantity a 1 b 1 b 3 c 2 from pyspark.sql.functions import split, col, explode, lit, expr #

对于创建PySpark数据帧的特定场景，其中两列以

管道

分隔，一列不是，而是两列的集合（管道分隔）：

由于我的

产品

和

数量

列是以管道分隔的，因此我将每个产品及其数量分开，并将其分解，以获得如下所示的单个数量：

product   quantity
a            1
b            1
b            3
c            2

from pyspark.sql.functions import split, col, explode, lit, expr

# create new df with product = 'no delimiter in rev' and quantity= 0
df1 = df.withColumn("product", lit("no delimiter in rev")) \
    .withColumn("quantity", lit(0))

# create another df by exploding the product/quantity structure and revenue=0
df2 = df.withColumn('product', split(col("product"), "\\|")) \
    .withColumn('quantity', split(col("quantity"), "\\|")) \
    .withColumn("product_quantity", explode(expr("arrays_zip(product, quantity)"))) \
    .selectExpr("product_quantity.*", "0 as revenue")

# union the 2 data frames
df1.union(df2).show()

#+-------------------+--------+-------+
#|            product|quantity|revenue|
#+-------------------+--------+-------+
#|no delimiter in rev|       0|      3|
#|no delimiter in rev|       0|      9|
#|                  a|       1|      0|
#|                  b|       1|      0|
#|                  b|       3|      0|
#|                  c|       2|      0|
#+-------------------+--------+-------+

但由于我没有收入定界，所以目前我只是在该列中添加了零，但我试图得到的是这样的结果（收入位于另一行

product

中，硬编码为

rev

no delimiter）

product           quantity  revenue
a                    1        0
b                    1        0
no delimiter in rev  0        3
b                    3        0
c                    2        0
no delimiter in rev  0        9

任何关于如何实现它的见解都会很有帮助

您可以

union

将产品列设置为

在rev

中无分隔符的原始数据框与分解后的数据框如下所示：

product   quantity
a            1
b            1
b            3
c            2

from pyspark.sql.functions import split, col, explode, lit, expr

# create new df with product = 'no delimiter in rev' and quantity= 0
df1 = df.withColumn("product", lit("no delimiter in rev")) \
    .withColumn("quantity", lit(0))

# create another df by exploding the product/quantity structure and revenue=0
df2 = df.withColumn('product', split(col("product"), "\\|")) \
    .withColumn('quantity', split(col("quantity"), "\\|")) \
    .withColumn("product_quantity", explode(expr("arrays_zip(product, quantity)"))) \
    .selectExpr("product_quantity.*", "0 as revenue")

# union the 2 data frames
df1.union(df2).show()

#+-------------------+--------+-------+
#|            product|quantity|revenue|
#+-------------------+--------+-------+
#|no delimiter in rev|       0|      3|
#|no delimiter in rev|       0|      9|
#|                  a|       1|      0|
#|                  b|       1|      0|
#|                  b|       3|      0|
#|                  c|       2|      0|
#+-------------------+--------+-------+

感谢您的见解。目前对于分解并运行少量计算后的

quantity

部分，我正在创建的数据帧相当于解决方案中的

df2

是

df2=df.groupBy（'some entries'）.agg（countDistinct（'id'）。别名（'unique_count'），sum（'quantitys'）。别名（'sum_quantity'））

然后解决方案的

df1

就是

df2=df.withColumn（'sum_revenue'，lit（0）.cast（DoubleType（）））

因此，如果我使用你的

df1

而不是这个，那么它不会从收入列中获取总收入。有什么见解吗？@思考数学我不确定我是否理解你在这里尝试做什么。当然，如果你做新的聚合并更改逻辑，它将不会像上面的解决方案那样工作。我认为这可能是一个新问题：）谢谢@Blackishop，如果我无法用完整的细节解决问题，我将创建一个新问题