Pyspark转换：从列名到行_Pyspark_Pyspark Sql_Pyspark Dataframes

Pyspark转换：从列名到行

pyspark

Pyspark转换：从列名到行,pyspark,pyspark-sql,pyspark-dataframes,Pyspark,Pyspark Sql,Pyspark Dataframes,我正在与pyspark合作，希望转换此spark数据帧： +----+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------

我正在与pyspark合作，希望转换此spark数据帧：

    +----+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
    | TS | ABC[0].VAL.VAL[0].UNT[0].sth1 | ABC[0].VAL.VAL[0].UNT[1].sth1 | ABC[0].VAL.VAL[1].UNT[0].sth1 | ABC[0].VAL.VAL[1].UNT[1].sth1 | ABC[0].VAL.VAL[0].UNT[0].sth2 | ABC[0].VAL.VAL[0].UNT[1].sth2 | ABC[0].VAL.VAL[1].UNT[0].sth2 | ABC[0].VAL.VAL[1].UNT[1].sth2 |
    +----+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
    | 1  | some_value                    | some_value                    | some_value                    | some_value                    | some_value                    | some_value                    | some_value                    | some_value                    |
    +----+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+

为此：

+----+-----+-----+------------+------------+
| TS | VAL | UNT |    sth1    |    sth2    |
+----+-----+-----+------------+------------+
|  1 |   0 |   0 | some_value | some_value |
|  1 |   0 |   1 | some_value | some_value |
|  1 |   1 |   0 | some_value | some_value |
|  1 |   1 |   1 | some_value | some_value |
+----+-----+-----+------------+------------+

你知道我如何通过一些奇特的变换来做到这一点吗

编辑：这就是我解决问题的方法：

from pyspark.sql.functions import array, col, explode, struct, lit
import re

df = sc.parallelize([(1, 0.0, 0.6, 0.1, 0.4, 0.7, 0.2, 0.4, 0.1), (2, 0.6, 0.7, 0.1, 0.5, 0.8, 0.3, 0.1, 0.3)]).toDF(["TS", "ABC[0].VAL.VAL[0].UNT[0].sth1", "ABC[0].VAL.VAL[0].UNT[1].sth1", "ABC[0].VAL.VAL[1].UNT[0].sth1", "ABC[0].VAL.VAL[1].UNT[1].sth1", "ABC[0].VAL.VAL[0].UNT[0].sth2", "ABC[0].VAL.VAL[0].UNT[1].sth2", "ABC[0].VAL.VAL[1].UNT[0].sth2", "ABC[0].VAL.VAL[1].UNT[1].sth2"]) 

newcols = list(map(lambda x: x.replace(".", "_"), df.columns))
df = df.toDF(*newcols)

cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in ["TS"]))
kvs = explode(array([struct(
                            lit( re.search(re.compile(r"VAL\[(\d{1,2})\]"),c).group(1) ).alias("VAL"),
                            lit( re.search(re.compile(r"UNT\[(\d{1,2})\]"),c).group(1) ).alias("UNT"),
                            lit( re.search(re.compile(r"([^_]+$)"),c).group(1) ).alias("Parameter"),
                            col(c).alias("data")) for c in cols
                    ])).alias("kvs")

display(df.select(["TS"] + [kvs]).select(["TS"] + ["kvs.VAL", "kvs.UNT", "kvs.Parameter", "kvs.data"]).groupBy("TS","VAL","UNT").pivot("Parameter").sum("data").orderBy("TS","VAL","UNT"))

from pyspark.sql.functions import array, col, explode, struct, lit
import re

df = sc.parallelize([(1, 0.0, 0.6, 0.1, 0.4, 0.7, 0.2, 0.4, 0.1), (2, 0.6, 0.7, 0.1, 0.5, 0.8, 0.3, 0.1, 0.3)]).toDF(["TS", "ABC[0].VAL.VAL[0].UNT[0].sth1", "ABC[0].VAL.VAL[0].UNT[1].sth1", "ABC[0].VAL.VAL[1].UNT[0].sth1", "ABC[0].VAL.VAL[1].UNT[1].sth1", "ABC[0].VAL.VAL[0].UNT[0].sth2", "ABC[0].VAL.VAL[0].UNT[1].sth2", "ABC[0].VAL.VAL[1].UNT[0].sth2", "ABC[0].VAL.VAL[1].UNT[1].sth2"]) 

newcols = list(map(lambda x: x.replace(".", "_"), df.columns))
df = df.toDF(*newcols)

cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in ["TS"]))
kvs = explode(array([struct(
                            lit( re.search(re.compile(r"VAL\[(\d{1,2})\]"),c).group(1) ).alias("VAL"),
                            lit( re.search(re.compile(r"UNT\[(\d{1,2})\]"),c).group(1) ).alias("UNT"),
                            lit( re.search(re.compile(r"([^_]+$)"),c).group(1) ).alias("Parameter"),
                            col(c).alias("data")) for c in cols
                    ])).alias("kvs")

display(df.select(["TS"] + [kvs]).select(["TS"] + ["kvs.VAL", "kvs.UNT", "kvs.Parameter", "kvs.data"]).groupBy("TS","VAL","UNT").pivot("Parameter").sum("data").orderBy("TS","VAL","UNT"))

输出：

+----+-----+-----+------+------+
| TS | VAL | UNT | sth1 | sth2 |
+----+-----+-----+------+------+
|  1 |   0 |   0 |    0 |  0.7 |
|  1 |   0 |   1 |  0.6 |  0.2 |
|  1 |   1 |   0 |  0.1 |  0.4 |
|  1 |   1 |   1 |  0.4 |  0.1 |
|  2 |   0 |   0 |  0.6 |  0.8 |
|  2 |   0 |   1 |  0.7 |  0.3 |
|  2 |   1 |   0 |  0.1 |  0.1 |
|  2 |   1 |   1 |  0.5 |  0.3 |
+----+-----+-----+------+------+

+----+-----+-----+------+------+
| TS | VAL | UNT | sth1 | sth2 |
+----+-----+-----+------+------+
|  1 |   0 |   0 |    0 |  0.7 |
|  1 |   0 |   1 |  0.6 |  0.2 |
|  1 |   1 |   0 |  0.1 |  0.4 |
|  1 |   1 |   1 |  0.4 |  0.1 |
|  2 |   0 |   0 |  0.6 |  0.8 |
|  2 |   0 |   1 |  0.7 |  0.3 |
|  2 |   1 |   0 |  0.1 |  0.1 |
|  2 |   1 |   1 |  0.5 |  0.3 |
+----+-----+-----+------+------+

如何才能做得更好呢？

所以我可以这样解决：

from pyspark.sql.functions import array, col, explode, struct, lit
import re

df = sc.parallelize([(1, 0.0, 0.6, 0.1, 0.4, 0.7, 0.2, 0.4, 0.1), (2, 0.6, 0.7, 0.1, 0.5, 0.8, 0.3, 0.1, 0.3)]).toDF(["TS", "ABC[0].VAL.VAL[0].UNT[0].sth1", "ABC[0].VAL.VAL[0].UNT[1].sth1", "ABC[0].VAL.VAL[1].UNT[0].sth1", "ABC[0].VAL.VAL[1].UNT[1].sth1", "ABC[0].VAL.VAL[0].UNT[0].sth2", "ABC[0].VAL.VAL[0].UNT[1].sth2", "ABC[0].VAL.VAL[1].UNT[0].sth2", "ABC[0].VAL.VAL[1].UNT[1].sth2"]) 

newcols = list(map(lambda x: x.replace(".", "_"), df.columns))
df = df.toDF(*newcols)

cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in ["TS"]))
kvs = explode(array([struct(
                            lit( re.search(re.compile(r"VAL\[(\d{1,2})\]"),c).group(1) ).alias("VAL"),
                            lit( re.search(re.compile(r"UNT\[(\d{1,2})\]"),c).group(1) ).alias("UNT"),
                            lit( re.search(re.compile(r"([^_]+$)"),c).group(1) ).alias("Parameter"),
                            col(c).alias("data")) for c in cols
                    ])).alias("kvs")

display(df.select(["TS"] + [kvs]).select(["TS"] + ["kvs.VAL", "kvs.UNT", "kvs.Parameter", "kvs.data"]).groupBy("TS","VAL","UNT").pivot("Parameter").sum("data").orderBy("TS","VAL","UNT"))

from pyspark.sql.functions import array, col, explode, struct, lit
import re

df = sc.parallelize([(1, 0.0, 0.6, 0.1, 0.4, 0.7, 0.2, 0.4, 0.1), (2, 0.6, 0.7, 0.1, 0.5, 0.8, 0.3, 0.1, 0.3)]).toDF(["TS", "ABC[0].VAL.VAL[0].UNT[0].sth1", "ABC[0].VAL.VAL[0].UNT[1].sth1", "ABC[0].VAL.VAL[1].UNT[0].sth1", "ABC[0].VAL.VAL[1].UNT[1].sth1", "ABC[0].VAL.VAL[0].UNT[0].sth2", "ABC[0].VAL.VAL[0].UNT[1].sth2", "ABC[0].VAL.VAL[1].UNT[0].sth2", "ABC[0].VAL.VAL[1].UNT[1].sth2"]) 

newcols = list(map(lambda x: x.replace(".", "_"), df.columns))
df = df.toDF(*newcols)

cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in ["TS"]))
kvs = explode(array([struct(
                            lit( re.search(re.compile(r"VAL\[(\d{1,2})\]"),c).group(1) ).alias("VAL"),
                            lit( re.search(re.compile(r"UNT\[(\d{1,2})\]"),c).group(1) ).alias("UNT"),
                            lit( re.search(re.compile(r"([^_]+$)"),c).group(1) ).alias("Parameter"),
                            col(c).alias("data")) for c in cols
                    ])).alias("kvs")

display(df.select(["TS"] + [kvs]).select(["TS"] + ["kvs.VAL", "kvs.UNT", "kvs.Parameter", "kvs.data"]).groupBy("TS","VAL","UNT").pivot("Parameter").sum("data").orderBy("TS","VAL","UNT"))

输出：

+----+-----+-----+------+------+
| TS | VAL | UNT | sth1 | sth2 |
+----+-----+-----+------+------+
|  1 |   0 |   0 |    0 |  0.7 |
|  1 |   0 |   1 |  0.6 |  0.2 |
|  1 |   1 |   0 |  0.1 |  0.4 |
|  1 |   1 |   1 |  0.4 |  0.1 |
|  2 |   0 |   0 |  0.6 |  0.8 |
|  2 |   0 |   1 |  0.7 |  0.3 |
|  2 |   1 |   0 |  0.1 |  0.1 |
|  2 |   1 |   1 |  0.5 |  0.3 |
+----+-----+-----+------+------+

+----+-----+-----+------+------+
| TS | VAL | UNT | sth1 | sth2 |
+----+-----+-----+------+------+
|  1 |   0 |   0 |    0 |  0.7 |
|  1 |   0 |   1 |  0.6 |  0.2 |
|  1 |   1 |   0 |  0.1 |  0.4 |
|  1 |   1 |   1 |  0.4 |  0.1 |
|  2 |   0 |   0 |  0.6 |  0.8 |
|  2 |   0 |   1 |  0.7 |  0.3 |
|  2 |   1 |   0 |  0.1 |  0.1 |
|  2 |   1 |   1 |  0.5 |  0.3 |
+----+-----+-----+------+------+

现在至少告诉我怎样才能做得更好…

你的方法很好。我唯一真正要做的就是在一次正则表达式搜索中从列名中提取基本部分。我也会删除一个多余的选择，以支持groupBy，但这并不重要

进口稀土从pyspark.sql.functions导入lit、explode、array、struct、col df=sc.parallelize[1,0.0,0.6,0.1,0.4,0.7,0.2,0.4,0.1,2,0.6,0.7,0.1,0.5,0.8,0.3,0.1,0.3] [TS，ABC[0]。VAL.VAL[0]。UNT[0]。sth1，ABC[0]。VAL.VAL[0]。UNT[1]。sth1，ABC[0]。VAL.VAL[1]。UNT[0]。sth1， ABC[0]。VAL.VAL[1]。UNT[1]。sth1，ABC[0]。VAL.VAL[0]。UNT[0]。sth2，ABC[0]。VAL.VAL[0]。UNT[1]。sth2， ABC[0]。VAL.VAL[1]。UNT[0]。sth2，ABC[0]。VAL.VAL[1]。UNT[1]。sth2] newcols=listmaplambda x:x.replace.，ux，df.columns df=df.toDF*newcols def提取索引和标签列名称： s=re.matchr\D+\D+\D+\D+\D+\D+[^ _uu].*$，列名 m、 n，label=s 返回intm、intn、标签 def创建结构列名称： val、unt、label=提取索引和标签列名称返回structlitval.aliasval， litunt.aliasunt， litlabel.aliaslabel， colcolumn_name.aliasvalue df2=df.select df.TS， explodearray[在df.columns[1:]中为c创建结构] 这是一个指导性的问题：它显示了结构几乎就在那里根 |-TS:long nullable=true |-col:struct nullable=false ||-val:integer nullable=false ||-unt:integer nullable=false ||-label:string nullable=false ||-value:double nullable=true df3=df2 .groupBydf2.TS、df2.col.val.aliasVAL、df2.col.unt.aliasUNT .pivotcol.label，值=sth1，sth2 sumcol.价值观 df3.orderByTS，VAL，UNT.show +--+--+--+--+--+ |TS | VAL | UNT | sth1 | sth2| +--+--+--+--+--+ | 1| 0| 0| 0.0| 0.7| | 1| 0| 1| 0.6| 0.2| | 1| 1| 0| 0.1| 0.4| | 1| 1| 1| 0.4| 0.1| | 2| 0| 0| 0.6| 0.8| | 2| 0| 1| 0.7| 0.3| | 2| 1| 0| 0.1| 0.1| | 2| 1| 1| 0.5| 0.3| +--+--+--+--+--+

如果您事先知道只有两列sth1和sth2将被旋转，可以将它们添加到pivot的values参数中，这将进一步提高效率。

演示您迄今为止所做的工作。所以这不是一个免费的编码服务。请记住你的最后一句话：如果你想批评，我建议你问一个问题。Stack Overflow专注于修复损坏的代码，据我所知，您的代码并没有损坏。Thx供您参考，OliverBy顺便说一句，@Lossa，如果此数据帧的生成方式发生改变，您可能不需要进行此解析。但你们需要展示这段代码。尽管如此，您展示的代码仍然可以工作，并且效率相当。