Pyspark转换:从列名到行

Pyspark转换:从列名到行,pyspark,pyspark-sql,pyspark-dataframes,Pyspark,Pyspark Sql,Pyspark Dataframes,我正在与pyspark合作,希望转换此spark数据帧: +----+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------

我正在与pyspark合作,希望转换此spark数据帧:

    +----+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
    | TS | ABC[0].VAL.VAL[0].UNT[0].sth1 | ABC[0].VAL.VAL[0].UNT[1].sth1 | ABC[0].VAL.VAL[1].UNT[0].sth1 | ABC[0].VAL.VAL[1].UNT[1].sth1 | ABC[0].VAL.VAL[0].UNT[0].sth2 | ABC[0].VAL.VAL[0].UNT[1].sth2 | ABC[0].VAL.VAL[1].UNT[0].sth2 | ABC[0].VAL.VAL[1].UNT[1].sth2 |
    +----+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
    | 1  | some_value                    | some_value                    | some_value                    | some_value                    | some_value                    | some_value                    | some_value                    | some_value                    |
    +----+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
为此:

+----+-----+-----+------------+------------+
| TS | VAL | UNT |    sth1    |    sth2    |
+----+-----+-----+------------+------------+
|  1 |   0 |   0 | some_value | some_value |
|  1 |   0 |   1 | some_value | some_value |
|  1 |   1 |   0 | some_value | some_value |
|  1 |   1 |   1 | some_value | some_value |
+----+-----+-----+------------+------------+
你知道我如何通过一些奇特的变换来做到这一点吗

编辑: 这就是我解决问题的方法:

from pyspark.sql.functions import array, col, explode, struct, lit
import re

df = sc.parallelize([(1, 0.0, 0.6, 0.1, 0.4, 0.7, 0.2, 0.4, 0.1), (2, 0.6, 0.7, 0.1, 0.5, 0.8, 0.3, 0.1, 0.3)]).toDF(["TS", "ABC[0].VAL.VAL[0].UNT[0].sth1", "ABC[0].VAL.VAL[0].UNT[1].sth1", "ABC[0].VAL.VAL[1].UNT[0].sth1", "ABC[0].VAL.VAL[1].UNT[1].sth1", "ABC[0].VAL.VAL[0].UNT[0].sth2", "ABC[0].VAL.VAL[0].UNT[1].sth2", "ABC[0].VAL.VAL[1].UNT[0].sth2", "ABC[0].VAL.VAL[1].UNT[1].sth2"]) 

newcols = list(map(lambda x: x.replace(".", "_"), df.columns))
df = df.toDF(*newcols)

cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in ["TS"]))
kvs = explode(array([struct(
                            lit( re.search(re.compile(r"VAL\[(\d{1,2})\]"),c).group(1) ).alias("VAL"),
                            lit( re.search(re.compile(r"UNT\[(\d{1,2})\]"),c).group(1) ).alias("UNT"),
                            lit( re.search(re.compile(r"([^_]+$)"),c).group(1) ).alias("Parameter"),
                            col(c).alias("data")) for c in cols
                    ])).alias("kvs")

display(df.select(["TS"] + [kvs]).select(["TS"] + ["kvs.VAL", "kvs.UNT", "kvs.Parameter", "kvs.data"]).groupBy("TS","VAL","UNT").pivot("Parameter").sum("data").orderBy("TS","VAL","UNT"))
from pyspark.sql.functions import array, col, explode, struct, lit
import re

df = sc.parallelize([(1, 0.0, 0.6, 0.1, 0.4, 0.7, 0.2, 0.4, 0.1), (2, 0.6, 0.7, 0.1, 0.5, 0.8, 0.3, 0.1, 0.3)]).toDF(["TS", "ABC[0].VAL.VAL[0].UNT[0].sth1", "ABC[0].VAL.VAL[0].UNT[1].sth1", "ABC[0].VAL.VAL[1].UNT[0].sth1", "ABC[0].VAL.VAL[1].UNT[1].sth1", "ABC[0].VAL.VAL[0].UNT[0].sth2", "ABC[0].VAL.VAL[0].UNT[1].sth2", "ABC[0].VAL.VAL[1].UNT[0].sth2", "ABC[0].VAL.VAL[1].UNT[1].sth2"]) 

newcols = list(map(lambda x: x.replace(".", "_"), df.columns))
df = df.toDF(*newcols)

cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in ["TS"]))
kvs = explode(array([struct(
                            lit( re.search(re.compile(r"VAL\[(\d{1,2})\]"),c).group(1) ).alias("VAL"),
                            lit( re.search(re.compile(r"UNT\[(\d{1,2})\]"),c).group(1) ).alias("UNT"),
                            lit( re.search(re.compile(r"([^_]+$)"),c).group(1) ).alias("Parameter"),
                            col(c).alias("data")) for c in cols
                    ])).alias("kvs")

display(df.select(["TS"] + [kvs]).select(["TS"] + ["kvs.VAL", "kvs.UNT", "kvs.Parameter", "kvs.data"]).groupBy("TS","VAL","UNT").pivot("Parameter").sum("data").orderBy("TS","VAL","UNT"))
输出:

+----+-----+-----+------+------+
| TS | VAL | UNT | sth1 | sth2 |
+----+-----+-----+------+------+
|  1 |   0 |   0 |    0 |  0.7 |
|  1 |   0 |   1 |  0.6 |  0.2 |
|  1 |   1 |   0 |  0.1 |  0.4 |
|  1 |   1 |   1 |  0.4 |  0.1 |
|  2 |   0 |   0 |  0.6 |  0.8 |
|  2 |   0 |   1 |  0.7 |  0.3 |
|  2 |   1 |   0 |  0.1 |  0.1 |
|  2 |   1 |   1 |  0.5 |  0.3 |
+----+-----+-----+------+------+
+----+-----+-----+------+------+
| TS | VAL | UNT | sth1 | sth2 |
+----+-----+-----+------+------+
|  1 |   0 |   0 |    0 |  0.7 |
|  1 |   0 |   1 |  0.6 |  0.2 |
|  1 |   1 |   0 |  0.1 |  0.4 |
|  1 |   1 |   1 |  0.4 |  0.1 |
|  2 |   0 |   0 |  0.6 |  0.8 |
|  2 |   0 |   1 |  0.7 |  0.3 |
|  2 |   1 |   0 |  0.1 |  0.1 |
|  2 |   1 |   1 |  0.5 |  0.3 |
+----+-----+-----+------+------+

如何才能做得更好呢?

所以我可以这样解决:

from pyspark.sql.functions import array, col, explode, struct, lit
import re

df = sc.parallelize([(1, 0.0, 0.6, 0.1, 0.4, 0.7, 0.2, 0.4, 0.1), (2, 0.6, 0.7, 0.1, 0.5, 0.8, 0.3, 0.1, 0.3)]).toDF(["TS", "ABC[0].VAL.VAL[0].UNT[0].sth1", "ABC[0].VAL.VAL[0].UNT[1].sth1", "ABC[0].VAL.VAL[1].UNT[0].sth1", "ABC[0].VAL.VAL[1].UNT[1].sth1", "ABC[0].VAL.VAL[0].UNT[0].sth2", "ABC[0].VAL.VAL[0].UNT[1].sth2", "ABC[0].VAL.VAL[1].UNT[0].sth2", "ABC[0].VAL.VAL[1].UNT[1].sth2"]) 

newcols = list(map(lambda x: x.replace(".", "_"), df.columns))
df = df.toDF(*newcols)

cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in ["TS"]))
kvs = explode(array([struct(
                            lit( re.search(re.compile(r"VAL\[(\d{1,2})\]"),c).group(1) ).alias("VAL"),
                            lit( re.search(re.compile(r"UNT\[(\d{1,2})\]"),c).group(1) ).alias("UNT"),
                            lit( re.search(re.compile(r"([^_]+$)"),c).group(1) ).alias("Parameter"),
                            col(c).alias("data")) for c in cols
                    ])).alias("kvs")

display(df.select(["TS"] + [kvs]).select(["TS"] + ["kvs.VAL", "kvs.UNT", "kvs.Parameter", "kvs.data"]).groupBy("TS","VAL","UNT").pivot("Parameter").sum("data").orderBy("TS","VAL","UNT"))
from pyspark.sql.functions import array, col, explode, struct, lit
import re

df = sc.parallelize([(1, 0.0, 0.6, 0.1, 0.4, 0.7, 0.2, 0.4, 0.1), (2, 0.6, 0.7, 0.1, 0.5, 0.8, 0.3, 0.1, 0.3)]).toDF(["TS", "ABC[0].VAL.VAL[0].UNT[0].sth1", "ABC[0].VAL.VAL[0].UNT[1].sth1", "ABC[0].VAL.VAL[1].UNT[0].sth1", "ABC[0].VAL.VAL[1].UNT[1].sth1", "ABC[0].VAL.VAL[0].UNT[0].sth2", "ABC[0].VAL.VAL[0].UNT[1].sth2", "ABC[0].VAL.VAL[1].UNT[0].sth2", "ABC[0].VAL.VAL[1].UNT[1].sth2"]) 

newcols = list(map(lambda x: x.replace(".", "_"), df.columns))
df = df.toDF(*newcols)

cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in ["TS"]))
kvs = explode(array([struct(
                            lit( re.search(re.compile(r"VAL\[(\d{1,2})\]"),c).group(1) ).alias("VAL"),
                            lit( re.search(re.compile(r"UNT\[(\d{1,2})\]"),c).group(1) ).alias("UNT"),
                            lit( re.search(re.compile(r"([^_]+$)"),c).group(1) ).alias("Parameter"),
                            col(c).alias("data")) for c in cols
                    ])).alias("kvs")

display(df.select(["TS"] + [kvs]).select(["TS"] + ["kvs.VAL", "kvs.UNT", "kvs.Parameter", "kvs.data"]).groupBy("TS","VAL","UNT").pivot("Parameter").sum("data").orderBy("TS","VAL","UNT"))
输出:

+----+-----+-----+------+------+
| TS | VAL | UNT | sth1 | sth2 |
+----+-----+-----+------+------+
|  1 |   0 |   0 |    0 |  0.7 |
|  1 |   0 |   1 |  0.6 |  0.2 |
|  1 |   1 |   0 |  0.1 |  0.4 |
|  1 |   1 |   1 |  0.4 |  0.1 |
|  2 |   0 |   0 |  0.6 |  0.8 |
|  2 |   0 |   1 |  0.7 |  0.3 |
|  2 |   1 |   0 |  0.1 |  0.1 |
|  2 |   1 |   1 |  0.5 |  0.3 |
+----+-----+-----+------+------+
+----+-----+-----+------+------+
| TS | VAL | UNT | sth1 | sth2 |
+----+-----+-----+------+------+
|  1 |   0 |   0 |    0 |  0.7 |
|  1 |   0 |   1 |  0.6 |  0.2 |
|  1 |   1 |   0 |  0.1 |  0.4 |
|  1 |   1 |   1 |  0.4 |  0.1 |
|  2 |   0 |   0 |  0.6 |  0.8 |
|  2 |   0 |   1 |  0.7 |  0.3 |
|  2 |   1 |   0 |  0.1 |  0.1 |
|  2 |   1 |   1 |  0.5 |  0.3 |
+----+-----+-----+------+------+

现在至少告诉我怎样才能做得更好…

你的方法很好。我唯一真正要做的就是在一次正则表达式搜索中从列名中提取基本部分。我也会删除一个多余的选择,以支持groupBy,但这并不重要

进口稀土 从pyspark.sql.functions导入lit、explode、array、struct、col df=sc.parallelize[1,0.0,0.6,0.1,0.4,0.7,0.2,0.4,0.1,2,0.6,0.7,0.1,0.5,0.8,0.3,0.1,0.3] [TS,ABC[0]。VAL.VAL[0]。UNT[0]。sth1,ABC[0]。VAL.VAL[0]。UNT[1]。sth1,ABC[0]。VAL.VAL[1]。UNT[0]。sth1, ABC[0]。VAL.VAL[1]。UNT[1]。sth1,ABC[0]。VAL.VAL[0]。UNT[0]。sth2,ABC[0]。VAL.VAL[0]。UNT[1]。sth2, ABC[0]。VAL.VAL[1]。UNT[0]。sth2,ABC[0]。VAL.VAL[1]。UNT[1]。sth2] newcols=listmaplambda x:x.replace.,ux,df.columns df=df.toDF*newcols def提取索引和标签列名称: s=re.matchr\D+\D+\D+\D+\D+\D+[^ _uu].*$,列名 m、 n,label=s 返回intm、intn、标签 def创建结构列名称: val、unt、label=提取索引和标签列名称 返回structlitval.aliasval, litunt.aliasunt, litlabel.aliaslabel, colcolumn_name.aliasvalue df2=df.select df.TS, explodearray[在df.columns[1:]中为c创建结构] 这是一个指导性的问题:它显示了结构几乎就在那里 根 |-TS:long nullable=true |-col:struct nullable=false ||-val:integer nullable=false ||-unt:integer nullable=false ||-label:string nullable=false ||-value:double nullable=true df3=df2 .groupBydf2.TS、df2.col.val.aliasVAL、df2.col.unt.aliasUNT .pivotcol.label,值=sth1,sth2 sumcol.价值观 df3.orderByTS,VAL,UNT.show +--+--+--+--+--+ |TS | VAL | UNT | sth1 | sth2| +--+--+--+--+--+ | 1| 0| 0| 0.0| 0.7| | 1| 0| 1| 0.6| 0.2| | 1| 1| 0| 0.1| 0.4| | 1| 1| 1| 0.4| 0.1| | 2| 0| 0| 0.6| 0.8| | 2| 0| 1| 0.7| 0.3| | 2| 1| 0| 0.1| 0.1| | 2| 1| 1| 0.5| 0.3| +--+--+--+--+--+
如果您事先知道只有两列sth1和sth2将被旋转,可以将它们添加到pivot的values参数中,这将进一步提高效率。

演示您迄今为止所做的工作。所以这不是一个免费的编码服务。请记住你的最后一句话:如果你想批评,我建议你问一个问题。Stack Overflow专注于修复损坏的代码,据我所知,您的代码并没有损坏。Thx供您参考,OliverBy顺便说一句,@Lossa,如果此数据帧的生成方式发生改变,您可能不需要进行此解析。但你们需要展示这段代码。尽管如此,您展示的代码仍然可以工作,并且效率相当。