Python 如何在pyspark中迭代dataframe多列?
假设我有一个数据帧df,只有一列,其中Python 如何在pyspark中迭代dataframe多列?,python,dataframe,pyspark,Python,Dataframe,Pyspark,假设我有一个数据帧df,只有一列,其中df.show()是 |a、 b,c,d| |a、 b,c,d| 所以我想得到一个df1,其中df1.show()是 |a | b | c| 简而言之,我想把一个只有一列的数据框分解成一个有多列的数据框。所以,我想 split_col = pyspark.sql.functions.split(df['x'], ' '), df=df.withColumn('0',split_col.getItem(0)) df=df.withColumn('1',spli
df.show()
是
|a、 b,c,d|
|a、 b,c,d|
所以我想得到一个df1,其中df1.show()
是
|a | b | c|
简而言之,我想把一个只有一列的数据框分解成一个有多列的数据框。所以,我想
split_col = pyspark.sql.functions.split(df['x'], ' '),
df=df.withColumn('0',split_col.getItem(0))
df=df.withColumn('1',split_col.getItem(1))
,等等
但是如果我有很多专栏的话。除了在pyspark上进行大量的迭代之外,还有其他方法可以做到这一点吗?谢谢,因此您可以使用
选择子句
迭代并设置名称,如下所示:
在本例中,每次循环运行时,您都将点击split
,因此效率较低
from pyspark.sql import functions as F
df.select(*[(F.split("x",' ')[i]).alias(str(i)) for i in range(100)]).explain()
#== Physical Plan ==
#*(1) Project [split(x#200, )[0] AS 0#1708, split(x#200, )[1]
AS 1#1709, split(x#200, )[2] AS 2#1710, split(x#200, )[3] AS
3#1711, split(x#200, )[4] AS 4#1712, split(x#200, )[5] AS
5#1713, split(x#200, )[6] AS 6#1714, split(x#200, )[7] AS
7#1715, split(x#200, )[8] AS 8#1716, split(x#200, )[9] AS
9#1717, split(x#200, )[10] AS 10#1718, split(x#200, )[11] AS
11#1719, split(x#200, )[12] AS 12#1720, split(x#200, )[13] AS
13#1721, split(x#200, )[14] AS 14#1722, split(x#200, )[15] AS
15#1723, split(x#200, )[16] AS 16#1724, split(x#200, )[17] AS
17#1725, split(x#200, )[18] AS 18#1726, split(x#200, )[19] AS
19#1727, split(x#200, )[20] AS 20#1728, split(x#200, )[21] AS
21#1729, split(x#200, )[22] AS 22#1730, split(x#200, )[23] AS
23#1731, ... 76 more fields]
#+- *(1) Scan ExistingRDD[x#200]
相反,您可以将其拆分一次
,并允许spark只投影一个拆分操作,而不是多个拆分操作。
from pyspark.sql import functions as F
df\
.withColumn("x", F.split('x',' '))\
.select(*[(F.col("x")[i]).alias(str(i)) for i in range(100)]).drop("x").explain()
#== Physical Plan ==
#*(1) Project [x#1908[0] AS 0#1910, x#1908[1] AS 1#1911,
x#1908[2] AS 2#1912, x#1908[3] AS 3#1913, x#1908[4] AS 4#1914,
x#1908[5] AS 5#1915, x#1908[6] AS 6#1916, x#1908[7] AS 7#1917,
x#1908[8] AS 8#1918, x#1908[9] AS 9#1919, x#1908[10] AS 10#1920,
x#1908[11] AS 11#1921, x#1908[12] AS 12#1922, x#1908[13] AS
13#1923, x#1908[14] AS 14#1924, x#1908[15] AS 15#1925, x#1908[16]
AS 16#1926, x#1908[17] AS 17#1927, x#1908[18] AS 18#1928,
x#1908[19] AS 19#1929, x#1908[20] AS 20#1930, x#1908[21] AS
21#1931, x#1908[22] AS 22#1932, x#1908[23] AS 23#1933, ... 76
more fields]
+- *(1) Project [split(x#200, ) AS x#1908]
+- *(1) Scan ExistingRDD[x#200]