具有1000列的spark数据帧上的pyspark行条件

具有1000列的spark数据帧上的pyspark行条件,pyspark,pyspark-sql,Pyspark,Pyspark Sql,我有一个Spark数据帧(df),有n行和m列,还有两个Python列表(lowerL和upperL),每个列表都有m个值。我想对lowerL和upperL之间的所有行进行采样。然后从采样的数据帧中获取df.col_1000的和(col_1000是df的一列)。 我正在使用PySpark(Spark 1.6.1) 对于n=5和m=4: df看起来像: |col_1|col_2|col_3|Result| +-----+-----+-----+------+ |62.45| 41.2|62.4

我有一个Spark数据帧(df),有n行和m列,还有两个Python列表(lowerL和upperL),每个列表都有m个值。我想对lowerL和upperL之间的所有行进行采样。然后从采样的数据帧中获取df.col_1000的和(col_1000是df的一列)。 我正在使用PySpark(Spark 1.6.1)

对于n=5和m=4:

df看起来像:

|col_1|col_2|col_3|Result|

+-----+-----+-----+------+

|62.45| 41.2|62.49|   1.0|

|56.45|46.39|60.38|   1.0|

|68.37|43.56|71.97|   0.0|

| 53.9| 51.7|70.12|   1.0|

| 56.4|57.32|48.39|   0.0|

+-----+-----+-----+------+
+-----+-----+-----+------+

|col_1|col_2|col_3|Result|

+-----+-----+-----+------+

|51.81|42.22|51.48|  -1.0|

+-----+-----+-----+------+
+-----+-----+-----+------+

|col_1|col_2|col_3|Result|

+-----+-----+-----+------+

|61.91|58.63|72.48|   2.0|

+-----+-----+-----+------+
lowerL看起来像:

|col_1|col_2|col_3|Result|

+-----+-----+-----+------+

|62.45| 41.2|62.49|   1.0|

|56.45|46.39|60.38|   1.0|

|68.37|43.56|71.97|   0.0|

| 53.9| 51.7|70.12|   1.0|

| 56.4|57.32|48.39|   0.0|

+-----+-----+-----+------+
+-----+-----+-----+------+

|col_1|col_2|col_3|Result|

+-----+-----+-----+------+

|51.81|42.22|51.48|  -1.0|

+-----+-----+-----+------+
+-----+-----+-----+------+

|col_1|col_2|col_3|Result|

+-----+-----+-----+------+

|61.91|58.63|72.48|   2.0|

+-----+-----+-----+------+
upperL看起来像:

|col_1|col_2|col_3|Result|

+-----+-----+-----+------+

|62.45| 41.2|62.49|   1.0|

|56.45|46.39|60.38|   1.0|

|68.37|43.56|71.97|   0.0|

| 53.9| 51.7|70.12|   1.0|

| 56.4|57.32|48.39|   0.0|

+-----+-----+-----+------+
+-----+-----+-----+------+

|col_1|col_2|col_3|Result|

+-----+-----+-----+------+

|51.81|42.22|51.48|  -1.0|

+-----+-----+-----+------+
+-----+-----+-----+------+

|col_1|col_2|col_3|Result|

+-----+-----+-----+------+

|61.91|58.63|72.48|   2.0|

+-----+-----+-----+------+
我想要的结果应该是:

|col_1|col_2|col_3|Result|

+-----+-----+-----+------+

|56.45|46.39|60.38|   1.0|

| 53.9| 51.7|70.12|   1.0|

+-----+-----+-----+------+

以下是示例:-

df = sqlContext.createDataFrame([ [62.45, 41.2, 62.49, 1.0], [56.45, 46.39, 60.38, 1.0], [68.37, 43.56, 71.97, 0.0], [ 53.9, 51.7, 70.12, 1.0], [ 56.4, 57.32, 48.39, 0.0]], ['col_1', 'col_2', 'col_3', 'Result'])
df.show()

+-----+-----+-----+------+
|col_1|col_2|col_3|Result|
+-----+-----+-----+------+
|62.45| 41.2|62.49|   1.0|
|56.45|46.39|60.38|   1.0|
|68.37|43.56|71.97|   0.0|
| 53.9| 51.7|70.12|   1.0|
| 56.4|57.32|48.39|   0.0|
+-----+-----+-----+------+

lower_df = sqlContext.createDataFrame([[51.81, 42.22, 51.48, -1.0]], ['col_1', 'col_2', 'col_3', 'Result'])
lower_df.show()

+-----+-----+-----+------+
|col_1|col_2|col_3|Result|
+-----+-----+-----+------+
|51.81|42.22|51.48|  -1.0|
+-----+-----+-----+------+


upper_df = sqlContext.createDataFrame([[61.91, 58.63, 72.48, 2.0]], ['col_1', 'col_2', 'col_3', 'Result'])
upper_df.show()

+-----+-----+-----+------+
|col_1|col_2|col_3|Result|
+-----+-----+-----+------+
|61.91|58.63|72.48|   2.0|
+-----+-----+-----+------+

lower_df_val = lower_df.first()
upper_df_val = upper_df.first()

import pyspark.sql.functions as F

for fil in [(F.col(df_column) > F.lit(lower_df_val[df_column])) & (F.col(df_column) < F.lit(upper_df_val[df_column]))  for df_column in df.columns]:
    df = df.filter(fil)

df.show()    


+-----+-----+-----+------+
|col_1|col_2|col_3|Result|
+-----+-----+-----+------+
|56.45|46.39|60.38|   1.0|
| 53.9| 51.7|70.12|   1.0|
+-----+-----+-----+------+
df=sqlContext.createDataFrame([62.45,41.2,62.49,1.0],[56.45,46.39,60.38,1.0],[68.37,43.56,71.97,0.0],[53.9,51.7,70.12,1.0],[56.4,57.32,48.39,0.0],[col_1],[col_1],[col_2],[col_3],[Result])
df.show()
+-----+-----+-----+------+
|第1列|第2列|第3列|结果|
+-----+-----+-----+------+
|62.45| 41.2|62.49|   1.0|
|56.45|46.39|60.38|   1.0|
|68.37|43.56|71.97|   0.0|
| 53.9| 51.7|70.12|   1.0|
| 56.4|57.32|48.39|   0.0|
+-----+-----+-----+------+
lower_df=sqlContext.createDataFrame([[51.81,42.22,51.48,-1.0]],['col_1','col_2','col_3','Result'])
下箭头显示()
+-----+-----+-----+------+
|第1列|第2列|第3列|结果|
+-----+-----+-----+------+
|51.81|42.22|51.48|  -1.0|
+-----+-----+-----+------+
upper_df=sqlContext.createDataFrame([[61.91,58.63,72.48,2.0]],['col_1','col_2','col_3','Result'])
上_df.show()
+-----+-----+-----+------+
|第1列|第2列|第3列|结果|
+-----+-----+-----+------+
|61.91|58.63|72.48|   2.0|
+-----+-----+-----+------+
lower_df_val=lower_df.first()
upper_df_val=upper_df.first()
导入pyspark.sql.F函数
对于fil in[(F.col(df_列)>F.lit(较低的df_val[df_列])和(F.col(df_列)
有样品吗?这一行令人困惑
我想对lowerL和upperL之间的所有行进行采样
我用一个示例编辑了这个问题。在本例中,df的5行中有2行位于lowerL和upperL之间。