Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 根据Pyspark数据框中的条件分配分数_Apache Spark_Pyspark - Fatal编程技术网

Apache spark 根据Pyspark数据框中的条件分配分数

Apache spark 根据Pyspark数据框中的条件分配分数,apache-spark,pyspark,Apache Spark,Pyspark,我在Pyspark中创建了一个数据框,使用下面的代码 df = sqlContext.createDataFrame( [(1,'Y','Y',0,0,0,2,'Y','N','Y','Y'), (2,'N','Y',2,1,2,3,'N','Y','Y','N'), (3,'Y','N',3,1,0,0,'N','N','N','N'), (4,'N','Y',5,0,1,0,'N','N','N','Y'), (5,'Y','N',2,2,0,

我在
Pyspark
中创建了一个数据框,使用下面的代码

df = sqlContext.createDataFrame(
    [(1,'Y','Y',0,0,0,2,'Y','N','Y','Y'),
     (2,'N','Y',2,1,2,3,'N','Y','Y','N'),
     (3,'Y','N',3,1,0,0,'N','N','N','N'),
     (4,'N','Y',5,0,1,0,'N','N','N','Y'),
     (5,'Y','N',2,2,0,1,'Y','N','N','Y'),
     (6,'Y','Y',0,0,3,6,'Y','N','Y','N'),
     (7,'N','N',1,1,3,4,'N','Y','N','Y'),
     (8,'Y','Y',1,1,2,0,'Y','Y','N','N')
    ],
    ('id', 'compatible', 'product', 'ios', 'pc', 'other', 'devices', 'customer', 'subscriber', 'circle', 'smb')
)
df.show

+---+----------+-------+---+---+-----+-------+--------+----------+------+---+
| id|compatible|product|ios| pc|other|devices|customer|subscriber|circle|smb|
+---+----------+-------+---+---+-----+-------+--------+----------+------+---+
|  1|         Y|      Y|  0|  0|    0|      2|       Y|         N|     Y|  Y|
|  2|         N|      Y|  2|  1|    2|      3|       N|         Y|     Y|  N|
|  3|         Y|      N|  3|  1|    0|      0|       N|         N|     N|  N|
|  4|         N|      Y|  5|  0|    1|      0|       N|         N|     N|  Y|
|  5|         Y|      N|  2|  2|    0|      1|       Y|         N|     N|  Y|
|  6|         Y|      Y|  0|  0|    3|      6|       Y|         N|     Y|  N|
|  7|         N|      N|  1|  1|    3|      4|       N|         Y|     N|  Y|
|  8|         Y|      Y|  1|  1|    2|      0|       Y|         Y|     N|  N|
+---+----------+-------+---+---+-----+-------+--------+----------+------+---+
现在,在上面的数据框中,我想根据一些条件创建一个新列

1) 当
compatible
列为
Y

2) 如果
product、customer、subscriber、circle、smb
columns value=
Y
assign value=
10
else
0

3) 如果ios、pc和其他列的总和大于
4
,则赋值=
10
else
0

4) 如果
devices
列大于
4
,则分配值=
10
else
0

然后将上述所有
值相加
并填充pyspark数据场中的
得分

我想要的输出如下

+---+----------+-------+---+---+-----+-------+--------+----------+------+---+-----+
| id|compatible|product|ios| pc|other|devices|customer|subscriber|circle|smb|score|
+---+----------+-------+---+---+-----+-------+--------+----------+------+---+-----+
|  1|         Y|      Y|  0|  0|    0|      2|       Y|         N|     Y|  Y|   50|
|  2|         N|      Y|  2|  1|    2|      3|       N|         Y|     Y|  N|    0|
|  3|         Y|      N|  3|  1|    0|      0|       N|         N|     N|  N|    0|
|  4|         N|      Y|  5|  0|    1|      0|       N|         N|     N|  Y|    0|
|  5|         Y|      N|  2|  2|    0|      1|       Y|         N|     N|  Y|   30|
|  6|         Y|      Y|  0|  0|    3|      6|       Y|         N|     Y|  N|   40| 
|  7|         N|      N|  1|  1|    3|      4|       N|         Y|     N|  Y|    0|
|  8|         Y|      Y|  1|  1|    2|      0|       Y|         Y|     N|  N|   30|
+---+----------+-------+---+---+-----+-------+--------+----------+------+---+-----+
我试着像下面一样

df1 = df.where(f.col('compatible') == 'Y').\
    withColumn('score', f.when(f.col('product') == 'Y', 10) +
               f.when(f.col('ios') + f.col('pc') + f.col('other') > 4, 10) + f.when(f.col('devices') > 0, 10) +
               f.when(f.col('customer') == 'Y', 10) + f.when(f.col('subscriber') == 'Y', 10) +
               f.when(f.col('circle') == 'Y', 10) + f.when(f.col('smb') == 'Y', 10).otherwise(0))
我得到的结果如下

+---+----------+-------+---+---+-----+-------+--------+----------+------+---+-----+
| id|compatible|product|ios| pc|other|devices|customer|subscriber|circle|smb|score|
+---+----------+-------+---+---+-----+-------+--------+----------+------+---+-----+
|  1|         Y|      Y|  0|  0|    0|      2|       Y|         N|     Y|  Y| null|
|  3|         Y|      N|  3|  1|    0|      0|       N|         N|     N|  N| null|
|  5|         Y|      N|  2|  2|    0|      1|       Y|         N|     N|  Y| null|
|  6|         Y|      Y|  0|  0|    3|      6|       Y|         N|     Y|  N| null|
|  8|         Y|      Y|  1|  1|    2|      0|       Y|         Y|     N|  N| null|
+---+----------+-------+---+---+-----+-------+--------+----------+------+---+-----+

我怎样才能实现我想要的?

当/否则条件应该满足您的要求时,遵循

df.withColumn('score',
              f.when(df['compatible'] == 'Y',
                     f.when(df['product'] == 'Y', 10).otherwise(0) +
                     f.when(df['customer'] == 'Y', 10).otherwise(0) +
                     f.when(df['subscriber'] == 'Y', 10).otherwise(0) +
                     f.when(df['circle'] == 'Y', 10).otherwise(0) +
                     f.when(df['smb'] == 'Y', 10).otherwise(0) +
                     f.when((df['ios'] + df['pc'] + df['other']) > 4, 10).otherwise(0) +
                     f.when(df['devices'] > 4, 10).otherwise(0)
                ).otherwise(0))\
    .show(truncate=False)
应该给你什么

+---+----------+-------+---+---+-----+-------+--------+----------+------+---+-----+
|id |compatible|product|ios|pc |other|devices|customer|subscriber|circle|smb|score|
+---+----------+-------+---+---+-----+-------+--------+----------+------+---+-----+
|1  |Y         |Y      |0  |0  |0    |2      |Y       |N         |Y     |Y  |40   |
|2  |N         |Y      |2  |1  |2    |3      |N       |Y         |Y     |N  |0    |
|3  |Y         |N      |3  |1  |0    |0      |N       |N         |N     |N  |0    |
|4  |N         |Y      |5  |0  |1    |0      |N       |N         |N     |Y  |0    |
|5  |Y         |N      |2  |2  |0    |1      |Y       |N         |N     |Y  |20   |
|6  |Y         |Y      |0  |0  |3    |6      |Y       |N         |Y     |N  |40   |
|7  |N         |N      |1  |1  |3    |4      |N       |Y         |N     |Y  |0    |
|8  |Y         |Y      |1  |1  |2    |0      |Y       |Y         |N     |N  |30   |
+---+----------+-------+---+---+-----+-------+--------+----------+------+---+-----+
您可以将其模块化为

def firstCondition(dataframe):
    return f.when(dataframe['product'] == 'Y', 10).otherwise(0) + \
           f.when(dataframe['customer'] == 'Y', 10).otherwise(0) + \
           f.when(dataframe['subscriber'] == 'Y', 10).otherwise(0) + \
           f.when(dataframe['circle'] == 'Y', 10).otherwise(0) + \
           f.when(dataframe['smb'] == 'Y', 10).otherwise(0)

def secondCondition(dataframe):
    return f.when((dataframe['ios'] + dataframe['pc'] + dataframe['other']) > 4, 10).otherwise(0)

def thirdCondition(dataframe):
    return f.when(dataframe['devices'] > 4, 10).otherwise(0)

df.withColumn('score',
              f.when(df['compatible'] == 'Y', firstCondition(df) + secondCondition(df) + thirdCondition(df)).otherwise(0))\
    .show(truncate=False)
我希望答案是有帮助的