Scala Spark数据帧联接显示意外结果-0行
我使用的是spark-1.6.0,我想加入2个数据帧,它们在下面的纱线日志中显示 df_列车_原始 df\u用户\u单击\u信息Scala Spark数据帧联接显示意外结果-0行,scala,apache-spark,join,apache-spark-sql,Scala,Apache Spark,Join,Apache Spark Sql,我使用的是spark-1.6.0,我想加入2个数据帧,它们在下面的纱线日志中显示 df_列车_原始 df\u用户\u单击\u信息 +------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+------
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
|subscriberid|user_clicks_avg_everyday_a_week|user_clicks_sum_time_1_9_a_week|user_clicks_sum_time_9_14_a_week|user_clicks_sum_time_14_17_a_week|user_clicks_sum_time_17_19_a_week|user_clicks_sum_time_19_23_a_week|user_clicks_sum_time_23_1_a_week|user_clicks_avg_everyday_weekday|user_clicks_sum_time_1_9_weekday|user_clicks_sum_time_9_14_weekday|user_clicks_sum_time_14_17_weekday|user_clicks_sum_time_17_19_weekday|user_clicks_sum_time_19_23_weekday|user_clicks_sum_time_23_1_weekday|user_clicks_avg_everyday_weekdend|user_clicks_sum_time_1_9_weekdend|user_clicks_sum_time_9_14_weekdend|user_clicks_sum_time_14_17_weekdend|user_clicks_sum_time_17_19_weekdend|user_clicks_sum_time_19_23_weekdend|user_clicks_sum_time_23_1_weekdend|
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
| 104752237| 1.71| 0| 0| 0| 4| 4| 4| 0.8| 0| 0| 0| 0| 4| 0| 4.0| 0| 0| 0| 4| 0| 4|
| 105517237| 17.14| 12| 36| 12| 0| 60| 0| 9.6| 0| 0| 0| 0| 48| 0| 36.0| 12| 36| 12| 0| 12| 0|
| 109901037| 2.14| 0| 3| 3| 6| 3| 0| 2.4| 0| 0| 3| 6| 3| 0| 1.5| 0| 3| 0| 0| 0| 0|
| 105246837| 8.0| 8| 0| 0| 16| 32| 0| 8.0| 8| 0| 0| 8| 24| 0| 8.0| 0| 0| 0| 8| 8| 0|
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
————————————
root
|-- subscriberid: string (nullable = true)
|-- user_clicks_avg_everyday_a_week: double (nullable = false)
|-- user_clicks_sum_time_1_9_a_week: long (nullable = false)
|-- user_clicks_sum_time_9_14_a_week: long (nullable = false)
|-- user_clicks_sum_time_14_17_a_week: long (nullable = false)
|-- user_clicks_sum_time_17_19_a_week: long (nullable = false)
|-- user_clicks_sum_time_19_23_a_week: long (nullable = false)
|-- user_clicks_sum_time_23_1_a_week: long (nullable = false)
|-- user_clicks_avg_everyday_weekday: double (nullable = false)
|-- user_clicks_sum_time_1_9_weekday: long (nullable = false)
|-- user_clicks_sum_time_9_14_weekday: long (nullable = false)
|-- user_clicks_sum_time_14_17_weekday: long (nullable = false)
|-- user_clicks_sum_time_17_19_weekday: long (nullable = false)
|-- user_clicks_sum_time_19_23_weekday: long (nullable = false)
|-- user_clicks_sum_time_23_1_weekday: long (nullable = false)
|-- user_clicks_avg_everyday_weekdend: double (nullable = false)
|-- user_clicks_sum_time_1_9_weekdend: long (nullable = false)
|-- user_clicks_sum_time_9_14_weekdend: long (nullable = false)
|-- user_clicks_sum_time_14_17_weekdend: long (nullable = false)
|-- user_clicks_sum_time_17_19_weekdend: long (nullable = false)
|-- user_clicks_sum_time_19_23_weekdend: long (nullable = false)
|-- user_clicks_sum_time_23_1_weekdend: long (nullable = false)
df_user_clicks_info.select("subscriberid").take(20).foreach(println)
[104752237]
[105517237]
[109901037]
[105246837]
我已尝试使用代码将它们内部连接起来:
val-df_-tmp_-tmp_0=df_-train_-raw.join(df_-user_单击信息,序列(“订阅ID”))
df_tmp_tmp_0.show()
而我得到的结果却一文不值!天哪
+------------+--------+-----+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
|subscriberid|objectid|label|subscriberid|user_clicks_avg_everyday_a_week|user_clicks_sum_time_1_9_a_week|user_clicks_sum_time_9_14_a_week|user_clicks_sum_time_14_17_a_week|user_clicks_sum_time_17_19_a_week|user_clicks_sum_time_19_23_a_week|user_clicks_sum_time_23_1_a_week|user_clicks_avg_everyday_weekday|user_clicks_sum_time_1_9_weekday|user_clicks_sum_time_9_14_weekday|user_clicks_sum_time_14_17_weekday|user_clicks_sum_time_17_19_weekday|user_clicks_sum_time_19_23_weekday|user_clicks_sum_time_23_1_weekday|user_clicks_avg_everyday_weekdend|user_clicks_sum_time_1_9_weekdend|user_clicks_sum_time_9_14_weekdend|user_clicks_sum_time_14_17_weekdend|user_clicks_sum_time_17_19_weekdend|user_clicks_sum_time_19_23_weekdend|user_clicks_sum_time_23_1_weekdend|
+------------+--------+-----+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
+------------+--------+-----+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
我不知道为什么?这里好像没什么问题?希望能得到一些帮助,谢谢~
在听了两位朋友关于空间的建议后,我想再试试:
df_train_raw
————————————
+------------+-----------+-----+
|subscriberid| objectid|label|
+------------+-----------+-----+
| 104752237|11029932485| 0|
| 105246837|11029932485| 0|
| 105517237|11029932485| 0|
| 108917037|11030797988| 0|
| 108917037|11029648595| 0|
| 109901037|11029648595| 0|
| 105517237|11030720502| 0|
| 105246837|11029986502| 0|
| 104752237|11029191717| 0|
| 105246837|11029191717| 0|
| 105517237|11029191717| 0|
| 109901037|11030138623| 0|
| 105517237|11014105538| 0|
| 105517237|11014105543| 0|
| 105517237|11016478156| 0|
| 105517237|11023285357| 0|
| 105246837|11026067980| 0|
| 105246837|11030797988| 0|
| 108917037|11029932485| 0|
| 109901037|11029932485| 0|
+------------+-----------+-----+
only showing top 20 rows
————————————
root
|-- subscriberid: long (nullable = true)
|-- objectid: long (nullable = true)
|-- label: integer (nullable = true)
并打印“subscriberid”列,这显示的不是空格
df_train_raw.select("subscriberid").take(20).foreach(println)
结果
[104752237]
[105246837]
[105517237]
[108917037]
[108917037]
[109901037]
[105517237]
[105246837]
[104752237]
[105246837]
[105517237]
[109901037]
[105517237]
[105517237]
[105517237]
[105517237]
[105246837]
[105246837]
[108917037]
[109901037]
然后,df_用户点击信息
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
|subscriberid|user_clicks_avg_everyday_a_week|user_clicks_sum_time_1_9_a_week|user_clicks_sum_time_9_14_a_week|user_clicks_sum_time_14_17_a_week|user_clicks_sum_time_17_19_a_week|user_clicks_sum_time_19_23_a_week|user_clicks_sum_time_23_1_a_week|user_clicks_avg_everyday_weekday|user_clicks_sum_time_1_9_weekday|user_clicks_sum_time_9_14_weekday|user_clicks_sum_time_14_17_weekday|user_clicks_sum_time_17_19_weekday|user_clicks_sum_time_19_23_weekday|user_clicks_sum_time_23_1_weekday|user_clicks_avg_everyday_weekdend|user_clicks_sum_time_1_9_weekdend|user_clicks_sum_time_9_14_weekdend|user_clicks_sum_time_14_17_weekdend|user_clicks_sum_time_17_19_weekdend|user_clicks_sum_time_19_23_weekdend|user_clicks_sum_time_23_1_weekdend|
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
| 104752237| 1.71| 0| 0| 0| 4| 4| 4| 0.8| 0| 0| 0| 0| 4| 0| 4.0| 0| 0| 0| 4| 0| 4|
| 105517237| 17.14| 12| 36| 12| 0| 60| 0| 9.6| 0| 0| 0| 0| 48| 0| 36.0| 12| 36| 12| 0| 12| 0|
| 109901037| 2.14| 0| 3| 3| 6| 3| 0| 2.4| 0| 0| 3| 6| 3| 0| 1.5| 0| 3| 0| 0| 0| 0|
| 105246837| 8.0| 8| 0| 0| 16| 32| 0| 8.0| 8| 0| 0| 8| 24| 0| 8.0| 0| 0| 0| 8| 8| 0|
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
————————————
root
|-- subscriberid: string (nullable = true)
|-- user_clicks_avg_everyday_a_week: double (nullable = false)
|-- user_clicks_sum_time_1_9_a_week: long (nullable = false)
|-- user_clicks_sum_time_9_14_a_week: long (nullable = false)
|-- user_clicks_sum_time_14_17_a_week: long (nullable = false)
|-- user_clicks_sum_time_17_19_a_week: long (nullable = false)
|-- user_clicks_sum_time_19_23_a_week: long (nullable = false)
|-- user_clicks_sum_time_23_1_a_week: long (nullable = false)
|-- user_clicks_avg_everyday_weekday: double (nullable = false)
|-- user_clicks_sum_time_1_9_weekday: long (nullable = false)
|-- user_clicks_sum_time_9_14_weekday: long (nullable = false)
|-- user_clicks_sum_time_14_17_weekday: long (nullable = false)
|-- user_clicks_sum_time_17_19_weekday: long (nullable = false)
|-- user_clicks_sum_time_19_23_weekday: long (nullable = false)
|-- user_clicks_sum_time_23_1_weekday: long (nullable = false)
|-- user_clicks_avg_everyday_weekdend: double (nullable = false)
|-- user_clicks_sum_time_1_9_weekdend: long (nullable = false)
|-- user_clicks_sum_time_9_14_weekdend: long (nullable = false)
|-- user_clicks_sum_time_14_17_weekdend: long (nullable = false)
|-- user_clicks_sum_time_17_19_weekdend: long (nullable = false)
|-- user_clicks_sum_time_19_23_weekdend: long (nullable = false)
|-- user_clicks_sum_time_23_1_weekdend: long (nullable = false)
df_user_clicks_info.select("subscriberid").take(20).foreach(println)
[104752237]
[105517237]
[109901037]
[105246837]
它也不起作用:(感谢帮助我的朋友们的帮助。原因是,我认为是SPARK-1.6.0中的一个缺陷,我通过更改数据流程而没有更新SPARK来解决它。我的意思是,一开始,我想从df_1和df_2获得df_3,但由于问题中提到的缺陷,它没有得到我想要的结果,所以我尝试了另一个获得df_tmp_1和df_tmp_2的方法,然后加入它们并得到结果。我也不知道为什么,但如果你使用SPARK-1.6.0并遇到像我一样的加入错误,这似乎是个好主意。你能转换到bigint然后进行比较吗?我猜boh数据帧上的数据类型可能不同,并让我知道结果@SadamHussain M,谢谢您的建议~我已尝试将2个数据帧中的“subscriberid”强制转换为long,但无效~:(它们是字符串类型的列。请确保任何数据帧中的数字前后都没有空格。“162323641”将不等于“162323641”因此这些行不会加入。@Selnay谢谢你的建议~我检查了两个数据帧中用于加入的“subscriberid”列,我打印了它,没有空格,它不起作用。:(Try
val df_tmp_tmp_0=df_train_raw.join(df_user_单击信息,df_train_raw(“subscriberid”)==df_user_单击信息(“subscriberid”)