Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 加入后筛选出错_R_Apache Spark_Sparklyr - Fatal编程技术网

R 加入后筛选出错

R 加入后筛选出错,r,apache-spark,sparklyr,R,Apache Spark,Sparklyr,我使用的代码如下所示:(只是一个简单的连接) 单个数据帧的内容并不重要。连接本身会起作用。为了节省计算能力,我想在加入后直接过滤(或其他什么) 但现在我得到了一条错误消息,Spark无法解析variable number.x 我不明白,因为变量是错误消息的一部分: Error: org.apache.spark.sql.AnalysisException: cannot resolve '`number.x`' given input columns: [elemname.x, kind.y,

我使用的代码如下所示:(只是一个简单的连接)

单个数据帧的内容并不重要。连接本身会起作用。为了节省计算能力,我想在加入后直接过滤(或其他什么)

但现在我得到了一条错误消息,Spark无法解析variable number.x

我不明白,因为变量是错误消息的一部分:

Error: org.apache.spark.sql.AnalysisException: cannot resolve '`number.x`' given input columns: [elemname.x, kind.y, timefrom, timetodeg, timeto, kind.x, elemuid, elemname.y, number.y, number.x]; line 7 pos 7;
'Project [*]
+- 'Filter ('number.x > 2500.0)
   +- SubqueryAlias yoxgbdyqlw
      +- Project [elemuid#7505 AS elemuid#7495, elemname#7506 AS elemname.x#7496, kind#7507 AS kind.x#7497, number#7508 AS number.x#7498, timefrom#7509 AS timefrom#7499, timeto#7510 AS timeto#7500, elemname#7512 AS elemname.y#7501, kind#7513 AS kind.y#7502, number#7514 AS number.y#7503, timetodeg#7516 AS timetodeg#7504]
         +- Join Inner, ((timefrom#7509 = timefromdeg#7515) && (elemuid#7505 = elemuid#7511))
            :- SubqueryAlias TBL_LEFT
            :  +- SubqueryAlias dez
            :     +- HiveTableRelation `default`.`dez`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [elemuid#7505, elemname#7506, kind#7507, number#7508, timefrom#7509, timeto#7510]
            +- SubqueryAlias TBL_RIGHT
               +- SubqueryAlias deg
                  +- HiveTableRelation `default`.`deg`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [elemuid#7511, elemname#7512, kind#7513, number#7514, timefromdeg#7515, timetodeg#7516]
加入后的
collect()
不是选项,因为这样我的内存就用完了。有没有可能让事情发生


我会很高兴得到帮助的

TL;DR不要使用默认的
后缀
c(“.x”,“.y”)
):

但据我所知,用
dplyr
API来表达这一点是不可能的(也许一些
rlang
技巧可以,但我现在想不出任何方法)

这个问题实际上并不特定于联接。您应该避免在名称中使用
。如果出于某种原因,您可以随时转到本机Spark API并解决问题:

df3 <- copy_to(sc, tibble(value.x = rnorm(42)))

df3 %>% 
  spark_dataframe() %>% 
  invoke("withColumnRenamed", "`value.x`", "value_x") %>%
  sdf_register()

# # Source:   table<sparklyr_tmp_61acdbbc592> [?? x 1]
# # Database: spark_connection
#    value_x
#      <dbl>
#  1 -0.0162
#  2  0.944 
#  3  0.821 
#  4  0.594 
#  5  0.919 
#  6  0.782 
#  7  0.0746
#  8 -1.99  
#  9  0.620 
# 10 -0.0561
# # ... with more r
df3%
spark_dataframe()%>%
调用(“WithColumnRename”、“value.x`”、“value\ux”)%>%
sdf_寄存器()
##来源:表[?x 1]
##数据库:spark_连接
#价值x
#      
#  1 -0.0162
#  2  0.944 
#  3  0.821 
#  4  0.594 
#  5  0.919 
#  6  0.782 
#  7  0.0746
#  8 -1.99  
#  9  0.620 
# 10 -0.0561
# # ... 带着更多的r
set.seed(1)

df1 <- copy_to(sc, tibble(id = 1:3, value = rnorm(3)))
df2 <- copy_to(sc, tibble(id = 1:3, value = rnorm(3)))

df1 %>% 
  inner_join(df2, by = c("id"), suffix=c("_x", "_y")) %>% 
  filter(value_y > -0.836)

# # Source:   lazy query [?? x 3]
# # Database: spark_connection
#      id value_x value_y
#   <dbl>   <dbl>   <dbl>
# 1    1.  -0.626   1.60 
# 2    2.   0.184   0.330
# 3    3.  -0.836  -0.820
`number.x`
df3 <- copy_to(sc, tibble(value.x = rnorm(42)))

df3 %>% 
  spark_dataframe() %>% 
  invoke("withColumnRenamed", "`value.x`", "value_x") %>%
  sdf_register()

# # Source:   table<sparklyr_tmp_61acdbbc592> [?? x 1]
# # Database: spark_connection
#    value_x
#      <dbl>
#  1 -0.0162
#  2  0.944 
#  3  0.821 
#  4  0.594 
#  5  0.919 
#  6  0.782 
#  7  0.0746
#  8 -1.99  
#  9  0.620 
# 10 -0.0561
# # ... with more r