使用dbplyr和corrr的两个变量之间的分组相关性

使用dbplyr和corrr的两个变量之间的分组相关性,r,correlation,impala,dbplyr,R,Correlation,Impala,Dbplyr,我和黑斑羚有联系 con <- DBI::dbConnect(odbc::odbc(), "impala connector", schema = "some_schema") library(dplyr) library(dbplyr) #I have to load both of them, if not tbl won't work table <- tbl(con, 'serverTable') 就像这样,我得到了错

我和黑斑羚有联系

con <- DBI::dbConnect(odbc::odbc(), "impala connector", schema = "some_schema")        
library(dplyr)
library(dbplyr) #I have to load both of them, if not tbl won't work
table <- tbl(con, 'serverTable')
就像这样,我得到了错误

stats中的错误::corx=x,y=y,use=use,method=method:同时提供“x”和“y”或类似“x”的矩阵

相反,如果我尝试使用stats、corVAR、num_date中的cor,我会得到错误

新版本中的错误_resultconnection@ptr,语句,立即:nanodbc/nanodbc.cpp:1412:HY000:[Cloudera][ImpalaODBC]370查询执行期间发生查询分析错误:[HY000]:分析异常:某些\u schema.cor未知

就像dbplyr不能将cor转换为SQL一样,如果我运行show_查询而不是collect,我就会看到它

编辑, 我使用SQL解决了这个问题:

SELECT id, cor
FROM(
SELECT id,
((tot_sum - (VAR_sum * date_sum / _count)) / sqrt((VAR_sq - pow(VAR_sum, 2.0) / _count) * (date_sq - pow(date_sum, 2.0) / _count))) AS cor
FROM (
SELECT id,
    sum(VAR) AS VAR_sum,
    sum(CAST(CAST(date AS TIMESTAMP) AS DOUBLE)) AS date_sum,
    sum(VAR * VAR) AS VAR_sq,
    sum(CAST(CAST(date AS TIMESTAMP) AS DOUBLE) * CAST(CAST(date AS TIMESTAMP) AS DOUBLE)) AS date_sq,
    sum(VAR * CAST(CAST(date_push AS TIMESTAMP) AS DOUBLE)) AS tot_sum,
    count(*) as _count
FROM (
SELECT id, VAR, date
FROM (
SELECT id, VAR, date
FROM schema
WHERE VAR IS NOT NULL) AS a
WHERE VAR < -10 OR VAR > -32) AS b
GROUP BY idur) AS c) AS d
WHERE ABS(cor) > 0.9 AND ABS(cor) <= 1
感谢这篇文章:

cor不在dplyr可以翻译的函数列表中-请参见此处:

您可以在代码中尝试以下操作:

mutate(corr = translate_sql(corr(VAR, num_date)))

这应该直接转换为CORRVAR,num_date。这些转换并不适用于所有数据库类型。如果在您的情况下无法实现此功能,您可能别无选择,只能在尝试运行不可翻译的函数之前收集数据。

感谢您的回答和链接,不幸的是,它也不起作用,我仍然会收到相同的错误,一些schema.corr未知
SELECT id, cor
FROM(
SELECT id,
((tot_sum - (VAR_sum * date_sum / _count)) / sqrt((VAR_sq - pow(VAR_sum, 2.0) / _count) * (date_sq - pow(date_sum, 2.0) / _count))) AS cor
FROM (
SELECT id,
    sum(VAR) AS VAR_sum,
    sum(CAST(CAST(date AS TIMESTAMP) AS DOUBLE)) AS date_sum,
    sum(VAR * VAR) AS VAR_sq,
    sum(CAST(CAST(date AS TIMESTAMP) AS DOUBLE) * CAST(CAST(date AS TIMESTAMP) AS DOUBLE)) AS date_sq,
    sum(VAR * CAST(CAST(date_push AS TIMESTAMP) AS DOUBLE)) AS tot_sum,
    count(*) as _count
FROM (
SELECT id, VAR, date
FROM (
SELECT id, VAR, date
FROM schema
WHERE VAR IS NOT NULL) AS a
WHERE VAR < -10 OR VAR > -32) AS b
GROUP BY idur) AS c) AS d
WHERE ABS(cor) > 0.9 AND ABS(cor) <= 1
mutate(corr = translate_sql(corr(VAR, num_date)))