使用dbplyr和corrr的两个变量之间的分组相关性
我和黑斑羚有联系使用dbplyr和corrr的两个变量之间的分组相关性,r,correlation,impala,dbplyr,R,Correlation,Impala,Dbplyr,我和黑斑羚有联系 con <- DBI::dbConnect(odbc::odbc(), "impala connector", schema = "some_schema") library(dplyr) library(dbplyr) #I have to load both of them, if not tbl won't work table <- tbl(con, 'serverTable') 就像这样,我得到了错
con <- DBI::dbConnect(odbc::odbc(), "impala connector", schema = "some_schema")
library(dplyr)
library(dbplyr) #I have to load both of them, if not tbl won't work
table <- tbl(con, 'serverTable')
就像这样,我得到了错误
stats中的错误::corx=x,y=y,use=use,method=method:同时提供“x”和“y”或类似“x”的矩阵
相反,如果我尝试使用stats、corVAR、num_date中的cor,我会得到错误
新版本中的错误_resultconnection@ptr,语句,立即:nanodbc/nanodbc.cpp:1412:HY000:[Cloudera][ImpalaODBC]370查询执行期间发生查询分析错误:[HY000]:分析异常:某些\u schema.cor未知
就像dbplyr不能将cor转换为SQL一样,如果我运行show_查询而不是collect,我就会看到它
编辑,
我使用SQL解决了这个问题:
SELECT id, cor
FROM(
SELECT id,
((tot_sum - (VAR_sum * date_sum / _count)) / sqrt((VAR_sq - pow(VAR_sum, 2.0) / _count) * (date_sq - pow(date_sum, 2.0) / _count))) AS cor
FROM (
SELECT id,
sum(VAR) AS VAR_sum,
sum(CAST(CAST(date AS TIMESTAMP) AS DOUBLE)) AS date_sum,
sum(VAR * VAR) AS VAR_sq,
sum(CAST(CAST(date AS TIMESTAMP) AS DOUBLE) * CAST(CAST(date AS TIMESTAMP) AS DOUBLE)) AS date_sq,
sum(VAR * CAST(CAST(date_push AS TIMESTAMP) AS DOUBLE)) AS tot_sum,
count(*) as _count
FROM (
SELECT id, VAR, date
FROM (
SELECT id, VAR, date
FROM schema
WHERE VAR IS NOT NULL) AS a
WHERE VAR < -10 OR VAR > -32) AS b
GROUP BY idur) AS c) AS d
WHERE ABS(cor) > 0.9 AND ABS(cor) <= 1
感谢这篇文章:
cor不在dplyr可以翻译的函数列表中-请参见此处:
您可以在代码中尝试以下操作:
mutate(corr = translate_sql(corr(VAR, num_date)))
这应该直接转换为CORRVAR,num_date。这些转换并不适用于所有数据库类型。如果在您的情况下无法实现此功能,您可能别无选择,只能在尝试运行不可翻译的函数之前收集数据。感谢您的回答和链接,不幸的是,它也不起作用,我仍然会收到相同的错误,一些schema.corr未知
SELECT id, cor
FROM(
SELECT id,
((tot_sum - (VAR_sum * date_sum / _count)) / sqrt((VAR_sq - pow(VAR_sum, 2.0) / _count) * (date_sq - pow(date_sum, 2.0) / _count))) AS cor
FROM (
SELECT id,
sum(VAR) AS VAR_sum,
sum(CAST(CAST(date AS TIMESTAMP) AS DOUBLE)) AS date_sum,
sum(VAR * VAR) AS VAR_sq,
sum(CAST(CAST(date AS TIMESTAMP) AS DOUBLE) * CAST(CAST(date AS TIMESTAMP) AS DOUBLE)) AS date_sq,
sum(VAR * CAST(CAST(date_push AS TIMESTAMP) AS DOUBLE)) AS tot_sum,
count(*) as _count
FROM (
SELECT id, VAR, date
FROM (
SELECT id, VAR, date
FROM schema
WHERE VAR IS NOT NULL) AS a
WHERE VAR < -10 OR VAR > -32) AS b
GROUP BY idur) AS c) AS d
WHERE ABS(cor) > 0.9 AND ABS(cor) <= 1
mutate(corr = translate_sql(corr(VAR, num_date)))