Sql server RevoScaleR:rxPredict,参数的数量与变量的数量不匹配

Sql server RevoScaleR:rxPredict,参数的数量与变量的数量不匹配,sql-server,r,revolution-r,Sql Server,R,Revolution R,我使用微软的“”为自己设置了R服务器,他们的示例非常有效 示例(纽约出租车数据)使用非分类变量(即距离、出租车票价等)预测分类变量(1或0表示是否支付小费) 我试图使用分类变量作为输入,使用线性回归(rxLinMod函数)预测类似的二进制输出,但我发现了一个错误 scoredOutput <- RxSqlServerData( connectionString = connStr, table = "binaryOutput" ) rxPredict(modelObject =

我使用微软的“”为自己设置了R服务器,他们的示例非常有效

示例(纽约出租车数据)使用非分类变量(即距离、出租车票价等)预测分类变量(1或0表示是否支付小费)

我试图使用分类变量作为输入,使用线性回归(rxLinMod函数)预测类似的二进制输出,但我发现了一个错误

scoredOutput <- RxSqlServerData(
  connectionString = connStr,
  table = "binaryOutput"
)

rxPredict(modelObject = isWonObj, data = pred, outData = scoredOutput, 
          predVarNames = "Score", type = "response", writeModelVars = FALSE, overwrite = TRUE,checkFactorLevels = FALSE)
错误表明参数的数量与变量的数量不匹配,但在我看来,变量的数量实际上是每个因子(变量)中的级别数

复制

在SQL Server中创建名为example的表:

USE [my_database];
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
CREATE TABLE [dbo].[example](
    [Person] [nvarchar](max) NULL,
    [City] [nvarchar](max) NULL,
    [Bin] [integer] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY];
# lm() and predict() don't have a problem with missing factor levels ("two" in this case):
fac <- c("one", "two", "three")
val = c(1, 2, 3)
trainingData <- data.frame(fac, val, stringsAsFactors = TRUE)
lmModel <- lm(val ~ fac, data = trainingData)
print(summary(lmModel))
predictionData = data.frame(fac = c("one", "three", "one", "one"))
lmPred <- predict(lmModel, newdata = predictionData)
lmPred
# The result is OK:
# 1 2 3 4
# 1 3 1 1

# rxLinMod() and rxPredict() behave different:
rxModel <- rxLinMod(val ~ fac, data = trainingData)
rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE)
# The following error is thrown:
# "INTERNAL ERROR: In rxPredict, the number of parameters does not match
# the number of  variables: 3 vs. 4."
# checkFactorLevels = FALSE doesn't help here, it actually seems to just
# check the order of factor levels.
levels(predictionData$fac) <- c("two", "three", "one")
rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE)
# The following error is thrown (twice):
# ERROR:order of factor levels in the data are inconsistent with
# the order of the model coefficients:fac = two versus fac = one. Set
# checkFactorLevels = FALSE to ignore.
rxPred <- rxPredict(rxModel, data = predictionData, checkFactorLevels = FALSE, writeModelVars = TRUE)
rxPred
#   val_Pred    fac
#1  1           two
#2  3           three
#3  1           two
#4  1           two
# This looks suspicious at best. While the prediction values are still
# correct if you look only at the order of the records in trainingData,
# the model variables are messed up.
将数据放入其中:

insert into [dbo].[example] values ('John','London',0);
insert into [dbo].[example] values ('Paul','New York',0);
insert into [dbo].[example] values ('George','Liverpool',1);
insert into [dbo].[example] values ('Ringo','Paris',1);
insert into [dbo].[example] values ('John','Sydney',1);
insert into [dbo].[example] values ('Paul','Mexico City',1);
insert into [dbo].[example] values ('George','London',1);
insert into [dbo].[example] values ('Ringo','New York',1);
insert into [dbo].[example] values ('John','Liverpool',1);
insert into [dbo].[example] values ('Paul','Paris',0);
insert into [dbo].[example] values ('George','Sydney',0);
insert into [dbo].[example] values ('Ringo','Mexico City',0);
我还使用了一个SQL函数,它以表格式返回变量,因为这是Microsoft示例中所要求的。创建函数
格式表

USE [my_database];
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
CREATE FUNCTION [dbo].[formatAsTable] (
@City nvarchar(max)='',
@Person nvarchar(max)='')
RETURNS TABLE
AS
  RETURN
  (
  -- Add the SELECT statement with parameter references here
  SELECT
    @City AS City,
    @Person AS Person
  );
我们现在有一个包含两个分类变量的表-
Person
,和
City

让我们开始预测。在R中,运行以下命令:

library(RevoScaleR)
# Set up the database connection
connStr <- "Driver=SQL Server;Server=<servername>;Database=<dbname>;Uid=<uid>;Pwd=<password>"
sqlShareDir <- paste("C:\\AllShare\\",Sys.getenv("USERNAME"),sep="")
sqlWait <- TRUE
sqlConsoleOutput <- FALSE
cc <- RxInSqlServer(connectionString = connStr, shareDir = sqlShareDir, 
                    wait = sqlWait, consoleOutput = sqlConsoleOutput)
rxSetComputeContext(cc)
# Set the SQL which gets our data base
sampleDataQuery <- "SELECT * from [dbo].[example] "
# Set up the data source
inDataSource <- RxSqlServerData(sqlQuery = sampleDataQuery, connectionString = connStr, 
                                colClasses = c(City = "factor",Bin="logical",Person="factor"
                                ),
                                rowsPerRead=500)    
请注意,它看起来是这样的:

...
Total independent variables: 11 (Including number dropped: 3)
...

Coefficients:
                           Bin
(Intercept)       6.666667e-01
City=London      -1.666667e-01
City=New York     4.450074e-16
City=Liverpool    3.333333e-01
City=Paris        4.720871e-16
City=Sydney      -1.666667e-01
City=Mexico City       Dropped
Person=John      -1.489756e-16
Person=Paul      -3.333333e-01
Person=George          Dropped
Person=Ringo           Dropped
它说有11个变量,这是很好的,因为这是因子中水平的总和

现在,当我试图根据
城市
个人
预测
Bin
值时,我得到一个错误:

首先,我将要预测的
城市
个人
格式化为一个表。然后,我将此作为输入进行预测

sq<-"SELECT City, Person FROM [dbo].[formatAsTable]('London','George')"
pred<-RxSqlServerData(sqlQuery = sq,connectionString = connStr
                      , colClasses = c(City = "factor",Person="factor"))
现在当我尝试预测时,我得到了一个错误

scoredOutput <- RxSqlServerData(
  connectionString = connStr,
  table = "binaryOutput"
)

rxPredict(modelObject = isWonObj, data = pred, outData = scoredOutput, 
          predVarNames = "Score", type = "response", writeModelVars = FALSE, overwrite = TRUE,checkFactorLevels = FALSE)
我可以看到11是从哪里来的,但我只为predict查询提供了2个值,所以我看不出3是从哪里来的,也看不出为什么会有问题


感谢您的帮助

答案似乎与R对待因子变量的方式一致,但错误信息可能会在因子、水平、变量和参数之间做出更明确的区分

看来,用于生成预测的输入参数不能仅仅是没有级别的字符或因子它们需要与模型参数化中使用的相同变量的因子具有相同的水平

因此,以下行:

sq<-"SELECT City, Person FROM [dbo].[formatAsTable]('London','George')"
pred<-RxSqlServerData(sqlQuery = sq,connectionString = connStr
                      , colClasses = c(City = "factor",Person="factor"))

sq答案似乎与R处理因子变量的方式一致,但错误信息可能会在因子、水平、变量和参数之间做出更明确的区分

看来,用于生成预测的输入参数不能仅仅是没有级别的字符或因子它们需要与模型参数化中使用的相同变量的因子具有相同的水平

因此,以下行:

sq<-"SELECT City, Person FROM [dbo].[formatAsTable]('London','George')"
pred<-RxSqlServerData(sqlQuery = sq,connectionString = connStr
                      , colClasses = c(City = "factor",Person="factor"))

sq您确定指定colInfo可以解决问题吗?与SQL Server结合使用时,rxPredict而不是rxPredict中似乎存在一个一般性问题:

USE [my_database];
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
CREATE TABLE [dbo].[example](
    [Person] [nvarchar](max) NULL,
    [City] [nvarchar](max) NULL,
    [Bin] [integer] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY];
# lm() and predict() don't have a problem with missing factor levels ("two" in this case):
fac <- c("one", "two", "three")
val = c(1, 2, 3)
trainingData <- data.frame(fac, val, stringsAsFactors = TRUE)
lmModel <- lm(val ~ fac, data = trainingData)
print(summary(lmModel))
predictionData = data.frame(fac = c("one", "three", "one", "one"))
lmPred <- predict(lmModel, newdata = predictionData)
lmPred
# The result is OK:
# 1 2 3 4
# 1 3 1 1

# rxLinMod() and rxPredict() behave different:
rxModel <- rxLinMod(val ~ fac, data = trainingData)
rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE)
# The following error is thrown:
# "INTERNAL ERROR: In rxPredict, the number of parameters does not match
# the number of  variables: 3 vs. 4."
# checkFactorLevels = FALSE doesn't help here, it actually seems to just
# check the order of factor levels.
levels(predictionData$fac) <- c("two", "three", "one")
rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE)
# The following error is thrown (twice):
# ERROR:order of factor levels in the data are inconsistent with
# the order of the model coefficients:fac = two versus fac = one. Set
# checkFactorLevels = FALSE to ignore.
rxPred <- rxPredict(rxModel, data = predictionData, checkFactorLevels = FALSE, writeModelVars = TRUE)
rxPred
#   val_Pred    fac
#1  1           two
#2  3           three
#3  1           two
#4  1           two
# This looks suspicious at best. While the prediction values are still
# correct if you look only at the order of the records in trainingData,
# the model variables are messed up.
#lm()和predict()没有缺少因子级别的问题(在本例中为“两个”):

fac您确定指定colInfo可以解决问题吗?与SQL Server结合使用时,rxPredict而不是rxPredict中似乎存在一个一般性问题:

USE [my_database];
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
CREATE TABLE [dbo].[example](
    [Person] [nvarchar](max) NULL,
    [City] [nvarchar](max) NULL,
    [Bin] [integer] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY];
# lm() and predict() don't have a problem with missing factor levels ("two" in this case):
fac <- c("one", "two", "three")
val = c(1, 2, 3)
trainingData <- data.frame(fac, val, stringsAsFactors = TRUE)
lmModel <- lm(val ~ fac, data = trainingData)
print(summary(lmModel))
predictionData = data.frame(fac = c("one", "three", "one", "one"))
lmPred <- predict(lmModel, newdata = predictionData)
lmPred
# The result is OK:
# 1 2 3 4
# 1 3 1 1

# rxLinMod() and rxPredict() behave different:
rxModel <- rxLinMod(val ~ fac, data = trainingData)
rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE)
# The following error is thrown:
# "INTERNAL ERROR: In rxPredict, the number of parameters does not match
# the number of  variables: 3 vs. 4."
# checkFactorLevels = FALSE doesn't help here, it actually seems to just
# check the order of factor levels.
levels(predictionData$fac) <- c("two", "three", "one")
rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE)
# The following error is thrown (twice):
# ERROR:order of factor levels in the data are inconsistent with
# the order of the model coefficients:fac = two versus fac = one. Set
# checkFactorLevels = FALSE to ignore.
rxPred <- rxPredict(rxModel, data = predictionData, checkFactorLevels = FALSE, writeModelVars = TRUE)
rxPred
#   val_Pred    fac
#1  1           two
#2  3           three
#3  1           two
#4  1           two
# This looks suspicious at best. While the prediction values are still
# correct if you look only at the order of the records in trainingData,
# the model variables are messed up.
#lm()和predict()没有缺少因子级别的问题(在本例中为“两个”):

fac仅设置因子水平(…水平(predictionData$fac)仅设置因子水平(…水平(predictionData$fac)我编辑了我的原始答案,以纳入您的问题-如果我理解正确,它应该会有所帮助。我编辑了我的原始答案,以纳入您的问题-如果我理解正确,它应该会有所帮助。很好的观点。这样的问题就是为什么我试图尽可能避免使用RevoScaleR!很好的观点。这样的问题就是为什么我此后,我一直试图尽可能避免使用RevoScaleR!