R 如何计算随机森林和H2O梯度增强模型的精度_R

R 如何计算随机森林和H2O梯度增强模型的精度

R 如何计算随机森林和H2O梯度增强模型的精度,r,R,//使用以下代码，我能够预测销售额并创建文件。但是我想计算整个模型的精度，看看测试集（test1）是如何执行的。以下是rstudio的完整代码 //So here are two data set 1)train 2) test. And the task is to predict the sales. //All the data set can be found here https://www.kaggle.com/c/rossmann-store-sales/data //RF根据

//使用以下代码，我能够预测销售额并创建文件。但是我想计算整个模型的精度，看看测试集（test1）是如何执行的。以下是rstudio的完整代码

//So here are two data set 1)train 2) test. And the task is to predict the sales.

//All the data set can be found here https://www.kaggle.com/c/rossmann-store-sales/data

//RF根据具有选定特征的列车数据进行培训，然后预测测试数据的销售

//任何计算预测精度的方法（RF模型），以便我可以将该精度与H2O进行比较。梯度推进模型？我好像迷路了

//Thanks a bunch.



```{r,echo=TRUE,message=F, warning=F}
library(readr)
library(randomForest)
set.seed(415)
```

Reading the CSV files to be analyzed 

```{r,echo=TRUE,message=F, warning=F}
train <- read_csv("train.csv")
test  <- read_csv("test.csv")
store <- read_csv("store.csv")

##merging the two files because two files have the different feature that have to be combined in order to the see the full effect of features on sales.
train1 <- merge(train,store) 
test1 <- merge(test,store)
```


Converting all the 'NA' in train data to Zeros. Store 622 has 11 missing values for the "open" column, in test data; so to predict correctly I have decided to input "1" for open column of store 622. Otherwise our prediction will not be correct.

```{r}


train1[is.na(train1)]   <- 0
test1[is.na(test1)]   <- 1

## We will only look at the stores that had status as "open"
train1<- train1[ which(train1$Open=='1'),]
```

train1 and test1 data have "Date" as column value. We will seperate the Date into month, year and day respectively. These new variables generated through "Date" column will be better handle to predict the sales 

```{r}

train1$Date <- as.Date(train1$Date)
test1$Date <- as.Date(test1$Date)

train1$month <- as.integer(format(train1$Date, "%m"))
train1$year <- as.integer(format(train1$Date, "%y"))
train1$day <- as.integer(format(train1$Date, "%d"))
train1$DayOfYear <- as.integer(as.POSIXlt(train1$Date)$yday)
train1$week <- as.integer( format(train1$Date+3, "%U"))


test1$month <- as.integer(format(test1$Date, "%m"))
test1$year <- as.integer(format(test1$Date, "%y"))
test1$day <- as.integer(format(test1$Date, "%d"))
test1$DayOfYear <-  as.integer(as.POSIXlt(test1$Date)$yday)
test1$week <- as.integer( format(test1$Date+3, "%U"))
```


```{r}

names(train1)
summary(train1)

names(test1)
summary(test1)

```

Features relevant to our analysis; Sales column is left as we are going to predict.

```{r}
variable.names <- names(train1)[c(1,2,6,7,8:12,14:23)]

for (f in variable.names) {
  if (class(train1[[f]])=="character") {
    levels <- unique(c(train1[[f]], test1[[f]]))
    train1[[f]] <- as.integer(factor(train1[[f]], levels=levels))
    test1[[f]]  <- as.integer(factor(test1[[f]],  levels=levels))
  }
}
```



// Random forest
```{r}
result <- randomForest(train1[,variable.names], 
                    log(train1$Sales+1),
                    mtry=5,
                    ntree=50,
                    max_depth = 30,
                    sampsize=150000,
                    do.trace=TRUE)

importance(result, type = 1)   
importance(result, type = 2)
varImpPlot(result)                 
pred <- exp(predict(result, test1)) -1
submission <- data.frame(Id=test$Id, Sales=pred)
write_csv(submission, "resultfile.csv")

```

//非常感谢。
```{r，echo=TRUE，message=F，warning=F}
图书馆（readr）
图书馆（森林）
种子集（415）
```
读取要分析的CSV文件
```{r，echo=TRUE，message=F，warning=F}
火车