创建预定义文本的字符向量列,并使用rbind或bind_行将其绑定到现有数据帧
你好 我将提出两个[可能]非常微小的问题,供您进行出色的复习 问题#1 我有一个相对整洁的df(dat),尺寸为10299 x 563。两个数据集[创建]dat共有563个变量,分别为“主题”(数字)、“标签”(数字)、3:563(文本文件中的变量名)。观察值1:2947来自“测试”数据集,而观察值2948:10299来自“训练”数据集 我想在dat中插入一列(header='type'),它基本上是由字符串test的第1:2947行和字符串序列的第2948:10299行组成的,这样我以后就可以在dataset或dplyr/tidyr中的其他类似聚合函数上分组 我创建了一个测试df(testdf=1:10299:dim(testdf)=102499 x 1),然后:创建预定义文本的字符向量列,并使用rbind或bind_行将其绑定到现有数据帧,r,dataframe,dplyr,rbind,cbind,R,Dataframe,Dplyr,Rbind,Cbind,你好 我将提出两个[可能]非常微小的问题,供您进行出色的复习 问题#1 我有一个相对整洁的df(dat),尺寸为10299 x 563。两个数据集[创建]dat共有563个变量,分别为“主题”(数字)、“标签”(数字)、3:563(文本文件中的变量名)。观察值1:2947来自“测试”数据集,而观察值2948:10299来自“训练”数据集 我想在dat中插入一列(header='type'),它基本上是由字符串test的第1:2947行和字符串序列的第2948:10299行组成的,这样我以后就可以
testdat[1:2947 , "type"] <- c("test")
testdat[2948:10299, "type"] <- c("train")
> head(ds, 2);tail(ds, 2)
X1.10299 type
1 1 test
2 2 test
X1.10299 type
10298 10298 train
10299 10299 train
列车df
列车[1:10,1:5]
实际代码(忽略函数调用/我正在通过控制台进行大部分测试)
[我与此代码一起使用的数据集
run\u analysis您所面临的问题源于这样一个事实,即您在用于创建数据框对象的变量列表中有重复的名称。如果您确保列名是唯一的,并且在对象之间共享,那么代码将运行。基于您上面使用的代码,我提供了一个完整的工作示例(注释中注明了修复和各种编辑):
vars Hi Zach,这确实很奇怪,因为我觉得你所做的一切都不会产生你所看到的东西。你能用你可以在这里复制的数据子集(可以是dput()的较小行/列数)复制行为吗
或read in?@forrest.Stevens用可复制的[希望]更新了问题dataset.非常有趣!您认为我在rbind和bind_行之间有一个列增量的原因是因为bind_行正在删除带有非字母数字的列?我不知道,因为我没有看到您的代码的实际摘录以及读取所有数据的方式。但是从您上面使用idential()进行的检查来看
情况似乎并非如此?正如我所说,我对这个结果感到困惑,看不出出现这种情况的明显原因,尤其是在所有数据类型都相同的情况下,等等(bind_rows()
如果列数据类型混合,则会呕吐)。更新了原始帖子,以包含我通过控制台测试编写的实际代码。好吧,我仍然无法解释您所观察到的内容,因为没有原始数据文件,我可能会被卡住。我编辑了上面的代码,以包含您的原始列名,认为这可能是dplyr
名称检查中的奇怪之处,但是我仍然不明白为什么bind_rows()
会毫无错误地进行,但最终会删除列。通常,如果列名不匹配,则会保留列并填充NAs。Hi@Forrest我在代码块之前添加了指向我正在使用的数据集的链接。这是一种不寻常的行为。
test.names <- names(test)
train.names <- names(train)
identical(test.names, train.names)
> TRUE
dat <- bind_rows(test, train)
subject labels tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z
1 2 5 0.2571778 -0.02328523 -0.01465376
2 2 5 0.2860267 -0.01316336 -0.11908252
3 2 5 0.2754848 -0.02605042 -0.11815167
4 2 5 0.2702982 -0.03261387 -0.11752018
5 2 5 0.2748330 -0.02784779 -0.12952716
6 2 5 0.2792199 -0.01862040 -0.11390197
7 2 5 0.2797459 -0.01827103 -0.10399988
8 2 5 0.2746005 -0.02503513 -0.11683085
9 2 5 0.2725287 -0.02095401 -0.11447249
10 2 5 0.2757457 -0.01037199 -0.09977589
subject label tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z
1 1 5 0.2885845 -0.020294171 -0.1329051
2 1 5 0.2784188 -0.016410568 -0.1235202
3 1 5 0.2796531 -0.019467156 -0.1134617
4 1 5 0.2791739 -0.026200646 -0.1232826
5 1 5 0.2766288 -0.016569655 -0.1153619
6 1 5 0.2771988 -0.010097850 -0.1051373
7 1 5 0.2794539 -0.019640776 -0.1100221
8 1 5 0.2774325 -0.030488303 -0.1253604
9 1 5 0.2772934 -0.021750698 -0.1207508
10 1 5 0.2805857 -0.009960298 -0.1060652
run_analysis <- function () {
#Vars available for use throughout the function that should be preserved
vars <- read.table("features.txt", header = FALSE, sep = "")
lookup_table <- data.frame(activitynum = c(1,2,3,4,5,6),
activity_label = c("walking", "walking_up",
"walking_down", "sitting",
"standing", "laying"))
test <- test_read_process(vars, lookup_table)
train <- train_read_process(vars, lookup_table)
}
test_read_process <- function(vars, lookup_table) {
#read in the three documents for cbinding later
test.sub <- read.table("test/subject_test.txt", header = FALSE)
test.labels <- read.table("test/y_test.txt", header = FALSE)
test.obs <- read.table("test/X_test.txt", header = FALSE, sep = "")
#cbind the cols together and set remaining colNames to var names in vars
test.dat <- cbind(test.sub, test.labels, test.obs)
colnames(test.dat) <- c("subject", "labels", as.character(vars[,2]))
#Use lookup_table to set the "test_labels" string values that correspond
#to their integer IDs
#test.lookup <- merge(test, lookup_table, by.x = "labels",
# by.y ="activitynum", all.x = T)
#Remove temporary symbols from globalEnv/memory
rm(test.sub, test.labels, test.obs)
#return
return(test.dat)
}
train_read_process <- function(vars, lookup_table) {
#read in the three documents for cbinding
train.sub <- read.table("train/subject_train.txt", header = FALSE)
train.labels <- read.table("train/y_train.txt", header = FALSE)
train.obs <- read.table("train/X_train.txt", header = FALSE, sep = "")
#cbind the cols together and set remaining colNames to var names in vars
train.dat <- cbind(train.sub, train.labels, train.obs)
colnames(train.dat) <- c("subject", "label", as.character(vars[,2]))
#Clean up temporary symbols from globalEnv/memory
rm(train.sub, train.labels, train.obs, vars)
return(train.dat)
}
vars <- read.table(file="features.txt", header=F, stringsAsFactors=F)
## FRS: This is the source of original problem:
duplicated(vars[,2])
vars[317:340,2]
duplicated(vars[317:340,2])
vars[396:419,2]
## FRS: I edited the following to both account for your data and variable
## issues:
test_read_process <- function() {
#read in the three documents for cbinding later
test.sub <- read.table("test/subject_test.txt", header = FALSE)
test.labels <- read.table("test/y_test.txt", header = FALSE)
test.obs <- read.table("test/X_test.txt", header = FALSE, sep = "")
#cbind the cols together and set remaining colNames to var names in vars
test.dat <- cbind(test.sub, test.labels, test.obs)
#colnames(test.dat) <- c("subject", "labels", as.character(vars[,2]))
colnames(test.dat) <- c("subject", "labels", paste0("V", 1:nrow(vars)))
return(test.dat)
}
train_read_process <- function() {
#read in the three documents for cbinding
train.sub <- read.table("train/subject_train.txt", header = FALSE)
train.labels <- read.table("train/y_train.txt", header = FALSE)
train.obs <- read.table("train/X_train.txt", header = FALSE, sep = "")
#cbind the cols together and set remaining colNames to var names in vars
train.dat <- cbind(train.sub, train.labels, train.obs)
#colnames(train.dat) <- c("subject", "labels", as.character(vars[,2]))
colnames(train.dat) <- c("subject", "labels", paste0("V", 1:nrow(vars)))
return(train.dat)
}
test_df <- test_read_process()
train_df <- train_read_process()
identical(names(test_df), names(train_df))
library("dplyr")
## FRS: These could be piped together but I've kept them separate for clarity:
train_df %>%
mutate(test="train") ->
train_df
test_df %>%
mutate(test="test") ->
test_df
test_df %>%
bind_rows(train_df) ->
out_df
head(out_df)
out_df
## FRS: You can set your column names to those of the original
## variable list but you still have duplicates to deal with:
names(out_df) <- c("subject", "labels", as.character(vars[,2]), "test")
duplicated(names(out_df))