使用R解析Surveymonkey csv文件

使用R解析Surveymonkey csv文件,r,parsing,csv,surveymonkey,R,Parsing,Csv,Surveymonkey,我正在尝试分析一个使用surveymonkey创建的大型调查,该调查在CSV文件中有数百列,并且输出格式很难使用,因为标题跨越两行 是否有人找到了一种简单的方法来管理CSV文件中的标题,从而使分析易于管理 其他人如何分析调查结果 谢谢 以下内容如何:使用read.csv()和header=FALSE。制作两个数组,一个包含两行标题,另一个包含调查答案。然后将的两行/句子粘贴在一起。最后,使用colnames() 我最后做的是使用libreoffice打印标题,标记为V1、V2等。然后我只是以

我正在尝试分析一个使用surveymonkey创建的大型调查,该调查在CSV文件中有数百列,并且输出格式很难使用,因为标题跨越两行

  • 是否有人找到了一种简单的方法来管理CSV文件中的标题,从而使分析易于管理
  • 其他人如何分析调查结果

谢谢

以下内容如何:使用
read.csv()
header=FALSE
。制作两个数组,一个包含两行标题,另一个包含调查答案。然后将的两行/句子粘贴在一起。最后,使用
colnames()

我最后做的是使用libreoffice打印标题,标记为V1、V2等。然后我只是以

 m1 <- read.csv('Sheet1.csv', header=FALSE, skip=1)

m1您可以从Surveymonkey以适合R的方便形式导出它,请参阅“高级电子表格格式”中的下载响应


截至2013年11月,网页布局似乎发生了变化。选择
analysisresults>Export All>All Responses Data>Original View>XLS+(在高级统计和分析软件中打开)
。然后转到导出并下载文件。您将获得原始数据,第一行=问题标题/接下来的每行=1个响应,如果您有许多响应/问题,可能会在多个文件之间分割


标题的问题是,带有“选择所有适用项”的列将有一个空白的顶行,列标题将是下面的行。这只是这类问题的一个问题

考虑到这一点,我编写了一个循环,遍历所有列,如果列名为空(字符长度为1),则用第二行中的值替换列名

然后,您可以删除第二行数据并获得一个整洁的数据框

for(i in 1:ncol(df)){
newname <- colnames(df)[i]
if(nchar(newname) < 2){
colnames(df)[i] <- df[1,i]
} 

df <- df[-1,]
for(1中的i:ncol(df)){

newname我必须经常处理这个问题,将标题放在两列上有点痛苦。此函数修复了这个问题,因此您只能处理一行标题。它还加入了多个标题问题,因此您可以使用top:bottom样式命名

#' @param x The path to a surveymonkey csv file
fix_names <- function(x) {
  rs <- read.csv(
    x,
    nrows = 2,
    stringsAsFactors = FALSE,
    header = FALSE,
    check.names = FALSE, 
    na.strings = "",
    encoding = "UTF-8"
  )

  rs[rs == ""] <- NA
  rs[rs == "NA"] <- "Not applicable"
  rs[rs == "Response"] <- NA
  rs[rs == "Open-Ended Response"] <- NA

  nms <- c()

  for(i in 1:ncol(rs)) {

    current_top <- rs[1,i]
    current_bottom <- rs[2,i]

    if(i + 1 < ncol(rs)) {
      coming_top <- rs[1, i+1]
      coming_bottom <- rs[2, i+1]
    }

    if(is.na(coming_top) & !is.na(current_top) & (!is.na(current_bottom) | grepl("^Other", coming_bottom)))
      pre <- current_top

    if((is.na(current_top) & !is.na(current_bottom)) | (!is.na(current_top) & !is.na(current_bottom)))
      nms[i] <- paste0(c(pre, current_bottom), collapse = " - ")

    if(!is.na(current_top) & is.na(current_bottom))
      nms[i] <- current_top

  }


  nms
}

参加聚会很晚,但这仍然是一个问题,我发现最好的解决方法是使用函数根据重复值将列名和子列名粘贴在一起

例如,如果导出到
.csv
,重复的列名将自动替换为RStudio中的
X
。如果导出到
.xlsx
,重复的值将是

下面是一个
base R
解决方案:

sm_header_function <- function(x, rep_val){
  
  orig <- x
  
  sv <- x
  sv <- sv[1,]
  sv <- sv[, sapply(sv, Negate(anyNA)), drop = FALSE]
  sv <- t(sv)
  sv <- cbind(rownames(sv), data.frame(sv, row.names = NULL))
  names(sv)[1] <- "name"
  names(sv)[2] <- "value"
  sv$grp <- with(sv, ave(name, FUN = function(x) cumsum(!startsWith(name, rep_val))))
  sv$new_value <- with(sv, ave(name, grp, FUN = function(x) head(x, 1)))
  sv$new_value <- paste0(sv$new_value, " ", sv$value)
  new_names <- as.character(sv$new_value)
  colnames(orig)[which(colnames(orig) %in% sv$name)] <- sv$new_value
  orig <- orig[-c(1),]
  return(orig)
}

sm_header_function(df, "X")
sm_header_function(df, "...")
来自SurveyMonkey的清洁出口:

> colnames(sample)
 [1] "Respondent ID"                                 "Please provide your contact information:"      "...11"                                        
 [4] "...12"                                         "...13"                                         "...14"                                        
 [7] "...15"                                         "...16"                                         "...17"                                        
[10] "...18"                                         "...19"                                         "I wish it would have snowed more this winter."
> colnames(sample_clean)
 [1] "Respondent ID"                                            "Please provide your contact information: Name"           
 [3] "Please provide your contact information: Company"         "Please provide your contact information: Address"        
 [5] "Please provide your contact information: Address 2"       "Please provide your contact information: City/Town"      
 [7] "Please provide your contact information: State/Province"  "Please provide your contact information: ZIP/Postal Code"
 [9] "Please provide your contact information: Country"         "Please provide your contact information: Email Address"  
[11] "Please provide your contact information: Phone Number"    "I wish it would have snowed more this winter. Response"  
样本数据:

structure(list(`Respondent ID` = c(NA, 11385284375, 11385273621, 
11385258069, 11385253194, 11385240121, 11385226951, 11385212508
), `Please provide your contact information:` = c("Name", "Benjamin Franklin", 
"Mae Jemison", "Carl Sagan", "W. E. B. Du Bois", "Florence Nightingale", 
"Galileo Galilei", "Albert Einstein"), ...11 = c("Company", "Poor Richard's", 
"NASA", "Smithsonian", "NAACP", "Public Health Co", "NASA", "ThinkTank"
), ...12 = c("Address", NA, NA, NA, NA, NA, NA, NA), ...13 = c("Address 2", 
NA, NA, NA, NA, NA, NA, NA), ...14 = c("City/Town", "Philadelphia", 
"Decatur", "Washington", "Great Barrington", "Florence", "Pisa", 
"Princeton"), ...15 = c("State/Province", "PA", "Alabama", "D.C.", 
"MA", "IT", "IT", "NJ"), ...16 = c("ZIP/Postal Code", "19104", 
"20104", "33321", "1230", "33225", "12345", "8540"), ...17 = c("Country", 
NA, NA, NA, NA, NA, NA, NA), ...18 = c("Email Address", "benjamins@gmail.com", 
"mjemison@nasa.gov", "stargazer@gmail.com", "dubois@web.com", 
"firstnurse@aol.com", "galileo123@yahoo.com", "imthinking@gmail.com"
), ...19 = c("Phone Number", "215-555-4444", "221-134-4646", 
"999-999-4422", "999-000-1234", "123-456-7899", "111-888-9944", 
"215-999-8877"), `I wish it would have snowed more this winter.` = c("Response", 
"Strongly disagree", "Strongly agree", "Neither agree nor disagree", 
"Strongly disagree", "Disagree", "Agree", "Strongly agree")), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data.frame"))

你能举一个Surveymonkey输出的小例子来说明这个问题吗?我可以想象一个解决方案,它使用
readLines
n=2
来读取(和按摩)标题,并使用
read.csv
skip=2,标题=FALSE
来获取数据……下次运行调查时,请使用LimeSurvey()-它是开源的,它有一个导出到R的功能,运行得相当好(披露:我编写了导出模块)@Ben,文件中的标题是两行问题名称/编号,然后子问题写在下面的一行。一般来说,处理起来非常麻烦。@Sean,在我的组织内,我通常会使用*.sav(你需要一个付费帐户)因为csv很难使用。SPSS文件可能有一些奇妙之处,但清理起来也不算太糟糕(@Andrie,也在为此做一些工作:))@Ben,在尝试创建一个小示例时,我发现Surveymonkey CSV文件的第二行似乎以空字符开头,当我使用read.CSV()或readLines()时,R忽略了这一行.Libreoffice可以读懂这一行!有一段时间让我发疯了!建议?由于第二行以空字符开头,恐怕这行不通。如果(!is.null(second.line)){paste(first.line,second.line)}
?不幸的是,第二行中有有用的信息,即使它以空字符开头!
> colnames(sample)
 [1] "Respondent ID"                                 "Please provide your contact information:"      "...11"                                        
 [4] "...12"                                         "...13"                                         "...14"                                        
 [7] "...15"                                         "...16"                                         "...17"                                        
[10] "...18"                                         "...19"                                         "I wish it would have snowed more this winter."
> colnames(sample_clean)
 [1] "Respondent ID"                                            "Please provide your contact information: Name"           
 [3] "Please provide your contact information: Company"         "Please provide your contact information: Address"        
 [5] "Please provide your contact information: Address 2"       "Please provide your contact information: City/Town"      
 [7] "Please provide your contact information: State/Province"  "Please provide your contact information: ZIP/Postal Code"
 [9] "Please provide your contact information: Country"         "Please provide your contact information: Email Address"  
[11] "Please provide your contact information: Phone Number"    "I wish it would have snowed more this winter. Response"  
structure(list(`Respondent ID` = c(NA, 11385284375, 11385273621, 
11385258069, 11385253194, 11385240121, 11385226951, 11385212508
), `Please provide your contact information:` = c("Name", "Benjamin Franklin", 
"Mae Jemison", "Carl Sagan", "W. E. B. Du Bois", "Florence Nightingale", 
"Galileo Galilei", "Albert Einstein"), ...11 = c("Company", "Poor Richard's", 
"NASA", "Smithsonian", "NAACP", "Public Health Co", "NASA", "ThinkTank"
), ...12 = c("Address", NA, NA, NA, NA, NA, NA, NA), ...13 = c("Address 2", 
NA, NA, NA, NA, NA, NA, NA), ...14 = c("City/Town", "Philadelphia", 
"Decatur", "Washington", "Great Barrington", "Florence", "Pisa", 
"Princeton"), ...15 = c("State/Province", "PA", "Alabama", "D.C.", 
"MA", "IT", "IT", "NJ"), ...16 = c("ZIP/Postal Code", "19104", 
"20104", "33321", "1230", "33225", "12345", "8540"), ...17 = c("Country", 
NA, NA, NA, NA, NA, NA, NA), ...18 = c("Email Address", "benjamins@gmail.com", 
"mjemison@nasa.gov", "stargazer@gmail.com", "dubois@web.com", 
"firstnurse@aol.com", "galileo123@yahoo.com", "imthinking@gmail.com"
), ...19 = c("Phone Number", "215-555-4444", "221-134-4646", 
"999-999-4422", "999-000-1234", "123-456-7899", "111-888-9944", 
"215-999-8877"), `I wish it would have snowed more this winter.` = c("Response", 
"Strongly disagree", "Strongly agree", "Neither agree nor disagree", 
"Strongly disagree", "Disagree", "Agree", "Strongly agree")), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data.frame"))