使用R导入混乱数据

使用R导入混乱数据,r,R,有人知道如何将以下数据以适当的形式导入R吗?我尝试将strsplit函数作为:test这是怎么回事? > nicelyFormatted [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] "Black Eagles" "01/12" "12/11" "1500" "W" "7.0" "420" "48" "Away" "+3" [2,] "Blue State"

有人知道如何将以下数据以适当的形式导入R吗?我尝试将strsplit函数作为:
test这是怎么回事?

> nicelyFormatted
     [,1]           [,2]    [,3]    [,4]   [,5] [,6]  [,7]  [,8]  [,9]   [,10]
[1,] "Black Eagles" "01/12" "12/11" "1500" "W"  "7.0" "420" "48"  "Away" "+3" 
[2,] "Blue State"   "02/18" "04/21" "1293" "L"  "8.0" "490" "48"  "Home" "+1" 
[3,] "Hawks"        "01/13" "02/17" "1028" "L"  "4.0" "46"  "460" "Away" NA   
[4,] "New Apple"    "09/23" "11/23" "563"  "L"  "3.0" "470" "47"  "Home" "+2" 
[5,] "Black White"  "07/05" "09/26" "713"  "L"  "5.2" "500" "45"  "Home" "+4" 
[6,] "PBO"          "10/24" "10/30" "1495" "L"  "1.9" "47"  "410" "Away" NA   



以下是用于获取上表的代码:

library(stringr)

# Open Connection to file
pathToFile <- path.expand("~/path/to/file/myfile.txt")
f <- file(pathToFile, "rb")  

# Read in lines
rawText <- readLines(f)


# Find the dahses
dsh <- str_locate_all(rawText, " - ")

# Splice, using the dashes as a guide
lng <- length(rawText)
spliced <- sapply(1:lng, function(i) 
  spliceOnDash(rawText[[i]], dsh[[c(i, 1)]], dsh[[c(i, 2)]])
)

# make it purtty
nicelyFormatted <- formatNicely(spliced)
nicelyFormatted


#-------------------#
#    FUNCTIONS      #
#-------------------#


spliceOnDash <- function(strn, start, end)  {

  # split around the date
  pre <- substr(strn, 1, start-6)
  dates <- substr(strn, start-5, end+5)
  post <- substr(strn, end+6, str_length(strn))

  # Clean up
  pre <- str_trim(pre)

  # replace all double spaces with single spaces
  while(str_detect(post, "  ")) {
    post <- str_replace_all(str_trim(post), "  ", " ")    
  }

  # splice on space
  post <- str_split(post, " ")

  # if dates are one field, remove this next line
  dates <- str_split(dates, " - ")

  # return
  c(unlist(pre), unlist(dates), unlist(post))
}

# Function to clean up the list into a nice table
formatNicely <- function(spliced)  {
  lngst <- max(sapply(spliced, length))
  t(sapply(spliced, function(x)  
      if(length(x) < lngst) c(x, rep(NA, lngst-length(x))) else x ))
}
库(stringr)
#打开到文件的连接

pathToFile这些数据来自哪里?原始文件中是否可能有制表符分隔符或类似的内容?是的,每个变量之间有制表符分隔符,变量的两个名称之间有空格。这些只是团队统计数据,每列应该代表一个变量(9个变量)。我无法理解,因为这个数据集包括字符串、数字和日期变量。任何帮助都将不胜感激。然后使用
read.table(数据文件,sep='\t')
。您能否发布
dput(test)
(在您重新分配strsplit之前)的结果,我可以发布一个有用的答案。您上传的数据中没有选项卡。由于空格用于分隔字段并作为团队名称的一部分,因此无法自动分隔字段,除非您编辑文件以在字符字段周围添加引号,或用逗号或制表符替换字段之间的空格。很抱歉,在此之前没有列出。变量包括球队名称、时间段、支持者人数、上次比赛结果、预算、友谊排名、年龄总和、状态、平均值。太棒了!谢谢大家,特别是RS。
library(stringr)

# Open Connection to file
pathToFile <- path.expand("~/path/to/file/myfile.txt")
f <- file(pathToFile, "rb")  

# Read in lines
rawText <- readLines(f)


# Find the dahses
dsh <- str_locate_all(rawText, " - ")

# Splice, using the dashes as a guide
lng <- length(rawText)
spliced <- sapply(1:lng, function(i) 
  spliceOnDash(rawText[[i]], dsh[[c(i, 1)]], dsh[[c(i, 2)]])
)

# make it purtty
nicelyFormatted <- formatNicely(spliced)
nicelyFormatted


#-------------------#
#    FUNCTIONS      #
#-------------------#


spliceOnDash <- function(strn, start, end)  {

  # split around the date
  pre <- substr(strn, 1, start-6)
  dates <- substr(strn, start-5, end+5)
  post <- substr(strn, end+6, str_length(strn))

  # Clean up
  pre <- str_trim(pre)

  # replace all double spaces with single spaces
  while(str_detect(post, "  ")) {
    post <- str_replace_all(str_trim(post), "  ", " ")    
  }

  # splice on space
  post <- str_split(post, " ")

  # if dates are one field, remove this next line
  dates <- str_split(dates, " - ")

  # return
  c(unlist(pre), unlist(dates), unlist(post))
}

# Function to clean up the list into a nice table
formatNicely <- function(spliced)  {
  lngst <- max(sapply(spliced, length))
  t(sapply(spliced, function(x)  
      if(length(x) < lngst) c(x, rep(NA, lngst-length(x))) else x ))
}