如何根据字符串在R中的位置正确拆分字符串？_R_Regex_String

如何根据字符串在R中的位置正确拆分字符串？

r regex string

如何根据字符串在R中的位置正确拆分字符串？,r,regex,string,R,Regex,String,我得到一个字符串向量，如下所示： t1 <- " Total" t2 <- " Total

我得到一个字符串向量，如下所示：

t1 <- "                                                                                                                Total"     
t2 <- "                                          Total                                                              Stock Price"  
t3 <- "                                         Dividend                              Misc Gain      MTCC Gain         Gain"      
t4 <- "                         Proportion        Gain                                Position        Position       Position"    
t5 <- "   Year   Dividend Gain    Earned        (1) x (2)   Dividend Gain Misc Gain    (4) - (5)       (3) - (4)     (6) + (7)"   
t6 <- "  –––––        –––––        –––––          –––––         –––––       –––––        –––––          –––––          –––––"     
t  <- c(t1, t2, t3, t4, t5, t6)

最长的单词是

proporty

，然后我将尝试在t4中找到

proporty

的开始和结束索引

另一个例子是第2列

Dividend Gain
     –––––

最长的单词是

股息收益

，我将尝试在t5中查找

股息收益

的开始和结束索引

如何从t中找到所需的索引？

一个解决方案是匹配所有向量的字符位置

首先，如果所有字符串具有相同的字符数，这可能会有所帮助。我们可以通过在末尾添加一些空白来实现这一点

# list string vector --
tl <- as.list(tx)

# make equal length --
tl <- lapply(tl, function(x) {
  d <- max(sapply(tl, nchar)) - nchar(x)
  if (d > 0) paste(x, Reduce(paste0, rep(" ", d - 1)))
  else x
})

# check equal num. of chars.
sd(sapply(tl, nchar))  
# [1] 0  # ok

数据

tx一种解决方案是匹配所有向量的字符位置
首先，如果所有字符串具有相同的字符数，这可能会有所帮助。我们可以通过在末尾添加一些空白来实现这一点
# list string vector --
tl <- as.list(tx)

# make equal length --
tl <- lapply(tl, function(x) {
  d <- max(sapply(tl, nchar)) - nchar(x)
  if (d > 0) paste(x, Reduce(paste0, rep(" ", d - 1)))
  else x
})

# check equal num. of chars.
sd(sapply(tl, nchar))  
# [1] 0  # ok


数据
tx最难的不是找到最长的单词，而是您拥有的数据结构。弄清楚这些词属于哪一列是相当乏味的。您能否设置字符范围，将每个t#分割为多个字符？你能把每个t#column的前10个字符称为1等吗？@svenhalvorson最好是他给出了一个dput
。如果我们把每个字符的位置看作一个“列”，那么看起来每个数据列之间至少有一个完整的字符列空格。因此，我将首先获取每行空间的字符位置，然后查看这些字符的交点，以找到数据列之间的断点。然后，您可以使用类似于read.fwf
的方法将内容解析为数据列。修剪空白，然后在每列中查找最长的单词是非常简单的。（虽然不规则的右边缘可能会弄乱固定宽度的文件解析器，但可能只需要使用strsplit
或substr
即可。）@jay.sf它可以用有效的语法复制/粘贴，在这里似乎不需要dput
。@Gregor你是对的，已经足够了。最难的是找不到最长的单词，这是您拥有的数据结构。弄清楚这些词属于哪一列是相当乏味的。您能否设置字符范围，将每个t#分割为多个字符？你能把每个t#column的前10个字符称为1等吗？@svenhalvorson最好是他给出了一个dput
。如果我们把每个字符的位置看作一个“列”，那么看起来每个数据列之间至少有一个完整的字符列空格。因此，我将首先获取每行空间的字符位置，然后查看这些字符的交点，以找到数据列之间的断点。然后，您可以使用类似于read.fwf
的方法将内容解析为数据列。修剪空白，然后在每列中查找最长的单词是非常简单的。（虽然不规则的右边缘可能会弄乱固定宽度的文件解析器，但可能只需要使用strsplit
或substr
即可。）@jay.sf它可以复制/粘贴有效的语法，似乎这里不需要dput。@Gregor你说得对，已经足够了。谢谢，这正是我想要的！谢谢，那正是我想要的！
splitAtCuts <- function(x) 
  split(x, cut(x, x[which(c(2, diff(x[- length(x)]), length(x)) > 1)],
               include.lowest=TRUE, right=FALSE))

# get character position matches --
# step 1
sl <- lapply(tl, function(x) {
  w <- which(strsplit(x, "")[[1]] != " ")
  return(splitAtCuts(w))
})
# step 2
pos <- sort(Reduce(union, unlist(sl)))

# extract column positions --
cols <- splitAtCuts(pos)

# cut into a matrix --
FUN <- Vectorize(function(x, y) 
  substring(tl[[x]], min(cols[[y]]), max(cols[[y]])))

M <- outer(seq(length(tl)), seq(length(cols)), FUN)

M <- apply(M, 2, function(x) gsub("^\\s|\\s{2,}|\\s$", "", x))
M

     [,1]    [,2]            [,3]         [,4]        [,5]            [,6]       
[1,] ""      ""              ""           ""          ""              ""         
[2,] ""      ""              ""           "Total"     ""              ""         
[3,] ""      ""              ""           "Dividend"  ""              ""         
[4,] ""      ""              "Proportion" "Gain"      ""              ""         
[5,] "Year"  "Dividend Gain" "Earned"     "(1) x (2)" "Dividend Gain" "Misc Gain"
[6,] "–––––" "–––––"         "–––––"      "–––––"     "–––––"         "–––––"    
     [,7]        [,8]        [,9]         
[1,] ""          ""          "Total"      
[2,] ""          ""          "Stock Price"
[3,] "Misc Gain" "MTCC Gain" "Gain"       
[4,] "Position"  "Position"  "Position"   
[5,] "(4) - (5)" "(3) - (4)" "(6) + (7)"  
[6,] "–––––"     "–––––"     "–––––"      

tx <- c("                                                                                                                Total", 
"                                          Total                                                              Stock Price", 
"                                         Dividend                              Misc Gain      MTCC Gain         Gain", 
"                         Proportion        Gain                                Position        Position       Position", 
"   Year   Dividend Gain    Earned        (1) x (2)   Dividend Gain Misc Gain    (4) - (5)       (3) - (4)     (6) + (7)", 
"  –––––        –––––        –––––          –––––         –––––       –––––        –––––          –––––          –––––"
)