R 拆分合并词（使用迷你字典）_R

R 拆分合并词（使用迷你字典）

R 拆分合并词（使用迷你字典）,r,R,我有一组词：其中一些是合并词，另一些只是简单的词。我还有一个单独的单词列表，我将使用它与我的第一个列表（作为字典）进行比较，以便“取消合并”某些单词下面是一个例子： ListA <- c("dopamine", "andthe", "lowerswim", "other", "different") ListB <- c("do", "mine", "and", "the", "lower", "owe", "swim") ListA我认为第一步应该是从ListB构建所有组合对：

我有一组词：其中一些是合并词，另一些只是简单的词。我还有一个单独的单词列表，我将使用它与我的第一个列表（作为字典）进行比较，以便“取消合并”某些单词

下面是一个例子：

ListA <- c("dopamine", "andthe", "lowerswim", "other", "different")
ListB <- c("do", "mine", "and", "the", "lower", "owe", "swim")

ListA我认为第一步应该是从ListB
构建所有组合对：
pairings <- expand.grid(ListB, ListB)
combos <- apply(pairings, 1, function(x) paste0(x[1], x[2]))
combos
#  [1] "dodo"       "minedo"     "anddo"      "thedo"      "lowerdo"    "owedo"      "swimdo"    
#  [8] "domine"     "minemine"   "andmine"    "themine"    "lowermine"  "owemine"    "swimmine"  
# [15] "doand"      "mineand"    "andand"     "theand"     "lowerand"   "oweand"     "swimand"   
# [22] "dothe"      "minethe"    "andthe"     "thethe"     "lowerthe"   "owethe"     "swimthe"   
# [29] "dolower"    "minelower"  "andlower"   "thelower"   "lowerlower" "owelower"   "swimlower" 
# [36] "doowe"      "mineowe"    "andowe"     "theowe"     "lowerowe"   "oweowe"     "swimowe"   
# [43] "doswim"     "mineswim"   "andswim"    "theswim"    "lowerswim"  "oweswim"    "swimswim"  

最后，您希望拆分ListA
中与ListB
中的一对元素相匹配的单词，除非该单词已经在ListB
中。我想有很多方法可以做到这一点，但我将使用lappy
和unlist
：
newA <- unlist(lapply(seq_along(ListA), function(idx) {
  if (is.na(matches[idx]) | ListA[idx] %in% ListB) {
    return(ListA[idx])
  } else {
    return(as.vector(as.matrix(pairings[combos == matches[idx],])))
  }
}))
newA
# [1] "dopamine"  "and"       "the"       "lower"     "swim"      "other"     "different"

newAstringr
中的一些helper函数会有帮助吗？我想其中的一些会让你很快行动起来。@hrbrmstr我不知道stringr软件包-我现在就去调查！谢谢你的建议。这太完美了。我花了大量的时间与stringr一起工作，试图让它发挥作用，然后我回来了，你制作了这个。
newA <- unlist(lapply(seq_along(ListA), function(idx) {
  if (is.na(matches[idx]) | ListA[idx] %in% ListB) {
    return(ListA[idx])
  } else {
    return(as.vector(as.matrix(pairings[combos == matches[idx],])))
  }
}))
newA
# [1] "dopamine"  "and"       "the"       "lower"     "swim"      "other"     "different"