R 拆分合并词(使用迷你字典)

R 拆分合并词(使用迷你字典),r,R,我有一组词:其中一些是合并词,另一些只是简单的词。我还有一个单独的单词列表,我将使用它与我的第一个列表(作为字典)进行比较,以便“取消合并”某些单词 下面是一个例子: ListA <- c("dopamine", "andthe", "lowerswim", "other", "different") ListB <- c("do", "mine", "and", "the", "lower", "owe", "swim") ListA我认为第一步应该是从ListB构建所有组合对:

我有一组词:其中一些是合并词,另一些只是简单的词。我还有一个单独的单词列表,我将使用它与我的第一个列表(作为字典)进行比较,以便“取消合并”某些单词

下面是一个例子:

ListA <- c("dopamine", "andthe", "lowerswim", "other", "different")
ListB <- c("do", "mine", "and", "the", "lower", "owe", "swim")

ListA我认为第一步应该是从
ListB
构建所有组合对:

pairings <- expand.grid(ListB, ListB)
combos <- apply(pairings, 1, function(x) paste0(x[1], x[2]))
combos
#  [1] "dodo"       "minedo"     "anddo"      "thedo"      "lowerdo"    "owedo"      "swimdo"    
#  [8] "domine"     "minemine"   "andmine"    "themine"    "lowermine"  "owemine"    "swimmine"  
# [15] "doand"      "mineand"    "andand"     "theand"     "lowerand"   "oweand"     "swimand"   
# [22] "dothe"      "minethe"    "andthe"     "thethe"     "lowerthe"   "owethe"     "swimthe"   
# [29] "dolower"    "minelower"  "andlower"   "thelower"   "lowerlower" "owelower"   "swimlower" 
# [36] "doowe"      "mineowe"    "andowe"     "theowe"     "lowerowe"   "oweowe"     "swimowe"   
# [43] "doswim"     "mineswim"   "andswim"    "theswim"    "lowerswim"  "oweswim"    "swimswim"  
最后,您希望拆分
ListA
中与
ListB
中的一对元素相匹配的单词,除非该单词已经在
ListB
中。我想有很多方法可以做到这一点,但我将使用
lappy
unlist

newA <- unlist(lapply(seq_along(ListA), function(idx) {
  if (is.na(matches[idx]) | ListA[idx] %in% ListB) {
    return(ListA[idx])
  } else {
    return(as.vector(as.matrix(pairings[combos == matches[idx],])))
  }
}))
newA
# [1] "dopamine"  "and"       "the"       "lower"     "swim"      "other"     "different"

newA
stringr
中的一些helper函数会有帮助吗?我想其中的一些会让你很快行动起来。@hrbrmstr我不知道
stringr
软件包-我现在就去调查!谢谢你的建议。这太完美了。我花了大量的时间与stringr一起工作,试图让它发挥作用,然后我回来了,你制作了这个。
newA <- unlist(lapply(seq_along(ListA), function(idx) {
  if (is.na(matches[idx]) | ListA[idx] %in% ListB) {
    return(ListA[idx])
  } else {
    return(as.vector(as.matrix(pairings[combos == matches[idx],])))
  }
}))
newA
# [1] "dopamine"  "and"       "the"       "lower"     "swim"      "other"     "different"