R 如何基于与其他列中的值匹配的行中的值创建新列？_R_String_Match

R 如何基于与其他列中的值匹配的行中的值创建新列？

r string

R 如何基于与其他列中的值匹配的行中的值创建新列？,r,string,match,R,String,Match,假设我有一个数据框架，其中包含一些分类变量和一些字符串值列。我想创建一个新列，如果分类列中的某些值匹配（或不匹配），则为每一行粘贴来自其他行的字符串值。这里是一个玩具的例子 toy <- data.frame("id" = c(1,2,3,2), "year" = c(2000,2000,2004,2004), "words" = c("a b", "c d", "e b", "c d")) 新列的前两行将为空，因为玩具示例中的年份不少于2000年。最后一行在新列中只有“a b”作为值，

假设我有一个数据框架，其中包含一些分类变量和一些字符串值列。我想创建一个新列，如果分类列中的某些值匹配（或不匹配），则为每一行粘贴来自其他行的字符串值。这里是一个玩具的例子

toy <- data.frame("id" = c(1,2,3,2), "year" = c(2000,2000,2004,2004), "words" = c("a b", "c d", "e b", "c d"))

新列的前两行将为空，因为玩具示例中的年份不少于2000年。最后一行在新列中只有“a b”作为值，因为它的

id

是重复的

我尝试了各种

apply

和

groupby

方法，但似乎没有一种完全符合要求。如果您有任何想法，我们将不胜感激

我使用了

sqldf

和

plyr

包来实现解决方案。虽然我不认为这是一个优雅的解决方案，但它是有效的。希望看到其他人提供更有效的解决方案

library(sqldf)

toy <- data.frame("id" = c(1,2,3,2), 
                   "year" = c(2000,2000,2004,2004), 
                   "words" = c("a b", "c d", "e b", "c d"))

toy

#  id year words
#1  1 2000   a b
#2  2 2000   c d
#3  3 2004   e b
#4  2 2004   c d

df <- sqldf('SELECT t1.*,t2.words AS word_pool FROM toy t1 LEFT JOIN toy t2 
       ON t1.year > t2.year AND
       t1.words <> t2.words')

df
#  id year words word_pool
#1  1 2000   a b      <NA>
#2  2 2000   c d      <NA>
#3  3 2004   e b       a b
#4  3 2004   e b       c d
#5  2 2004   c d       a b

result <- plyr::ddply(df,c("id","year","words"), 
                      function(dfx)paste(dfx$word_pool, 
                                         collapse = " "))

result
#  id year words      V1
#1  1 2000   a b      NA
#2  2 2000   c d      NA
#3  2 2004   c d     a b
#4  3 2004   e b a b c d

库（sqldf）
toy带有for和which，它必须像apply和no use那样编写外部库
        ## Create data
        toy <-
          data.frame(
            "id" = c(1, 2, 3, 2),
            "year" = c(2000, 2000, 2004, 2004),
            "words" = c("a b", "c d", "e b", "c d")
          )

        toy$word_pool <- 0
        for (i in 1:length(toy)) {
          # Recognize index from condition
          condition_index <- which(toy$year[i] > toy$year
                                        & toy$id[i] != toy$id)
          # assign
          if (length(condition_index) == 0){# case no index
            toy$word_pool[i] = ""
          }
          else{# paste with collapse join array
            toy$word_pool[i] = paste(toy$words[condition_index],
                                     collapse = " ", sep = " ")
          }
        }
        toy
        # id year words word_pool
        # 1  2000   a b          
        # 2  2000   c d          
        # 3  2004   e b   a b c d
        # 2  2004   c d       a b

创建数据
玩具谢谢你。当然可以，但理想情况下，解决方案不会使用循环（我处理的实际数据有数千行）。这是可行的，我还想看看其他人是否有不同的解决方案。您认为这样可以有效地扩展到数千行吗？sqldf非常有效。plyr部分可能会遇到瓶颈。然而，我使用类似的代码进行了至少一百万次观察，没有任何问题。所以对于几千人来说，你应该很好。
        ## Create data
        toy <-
          data.frame(
            "id" = c(1, 2, 3, 2),
            "year" = c(2000, 2000, 2004, 2004),
            "words" = c("a b", "c d", "e b", "c d")
          )

        toy$word_pool <- 0
        for (i in 1:length(toy)) {
          # Recognize index from condition
          condition_index <- which(toy$year[i] > toy$year
                                        & toy$id[i] != toy$id)
          # assign
          if (length(condition_index) == 0){# case no index
            toy$word_pool[i] = ""
          }
          else{# paste with collapse join array
            toy$word_pool[i] = paste(toy$words[condition_index],
                                     collapse = " ", sep = " ")
          }
        }
        toy
        # id year words word_pool
        # 1  2000   a b          
        # 2  2000   c d          
        # 3  2004   e b   a b c d
        # 2  2004   c d       a b