基于单个列中的多个项重新调整data.frame的形状，使其具有附加行_R

基于单个列中的多个项重新调整data.frame的形状，使其具有附加行

基于单个列中的多个项重新调整data.frame的形状，使其具有附加行,r,R,我有一个datatable，其中包含使用twitteR库捕获的tweet列表，并希望获得一个带有注释的tweet列表例如，我从以下几点开始： tmp=data.frame(tweets=c("this tweet with #onehashtag","#two hashtags #here","no hashtags"),dummy=c('random','other','column')) > tmp tweets dummy 1 thi

我有一个datatable，其中包含使用twitteR库捕获的tweet列表，并希望获得一个带有注释的tweet列表

例如，我从以下几点开始：

tmp=data.frame(tweets=c("this tweet with #onehashtag","#two hashtags #here","no hashtags"),dummy=c('random','other','column'))
> tmp
                       tweets  dummy
1 this tweet with #onehashtag random
2         #two hashtags #here  other
3                 no hashtags column

并希望生成：

result=data.frame(tweets=c("this tweet with #onehashtag","#two hashtags #here","#two hashtags #here","no hashtags"),dummy=c('random','other','other','column'),tag=c('#onehashtag','#two','#here',NA))
> result
                       tweets  dummy        tag
1 this tweet with #onehashtag random #onehashtag
2         #two hashtags #here  other        #two
3         #two hashtags #here  other       #here
4                 no hashtags column        <NA>

要将tweet中的标签提取到列表中，可以使用以下方法：

tmp$tags=sapply(tmp$tweets,function(x) str_extract_all(x,'#[a-zA-Z0-9]+'))
> tmp
                       tweets  dummy        tags
1 this tweet with #onehashtag random #onehashtag
2         #two hashtags #here  other #two, #here
3                 no hashtags column

但是我在某个地方遗漏了一个技巧，无法看到如何使用它作为创建重复行的基础

首先让我们获取匹配项：

matches <- gregexpr("#[a-zA-Z0-9]+",tmp$tweets)
matches
[[1]]
[1] 17
attr(,"match.length")
[1] 11

[[2]]
[1]  1 15
attr(,"match.length")
[1] 4 5

[[3]]
[1] -1
attr(,"match.length")
[1] -1

现在使用匹配项获取开始和结束位置：

starts <- unlist(matches)
ends <- starts + unlist(sapply(matches,function(x) attr(x,"match.length"))) - 1

带标记行和不带标记行的行为不同，因此如果单独处理这些情况，代码将更容易理解

像以前一样使用

str\u extract\u all

获取标签

tags <- str_extract_all(tmp$tweets, '#[a-zA-Z0-9]+')

使用此索引展开

tmp

，并添加标记列

tagged <- tmp[index, ]
tagged$tags <- unlist(tags)

has_no_tag <- sapply(tags, function(x) length(x) == 0L)
not_tagged <- tmp[has_no_tag, ]
not_tagged$tags <- NA

标记
tags <- str_extract_all(tmp$tweets, '#[a-zA-Z0-9]+')

index <- rep.int(seq_len(nrow(tmp)), sapply(tags, length))

tagged <- tmp[index, ]
tagged$tags <- unlist(tags)

has_no_tag <- sapply(tags, function(x) length(x) == 0L)
not_tagged <- tmp[has_no_tag, ]
not_tagged$tags <- NA

all_data <- rbind(tagged, not_tagged)