Regex 从R中的regmatches创建数据帧

Regex 从R中的regmatches创建数据帧,regex,r,Regex,R,我找了又找,但不知道如何将regmatches的输出转换成任何可以导出的内容。希望这个问题不是那么具体,对社区来说毫无价值。我遇到了与以下链接中的问题类似的问题: 但是,我不知道如何从regmatches生成的列表中保存/导出/生成数据帧。理想情况下,每个has标记将保存在单独的列中。但我每次尝试都会得到如下输出: [[6267]] character(0) [[6268]] [1] "#ASCO15" [[6269]] [1] "#FDA" "#Fast" "#

我找了又找,但不知道如何将regmatches的输出转换成任何可以导出的内容。希望这个问题不是那么具体,对社区来说毫无价值。我遇到了与以下链接中的问题类似的问题:

但是,我不知道如何从regmatches生成的列表中保存/导出/生成数据帧。理想情况下,每个has标记将保存在单独的列中。但我每次尝试都会得到如下输出:

[[6267]]
character(0)

[[6268]]
[1] "#ASCO15"

[[6269]]
[1] "#FDA"        "#Fast"       "#Track"      "#AML"        "#Pancreatic"    
如果我尝试导出regmatches的结果,我会得到:

Error in data.frame(character(0), character(0), character(0), character(0),  : 
  arguments imply differing number of rows: 0, 8, 2, 3, 5, 1, 4, 7, 6, 9 
谢谢

编辑: 对不起,我可能解释得不好

dput(hi)
structure(list(text = c("Hooray ! #Wimbledon2Day has plugged its brain back in at last ! No more sub- Top Gear telly #propertenniscoverage", 
"gone but never forgotten #TopGear ", "The final episode of 'Top Gear' with Jeremy Clarkson is going to break records http://brbr.co/1JCeJYc\312"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-3L), .Names = "text")
从这些数据中,我想取出hashtags(#)和它们后面的单词,并将它们分配给列。上面链接中的代码完成了第一部分

test<-regmatches(hi$text,gregexpr("#(\\d|\\w)+",hi$text),)
但当我尝试检查或导出它时,我得到:

Error in data.frame(c("#Wimbledon2Day", "#propertenniscoverage"), "#TopGear",  : 
  arguments imply differing number of rows: 2, 1, 0

使用链接文章中的示例

foo <- c("RddzAlejandra: RT @NiallOfficial: What a day for @johnJoeNevin ! Sooo proud t have been there to see him at #London2012 and here in mgar #MullingarShuffle","BPOInsight: RT @atos: Atos completes delivery of key IT systems for London 2012 Olympic Games http://t.co/Modkyo2R #london2012","BloombergWest: The #Olympics sets a ratings record for #NBC, with 219M viewers tuning in. http://t.co/scGzIXBp #london2012 #tech")

ms <- regmatches(foo, gregexpr("#(\\d|\\w)+", foo))  # extract hashtags from tweet (from other post)
cols <- unique(unlist(ms))                           # get unique hashtags

setNames(data.frame(t(sapply(ms, function(i) cols %in% i))), cols)

#   #London2012 #MullingarShuffle #london2012 #Olympics  #NBC #tech
# 1        TRUE              TRUE       FALSE     FALSE FALSE FALSE
# 2       FALSE             FALSE        TRUE     FALSE FALSE FALSE
# 3       FALSE             FALSE        TRUE      TRUE  TRUE  TRUE

<代码> fo如果你有大量的推特和独特的标签,你应该考虑使用稀疏矩阵。您可以在
arules
包中找到这样一个稀疏矩阵对象
itemmatric
。您可以将列表直接强制到这个稀疏矩阵中,而不必在@LegalizeIt的答案中写出
unique
sapply
步骤(这是一个很好的基本解决方案,我给他+1)


foo您希望数据帧看起来像什么?您想为每个hashtag创建一列,为每个tweet创建一行吗?
dput
一些数据,以便创建一个小数据和触发该错误的代码的可复制示例。也许可以尝试从
data.table
包(
1.9.5+
)中的
rbindlist(df,fill=T)
,这至少会为您提供一个
data.frame
,尽管它的形式可能非常混乱。您可能还需要考虑使您的hashtags不区分大小写。。。所以#london2012和#london2012被分组在一起。。。也许
ms
foo <- c("RddzAlejandra: RT @NiallOfficial: What a day for @johnJoeNevin ! Sooo proud t have been there to see him at #London2012 and here in mgar #MullingarShuffle","BPOInsight: RT @atos: Atos completes delivery of key IT systems for London 2012 Olympic Games http://t.co/Modkyo2R #london2012","BloombergWest: The #Olympics sets a ratings record for #NBC, with 219M viewers tuning in. http://t.co/scGzIXBp #london2012 #tech")

ms <- regmatches(foo, gregexpr("#(\\d|\\w)+", foo))  # extract hashtags from tweet (from other post)
cols <- unique(unlist(ms))                           # get unique hashtags

setNames(data.frame(t(sapply(ms, function(i) cols %in% i))), cols)

#   #London2012 #MullingarShuffle #london2012 #Olympics  #NBC #tech
# 1        TRUE              TRUE       FALSE     FALSE FALSE FALSE
# 2       FALSE             FALSE        TRUE     FALSE FALSE FALSE
# 3       FALSE             FALSE        TRUE      TRUE  TRUE  TRUE
foo <- c("RddzAlejandra: RT @NiallOfficial: What a day for @johnJoeNevin ! Sooo proud t have been there to see him at #London2012 and here in mgar #MullingarShuffle","BPOInsight: RT @atos: Atos completes delivery of key IT systems for London 2012 Olympic Games http://t.co/Modkyo2R #london2012","BloombergWest: The #Olympics sets a ratings record for #NBC, with 219M viewers tuning in. http://t.co/scGzIXBp #london2012 #tech")

ms <- regmatches(foo, gregexpr("#(\\d|\\w)+", foo))  # extract hashtags from tweet (from other post)

library(arules)
im <- as(ms, "itemMatrix")

#you can retrieve the rows like this
as(im,"matrix")
#   #london2012 #London2012 #MullingarShuffle #NBC #Olympics #tech
# 1           0           1                 1    0         0     0
# 2           1           0                 0    0         0     0
# 3           1           0                 0    1         1     1