使用R提取多条tweet中的hashtag_R

使用R提取多条tweet中的hashtag

使用R提取多条tweet中的hashtag,r,R,我非常想要一个从R中的集体推文中提取哈希标签的解决方案。例如： [[1]] [1] "RddzAlejandra: RT @NiallOfficial: What a day for @johnJoeNevin ! Sooo proud t have been there to see him at #London2012 and here in mgar #MullingarShuffle" [[2]] [1] "BPOInsight: RT @atos: Atos completes d

我非常想要一个从R中的集体推文中提取哈希标签的解决方案。例如：

[[1]]
[1] "RddzAlejandra: RT @NiallOfficial: What a day for @johnJoeNevin ! Sooo proud t have been there to see him at #London2012 and here in mgar #MullingarShuffle"

[[2]]
[1] "BPOInsight: RT @atos: Atos completes delivery of key IT systems for London 2012 Olympic Games http://t.co/Modkyo2R #london2012"

[[3]]
[1] "BloombergWest: The #Olympics sets a ratings record for #NBC, with 219M viewers tuning in. http://t.co/scGzIXBp #london2012 #tech"

我如何解析它来提取所有推文中的标签词列表。以前的解决方案在第一条tweet中仅显示hashtag，代码中包含以下错误消息：

> string <-"MonicaSarkar: RT @saultracey: Sun kissed #olmpicrings at #towerbridge #london2012   @ Tower Bridge http://t.co/wgIutHUl"
> 
> [[2]]
Error: unexpected '[[' in "[["
> [1] "ccrews467: RT @BBCNews: England manager Roy Hodgson calls #London2012 a \"wake-up call\": footballers and fans should emulate spirit of #Olympics http://t.co/wLD2VA1K" 
Error: unexpected '[' in "["
> hashtag.regex <- perl("(?<=^|\\s)#\\S+")
> hashtags <- str_extract_all(string, hashtag.regex)
> print(hashtags)
[[1]]
[1] "#olmpicrings" "#towerbridge" "#london2012"

>字符串
> [[2]]
错误：在“[]”中出现意外的“[]”
>[1]“ccrews467:RT@BBCNews：英格兰队主教练罗伊·霍奇森（Roy Hodgson）将2012年伦敦奥运会称为一个“警钟”：足球运动员和球迷应该效仿奥运精神http://t.co/wLD2VA1K" 
错误：在“[”中出现意外的“[”
>hashtag.regex hashtags打印（hashtags）
[[1]]
[1] “olmpicrings”“towerbridge”“伦敦2012”

一个

strsplit

和

grep

版本怎么样：

> lapply(strsplit(x, ' '), function(w) grep('#', w, value=TRUE))
[[1]]
[1] "#London2012"       "#MullingarShuffle"

[[2]]
[1] "#london2012"

[[3]]
[1] "#Olympics"   "#NBC,"       "#london2012" "#tech"

我不知道如何在不首先拆分的情况下从每个字符串返回多个结果，但我打赌有一种方法！

使用

regmatches

和

gregexpr

这会给你一个列表，每个tweet都有hashtags，假设hastag的格式是#后跟任意数量的字母或数字（我对twitter不太熟悉）:

如果您发布以前的代码，我们可能会告诉您在哪里循环或递归以清除

yourdata[[1:n]][1]

的所有元素，只需说一下，使用双方括号中的向量将给您“尝试选择多个元素”错误：）如果答案令人满意地回答了您的问题，请接受该答案，或者在对答案的评论中解释为什么答案没有。@SachaEpskamp--是的，我试图描述OP可能正在搜索的数据范围时太匆忙了。抱歉。

foo <- c("RddzAlejandra: RT @NiallOfficial: What a day for @johnJoeNevin ! Sooo proud t have been there to see him at #London2012 and here in mgar #MullingarShuffle","BPOInsight: RT @atos: Atos completes delivery of key IT systems for London 2012 Olympic Games http://t.co/Modkyo2R #london2012","BloombergWest: The #Olympics sets a ratings record for #NBC, with 219M viewers tuning in. http://t.co/scGzIXBp #london2012 #tech")

regmatches(foo,gregexpr("#(\\d|\\w)+",foo))

[[1]]
[1] "#London2012"       "#MullingarShuffle"

[[2]]
[1] "#london2012"

[[3]]
[1] "#Olympics"   "#NBC"        "#london2012" "#tech"