Regex R中的模式替换
我正在R中处理一个Twitter数据集,我发现很难从推文中删除用户名 这是我的数据集的tweet列中的tweet示例:Regex R中的模式替换,regex,r,twitter,Regex,R,Twitter,我正在R中处理一个Twitter数据集,我发现很难从推文中删除用户名 这是我的数据集的tweet列中的tweet示例: [1] "@danimottale: 2 bad our inalienable rights offend their sensitivities. U cannot reason with obtuse zealotry. // So very well said." [2] "@FreeMktMonkey @drleegross Want to buil
[1] "@danimottale: 2 bad our inalienable rights offend their sensitivities. U cannot reason with obtuse zealotry. // So very well said."
[2] "@FreeMktMonkey @drleegross Want to build HSA throughout lifetime for when older thus need HDHP not to deplete it if ill before 65y/o.thanks"
我要删除/替换以“@”开头的所有单词以获得此输出:
[1] "2 bad our inalienable rights offend their sensitivities. U cannot reason with obtuse zealotry. // So very well said."
[2] "Want to build HSA throughout lifetime for when older thus need HDHP not to deplete it if ill before 65y/o.thanks"
此gsub函数仅用于删除“@”符号
我想说的是,删除文本符号后面的字符,直到遇到空格或标点符号
我开始尝试只处理空间问题,但无济于事:
gsub("@.*[:space:]$", "", tweetdata$tweets)
这将完全删除第二条tweet
gsub("@.*[:blank:]$", "", tweetdata$tweets)
这不会改变输出
我将非常感谢您的帮助。您可以使用以下工具
\S+
匹配任何非空白字符(1
或更多次),然后匹配单个空白字符
gsub('@\\S+\\s', '', noRT$text)
编辑:否定匹配也可以正常工作(仅使用空格字符)
这里的正则表达式方法简单明了。我添加了第二个选项,允许您使用qdap的
genX
函数删除任意两个边界之间的文本。这允许您提供左右边界
library(qdap)
genX(x, "@", "\\s")
## [1] "2 bad our inalienable rights offend their sensitivities. U cannot reason with obtuse zealotry. // So very well said."
## [2] "Want to build HSA throughout lifetime for when older thus need HDHP not to deplete it if ill before 65y/o.thanks"
非常感谢-非常有帮助,可惜我不能投票支持它,因为我是新的。@user3722736您可以通过单击投票支持计数下方左侧的复选标记来检查此解决方案是否符合您的需要。使用
sub
而不是gsub
,因为只有一个替换项。第二个字符串中有多个替换项。谢谢,很高兴看到另一个解决方案。我希望我能投票支持你的答案,但我还没有名声。
gsub('@[^ ]+ ', '', noRT$text)
library(qdap)
genX(x, "@", "\\s")
## [1] "2 bad our inalienable rights offend their sensitivities. U cannot reason with obtuse zealotry. // So very well said."
## [2] "Want to build HSA throughout lifetime for when older thus need HDHP not to deplete it if ill before 65y/o.thanks"