用R分隔twitter状态/超链接/日期
我想自动分离以下推文,以获得推文本身,超链接和日期在三个单独的列。有人能帮忙吗?我的数据集的名称是DB_YS,它是一个txt文件 以下是我的数据框中的一些推文:用R分隔twitter状态/超链接/日期,r,twitter,R,Twitter,我想自动分离以下推文,以获得推文本身,超链接和日期在三个单独的列。有人能帮忙吗?我的数据集的名称是DB_YS,它是一个txt文件 以下是我的数据框中的一些推文: Thank you, everyone! indyref http://t.co/1kTzqjyGE7 Sep 18, 2014 As the polls close, total likes on the @YesScotland Facebook page have passed David Cameron s one.
Thank you, everyone! indyref http://t.co/1kTzqjyGE7 Sep 18, 2014
As the polls close, total likes on the @YesScotland Facebook page have passed David Cameron s one. indyref voteYes http://t.co/x7IoB1EtfY Sep 18, 2014
We can be proud of indyref, which has seen a flourishing of Scotland’s self-confidence as a nation VoteYes http://t.co/1OqxvbpoS9 Sep 18, 2014
We can afford world-class public services. A Yes vote means we can strengthen our NHS. VoteYes indyref http://t.co/D9Vn5OqStV Sep 18, 2014
This is a once in a lifetime opportunity to choose a new and better path for Scotland VoteYes indyref http://t.co/9knT6Mx4vZ Sep 18, 2014
Our young people shouldn t have to leave to find decent jobs. VoteYes indyref http://t.co/vAE164f0Oy Sep 18, 2014
下面是一个使用一系列正则表达式的基本包解决方案:
# Assume df is your data frame with a column called txt
# Match text until the beginning of the URL
tweet.regex <- regexpr("^.*(?=http)", df$txt, perl=T)
# Extract tweet text
tweet <- substr(df$txt, tweet.regex, attr(tweet.regex, "match.length"))
# Match text from the beginning of the URL to the next space
url.regex <- regexpr("http[^ ]+(?= )", df$txt, perl=T)
# Extract URL
url <- substr(df$txt, url.regex, url.regex + attr(url.regex, "match.length"))
# Match the date
date.regex <- regexpr("[A-Za-z]+ \\d+, \\d{4} *$", df$txt, perl=T)
# Extract date
date <- substr(df$txt, date.regex, date.regex + attr(date.regex, "match.length"))
# Combine results
tweet.df <- data.frame(tweet, url, date, stringsAsFactors=F)
#假设df是您的数据帧,包含一个名为txt的列
#匹配文本,直到URL开头
tweet.regex下面是一个使用一系列正则表达式的基本包解决方案:
# Assume df is your data frame with a column called txt
# Match text until the beginning of the URL
tweet.regex <- regexpr("^.*(?=http)", df$txt, perl=T)
# Extract tweet text
tweet <- substr(df$txt, tweet.regex, attr(tweet.regex, "match.length"))
# Match text from the beginning of the URL to the next space
url.regex <- regexpr("http[^ ]+(?= )", df$txt, perl=T)
# Extract URL
url <- substr(df$txt, url.regex, url.regex + attr(url.regex, "match.length"))
# Match the date
date.regex <- regexpr("[A-Za-z]+ \\d+, \\d{4} *$", df$txt, perl=T)
# Extract date
date <- substr(df$txt, date.regex, date.regex + attr(date.regex, "match.length"))
# Combine results
tweet.df <- data.frame(tweet, url, date, stringsAsFactors=F)
#假设df是您的数据帧,包含一个名为txt的列
#匹配文本,直到URL开头
tweet.regex下面是一个使用一系列正则表达式的基本包解决方案:
# Assume df is your data frame with a column called txt
# Match text until the beginning of the URL
tweet.regex <- regexpr("^.*(?=http)", df$txt, perl=T)
# Extract tweet text
tweet <- substr(df$txt, tweet.regex, attr(tweet.regex, "match.length"))
# Match text from the beginning of the URL to the next space
url.regex <- regexpr("http[^ ]+(?= )", df$txt, perl=T)
# Extract URL
url <- substr(df$txt, url.regex, url.regex + attr(url.regex, "match.length"))
# Match the date
date.regex <- regexpr("[A-Za-z]+ \\d+, \\d{4} *$", df$txt, perl=T)
# Extract date
date <- substr(df$txt, date.regex, date.regex + attr(date.regex, "match.length"))
# Combine results
tweet.df <- data.frame(tweet, url, date, stringsAsFactors=F)
#假设df是您的数据帧,包含一个名为txt的列
#匹配文本,直到URL开头
tweet.regex下面是一个使用一系列正则表达式的基本包解决方案:
# Assume df is your data frame with a column called txt
# Match text until the beginning of the URL
tweet.regex <- regexpr("^.*(?=http)", df$txt, perl=T)
# Extract tweet text
tweet <- substr(df$txt, tweet.regex, attr(tweet.regex, "match.length"))
# Match text from the beginning of the URL to the next space
url.regex <- regexpr("http[^ ]+(?= )", df$txt, perl=T)
# Extract URL
url <- substr(df$txt, url.regex, url.regex + attr(url.regex, "match.length"))
# Match the date
date.regex <- regexpr("[A-Za-z]+ \\d+, \\d{4} *$", df$txt, perl=T)
# Extract date
date <- substr(df$txt, date.regex, date.regex + attr(date.regex, "match.length"))
# Combine results
tweet.df <- data.frame(tweet, url, date, stringsAsFactors=F)
#假设df是您的数据帧,包含一个名为txt的列
#匹配文本,直到URL开头
这里有一个使用stringr
包的解决方案
library("stringr")
dat <- c("Thank you, everyone! indyref http://t.co/1kTzqjyGE7 Sep 18, 2014 ",
"As the polls close, total likes on the @YesScotland Facebook page have passed David Cameron s one. indyref voteYes http://t.co/x7IoB1EtfY Sep 18, 2014 ",
"We can be proud of indyref, which has seen a flourishing of Scotland’s self-confidence as a nation VoteYes http://t.co/1OqxvbpoS9 Sep 18, 2014 ",
"We can afford world-class public services. A Yes vote means we can strengthen our NHS. VoteYes indyref http://t.co/D9Vn5OqStV Sep 18, 2014 ",
"This is a once in a lifetime opportunity to choose a new and better path for Scotland VoteYes indyref http://t.co/9knT6Mx4vZ Sep 18, 2014 ",
"Our young people shouldn t have to leave to find decent jobs. VoteYes indyref http://t.co/vAE164f0Oy Sep 18, 2014 ")
dates <- str_extract(dat, "[A-Z]{1}[a-z]{2} [0-9]{1,2}, [0-9]{4}")
url <- str_extract(dat, "http://t.co/[0-9A-Za-z]{10}")
text <- gsub(" indyref.+", "", dat)
df <- data.frame(dates, text, url, stringsAsFactors=F)
库(“stringr”)
dat这里有一个使用stringr
包的解决方案
library("stringr")
dat <- c("Thank you, everyone! indyref http://t.co/1kTzqjyGE7 Sep 18, 2014 ",
"As the polls close, total likes on the @YesScotland Facebook page have passed David Cameron s one. indyref voteYes http://t.co/x7IoB1EtfY Sep 18, 2014 ",
"We can be proud of indyref, which has seen a flourishing of Scotland’s self-confidence as a nation VoteYes http://t.co/1OqxvbpoS9 Sep 18, 2014 ",
"We can afford world-class public services. A Yes vote means we can strengthen our NHS. VoteYes indyref http://t.co/D9Vn5OqStV Sep 18, 2014 ",
"This is a once in a lifetime opportunity to choose a new and better path for Scotland VoteYes indyref http://t.co/9knT6Mx4vZ Sep 18, 2014 ",
"Our young people shouldn t have to leave to find decent jobs. VoteYes indyref http://t.co/vAE164f0Oy Sep 18, 2014 ")
dates <- str_extract(dat, "[A-Z]{1}[a-z]{2} [0-9]{1,2}, [0-9]{4}")
url <- str_extract(dat, "http://t.co/[0-9A-Za-z]{10}")
text <- gsub(" indyref.+", "", dat)
df <- data.frame(dates, text, url, stringsAsFactors=F)
库(“stringr”)
dat这里有一个使用stringr
包的解决方案
library("stringr")
dat <- c("Thank you, everyone! indyref http://t.co/1kTzqjyGE7 Sep 18, 2014 ",
"As the polls close, total likes on the @YesScotland Facebook page have passed David Cameron s one. indyref voteYes http://t.co/x7IoB1EtfY Sep 18, 2014 ",
"We can be proud of indyref, which has seen a flourishing of Scotland’s self-confidence as a nation VoteYes http://t.co/1OqxvbpoS9 Sep 18, 2014 ",
"We can afford world-class public services. A Yes vote means we can strengthen our NHS. VoteYes indyref http://t.co/D9Vn5OqStV Sep 18, 2014 ",
"This is a once in a lifetime opportunity to choose a new and better path for Scotland VoteYes indyref http://t.co/9knT6Mx4vZ Sep 18, 2014 ",
"Our young people shouldn t have to leave to find decent jobs. VoteYes indyref http://t.co/vAE164f0Oy Sep 18, 2014 ")
dates <- str_extract(dat, "[A-Z]{1}[a-z]{2} [0-9]{1,2}, [0-9]{4}")
url <- str_extract(dat, "http://t.co/[0-9A-Za-z]{10}")
text <- gsub(" indyref.+", "", dat)
df <- data.frame(dates, text, url, stringsAsFactors=F)
库(“stringr”)
dat这里有一个使用stringr
包的解决方案
library("stringr")
dat <- c("Thank you, everyone! indyref http://t.co/1kTzqjyGE7 Sep 18, 2014 ",
"As the polls close, total likes on the @YesScotland Facebook page have passed David Cameron s one. indyref voteYes http://t.co/x7IoB1EtfY Sep 18, 2014 ",
"We can be proud of indyref, which has seen a flourishing of Scotland’s self-confidence as a nation VoteYes http://t.co/1OqxvbpoS9 Sep 18, 2014 ",
"We can afford world-class public services. A Yes vote means we can strengthen our NHS. VoteYes indyref http://t.co/D9Vn5OqStV Sep 18, 2014 ",
"This is a once in a lifetime opportunity to choose a new and better path for Scotland VoteYes indyref http://t.co/9knT6Mx4vZ Sep 18, 2014 ",
"Our young people shouldn t have to leave to find decent jobs. VoteYes indyref http://t.co/vAE164f0Oy Sep 18, 2014 ")
dates <- str_extract(dat, "[A-Z]{1}[a-z]{2} [0-9]{1,2}, [0-9]{4}")
url <- str_extract(dat, "http://t.co/[0-9A-Za-z]{10}")
text <- gsub(" indyref.+", "", dat)
df <- data.frame(dates, text, url, stringsAsFactors=F)
库(“stringr”)
dat这里是使用“stringr”包的解决方案。这是基于科里的回答,但它纠正了一些错误,如果你有非传统的推特
它假设您有一个名为DB_YS.txt的.txt文件,其中包含所有原始文本格式的推文。并且您已经安装了库“stringr”。否则,您必须安装.packages(“stringr”)
库(stringr)
#将数据加载到R中
RawData这里是使用“stringr”包的解决方案。这是基于科里的回答,但它纠正了一些错误,如果你有非传统的推特
它假设您有一个名为DB_YS.txt的.txt文件,其中包含所有原始文本格式的推文。并且您已经安装了库“stringr”。否则,您必须安装.packages(“stringr”)
库(stringr)
#将数据加载到R中
RawData这里是使用“stringr”包的解决方案。这是基于科里的回答,但它纠正了一些错误,如果你有非传统的推特
它假设您有一个名为DB_YS.txt的.txt文件,其中包含所有原始文本格式的推文。并且您已经安装了库“stringr”。否则,您必须安装.packages(“stringr”)
库(stringr)
#将数据加载到R中
RawData这里是使用“stringr”包的解决方案。这是基于科里的回答,但它纠正了一些错误,如果你有非传统的推特
它假设您有一个名为DB_YS.txt的.txt文件,其中包含所有原始文本格式的推文。并且您已经安装了库“stringr”。否则,您必须安装.packages(“stringr”)
库(stringr)
#将数据加载到R中
RawData您要获取日期的正则表达式将匹配作为推文文本一部分的日期。我建议在末尾添加$
,以确保只匹配字符串末尾的日期。要获取日期的正则表达式将匹配作为推文文本一部分的日期。我建议在末尾添加$
,以确保只匹配字符串末尾的日期。要获取日期的正则表达式将匹配作为推文文本一部分的日期。我建议在末尾添加$
,以确保只匹配字符串末尾的日期。要获取日期的正则表达式将匹配作为推文文本一部分的日期。我建议在末尾添加$
,以确保只匹配字符串末尾的日期。非常感谢!)我确实有一些情况下,日期似乎没有放在正确的栏中,比如这条推文,所有内容都放在“推文”栏中:非常祝贺@Team_Scotland的所有人,因为他们已经出色地完成了奖牌目标!还有更多的时间。。。戈斯科特兰2014年7月29日。知道怎么解决吗?非常感谢!:)我确实有一些情况下,日期似乎没有放在正确的栏中,比如这条推文,所有内容都放在“推文”栏中:非常祝贺@Team_Scotland的所有人,因为他们已经出色地完成了奖牌目标!还有更多的时间。。。戈斯科特兰2014年7月29日。知道怎么解决吗?非常感谢!:)我确实有一些情况下,日期似乎没有放在正确的栏中,比如这条推文,所有内容都放在“推文”栏中:非常祝贺@Team_Scotland的所有人,因为他们已经出色地完成了奖牌目标!还有更多的时间。。。戈斯科特兰2014年7月29日。知道怎么解决吗?非常感谢!:)我确实有一些情况下,日期似乎没有放在正确的栏中,比如这条推文,所有内容都放在“推文”栏中:非常祝贺@Team_Scotland的所有人,因为他们已经出色地完成了奖牌目标!还有更多的时间。。。戈斯科特兰2014年7月29日。你知道怎么解决吗?