用R分隔twitter状态/超链接/日期_R_Twitter

用R分隔twitter状态/超链接/日期

r twitter

用R分隔twitter状态/超链接/日期,r,twitter,R,Twitter,我想自动分离以下推文，以获得推文本身，超链接和日期在三个单独的列。有人能帮忙吗？我的数据集的名称是DB_YS，它是一个txt文件以下是我的数据框中的一些推文： Thank you, everyone! indyref http://t.co/1kTzqjyGE7 Sep 18, 2014 As the polls close, total likes on the @YesScotland Facebook page have passed David Cameron s one.

我想自动分离以下推文，以获得推文本身，超链接和日期在三个单独的列。有人能帮忙吗？我的数据集的名称是DB_YS，它是一个txt文件

以下是我的数据框中的一些推文：

Thank you, everyone!  indyref http://t.co/1kTzqjyGE7 Sep 18, 2014 
  As the polls close, total likes on the @YesScotland Facebook page have passed David Cameron s one.  indyref  voteYes http://t.co/x7IoB1EtfY Sep 18, 2014 
We can be proud of  indyref, which has seen a flourishing of Scotland’s self-confidence as a nation  VoteYes http://t.co/1OqxvbpoS9 Sep 18, 2014 
We can afford world-class public services. A Yes vote means we can strengthen our NHS.  VoteYes  indyref http://t.co/D9Vn5OqStV Sep 18, 2014 
This is a once in a lifetime opportunity to choose a new and better path for Scotland  VoteYes  indyref http://t.co/9knT6Mx4vZ Sep 18, 2014 
Our young people shouldn t have to leave to find decent jobs.  VoteYes  indyref http://t.co/vAE164f0Oy Sep 18, 2014

下面是一个使用一系列正则表达式的基本包解决方案：

# Assume df is your data frame with a column called txt

# Match text until the beginning of the URL
tweet.regex <- regexpr("^.*(?=http)", df$txt, perl=T)

# Extract tweet text
tweet <- substr(df$txt, tweet.regex, attr(tweet.regex, "match.length"))

# Match text from the beginning of the URL to the next space
url.regex <- regexpr("http[^ ]+(?= )", df$txt, perl=T)

# Extract URL
url <- substr(df$txt, url.regex, url.regex + attr(url.regex, "match.length"))

# Match the date
date.regex <- regexpr("[A-Za-z]+ \\d+, \\d{4} *$", df$txt, perl=T)

# Extract date
date <- substr(df$txt, date.regex, date.regex + attr(date.regex, "match.length"))

# Combine results
tweet.df <- data.frame(tweet, url, date, stringsAsFactors=F)

#假设df是您的数据帧，包含一个名为txt的列
#匹配文本，直到URL开头
tweet.regex下面是一个使用一系列正则表达式的基本包解决方案：
# Assume df is your data frame with a column called txt

# Match text until the beginning of the URL
tweet.regex <- regexpr("^.*(?=http)", df$txt, perl=T)

# Extract tweet text
tweet <- substr(df$txt, tweet.regex, attr(tweet.regex, "match.length"))

# Match text from the beginning of the URL to the next space
url.regex <- regexpr("http[^ ]+(?= )", df$txt, perl=T)

# Extract URL
url <- substr(df$txt, url.regex, url.regex + attr(url.regex, "match.length"))

# Match the date
date.regex <- regexpr("[A-Za-z]+ \\d+, \\d{4} *$", df$txt, perl=T)

# Extract date
date <- substr(df$txt, date.regex, date.regex + attr(date.regex, "match.length"))

# Combine results
tweet.df <- data.frame(tweet, url, date, stringsAsFactors=F)

#假设df是您的数据帧，包含一个名为txt的列
#匹配文本，直到URL开头
tweet.regex下面是一个使用一系列正则表达式的基本包解决方案：
# Assume df is your data frame with a column called txt

# Match text until the beginning of the URL
tweet.regex <- regexpr("^.*(?=http)", df$txt, perl=T)

# Extract tweet text
tweet <- substr(df$txt, tweet.regex, attr(tweet.regex, "match.length"))

# Match text from the beginning of the URL to the next space
url.regex <- regexpr("http[^ ]+(?= )", df$txt, perl=T)

# Extract URL
url <- substr(df$txt, url.regex, url.regex + attr(url.regex, "match.length"))

# Match the date
date.regex <- regexpr("[A-Za-z]+ \\d+, \\d{4} *$", df$txt, perl=T)

# Extract date
date <- substr(df$txt, date.regex, date.regex + attr(date.regex, "match.length"))

# Combine results
tweet.df <- data.frame(tweet, url, date, stringsAsFactors=F)

#假设df是您的数据帧，包含一个名为txt的列
#匹配文本，直到URL开头
tweet.regex下面是一个使用一系列正则表达式的基本包解决方案：
# Assume df is your data frame with a column called txt

# Match text until the beginning of the URL
tweet.regex <- regexpr("^.*(?=http)", df$txt, perl=T)

# Extract tweet text
tweet <- substr(df$txt, tweet.regex, attr(tweet.regex, "match.length"))

# Match text from the beginning of the URL to the next space
url.regex <- regexpr("http[^ ]+(?= )", df$txt, perl=T)

# Extract URL
url <- substr(df$txt, url.regex, url.regex + attr(url.regex, "match.length"))

# Match the date
date.regex <- regexpr("[A-Za-z]+ \\d+, \\d{4} *$", df$txt, perl=T)

# Extract date
date <- substr(df$txt, date.regex, date.regex + attr(date.regex, "match.length"))

# Combine results
tweet.df <- data.frame(tweet, url, date, stringsAsFactors=F)

#假设df是您的数据帧，包含一个名为txt的列
#匹配文本，直到URL开头
这里有一个使用stringr
包的解决方案
library("stringr")
dat <- c("Thank you, everyone!  indyref http://t.co/1kTzqjyGE7 Sep 18, 2014 ",
"As the polls close, total likes on the @YesScotland Facebook page have passed David Cameron s one.  indyref  voteYes http://t.co/x7IoB1EtfY Sep 18, 2014 ",
"We can be proud of  indyref, which has seen a flourishing of Scotland’s self-confidence as a nation  VoteYes http://t.co/1OqxvbpoS9 Sep 18, 2014 ",
"We can afford world-class public services. A Yes vote means we can strengthen our NHS.  VoteYes  indyref http://t.co/D9Vn5OqStV Sep 18, 2014 ",
"This is a once in a lifetime opportunity to choose a new and better path for Scotland  VoteYes  indyref http://t.co/9knT6Mx4vZ Sep 18, 2014 ",
"Our young people shouldn t have to leave to find decent jobs.  VoteYes  indyref http://t.co/vAE164f0Oy Sep 18, 2014 ")

dates <- str_extract(dat, "[A-Z]{1}[a-z]{2} [0-9]{1,2}, [0-9]{4}")
url <- str_extract(dat, "http://t.co/[0-9A-Za-z]{10}")
text <- gsub("  indyref.+", "", dat)
df <- data.frame(dates, text, url, stringsAsFactors=F)

库（“stringr”）
dat这里有一个使用stringr
包的解决方案
library("stringr")
dat <- c("Thank you, everyone!  indyref http://t.co/1kTzqjyGE7 Sep 18, 2014 ",
"As the polls close, total likes on the @YesScotland Facebook page have passed David Cameron s one.  indyref  voteYes http://t.co/x7IoB1EtfY Sep 18, 2014 ",
"We can be proud of  indyref, which has seen a flourishing of Scotland’s self-confidence as a nation  VoteYes http://t.co/1OqxvbpoS9 Sep 18, 2014 ",
"We can afford world-class public services. A Yes vote means we can strengthen our NHS.  VoteYes  indyref http://t.co/D9Vn5OqStV Sep 18, 2014 ",
"This is a once in a lifetime opportunity to choose a new and better path for Scotland  VoteYes  indyref http://t.co/9knT6Mx4vZ Sep 18, 2014 ",
"Our young people shouldn t have to leave to find decent jobs.  VoteYes  indyref http://t.co/vAE164f0Oy Sep 18, 2014 ")

dates <- str_extract(dat, "[A-Z]{1}[a-z]{2} [0-9]{1,2}, [0-9]{4}")
url <- str_extract(dat, "http://t.co/[0-9A-Za-z]{10}")
text <- gsub("  indyref.+", "", dat)
df <- data.frame(dates, text, url, stringsAsFactors=F)

库（“stringr”）
dat这里有一个使用stringr
包的解决方案
library("stringr")
dat <- c("Thank you, everyone!  indyref http://t.co/1kTzqjyGE7 Sep 18, 2014 ",
"As the polls close, total likes on the @YesScotland Facebook page have passed David Cameron s one.  indyref  voteYes http://t.co/x7IoB1EtfY Sep 18, 2014 ",
"We can be proud of  indyref, which has seen a flourishing of Scotland’s self-confidence as a nation  VoteYes http://t.co/1OqxvbpoS9 Sep 18, 2014 ",
"We can afford world-class public services. A Yes vote means we can strengthen our NHS.  VoteYes  indyref http://t.co/D9Vn5OqStV Sep 18, 2014 ",
"This is a once in a lifetime opportunity to choose a new and better path for Scotland  VoteYes  indyref http://t.co/9knT6Mx4vZ Sep 18, 2014 ",
"Our young people shouldn t have to leave to find decent jobs.  VoteYes  indyref http://t.co/vAE164f0Oy Sep 18, 2014 ")

dates <- str_extract(dat, "[A-Z]{1}[a-z]{2} [0-9]{1,2}, [0-9]{4}")
url <- str_extract(dat, "http://t.co/[0-9A-Za-z]{10}")
text <- gsub("  indyref.+", "", dat)
df <- data.frame(dates, text, url, stringsAsFactors=F)

库（“stringr”）
dat这里有一个使用stringr
包的解决方案
library("stringr")
dat <- c("Thank you, everyone!  indyref http://t.co/1kTzqjyGE7 Sep 18, 2014 ",
"As the polls close, total likes on the @YesScotland Facebook page have passed David Cameron s one.  indyref  voteYes http://t.co/x7IoB1EtfY Sep 18, 2014 ",
"We can be proud of  indyref, which has seen a flourishing of Scotland’s self-confidence as a nation  VoteYes http://t.co/1OqxvbpoS9 Sep 18, 2014 ",
"We can afford world-class public services. A Yes vote means we can strengthen our NHS.  VoteYes  indyref http://t.co/D9Vn5OqStV Sep 18, 2014 ",
"This is a once in a lifetime opportunity to choose a new and better path for Scotland  VoteYes  indyref http://t.co/9knT6Mx4vZ Sep 18, 2014 ",
"Our young people shouldn t have to leave to find decent jobs.  VoteYes  indyref http://t.co/vAE164f0Oy Sep 18, 2014 ")

dates <- str_extract(dat, "[A-Z]{1}[a-z]{2} [0-9]{1,2}, [0-9]{4}")
url <- str_extract(dat, "http://t.co/[0-9A-Za-z]{10}")
text <- gsub("  indyref.+", "", dat)
df <- data.frame(dates, text, url, stringsAsFactors=F)

库（“stringr”）
dat这里是使用“stringr”包的解决方案。这是基于科里的回答，但它纠正了一些错误，如果你有非传统的推特
它假设您有一个名为DB_YS.txt的.txt文件，其中包含所有原始文本格式的推文。并且您已经安装了库“stringr”。否则，您必须安装.packages（“stringr”）
库（stringr）
#将数据加载到R中
RawData这里是使用“stringr”包的解决方案。这是基于科里的回答，但它纠正了一些错误，如果你有非传统的推特
它假设您有一个名为DB_YS.txt的.txt文件，其中包含所有原始文本格式的推文。并且您已经安装了库“stringr”。否则，您必须安装.packages（“stringr”）
库（stringr）
#将数据加载到R中
RawData这里是使用“stringr”包的解决方案。这是基于科里的回答，但它纠正了一些错误，如果你有非传统的推特
它假设您有一个名为DB_YS.txt的.txt文件，其中包含所有原始文本格式的推文。并且您已经安装了库“stringr”。否则，您必须安装.packages（“stringr”）
库（stringr）
#将数据加载到R中
RawData这里是使用“stringr”包的解决方案。这是基于科里的回答，但它纠正了一些错误，如果你有非传统的推特
它假设您有一个名为DB_YS.txt的.txt文件，其中包含所有原始文本格式的推文。并且您已经安装了库“stringr”。否则，您必须安装.packages（“stringr”）
库（stringr）
#将数据加载到R中
RawData您要获取日期的正则表达式将匹配作为推文文本一部分的日期。我建议在末尾添加$
，以确保只匹配字符串末尾的日期。要获取日期的正则表达式将匹配作为推文文本一部分的日期。我建议在末尾添加$
，以确保只匹配字符串末尾的日期。要获取日期的正则表达式将匹配作为推文文本一部分的日期。我建议在末尾添加$
，以确保只匹配字符串末尾的日期。要获取日期的正则表达式将匹配作为推文文本一部分的日期。我建议在末尾添加$
，以确保只匹配字符串末尾的日期。非常感谢！）我确实有一些情况下，日期似乎没有放在正确的栏中，比如这条推文，所有内容都放在“推文”栏中：非常祝贺@Team_Scotland的所有人，因为他们已经出色地完成了奖牌目标！还有更多的时间。。。戈斯科特兰2014年7月29日。知道怎么解决吗？非常感谢！：）我确实有一些情况下，日期似乎没有放在正确的栏中，比如这条推文，所有内容都放在“推文”栏中：非常祝贺@Team_Scotland的所有人，因为他们已经出色地完成了奖牌目标！还有更多的时间。。。戈斯科特兰2014年7月29日。知道怎么解决吗？非常感谢！：）我确实有一些情况下，日期似乎没有放在正确的栏中，比如这条推文，所有内容都放在“推文”栏中：非常祝贺@Team_Scotland的所有人，因为他们已经出色地完成了奖牌目标！还有更多的时间。。。戈斯科特兰2014年7月29日。知道怎么解决吗？非常感谢！：）我确实有一些情况下，日期似乎没有放在正确的栏中，比如这条推文，所有内容都放在“推文”栏中：非常祝贺@Team_Scotland的所有人，因为他们已经出色地完成了奖牌目标！还有更多的时间。。。戈斯科特兰2014年7月29日。你知道怎么解决吗？