R 将两个由点连接的单词分开

R 将两个由点连接的单词分开,r,regex,string,strsplit,R,Regex,String,Strsplit,我有一个包含新闻文章的大数据框。我注意到一些文章中有两个单词用点连接,如下示例所示政府表示退出很重要。。我将进行一些主题建模,所以我需要分离每个单词 这是我用来分隔这些单词的代码 #String example test <- c("i need.to separate the words connected by dots. however, I need.to keep having the dots separating sentences") #Code

我有一个包含新闻文章的大数据框。我注意到一些文章中有两个单词用点连接,如下示例所示
政府表示退出很重要。
。我将进行一些主题建模,所以我需要分离每个单词

这是我用来分隔这些单词的代码

    #String example
    test <- c("i need.to separate the words connected by dots. however, I need.to keep having the dots separating sentences")

    #Code to separate the words
    test <- do.call(paste, as.list(strsplit(test, "\\.")[[1]]))

   #This is what I get
  > test
  [1] "i need to separate the words connected by dots  however, I need to keep having the dots separating sentences"
最后一个音符

我的数据框架由17000篇文章组成;所有文本都是小写的。我只是举了一个小例子,说明我在试图分离两个由点连接的单词时遇到的问题。另外,有什么方法可以在列表中使用strsplit
strsplit

您可以使用

test <- c("i need.to separate the words connected by dots. however, I need.to keep having the dots separating sentences. Look at http://google.com for s.0.m.e more details.")
# Replace each dot that is in between word characters
gsub("\\b\\.\\b", " ", test, perl=TRUE)
# Replace each dot that is in between letters
gsub("(?<=\\p{L})\\.(?=\\p{L})", " ", test, perl=TRUE)
# Replace each dot that is in between word characters, but no in URLs
gsub("(?:ht|f)tps?://\\S*(*SKIP)(*F)|\\b\\.\\b", " ", test, perl=TRUE)
详细信息

    #String example
    test <- c("i need.to separate the words connected by dots. however, I need.to keep having the dots separating sentences")

    #Code to separate the words
    test <- do.call(paste, as.list(strsplit(test, "\\.")[[1]]))

   #This is what I get
  > test
  [1] "i need to separate the words connected by dots  however, I need to keep having the dots separating sentences"
  • \b\.\b
    -用单词边界围起来的点(即
    之前和之后)不能是任何非单词字符,不能有字母、数字或下划线以外的任何字符

  • (?Try
    gsub(“\\b\\.\\b”,”,test,perl=TRUE)
    。这将删除字母/数字/下划线之间的点。如果这不完全是您所需要的,您能否详细解释一下要删除点的上下文?它可以工作。我是否有可能在不修改URL的情况下应用此代码?我的数据框由不同的新闻文章组成,其中包含一些URL。我希望e保留它们,但此代码肯定会更改它们。请提供一个示例并更新问题。我将发布一个新问题,因为您已为我正确回答了此问题。谢谢!不,请不要,我将在此处发布答案。
    [1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google com for s 0 m e more details."
    [1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google com for s.0.m e more details."
    [1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google.com for s 0 m e more details."