Regex 如何将R中的文本文件作为一行读取_Regex_R_Text_Text Processing_Text Mining

Regex 如何将R中的文本文件作为一行读取

regex r text

Regex 如何将R中的文本文件作为一行读取,regex,r,text,text-processing,text-mining,Regex,R,Text,Text Processing,Text Mining,我正在尝试处理一个文本文件。总的来说，我有一个语料库，我想分析。为了使用tm包（R中的一个文本挖掘包）创建一个语料库对象，我需要将此段落变成一个巨大的向量，以便正确阅读我有一段 Commercial exploitation over the past two hundred years drove the great Mysticete whales to near extinction. Variation in

我正在尝试处理一个文本文件。总的来说，我有一个语料库，我想分析。为了使用tm包（R中的一个文本挖掘包）创建一个语料库对象，我需要将此段落变成一个巨大的向量，以便正确阅读

我有一段

          Commercial exploitation over the past two hundred years drove                  
          the great Mysticete whales to near extinction.  Variation in                   
          the sizes of populations prior to exploitation, minimal                        
          population size during exploitation and current population                     
          sizes permit analyses of the effects of differing levels of                    
          exploitation on species with different biogeographical                         
          distributions and life-history characteristics.

我使用了scan和readLine方法，它对文本的处理如下：

[28]“过去两百年的商业开发推动了”
[29]“大鲸鱼濒临灭绝的变异”
[30]“开发前的人口规模最小”

有没有办法消除断线？还是将文本文件作为一个巨大的向量来读取

到目前为止，所有发布的解决方案都很好，谢谢。

我不久前遇到了同样的问题，并找到了一个解决方法：阅读单独的行，然后将它们粘贴在一起，删除“\n”换行符：

filename <- "tmp.txt"
paste0(readLines(filename),collapse=" ")

指定足够大的字符数（本例中为100000）。

如果需要对文件执行太多处理，则可能需要很长时间才能读取。你可以考虑不变地阅读它，然后做出改变。

stringi

包具有用于此特定操作的功能。作者们用C语言编写，所以他们的函数又好又快

因此，假设您已读入该文件，并将其命名为

txt

library(stringi)
stri_flatten(txt)
# [1] "          Commercial exploitation over the past two hundred years drove                  \n          the great Mysticete whales to near extinction.  Variation in                   \n          the sizes of populations prior to exploitation, minimal                        \n          population size during exploitation and current population                     \n          sizes permit analyses of the effects of differing levels of                    \n          exploitation on species with different biogeographical                         \n          distributions and life-history characteristics."

字符串仍然是相同的格式，只是变平了。检查是否可以查看

cat

cat(stri_flatten(txt))
          Commercial exploitation over the past two hundred years drove                  
          the great Mysticete whales to near extinction.  Variation in                   
          the sizes of populations prior to exploitation, minimal                        
          population size during exploitation and current population                     
          sizes permit analyses of the effects of differing levels of                    
          exploitation on species with different biogeographical                         
          distributions and life-history characteristics.

这将把整个文件读入一个长度为一个字符的向量

x对R不太熟悉，但是你不能循环行并将它们附加到单个字符串数据类型中。我是R的初学者。我知道很多人使用apply函数进行循环。我可以试试你的建议。谢谢你的好主意。@krammer所以我做了更多的搜索检查，你也可以使用readChar
为什么要这样做？一个字符串至少可以说很难处理。谢谢Richard！我不知道有一个stringi包。@user3426338-我会检查一下。学习函数需要一分钟的时间，因为有很多函数，它们都非常快。谢谢。我决定用linux命令行来完成它。我有大约5700个文件要预处理，这只是最简单的方法，但这对未来来说是一个很好的知识。这个解决方案听起来非常好。但是，如何将相同的输出写入文件？我使用了write
命令，每一行后面都会有空行。@JaneshDevkota要将字符向量写入文件，请尝试使用cat
例如cat（charVector，file=“textfile.txt”，append=F，fill=F）。当append为false时，它将覆盖该文件。如果fill为false，则不会添加新行或回车（包括EOL和EOF），这可能是某些程序的问题。但所有的控制权都在你手中
cat(stri_flatten(txt))
          Commercial exploitation over the past two hundred years drove                  
          the great Mysticete whales to near extinction.  Variation in                   
          the sizes of populations prior to exploitation, minimal                        
          population size during exploitation and current population                     
          sizes permit analyses of the effects of differing levels of                    
          exploitation on species with different biogeographical                         
          distributions and life-history characteristics.