Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/gwt/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 文本数据集中的最长行_R_Text Mining_Tm - Fatal编程技术网

R 文本数据集中的最长行

R 文本数据集中的最长行,r,text-mining,tm,R,Text Mining,Tm,我正在寻找一种方法来查找文本文件中最长行的长度 例如从代码> TM 包中考虑一个简单的数据集。< /P> install.packages("tm") library(tm) txt <- system.file("texts", "txt", package = "tm") ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"), readerControl = list(language = "lat")) length(ov

我正在寻找一种方法来查找文本文件中最长行的长度

例如从代码> TM 包中考虑一个简单的数据集。< /P>

install.packages("tm")
library(tm)
txt <- system.file("texts", "txt", package = "tm") 

ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"), readerControl = 
list(language = "lat"))

length(ovid)
[1] 5
事实上,当我们去掉空白处的填充后,它是第四个最长的文本。这是怎么做的。请注意,这其中很大一部分是由于从tm(V)语料库对象中获取文本的困难,例如,以前曾多次询问过该对象

请注意,我将您关于“行”的问题解释为指五个文档,每个文档超过五行,但包含多行(每个文档长度在16到18个字符向量之间)。我希望我的解释是正确的

texts <- sapply(ovid$content, "[[", "content")
str(texts)
## List of 5
## $ : chr [1:16] "    Si quis in hoc artem populo non novit amandi," "         hoc legat et lecto carmine doctus amet." "    arte citae veloque rates remoque moventur," "         arte leves currus: arte regendus amor." ...
## $ : chr [1:17] "    quas Hector sensurus erat, poscente magistro" "         verberibus iussas praebuit ille manus." "    Aeacidae Chiron, ego sum praeceptor Amoris:" "         saevus uterque puer, natus uterque dea." ...
## $ : chr [1:17] "    vera canam: coeptis, mater Amoris, ades!" "    este procul, vittae tenues, insigne pudoris," "         quaeque tegis medios, instita longa, pedes." "    nos venerem tutam concessaque furta canemus," ...
## $ : chr [1:17] "    scit bene venator, cervis ubi retia tendat," "         scit bene, qua frendens valle moretur aper;" "    aucupibus noti frutices; qui sustinet hamos," "         novit quae multo pisce natentur aquae:" ...
## $ : chr [1:18] "    mater in Aeneae constitit urbe sui." "    seu caperis primis et adhuc crescentibus annis," "         ante oculos veniet vera puella tuos:" "    sive cupis iuvenem, iuvenes tibi mille placebunt." ...
看起来更像是这样。现在我们可以找出哪一个最长

nchar(texts)
## [1] 600 621 644 668 622
which.max(nchar(texts))
## [1] 4

which.max(nchar(ovid))
?感谢您的建议,但恐怕它没有返回正确的答案。您想要五个元素的最大
长度
?即
长度的
max
(lappy(ovid,as.character))
?实际上,该命令返回的是列表中最长字符向量的长度,即
ovid
。对象ovid是一个字符向量列表,列表的第五个元素有18个元素,其中三个是空字符串。看看
lappy(ovid,as.character))
这会很清楚。@KenBenoit你是对的。现在我明白我的错误了。伙计们,我很抱歉把问题说错了。我真的很感谢你的努力。你的解释正确。这是我的错。实际上,我混淆了这五个文件,认为它们是长度为16、18 ecc的“行”。因此,问题的真正目标是组成五个文档的字符向量的长度。谢谢你让我明白。我真的很感谢你的努力,不客气。听起来这是一个值得接受的答案:-),绝对是。也有道理。它帮助我在数据挖掘领域迈出了沉重的第一步!
# need to trim leading and trailing whitespace
texts <- lapply(texts, stringi::stri_trim_both)
## texts[1]
## [[1]]
## [1] "Si quis in hoc artem populo non novit amandi,"     "hoc legat et lecto carmine doctus amet."          
## [3] "arte citae veloque rates remoque moventur,"        "arte leves currus: arte regendus amor."           
## [5] ""                                                  "curribus Automedon lentisque erat aptus habenis," 
## [7] "Tiphys in Haemonia puppe magister erat:"           "me Venus artificem tenero praefecit Amori;"       
## [9] "Tiphys et Automedon dicar Amoris ego."             "ille quidem ferus est et qui mihi saepe repugnet:"
## [11] ""                                                  "sed puer est, aetas mollis et apta regi."         
## [13] "Phillyrides puerum cithara perfecit Achillem,"     "atque animos placida contudit arte feros."        
## [15] "qui totiens socios, totiens exterruit hostes,"     "creditur annosum pertimuisse senem."              

# now paste them together to make a single character vector of the five documents
texts <- sapply(texts, paste, collapse = "\n")
str(texts)
##  chr [1:5] "Si quis in hoc artem populo non novit amandi,\nhoc legat et lecto carmine doctus amet.\narte citae veloque rates remoque movent"| __truncated__ ...
cat(texts[1])
## Si quis in hoc artem populo non novit amandi,
## hoc legat et lecto carmine doctus amet.
## arte citae veloque rates remoque moventur,
## arte leves currus: arte regendus amor.
## 
## curribus Automedon lentisque erat aptus habenis,
## Tiphys in Haemonia puppe magister erat:
## me Venus artificem tenero praefecit Amori;
## Tiphys et Automedon dicar Amoris ego.
## ille quidem ferus est et qui mihi saepe repugnet:
##     
## sed puer est, aetas mollis et apta regi.
## Phillyrides puerum cithara perfecit Achillem,
## atque animos placida contudit arte feros.
## qui totiens socios, totiens exterruit hostes,
## creditur annosum pertimuisse senem.
nchar(texts)
## [1] 600 621 644 668 622
which.max(nchar(texts))
## [1] 4