阅读R中语料库中每个文档的前两行
我很难理解如何阅读R语料库中每个文档的前两行。前两行包含我要分析的新闻文章的标题。我想在标题(而不是每篇文章的其余部分)中搜索“堕胎”一词 以下是我创建语料库的代码:阅读R中语料库中每个文档的前两行,r,for-loop,corpus,readlines,quanteda,R,For Loop,Corpus,Readlines,Quanteda,我很难理解如何阅读R语料库中每个文档的前两行。前两行包含我要分析的新闻文章的标题。我想在标题(而不是每篇文章的其余部分)中搜索“堕胎”一词 以下是我创建语料库的代码: myCorp <- corpus(readtext(file='~/R/win-library/3.3/quanteda/Abortion/1972/*')) myCorpreadLines函数需要一个连接对象作为参数。因此,由于corpus函数不返回连接,因此需要在循环中创建到语料库中字符串的连接 myCorp <
myCorp <- corpus(readtext(file='~/R/win-library/3.3/quanteda/Abortion/1972/*'))
myCorpreadLines
函数需要一个连接对象作为参数。因此,由于corpus
函数不返回连接,因此需要在循环中创建到语料库中字符串的连接
myCorp <- Corpus(quanteda::data_corpus_inaugural)
for (text in myCorp$documents$texts) {
con <- textConnection(text,)
first_lines <- readLines(con, n = 2)
close.connection(con)
# Test if the word "speaker" is in the two lines
if(any(grepl(pattern = "speaker",x = first_lines, ignore.case = T))){
print(first_lines)
}
}
myCorp我建议两种选择:
正则表达式替换只保留前2行
如果前两行包含所需内容,则只需使用正则表达式提取前两行即可。这比循环快
@rconradin的解决方案是可行的,但正如您在?语料库中所注意到的,我们强烈反对直接访问语料库对象的内部(因为它很快就会改变)。不循环也更快
# test corpus for demonstration
testcorp <- corpus(c(
d1 = "This is doc1, line 1.\nDoc1, Line 2.\nLine 3 of doc1.",
d2 = "This is doc2, line 1.\nDoc2, Line 2.\nLine 3 of doc2."
))
summary(testcorp)
## Corpus consisting of 2 documents.
##
## Text Types Tokens Sentences
## d1 12 17 3
## d2 12 17 3
给出错误<代码>未找到语料库
。尽管如此,docs@ThejKiran,问题是使用“语料库”功能,即tm
软件包上的功能。因此,您需要使用require(tm)
加载库。我使用了library(tm)
<代码>语料库
是函数名。不是corpus
(这是我在前面的评论中想说的)。什么是data\u char\u就职
?data\u char\u就职
是quanteda
库的数据集。此名称现在已被弃用,您需要使用数据\u语料库\u就职
。
# test corpus for demonstration
testcorp <- corpus(c(
d1 = "This is doc1, line 1.\nDoc1, Line 2.\nLine 3 of doc1.",
d2 = "This is doc2, line 1.\nDoc2, Line 2.\nLine 3 of doc2."
))
summary(testcorp)
## Corpus consisting of 2 documents.
##
## Text Types Tokens Sentences
## d1 12 17 3
## d2 12 17 3
texts(testcorp) <-
stringi::stri_replace_all_regex(texts(testcorp), "(.*\\n.*)(\\n).*", "$1")
## Corpus consisting of 2 documents.
##
## Text Types Tokens Sentences
## d1 10 12 2
## d2 10 12 2
texts(testcorp)
## d1 d2
## "This is doc1, line 1.\nDoc1, Line 2." "This is doc2, line 1.\nDoc2, Line 2."
testcorp2 <- corpus_segment(testcorp, what = "other", delimiter = "\\n",
valuetype = "regex")
summary(testcorp2)
## Corpus consisting of 6 documents.
##
## Text Types Tokens Sentences
## d1.1 7 7 1
## d1.2 5 5 1
## d1.3 5 5 1
## d2.1 7 7 1
## d2.2 5 5 1
## d2.3 5 5 1
# get the serial number from each docname
docvars(testcorp2, "sentenceno") <-
as.integer(gsub(".*\\.(\\d+)", "\\1", docnames(testcorp2)))
summary(testcorp2)
## Corpus consisting of 6 documents.
##
## Text Types Tokens Sentences sentenceno
## d1.1 7 7 1 1
## d1.2 5 5 1 2
## d1.3 5 5 1 3
## d2.1 7 7 1 1
## d2.2 5 5 1 2
## d2.3 5 5 1 3
testcorp3 <- corpus_subset(testcorp2, sentenceno <= 2)
texts(testcorp3)
## d1.1 d1.2 d2.1 d2.2
## "This is doc1, line 1." "Doc1, Line 2." "This is doc2, line 1." "Doc2, Line 2."