阅读R中语料库中每个文档的前两行

阅读R中语料库中每个文档的前两行,r,for-loop,corpus,readlines,quanteda,R,For Loop,Corpus,Readlines,Quanteda,我很难理解如何阅读R语料库中每个文档的前两行。前两行包含我要分析的新闻文章的标题。我想在标题(而不是每篇文章的其余部分)中搜索“堕胎”一词 以下是我创建语料库的代码: myCorp <- corpus(readtext(file='~/R/win-library/3.3/quanteda/Abortion/1972/*')) myCorpreadLines函数需要一个连接对象作为参数。因此,由于corpus函数不返回连接,因此需要在循环中创建到语料库中字符串的连接 myCorp <

我很难理解如何阅读R语料库中每个文档的前两行。前两行包含我要分析的新闻文章的标题。我想在标题(而不是每篇文章的其余部分)中搜索“堕胎”一词

以下是我创建语料库的代码:

myCorp <- corpus(readtext(file='~/R/win-library/3.3/quanteda/Abortion/1972/*'))

myCorpreadLines
函数需要一个连接对象作为参数。因此,由于
corpus
函数不返回连接,因此需要在循环中创建到语料库中字符串的连接

myCorp <- Corpus(quanteda::data_corpus_inaugural)

for (text in myCorp$documents$texts) {
  con <- textConnection(text,)
  first_lines <- readLines(con, n = 2)
  close.connection(con)

  # Test if the word "speaker" is in the two lines
  if(any(grepl(pattern = "speaker",x = first_lines, ignore.case = T))){
    print(first_lines)
  }
}

myCorp我建议两种选择:

正则表达式替换只保留前2行 如果前两行包含所需内容,则只需使用正则表达式提取前两行即可。这比循环快

@rconradin的解决方案是可行的,但正如您在?语料库中所注意到的,我们强烈反对直接访问语料库对象的内部(因为它很快就会改变)。不循环也更快

# test corpus for demonstration
testcorp <- corpus(c(
    d1 = "This is doc1, line 1.\nDoc1, Line 2.\nLine 3 of doc1.",
    d2 = "This is doc2, line 1.\nDoc2, Line 2.\nLine 3 of doc2."
))

summary(testcorp)
## Corpus consisting of 2 documents.
## 
##  Text Types Tokens Sentences
##    d1    12     17         3
##    d2    12     17         3

给出错误<代码>未找到语料库
。尽管如此,
docs@ThejKiran,问题是使用“语料库”功能,即
tm
软件包上的功能。因此,您需要使用
require(tm)
加载库。我使用了
library(tm)
<代码>语料库
是函数名。不是
corpus
(这是我在前面的评论中想说的)。什么是
data\u char\u就职
data\u char\u就职
quanteda
库的数据集。此名称现在已被弃用,您需要使用
数据\u语料库\u就职
# test corpus for demonstration
testcorp <- corpus(c(
    d1 = "This is doc1, line 1.\nDoc1, Line 2.\nLine 3 of doc1.",
    d2 = "This is doc2, line 1.\nDoc2, Line 2.\nLine 3 of doc2."
))

summary(testcorp)
## Corpus consisting of 2 documents.
## 
##  Text Types Tokens Sentences
##    d1    12     17         3
##    d2    12     17         3
texts(testcorp) <- 
    stringi::stri_replace_all_regex(texts(testcorp), "(.*\\n.*)(\\n).*", "$1")
## Corpus consisting of 2 documents.
## 
##  Text Types Tokens Sentences
##    d1    10     12         2
##    d2    10     12         2

texts(testcorp)
##                                     d1                                     d2 
## "This is doc1, line 1.\nDoc1, Line 2." "This is doc2, line 1.\nDoc2, Line 2." 
testcorp2 <- corpus_segment(testcorp, what = "other", delimiter = "\\n", 
                            valuetype = "regex")
summary(testcorp2)
## Corpus consisting of 6 documents.
## 
##  Text Types Tokens Sentences
##  d1.1     7      7         1
##  d1.2     5      5         1
##  d1.3     5      5         1
##  d2.1     7      7         1
##  d2.2     5      5         1
##  d2.3     5      5         1

# get the serial number from each docname
docvars(testcorp2, "sentenceno") <- 
    as.integer(gsub(".*\\.(\\d+)", "\\1", docnames(testcorp2)))
summary(testcorp2)
## Corpus consisting of 6 documents.
## 
##  Text Types Tokens Sentences sentenceno
##  d1.1     7      7         1          1
##  d1.2     5      5         1          2
##  d1.3     5      5         1          3
##  d2.1     7      7         1          1
##  d2.2     5      5         1          2
##  d2.3     5      5         1          3

testcorp3 <- corpus_subset(testcorp2, sentenceno <= 2)
texts(testcorp3)
##                    d1.1                    d1.2                    d2.1                    d2.2 
## "This is doc1, line 1."         "Doc1, Line 2." "This is doc2, line 1."         "Doc2, Line 2."