R 在文本集合上循环以提取子章节

R 在文本集合上循环以提取子章节,r,dataframe,R,Dataframe,作为我示例的延续,我现在面临的问题是,我想在R中提取文档集合中所有文档的子章节,以便进一步进行文本挖掘。这是我的示例数据: 不幸的是,这将返回一个空数据帧。我这里出了什么错?非常感谢您的帮助。 第一个df行的预期输出如下所示: doc_title <- c("Example.docx") chapter_id <- (c("1 Introduction")) text <- (c("He lay on his armour-like back, and if he lifte


不幸的是,这将返回一个空数据帧。我这里出了什么错?非常感谢您的帮助。 第一个df行的预期输出如下所示:

doc_title <- c("Example.docx")
chapter_id <- (c("1 Introduction")) 
text <- (c("He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.""))

chapter_one_df <- data.frame(doc_title, chapter_id, text)





divideInto_subchapters <- function(doc_corpus){

  corpus_text <- doc_corpus$text

  # Replace lines starting with N.N.N+ with space
  corpus_text <- gsub("\\R\\d+(?:\\.\\d+){2,}\\s+[A-Z].*\\R?", " ", corpus_text, perl=TRUE)

  # Split into IDs and Texts
  data <- str_match_all(corpus_text, "(?sm)^(\\d+(?:\\.\\d+)?\\s+[A-Z][^\r\n]*)\\R(.*?)(?=\\R\\d+(?:\\.\\d+)?\\s+[A-Z]|\\z)")

  # Get the chapter ID column
  chapter_id <- trimws(data[[1]][,2])

  # Get the text ID column
  text <- trimws(data[[1]][,3])

  # Create the target DF
  corpus <- data.frame(doc_title, chapter_id, text)

subchapter_corpus <- data.frame()

for (i in 1:nrow(doc_corpus)) {
  temp_corpus <- divideInto_subchapters(doc_corpus[i])
  subchapter_corpus <- rbind(subchapter_corpus, temp_corpus)
doc_title <- c("Example.docx")
chapter_id <- (c("1 Introduction")) 
text <- (c("He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.""))

chapter_one_df <- data.frame(doc_title, chapter_id, text)