R 在文本集合上循环以提取子章节_R_Dataframe

R 在文本集合上循环以提取子章节

r dataframe

R 在文本集合上循环以提取子章节,r,dataframe,R,Dataframe,作为我示例的延续，我现在面临的问题是，我想在R中提取文档集合中所有文档的子章节，以便进一步进行文本挖掘。这是我的示例数据：不幸的是，这将返回一个空数据帧。我这里出了什么错？非常感谢您的帮助。第一个df行的预期输出如下所示： doc_title <- c("Example.docx") chapter_id <- (c("1 Introduction")) text <- (c("He lay on his armour-like back, and if he lifte

作为我示例的延续，我现在面临的问题是，我想在R中提取文档集合中所有文档的子章节，以便进一步进行文本挖掘。这是我的示例数据：

不幸的是，这将返回一个空数据帧。我这里出了什么错？非常感谢您的帮助。第一个df行的预期输出如下所示：

doc_title <- c("Example.docx")
chapter_id <- (c("1 Introduction")) 
text <- (c("He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.""))

chapter_one_df <- data.frame(doc_title, chapter_id, text)

因此，对我来说，循环给了我超出范围的下标，直到我将doc_语料库[I]更改为doc_语料库[I，]。通过这种更改，我确实在结果数据帧中得到了一行

然而，这只是第2.2章的进一步内容。它似乎还缺少1.1

若这是一个正则表达式的问题，那个么伙计，若你们评论一下你们在用它做什么，肯定会有帮助的！：

请随意评论，我会根据需要修改我的答案，直到有帮助为止。不确定它是否是这样工作的，但这只是我回答问题的第三天。

太好了，缺少分号才是问题所在！！

divideInto_subchapters <- function(doc_corpus){

  corpus_text <- doc_corpus$text

  # Replace lines starting with N.N.N+ with space
  corpus_text <- gsub("\\R\\d+(?:\\.\\d+){2,}\\s+[A-Z].*\\R?", " ", corpus_text, perl=TRUE)

  # Split into IDs and Texts
  data <- str_match_all(corpus_text, "(?sm)^(\\d+(?:\\.\\d+)?\\s+[A-Z][^\r\n]*)\\R(.*?)(?=\\R\\d+(?:\\.\\d+)?\\s+[A-Z]|\\z)")

  # Get the chapter ID column
  chapter_id <- trimws(data[[1]][,2])

  # Get the text ID column
  text <- trimws(data[[1]][,3])

  # Create the target DF
  corpus <- data.frame(doc_title, chapter_id, text)

  return(corpus)
}

subchapter_corpus <- data.frame()

for (i in 1:nrow(doc_corpus)) {
  temp_corpus <- divideInto_subchapters(doc_corpus[i])
  subchapter_corpus <- rbind(subchapter_corpus, temp_corpus)
}

doc_title <- c("Example.docx")
chapter_id <- (c("1 Introduction")) 
text <- (c("He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.""))

chapter_one_df <- data.frame(doc_title, chapter_id, text)