Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/81.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 从文件中自动提取节(和节标题)_R_Stringr_Stringi_Tidytext_Read Text - Fatal编程技术网

R 从文件中自动提取节(和节标题)

R 从文件中自动提取节(和节标题),r,stringr,stringi,tidytext,read-text,R,Stringr,Stringi,Tidytext,Read Text,我需要从.Rmd文件(例如,从tidy text mining book: ) 据我所知,一个节从签名开始,一直运行到下一个签名、签名或文件结尾 整个文本已经被提取(使用dt这里是一个使用tidyverse方法的示例。这不一定适用于您拥有的任何文件——如果您正在使用标记,您可能应该尝试找到一个适当的标记解析库,正如Spacedman在其评论中提到的那样 library(tidyverse) ## A df where each line is a row in the rmd file. ra

我需要从.Rmd文件(例如,从tidy text mining book: )

据我所知,一个节从签名开始,一直运行到下一个签名、签名或文件结尾


整个文本已经被提取(使用
dt这里是一个使用
tidyverse
方法的示例。这不一定适用于您拥有的任何文件——如果您正在使用标记,您可能应该尝试找到一个适当的标记解析库,正如Spacedman在其评论中提到的那样

library(tidyverse)

## A df where each line is a row in the rmd file.
raw <- data_frame(
  text = read_lines("https://raw.githubusercontent.com/dgrtwo/tidy-text-mining/master/01-tidy-text.Rmd")
)

## We don't want to mark R comments as sections.
detect_codeblocks <- function(text) {
  blocks <- text %>%
    str_detect("```") %>%
    cumsum()

  blocks %% 2 != 0
}

## Here is an example of how you can extract information, such
## headers, using regex patterns.
df <-
  raw %>%
  mutate(
    code_block = detect_codeblocks(text),
    section = text %>%
      str_match("^# .*") %>%
      str_remove("^#+ +"),
    section = ifelse(code_block, NA, section),
    subsection = text %>%
      str_match("^## .*") %>%
      str_remove("^#+ +"),
    subsection = ifelse(code_block, NA, subsection),
    ) %>%
  fill(section, subsection)

## If you wish to glue the text together within sections/subsections,
## then just group by them and flatten the text.
df %>%
  group_by(section, subsection) %>%
  slice(-1) %>%                           # remove the header
  summarize(
    text = text %>%
      str_flatten(" ") %>%
      str_trim()
  ) %>%
  ungroup()

#> # A tibble: 7 x 3
#>   section                          subsection  text                       
#>   <chr>                            <chr>       <chr>                      
#> 1 The tidy text format {#tidytext} Contrastin… "As we stated above, we de…
#> 2 The tidy text format {#tidytext} Summary     In this chapter, we explor…
#> 3 The tidy text format {#tidytext} The `unnes… "Emily Dickinson wrote som…
#> 4 The tidy text format {#tidytext} The gutenb… "Now that we've used the j…
#> 5 The tidy text format {#tidytext} Tidying th… "Let's use the text of Jan…
#> 6 The tidy text format {#tidytext} Word frequ… "A common task in text min…
#> 7 The tidy text format {#tidytext} <NA>        "```{r echo = FALSE} libra…
库(tidyverse)
##一种df,其中每行都是rmd文件中的一行。
RAW5整洁的文本格式{#tidytext}tiding th…“让我们使用Jan的文本…”…
#>6整洁的文本格式{#tidytext}Word frequ…“文本管理中的一项常见任务…
#>7整洁的文本格式{tidytext}`{r echo=FALSE}库…

下面是一个使用
tidyverse
方法的示例。这不一定适用于您拥有的任何文件——如果您使用markdown,您可能应该尝试找到一个合适的markdown解析库,正如Spacedman在其评论中提到的那样

library(tidyverse)

## A df where each line is a row in the rmd file.
raw <- data_frame(
  text = read_lines("https://raw.githubusercontent.com/dgrtwo/tidy-text-mining/master/01-tidy-text.Rmd")
)

## We don't want to mark R comments as sections.
detect_codeblocks <- function(text) {
  blocks <- text %>%
    str_detect("```") %>%
    cumsum()

  blocks %% 2 != 0
}

## Here is an example of how you can extract information, such
## headers, using regex patterns.
df <-
  raw %>%
  mutate(
    code_block = detect_codeblocks(text),
    section = text %>%
      str_match("^# .*") %>%
      str_remove("^#+ +"),
    section = ifelse(code_block, NA, section),
    subsection = text %>%
      str_match("^## .*") %>%
      str_remove("^#+ +"),
    subsection = ifelse(code_block, NA, subsection),
    ) %>%
  fill(section, subsection)

## If you wish to glue the text together within sections/subsections,
## then just group by them and flatten the text.
df %>%
  group_by(section, subsection) %>%
  slice(-1) %>%                           # remove the header
  summarize(
    text = text %>%
      str_flatten(" ") %>%
      str_trim()
  ) %>%
  ungroup()

#> # A tibble: 7 x 3
#>   section                          subsection  text                       
#>   <chr>                            <chr>       <chr>                      
#> 1 The tidy text format {#tidytext} Contrastin… "As we stated above, we de…
#> 2 The tidy text format {#tidytext} Summary     In this chapter, we explor…
#> 3 The tidy text format {#tidytext} The `unnes… "Emily Dickinson wrote som…
#> 4 The tidy text format {#tidytext} The gutenb… "Now that we've used the j…
#> 5 The tidy text format {#tidytext} Tidying th… "Let's use the text of Jan…
#> 6 The tidy text format {#tidytext} Word frequ… "A common task in text min…
#> 7 The tidy text format {#tidytext} <NA>        "```{r echo = FALSE} libra…
库(tidyverse)
##一种df,其中每行都是rmd文件中的一行。
RAW5整洁的文本格式{#tidytext}tiding th…“让我们使用Jan的文本…”…
#>6整洁的文本格式{#tidytext}Word frequ…“文本管理中的一项常见任务…
#>7整洁的文本格式{tidytext}`{r echo=FALSE}库…

以注释#标记开头的代码块如何?您确实需要使用标记解析库解析标记。以注释#标记开头的代码块如何?您确实需要使用标记解析库解析标记。