R-将提取的文本数据(每个实例作为行)导出为Data.frame格式
我试图从I number of standardized.txt表单中的I number个标准化实例中提取/导出文本到一个数据框中,其中每个实例都是一个单独的行。然后我想将该数据导出为.xlsx文件。到目前为止,我可以成功地提取数据(尽管算法提取的数据略多于所述的gregexpr()参数),但只能以.txt格式一次性导出文本R-将提取的文本数据(每个实例作为行)导出为Data.frame格式,r,machine-learning,nlp,artificial-intelligence,text-extraction,R,Machine Learning,Nlp,Artificial Intelligence,Text Extraction,我试图从I number of standardized.txt表单中的I number个标准化实例中提取/导出文本到一个数据框中,其中每个实例都是一个单独的行。然后我想将该数据导出为.xlsx文件。到目前为止,我可以成功地提取数据(尽管算法提取的数据略多于所述的gregexpr()参数),但只能以.txt格式一次性导出文本 在每个实例都有自己的行的情况下,如何创建提取的txt文件文本的数据帧? (一旦数据是data.frame格式,我就知道如何从那里导出为xlsx。) 如何仅从已设置的参数中提
# Txt Data Format
txt 1 <-
"A. The First: abcdefg hijklmnop qrstuv wxyz.
B. The Second: abcdefg hijklmnop qrstuv wxyz.
C. The Third: abcdefg hijklmnop qrstuv wxyz.
D. The Fourth: abcdefg hijklmnop qrstuv wxyz.
A. The First: abcdefg hijklmnop qrstuv wxyz.
B. The Second: abcdefg hijklmnop qrstuv wxyz.
C. The Third: abcdefg hijklmnop qrstuv wxyz.
D. The Fourth: abcdefg hijklmnop qrstuv wxyz."
txt 2 <-
"A. The First: abcdefg hijklmnop qrstuv wxyz.
B. The Second: abcdefg hijklmnop qrstuv wxyz.
C. The Third: abcdefg hijklmnop qrstuv wxyz.
D. The Fourth: abcdefg hijklmnop qrstuv wxyz.
A. The First: abcdefg hijklmnop qrstuv wxyz.
B. The Second: abcdefg hijklmnop qrstuv wxyz.
C. The Third: abcdefg hijklmnop qrstuv wxyz.
D. The Fourth: abcdefg hijklmnop qrstuv wxyz."
#################################
# Directory and Text Extraction #
#################################
dest <- "C:/Desktop/"
docs_text <- list.files(path = dest, pattern = "txt", full.names = TRUE)
## Assumes that all the content I want to extract is between "A." and "C." in
## the text while ignoring "C." and "D." content.
docs_list <- list.files(path = dest, pattern = "txt", full.names = TRUE)
docs_doc <- lapply(docs_list, function(i) {
j <- paste0(scan(i, what = character()), collapse = " ")
regmatches(j, gregexpr("(?<=A. The First).*?(?=C. The Third)", j, perl=TRUE))
})
lapply(1:length(docs_doc), function(i) write.table(docs_doc[i], file=paste(docs_list[i], " ",
" ", sep="."), quote = FALSE, row.names = FALSE, col.names = FALSE, eol = " " ))
#Txt数据格式
txt 1为了方便使用tibble
对象和非常有效的bind_rows
命令,我使用了dplyr
:
dest <- "~"
docs_text <- list.files(path = dest, pattern = "txt", full.names = TRUE)
library(dplyr)
docs_df <- lapply(docs_text, function(f) {
lines <- readLines(f)
tibble(
file = basename(f),
line = seq_along(lines),
text = lines
)
}) %>%
bind_rows()
对于导出到Excel,我建议rio::export()
我仍然需要为每个gregexpr()提取指定自己的行。我们可以核对一下你的:docs\u df和我的:docs\u doc我没有注意到你只想要某些行。我更新了我的答案。哦,我以为我只是在澄清我的问题。我向您道歉,并将确保今后在我的SO礼仪上更加谨慎。我将重新提出同样的问题,并补充澄清。感谢您花时间回答这个问题,并提供反馈,让社区参与进来!你的洞察力帮助我找到了清晰!伟大的如果这听起来很刺耳,我很抱歉。我当时很忙,在我的脑海里听起来有点好听。。。
docs_df %>%
filter(grepl("^A.|^B.", text))
#> # A tibble: 8 x 3
#> file line text
#> <chr> <int> <chr>
#> 1 txt_1.txt 1 A. The First: abcdefg hijklmnop qrstuv wxyz.
#> 2 txt_1.txt 2 B. The Second: abcdefg hijklmnop qrstuv wxyz.
#> 3 txt_1.txt 6 A. The First: abcdefg hijklmnop qrstuv wxyz.
#> 4 txt_1.txt 7 B. The Second: abcdefg hijklmnop qrstuv wxyz.
#> 5 txt_2.txt 1 A. The First: abcdefg hijklmnop qrstuv wxyz.
#> 6 txt_2.txt 2 B. The Second: abcdefg hijklmnop qrstuv wxyz.
#> 7 txt_2.txt 6 A. The First: abcdefg hijklmnop qrstuv wxyz.
#> 8 txt_2.txt 7 B. The Second: abcdefg hijklmnop qrstuv wxyz.