从r中的txt文件中提取一些文本

从r中的txt文件中提取一些文本,r,tidyverse,stringr,R,Tidyverse,Stringr,我有大约10000(大约)txt文件,它是从html文件转换而来的。我想从这些txt文件中提取一些文本。下面是txt文件的一部分示例,我想从中提取所需的文本- TABLE OF CONTENTS Item 2.02. Results of Operations and Financial Condition Item 9.01. Financial Statements and Exhibits SI

我有大约10000(大约)
txt
文件,它是从
html
文件转换而来的。我想从这些
txt
文件中提取一些文本。下面是
txt
文件的一部分示例,我想从中提取所需的文本-

                        TABLE OF CONTENTS                        Item 2.02. 
Results of Operations and Financial Condition Item 9.01. Financial Statements and Exhibits SIGNATURES EXHIBIT INDEX EXHIBIT 99.1     
Table of Contents        Item 2.02. Results of Operations and Financial Condition         On January 27, 2005, SanDisk Corporation issued a press release to report its financial results for its fourth quarter and fiscal year ended January 2, 2005. 
The press release is attached hereto as Exhibit 99.1 and is incorporated herein in its entirety by reference.        
The information contained herein and in the accompanying Exhibit 99.1 shall be incorporated by reference into any filing of the Registrant, whether made before or after the date hereof, where such incorporation is provided for, and shall be specifically incorporated by reference into our currently effective registration statements on Form S-3 and Form S-8. 
Except as provided in the previous sentence, the information in this Item 2.02, including Exhibit 99.1 hereto, shall not be deemed to be  ;filed ; for purposes of Section 18 of the Securities Exchange Act of 1934, as amended, or otherwise subject to the liabilities of that section or Sections 11 and 12 of the Securities Act of 1933, as amended.     
Item 9.01. Financial Statements and Exhibits Exhibits 

如果检查上述内容,可以看到有两个目录-一个在开始时写为
目录
,另一个在后面写为
目录
。在第一个目录之后,有
第2.02项等文本。运营结果和财务状况第9.01项。财务报表和附件签名附件索引附件99.1,在第二个目录后有
第2.02项等文本。经营业绩和财务状况2005年1月27日,SanDisk公司发布新闻稿,报告截至2005年1月2日的第四季度和财年的财务业绩。

现在我想提取以下文本作为变量-

我想提取
项目2.02。运营结果和财务状况
第二个目录之后
作为变量名称,之后写入的日期作为该变量的值-在这种情况下,应为
2005年1月27日
。请注意,对于本例,我有一个项目名为
item2.02
。但在我的其他文件中,会有许多其他项目,如
项目3.02或项目5.02等
,每个项目下都会有一个日期,通常写在项目后的第一行。重要的一点是,我不想提取
第9.01项。我的任何txt文件的财务报表和附件

实际上,我编写以下代码将所有文件文本文件放在一个数据框中-

library(readtext)
library (tidyverse)
list_of_files <- list.files(path = "Edgar filings_full text", recursive = TRUE,
                            pattern = "\\.txt$", 
                            full.names = TRUE)


df <- list_of_files %>%
  set_names(.) %>%
  map_df(readtext, .id = "FileName") 
感谢您的任何帮助

unlist(str_extract_all(df$text[1], "Item[0-9]{1}\\.[0-9]{2}"))