如何通过html标记或正则表达式分割txt文件，以便在R中将其保存为单独的txt文件？_Html_R_Strsplit

如何通过html标记或正则表达式分割txt文件，以便在R中将其保存为单独的txt文件？

html r

如何通过html标记或正则表达式分割txt文件，以便在R中将其保存为单独的txt文件？,html,r,strsplit,Html,R,Strsplit,我有一个LexisNexis批量下载的html和txt格式的新闻文章的输出。该文件本身包含几个不同新闻文章的标题、元数据和正文，我需要将它们系统地分离并另存为独立的txt文件。txt版本的标题如下所示： > head(textz, 100) [1] "ï»¿" [2] " 1

我有一个LexisNexis批量下载的html和txt格式的新闻文章的输出。该文件本身包含几个不同新闻文章的标题、元数据和正文，我需要将它们系统地分离并另存为独立的txt文件。txt版本的标题如下所示：

> head(textz, 100)
[1] "ï»¿"                                                                              
[2] "                               1 of 103 DOCUMENTS"                                
[3] ""                                                                                 
[4] ""                                                                                 

[5] "                                Foreign Affairs"                                  

[6] ""                                                                                 
[7] "                              May 2013 - June 2013"                               
[8] ""                                                                                 
[9] "Why the U.S. Army Needs Armor Subtitle: The Case for a Balanced Force"            
[10] ""                                                                                 

[11] "BYLINE: Chris McKinney, Mark Elfendahl, and H. R. McMaster Authors BIOS: CHRIS"   
[12] "MCKINNEY is a Lieutenant Colonel in the U.S. Army and an adviser to the Saudi"    
[13] "Arabian National Guard. MARK ELFENDAHL is a Colonel in the U.S. Army and a"       
[14] "student at the Joint Advanced Warfighting School in Norfolk, Virginia. H. R."     
[15] "MCMASTER is a Major General in the U.S. Army and Commander of the Maneuver"       
[16] "Center of Excellence at Fort Benning, Georgia."                                   

[17] ""                                                                                 

[18] "SECTION: Vol. 92 No. 4 PAGE: 129"                                                 

[19] ""                                                                                 

[20] "LENGTH: 2856 words"                                                               

[21] ""                                                                                 

[22] ""                                                                                 

[23] "Ever since World War II, the United States has depended on armored forces --"     
[24] "forces equipped with tanks and other protected vehicles -- to wage its wars."
....
....

html版本的快照如下所示：

<DOC NUMBER=103>
<DOCFULL> -->
<br><div class="c0">
<p class="c1"><span class="c2">103 of 103 DOCUMENTS</span></p>
</div>
<br><div class="c0">
<br><p class="c1"><span class="c2">The New York Times</span></p>
</div>
<br><div class="c3">
<p class="c1"><span class="c4">July</span>
<span class="c2"> 26, 2011 Tuesday</span>
<span class="c2">Â </span>
<span class="c2">Â <br>Late Edition - Final</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c7">A Step Toward Trust With China</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c8">BYLINE: </span><span class="c2">By MIKE MULLEN. </span></p>
<p class="c9"><span class="c2">Mike Mullen, a </span>
<span class="c4">Navy admiral,</span><span class="c2"> is the chairman of the Joint Chiefs of Staff.
</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c8">SECTION: </span>
<span class="c2">Section A; Column 0; Editorial Desk; OP-ED CONTRIBUTOR; Pg. 23</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c8">LENGTH: </span>
<span class="c2">794 words</span></p>
</div>
<br><div class="c5">
<p class="c9"><span class="c2">Washington</span></p>
<p class="c9"><span class="c2">THE military relationship between the United States and China is one of the world's most important. And yet, clouded by some misunderstanding and suspicion, it remains among the most challenging. There are issues on which we disagree and are tempted to confront each other. But there are crucial areas where our interests coincide, on which we must work together.
</span></p>


-->


103份文件中的103份



纽约时报


七月
2011年11月26日星期二
Â 

最新版本-最终版


向信任中国迈出了一步


署名：迈克·马伦
迈克·马伦，a
海军上将，是参谋长联席会议主席。



章节：
A节；第0列；编辑台；OP-ED投稿人；第23页


长度：
794字


华盛顿
美国和中国之间的军事关系是世界上最重要的关系之一。然而，由于一些误解和怀疑，它仍然是最具挑战性的。在一些问题上，我们意见不一，并试图相互对抗。但在一些关键领域，我们的利益是一致的，我们必须共同努力。

每个文档中的唯一文档由“[0-9]个文档中的[0-9]行分隔，但是在grep系列和strsplit之间，我无法找到一种方法来拆分R中的txt（或html）文件，这种方法可以清晰地分隔组件文章，并允许我将它们保存为独立的txt文件。对其他问题的彻底搜索要么没有帮助，要么需要使用Python。任何建议都很好

library

rvest

可以轻松解析html。您的文档与

和

标题不太一致。下面的答案使用您提供的扩展文档来显示下一个文档（104）。您可以使用lappy结构来做其他事情，比如为每篇文章编写一个文本文件。注意html_节点中的css选择器。html中似乎没有太多的结构，但是如果您找到一些模式，您可以使用选择器将每篇文章的某些部分作为目标

library(rvest)
library(stringr)

articles  <- str_replace_all(doc, "\\n", " ") %>%    # remove new line to simplify
  str_replace_all("<DOCFULL>\\s+\\-\\->", " " ) %>%  # remove redundant header
  strsplit("<DOC NUMBER=\\d+>") %>%                  # split on DOC NUMBER header
  unlist()                                           # to a vector

# drop the first empty result form the split
articles <- articles[-1]

# use lapply to travers all articles. 
c2_texts <- lapply(articles, function (article) {
  article %>% 
    read_html() %>%           # character input parsed as html
    html_nodes(css=".c2") %>% # find nodes with CSS selector, ex: c2
    html_text() })            # extract text from within the node

c2_texts
# [[1]]
# [1] "103 of 103 DOCUMENTS"                                                                                                                                                                                                                                                                                                                                                           
# [2] "The New York Times"                                                                                                                                                                                                                                                                                                                                                             
# [3] " 26, 2011 Tuesday"                                                                                                                                                                                                                                                                                                                                                              
# [4] "Â "                                                                                                                                                                                                                                                                                                                                                                             
# [5] "Â Late Edition - Final"                                                                                                                                                                                                                                                                                                                                                         
# [6] "By MIKE MULLEN. "                                                                                                                                                                                                                                                                                                                                                               
# [7] "Mike Mullen, a "                                                                                                                                                                                                                                                                                                                                                                
# [8] " is the chairman of the Joint Chiefs of Staff.     "                                                                                                                                                                                                                                                                                                                            
# [9] "Section A; Column 0; Editorial Desk; OP-ED CONTRIBUTOR; Pg. 23"                                                                                                                                                                                                                                                                                                                 
# [10] "794 words"                                                                                                                                                                                                                                                                                                                                                                      
# [11] "Washington"                                                                                                                                                                                                                                                                                                                                                                     
# [12] "THE military relationship between the United States and China is one of the worlds most important. And yet, clouded by some misunderstanding and suspicion, it remains among the most challenging. There are issues on which we disagree and are tempted to confront each other. But there are crucial areas where our interests coincide, on which we must work together.     "
# 
# [[2]]
# [1] "104 of 104 DOCUMENTS" "The Added Item"

库（rvest）
图书馆（stringr）
文章%#删除新行以简化
str\u replace\u all（“\\s+\\-\\->，”）%>%\35;删除冗余头
strsplit（“”%>%#在单据编号标题上拆分
unlist（）#到向量
#从拆分中删除第一个空结果
articles%#字符输入解析为html
html_节点（css=“.c2”）%%>%#使用css选择器查找节点，例如：c2
html_text（）}）#从节点内提取文本
c2_文本
# [[1]]
#[1]“103份文件中的103份”                                                                                                                                                                                                                                                                                                                                                           
#[2]《纽约时报》                                                                                                                                                                                                                                                                                                                                                             
#[3]“2011年11月26日星期二”                                                                                                                                                                                                                                                                                                                                                              
# [4] "Â "                                                                                                                                                                                                                                                                                                                                                                             
#[5]“最新版本-最终版”                                                                                                                                                                                                                                                                                                                                                         
#[6]“迈克·马伦（MIKE MULLEN）。”                                                                                                                                                                                                                                                                                                                                                               
#[7]“迈克·马伦，a”                                                                                                                                                                                                                                                                                                                                                                
#[8]“是参谋长联席会议主席。”                                                                                                                                                                                                                                                                                                                            
#[9]“A部分；第0栏；编辑台；专栏投稿人；第23页”
texts <- unlist(strsplit(doc_text, "\\s+\\d+\\sof\\s\\d+\\sDOCUMENTS") )
texts <- texts[-1]  # drop the first empty split

lapply (1:length(texts), function(i){ write(texts[i], paste0("file", i, ".txt"))})