R 获取文本文件的特定部分及其内容_R

R 获取文本文件的特定部分及其内容

R 获取文本文件的特定部分及其内容,r,R,我有一个文本文件，看起来像： 1 Hello 1.1 Hi 1.2 Hey 2 Next section 2.1 New section 3 thrid 4 last 我有另一个文本文件，看起来像 1 Hello My name is John. It was nice to meet you. 1.1 Hi Hi again. My last name is Doe. 1.1.1 Bye 1.2 Hey Greetings. 2 Next section This is the second

我有一个文本文件，看起来像：

1 Hello
1.1 Hi
1.2 Hey
2 Next section
2.1 New section
3 thrid
4 last

我有另一个文本文件，看起来像

1 Hello
My name is John. It was nice to meet you.
1.1 Hi
Hi again. My last name is Doe.
1.1.1 Bye
1.2 Hey
Greetings.
2 Next section
This is the second section. I am majoring in CS.
2.1 New Section
Welcome. I am an undergraduate student.
3 third
1. hi
2. hello
3. hey
4 last

我想知道如何从上一个文本文件中读取数据，并使用它查找第二个数据文件中的特定部分以及在下一个部分正常运行后的所有内容。因此，基本上，我试图得到如下结果：

Section      Content
1 Hello      My name is John. It was nice to meet you.
1.1 Hi       Hi again. My last name is Doe. 1.1.1 Bye
1.2 Hey      Greetings.

……等等

我想知道我怎么做。

这个问题的答案是肯定的，这是可以做到的。根据您用来完成此任务的编程语言，实现将有很大的不同。高级别概述将是

按行将原始文件拆分为字符串数组。这是用于搜索第二个文档的密钥列表

将第二个文件读入字符串变量

遍历所有键（迭代器x），并在第二个文档中找到它们的索引。差不多

int start=seconddocument.indexof（键[x]）
int end=seconddocument.indexof（键[x+1]）

然后使用这些开始和结束位置，可以使用substring（）函数提取内容 string matchedContent=seconddocument.substring（开始、结束）

这会一直工作到您找到最后一个匹配项，因为在x是最后一个键的情况下，键[x+1]将不存在。在这种情况下，需要将end设置为文档中最后一个字符的位置，或者使用一个仅取起点的substring方法

HTH

以下解决方案肯定会有所改进，但它可能会为您提供解决问题的方法。根据您需要处理的文件的大小和结构，这种方法可能是可行的，或者需要在检测部分和速度方面进行更多的调整

file1 = 
"1 Hello
1.1 Hi
1.2 Hey
2 Next section
2.1 New section
3 thrid
4 last"

file2 = 
"1 Hello
My name is John. It was nice to meet you.
1.1 Hi
Hi again. My last name is Doe.
1.1.1 Bye
1.2 Hey
Greetings.
2 Next section
This is the second section. I am majoring in CS.
2.1 New Section
Welcome. I am an undergraduate student.
3 third
1. hi
2. hello
3. hey
4 last"

file1 = unlist(strsplit(file1, "\n", fixed = T))
file2 = unlist(strsplit(file2, "\n", fixed = T))
positions = unlist(sapply(file1, function(x) grep(paste0("^", x, "$"), file2, ignore.case = T)))
positions = cbind(positions, c(positions[-1]-1, length(file2)))
text = mapply(function(x, y) file2[x:y], positions[,1], positions[,2])             
text = lapply(text, function(x) x[-1])
result = cbind(positions, text)
result
# positions    text                                              
# 1 Hello         1         2  "My name is John. It was nice to meet you."       
# 1.1 Hi          3         5  Character,2                                       
# 1.2 Hey         6         7  "Greetings."                                      
# 2 Next section  8         9  "This is the second section. I am majoring in CS."
# 2.1 New section 10        15 Character,5                                       
# 4 last          16        16 Character,0  

# Note that the text column contains lists storing the individual lines.
# e.g. for "2.1 New section":
class(result[5, "text"])
# list
result[5, "text"]
# [[1]]
# [1] "Welcome. I am an undergraduate student." "3 third"  #<< note the different spelling of third                              
# [3] "1. hi"                                   "2. hello"                               
# [5] "3. hey"

file1=
“你好
1.1嗨
1.2嘿
2下一节
2.1新的一节
三分之三
最后4“
文件2=
“你好
我叫约翰，很高兴认识你。
1.1嗨
你好，我姓多伊。
1.1.1再见
1.2嘿
问候语。
2下一节
这是第二部分，我主修计算机科学。
2.1新的一节
欢迎。我是一名本科生。
三分之一
你好
2.你好
3.嘿
最后4“
file1=unlist（strsplit（file1，“\n”，fixed=T））
file2=unlist（strsplit（file2，“\n”，fixed=T））
positions=unlist（sapply（文件1，函数（x）grep（粘贴0（“^”，x，“$”），文件2，ignore.case=T）））
位置=cbind（位置，c（位置[-1]-1，长度（文件2）））
text=mapply（函数（x，y）文件2[x:y]，位置[，1]，位置[，2]）
text=lappy（文本，函数（x）x[-1]）
结果=cbind（位置、文本）
结果
#位置文本
#1你好12“我叫约翰，很高兴认识你。”
#1.1 Hi 3 5个字符，2
#1.2嘿6 7“问候”
#“这是第二部分，我主修计算机科学。”
#2.1新的第10节15个字符，5
#4最后16个字符，0
#请注意，文本列包含存储各行的列表。
#例如，“2.1新章节”：
类别（结果[5，“文本”]）
#名单
结果[5，“文本”]
# [[1]]
#[1]“欢迎光临。我是一名本科生。”“三分之一”#您的意见也可以是一个小节吗？在这种情况下，您的输出是什么？直到下一小节或下一小节？如果输入是小节，那么输出将一直到下一小节。为什么1.1 Hi、2.1新小节和4最后显示为字符#。你知道这些部分和其他部分有什么不同吗？我已经更新了我的答案。文本列中的结果存储为列表。如果列表包含多个元素，则只显示其长度和类型。感谢您的帮助。我接受了你的回答。我还有一个问题。我的文本文件中的字符串。readLines（文本文件）的产品不能很好地作为file1和file2的输入。readFile也不能作为文件的输入。是否有更好的方法将文本从文本文件提取到字符串对象，格式和间距与文本文件完全相同。好的，谢谢您的接受。应用于您的文件的readLines的输出是什么？您可以使用输出更新您的答案。例如，只发布较短文件的输出可能就足够了，因为我猜这些文件通常具有相同的结构。关于更好的可读性，您可以添加一个只包含列表第一个元素的附加列（然后以可见方式打印），方式类似于lappy（结果[，“文本”]，函数（x）x[1])
。如果想要更长的字符串，可以提取前X个元素，并通过paste0（…，collapse=”“）
将它们粘贴在一起。然而，每行打印的字符有一定的限制，从未尝试过，所以我不确定实际会显示多少字符。