如何在R中读取指示页码和每页段落的txt文件_R

如何在R中读取指示页码和每页段落的txt文件

如何在R中读取指示页码和每页段落的txt文件,r,R,我使用R中的readLines（）读取了一个.txt文件。我没有在txt文件中给出行号-（即显示行号）。 txt文件的格式如下 page1: paragraph1:Banks were early adopters, but now the range of applications and organizations using predictive analytics successfully have multiplied. Dire

我使用R中的

readLines（）

读取了一个.txt文件。我没有在txt文件中给出行号-（即显示行号）。 txt文件的格式如下

        page1:
       paragraph1:Banks were early adopters, but now the range of applications 
            and organizations using predictive analytics successfully have multiplied. Direct marketing and sales.
     Leads coming in from a company’s website can be scored to determine the probability of a 
                            sale and to set the proper follow-up priority. 
paragraph2: Campaigns can be targeted to the candidates most 
                        likely to respond. Customer relationships.Customer characteristics and behavior are strongly 
                        predictive of attrition (e.g., mobile phone contracts and credit cards). Attrition or “churn” 
                    models help companies set strategies to reduce churn rates via communications and special offers. 
                Pricing optimization. With sufficient data, the relationship between demand and price can be modeled for 
            any product and then used to determine the best pricing strategy.

类似地，.txt文件中的第2页也有段落

但我无法区分页面和段落，因为.txt文件无法区分页面。是否有任何方式或建议在R中标明页码和段落

爱德华·卡尼（Edward Carney）给出的答案正适合这一点。但是如果我没有使用“段落（否）”，如何使用制表符/空格识别段落？

此方法使用

tm

库中的

stripWhitespace

函数，但除此之外，它是基本的R

首先，读入文本并使用

grep

找到

页面：

行

x <- readLines('text2.txt')
page_locs <- grep('page\\d:', x)
# add an element with the last line of the text plus 1
page_locs[length(page_locs)+1] <- length(x) + 1
# strip out the whitespace
x <- stripWhitespace(x)
# break the text into a list of pages, eliminating the `page#:` lines.
pages <- list()
# grab each page's lines into successive list elements
for (i in 1:(length(page_locs)-1)) {
  pages[[i]] <- x[(page_locs[i]+1):(page_locs[i+1]-1)]
}

x“页码”基于每页的行数、字体大小以及源代码中是否存在^L
（换行符）。我唯一能想到的段落（可能跨越多页）是一个双行提要（连续的\n
）。您是否有其他清晰的方式来区分一个和另一个？grep（'^\t'，x）
for tab。^字符确保这只会“看到”行开头的选项卡。您可以对空格字符使用相同的方法，但空格可能因其他原因而有问题。
for (i in 1:length(pages)) {
    # get the locations for the paragraphs
    para_locs <- grep('paragraph\\d:', pages[[i]])
    # add an end element
    para_locs[length(para_locs) + 1] <- length(pages[[i]]) + 1
    # delete the paragraph marker
    curr_page <- gsub('paragraph\\d:','',pages[[i]])
    curr_paras <- list()
    # step through the paragraphs in each page
    for (j in 1:(length(para_locs)-1)) {
        # collapse the vectors for each paragraph
        curr_paras[[j]] <- paste(curr_page[para_locs[j]:(para_locs[j+1]-1)], collapse='')
        # delete leading spaces for each paragraph if desired
        curr_paras[[j]] <- gsub('^ ','',curr_paras[[j]])
    }
    # store the list of paragraphs back into the pages list
    pages[[i]] <- curr_paras
}