将字符串从间距不规则的pdf提取到整齐的R数据帧中
我正在努力自学一个将不规则间距的PDF表格转换成R中整齐的数据框的过程。我的目标是从最近的巴基斯坦人口普查中提取人口数据,该人口普查目前分布在137个单独的PDF中。是一个示例目标文件。我已经能够从其他指南中拼凑出一些必要的步骤来将pdf分解为文本字符串,但是我已经在正则表达式中纠结起来,我认为将文本进一步转换为数据帧是必要的 到目前为止,我已经了解到的步骤:将字符串从间距不规则的pdf提取到整齐的R数据帧中,r,regex,string,pdf,R,Regex,String,Pdf,我正在努力自学一个将不规则间距的PDF表格转换成R中整齐的数据框的过程。我的目标是从最近的巴基斯坦人口普查中提取人口数据,该人口普查目前分布在137个单独的PDF中。是一个示例目标文件。我已经能够从其他指南中拼凑出一些必要的步骤来将pdf分解为文本字符串,但是我已经在正则表达式中纠结起来,我认为将文本进一步转换为数据帧是必要的 到目前为止,我已经了解到的步骤: # import file district_import <- pdf_text("http://www.pbscensus.g
# import file
district_import <- pdf_text("http://www.pbscensus.gov.pk/sites/default/files/bwpsr/kp/ABBOTTABAD_BLOCKWISE.pdf")
# convert text to string
data <- toString(district_import)
# convert text to character lines
data <- read_lines(data)
# clean up page headers and footers
header_row_1 <- grep("POPULATION AND HOUSEHOLD DETAIL FROM BLOCK TO DISTRICT LEVEL", data)
header_row_2 <- grep("KHYBER PAKHTUNKHWA", data)
header_row_3 <- grep("ADMIN UNIT", data)
footer_row <- grep("Page ", data)
data <- data[- c(header_row_1, header_row_2, header_row_3, footer_row)]
(请注意,尽管此处的分界点看起来是这样的,但各个分区的文件中前导空格的长度并不一致,我预计在137个分区中不会一致,我最终的目标是循环并合并到一个国家/地区-wide数据帧。)
从这一点上,我期望的输出是将其转换成一个整洁的数据框架,并以人口普查区(六位数代码,在原始pdf中未按名称识别)作为组织的基本单位:
district sub_lvl01 sub_lvl02 sub_lvl03 sub_lvl04 census_block population household
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 ABBOTTABAD DISTRICT ABBOTTABAD TEHSIL ABBOTTABAD CANTONMENT CHARGE NO 01 CIRCLE NO 01 023010101 5,131 705
2 ABBOTTABAD DISTRICT ABBOTTABAD TEHSIL ABBOTTABAD CANTONMENT CHARGE NO 01 CIRCLE NO 01 023010102 2,654 435
3 ABBOTTABAD DISTRICT ABBOTTABAD TEHSIL ABBOTTABAD CANTONMENT CHARGE NO 01 CIRCLE NO 01 023010103 1,004 173
4 ABBOTTABAD DISTRICT ABBOTTABAD TEHSIL ABBOTTABAD CANTONMENT CHARGE NO 01 CIRCLE NO 01 023010104 2,216 349
5 ABBOTTABAD DISTRICT ABBOTTABAD TEHSIL ABBOTTABAD CANTONMENT CHARGE NO 01 CIRCLE NO 01 023010105 94 14
6 ABBOTTABAD DISTRICT ABBOTTABAD TEHSIL ABBOTTABAD CANTONMENT CHARGE NO 01 CIRCLE NO 01 023010106 1,051 171
... etc
地区sub_lvl01 sub_lvl02 sub_lvl03 sub_lvl04人口普查区人口住户
1阿伯塔巴德区阿伯塔巴德乡阿伯塔巴德营地费用编号01圈编号01 02010101 5131 705
2阿伯塔巴德区阿伯塔巴德乡阿伯塔巴德营地费用编号01圈编号01 02010102 2654 435
3阿伯塔巴德区阿伯塔巴德乡阿伯塔巴德营地费用编号01圈编号01 02301003 1004 173
4阿伯塔巴德区阿伯塔巴德乡阿伯塔巴德营地费用编号01圈编号01 02301010 04 2216 349
5阿伯塔巴德区阿伯塔巴德乡阿伯塔巴德营地费用编号01圈编号01 02010105 94 14
6阿伯塔巴德区阿伯塔巴德乡阿伯塔巴德营地费用编号01圈编号01 02010106 1051 171
等
我曾经尝试过使用正则表达式,试图找出如何提取它,但在这样做的过程中我完全迷失了方向,特别是考虑到变量之间缺少标准分隔符
在regex101.com上玩,我想这段代码至少可以让我提取人口和家庭数据:
pop_hh_str <- str_match_all(data, "(?!\\d{6})(?<=\\s)\\d*[,.]*\\d*[,.]*\\d*")
pop_hh_str我可以为您提供一些代码,用于将普查块输出到data.frame中。如果您可以获得普查块的查找表,则可以添加其余数据
从数据向量继续:
library(stringr)
# find the rows which have 9 digits + a space
data1 <- data[which(str_detect(data, "\\d{9} "))]
# remove spaces in front of the line
data1 <- str_remove(data1, " +")
# replace all other spaces with 1 space
data1 <- str_replace_all(data1, " +", " ")
# create data.frame and split the value column into 3 with new headers.
library(tidyr)
library(dplyr)
df <- data1 %>%
as_data_frame() %>%
separate(value ,into = c("census_block", "population", "household"), sep = " ")
df
# A tibble: 1,106 x 3
census_block population household
<chr> <chr> <chr>
1 023010101 5,131 705
2 023010102 2,654 435
3 023010103 1,004 173
4 023010104 2,216 349
5 023010105 94 14
6 023010106 1,051 171
7 023010201 1,352 211
8 023010202 1,019 161
9 023010203 4,079 691
10 023010204 2,171 345
# ... with 1,096 more rows
库(stringr)
#查找包含9位数+空格的行
数据1数据
(由于我不想安装pdftools
,因此我手动重新创建您的数据):
但在这种情况下,字符之间可以有1个空格,
因此,我们希望超过1个空格
,因此“\\s{2,}”
(至少两个ws)作为列的分隔符。
其次,有时在数据之前和/或之后有前导/尾随空格。
因此,我们通过trimws()
因此:
然后,我们可以使用Reduce()
从这里开始,您将需要构建帮助器列,这些列具有计数器,其中的行显示了哪种类型的信息。。。。
这样的计数将帮助您将数据帧拆分为子数据帧。split()
将非常有用
我编写了一些函数,这些函数可能有助于对data
vec中的一行的“级别”进行分类,方法是计算它在开始时是否有超过k个空格
not.more.than.k.leading.whitespaces <- function(s, k) {
!grepl(paste0("^\\s{", k, ",}"), s)
}
leveler <- function(s, k) {
cumsum(not.more.than.k.leading.whitespaces(s, k))
}
要更紧凑、更有规律地执行此操作,请执行以下操作:
# abstract over the split-flatten-annotate cycle/pattern by:
spl.fl.annotate <- function(df.a.l, col, col.name) {
df.sN.ll <- lapply(df.a.l, function(df) split(df, df[, col]))
df.sN.l <- unlist(df.sN.ll, recursive = FALSE)
lapply(df.sN.l, annotate.by.first.row, 1, col.name)
}
# now the cycles can be written as:
df.a0.l <- spl.fl.annotate(list(`0` = df), "level0", "district")
df.a1.l <- spl.fl.annotate(df.a0.l, "level1", "thesil")
df.a2.l <- spl.fl.annotate(df.a1.l, "level2", "cantonment")
df.a3.l <- spl.fl.annotate(df.a2.l, "level3", "charge")
df.a4.l <- spl.fl.annotate(df.a3.l, "level4", "circle")
# fuse subdata frames by `Reduce(rbind, ...)`
res.df <- Reduce(rbind, df.a4.l)
res.cleaned.df <- res.df[, c("district", "thesil", "cantonment", "charge", "circle", "V1", "V2", "V3")]
#通过以下方式对拆分展平周期/模式进行抽象注释:
spl.fl.annotate谢谢!需要找到一个函数来按相关的行政分区对普查块进行分组,但我认为我应该能够遵循这个函数。是的,您知道函数split()
?您需要一个帮助器列,它指示您的数据应该属于哪个组…现在我完成了它!)可能是必须修改leveler()
函数的值-我在数据行前面只有空格。。。但这就是我要说的,使用前面的空格作为信息,行的信息反映了级别。我的示例代码应该全部通过。。。我希望这有帮助!太棒了。我试图找到一个解决办法的努力将会更加笨拙。非常感谢您的帮助@cjsc:欢迎!这花了我相当长的时间-虽然我有一些直觉,这种方法会导致一些东西。。。如果对你有帮助,我会很高兴的!;)现在我用一个函数再次抽象了重复模式。。。我认为这个例子展示了函数式编程(FP)的威力lappy()
是一个典型的高阶函数。。。对于FP来说,这是一个典型的例子,你使用了一些小的辅助函数,这些函数和其他函数一起,在指导过程中被证明是非常强大的。R是一种FP语言。当使用FP方法时,它会发光。
strsplit(data, "\\s+") # "\\s+" meaning: 1 or more white spaces
strsplit(trimws(data), "\\s{2,}")
df <- Reduce(rbind, strsplit(trimws(data), "\\s{2,}"))
rownames(df) <- 1:dim(df)[1] # just give at least numbers as rownames
df <- as.data.frame(df)
[,1] [,2] [,3]
1 "ABBOTTABAD DISTRICT" "1,332,912" "216,534"
2 "ABBOTTABAD TEHSIL" "981,590" "161,445"
3 "ABBOTTABAD CANTONMENT" "138,311" "21183"
4 "CHARGE NO 01" "138,311" "21183"
5 "CIRCLE NO 01" "12,150" "1847"
6 "023010101" "5,131" "705"
7 "023010102" "2,654" "435"
8 "023010103" "1,004" "173"
9 "023010104" "2,216" "349"
10 "023010105" "94" "14"
11 "023010106" "1,051" "171"
12 "CIRCLE NO 02" "15,383" "2435"
13 "023010201" "1,352" "211"
14 "023010202" "1,019" "161"
15 "023010203" "4,079" "691"
not.more.than.k.leading.whitespaces <- function(s, k) {
!grepl(paste0("^\\s{", k, ",}"), s)
}
leveler <- function(s, k) {
cumsum(not.more.than.k.leading.whitespaces(s, k))
}
df$level0 <- leveler(data, 0)
df$level1 <- leveler(data, 5)
df$level2 <- leveler(data, 11)
df$level3 <- leveler(data, 24)
df$level4 <- leveler(data, 37)
# important helper function:
annotate.by.first.row <- function(df, col, col.title) {
# take first row's column content and add it to the df as a column content
info <- df[1, col]
rowsn <- dim(df)[1]
df.new <- df[2:rowsn, ]
df.new[, col.title] <- info
df.new
}
# split data frame to a list of sub data frames
df.l0 <- split(df, df$level0)
# apply our helper function for annotation column generation
# using the information of the first row of the sub data frames
df.a0.l <- lapply(df.l0, annotate.by.first.row, 1, "district")
# cycle through: split, flatten, annotate.by.first.row
# to add next first row information as a column
df.s1.ll <- lapply(df.a0.l, function(df) split(df, df$level1))
df.s1.l <- unlist(df.s1.ll, recursive = FALSE)
df.a1.l <- lapply(df.s1.l, annotate.by.first.row, 1, "thesil")
# repeat the cycles ...
df.s2.ll <- lapply(df.a1.l, function(df) split(df, df$level2))
df.s2.l <- unlist(df.s2.ll, recursive = FALSE)
df.a2.l <- lapply(df.s2.l, annotate.by.first.row, 1, "cantonment")
df.s3.ll <- lapply(df.a2.l, function(df) split(df, df$level3))
df.s3.l <- unlist(df.s3.ll, recursive = FALSE)
df.a3.l <- lapply(df.s3.l, annotate.by.first.row, 1, "charge")
df.s4.ll <- lapply(df.a3.l, function(df) split(df, df$level4))
df.s4.l <- unlist(df.s4.ll, recursive = FALSE)
df.a4.l <- lapply(df.s4.l, annotate.by.first.row, 1, "circle")
# fuse subdata frames by `Reduce(rbind, ...)`
res.df <- Reduce(rbind, df.a4.l)
res.cleaned.df <- res.df[, c("district", "thesil", "cantonment", "charge", "circle", "V1", "V2", "V3")]
> res.cleaned.df
# district thesil cantonment charge
# 6 ABBOTTABAD DISTRICT ABBOTTABAD TEHSIL ABBOTTABAD CANTONMENT CHARGE NO 01
# 7 ABBOTTABAD DISTRICT ABBOTTABAD TEHSIL ABBOTTABAD CANTONMENT CHARGE NO 01
# 8 ABBOTTABAD DISTRICT ABBOTTABAD TEHSIL ABBOTTABAD CANTONMENT CHARGE NO 01
# 9 ABBOTTABAD DISTRICT ABBOTTABAD TEHSIL ABBOTTABAD CANTONMENT CHARGE NO 01
# 10 ABBOTTABAD DISTRICT ABBOTTABAD TEHSIL ABBOTTABAD CANTONMENT CHARGE NO 01
# 11 ABBOTTABAD DISTRICT ABBOTTABAD TEHSIL ABBOTTABAD CANTONMENT CHARGE NO 01
# 13 ABBOTTABAD DISTRICT ABBOTTABAD TEHSIL ABBOTTABAD CANTONMENT CHARGE NO 01
# 14 ABBOTTABAD DISTRICT ABBOTTABAD TEHSIL ABBOTTABAD CANTONMENT CHARGE NO 01
# 15 ABBOTTABAD DISTRICT ABBOTTABAD TEHSIL ABBOTTABAD CANTONMENT CHARGE NO 01
# circle V1 V2 V3
# 6 CIRCLE NO 01 023010101 5,131 705
# 7 CIRCLE NO 01 023010102 2,654 435
# 8 CIRCLE NO 01 023010103 1,004 173
# 9 CIRCLE NO 01 023010104 2,216 349
# 10 CIRCLE NO 01 023010105 94 14
# 11 CIRCLE NO 01 023010106 1,051 171
# 13 CIRCLE NO 02 023010201 1,352 211
# 14 CIRCLE NO 02 023010202 1,019 161
# 15 CIRCLE NO 02 023010203 4,079 691
# abstract over the split-flatten-annotate cycle/pattern by:
spl.fl.annotate <- function(df.a.l, col, col.name) {
df.sN.ll <- lapply(df.a.l, function(df) split(df, df[, col]))
df.sN.l <- unlist(df.sN.ll, recursive = FALSE)
lapply(df.sN.l, annotate.by.first.row, 1, col.name)
}
# now the cycles can be written as:
df.a0.l <- spl.fl.annotate(list(`0` = df), "level0", "district")
df.a1.l <- spl.fl.annotate(df.a0.l, "level1", "thesil")
df.a2.l <- spl.fl.annotate(df.a1.l, "level2", "cantonment")
df.a3.l <- spl.fl.annotate(df.a2.l, "level3", "charge")
df.a4.l <- spl.fl.annotate(df.a3.l, "level4", "circle")
# fuse subdata frames by `Reduce(rbind, ...)`
res.df <- Reduce(rbind, df.a4.l)
res.cleaned.df <- res.df[, c("district", "thesil", "cantonment", "charge", "circle", "V1", "V2", "V3")]