通过R中的OCR从pdf读取文本_R_Regex_Imagemagick_Ocr_Tesseract

通过R中的OCR从pdf读取文本

r regex imagemagick

通过R中的OCR从pdf读取文本,r,regex,imagemagick,ocr,tesseract,R,Regex,Imagemagick,Ocr,Tesseract,我正在阅读R中的pdf数据，通过Tesseract和OCR，我有一些类似于上面的pdf文档中的数据。我想读地址，性别是女性。如何正确阅读，因为没有任何指针通过OCR读取R中的复选框，并且当我阅读时，无法正确分配地址。这是有关如何识别复选框的问题的可能答案如果没有更大的pdf样本，我无法100%确定这是正确的。我已经下载了一份你的图片来做这个 library(tabulizer) library(dplyr) library(tibble) library(tidyr) library(read

我正在阅读R中的pdf数据，通过Tesseract和OCR，我有一些类似于上面的pdf文档中的数据。我想读地址，性别是女性。如何正确阅读，因为没有任何指针通过OCR读取R中的复选框，并且当我阅读时，无法正确分配地址。

这是有关如何识别复选框的问题的可能答案如果没有更大的pdf样本，我无法100%确定这是正确的。我已经下载了一份你的图片来做这个

library(tabulizer)
library(dplyr)
library(tibble)
library(tidyr)
library(readr)
library(stringr)

# path to your pdf

pdf_file <- "wF8L3.pdf"


# manually locate the area of text containing the check box

locate_areas(pdf_file)


# [[1]]
# top     left   bottom    right 
# 11.4404 197.3523  56.6117 316.4253 

#  I just copy and paste this into the extract_text function below


# extract pdf text

txt <- 
  extract_text(pdf_file, area = list(c(11.4404, 197.3523,  56.6117, 316.4253))) %>% 
  read_lines()

# turn into a tibble and wrangle
# this may not be the most efficient way and may need to be adapted for your use case.
# there is no magic. Data wrangling pdfs is a pain in my experience and requires alot of time.
# to determine if a box is ticked or not is done by inspection of the character which is generated by the tick box by the pdf reading process.
# For this to work it will need to be consistent in your document.

tib <- 
  tibble("gen" = txt[1], 
         "mf" = txt[2]) %>%
  separate(col = mf, into = c("m", "f"), sep = "I") %>% 
  mutate(m_tick = case_when(str_detect(m, "o") ~ FALSE,
                           TRUE ~ TRUE),
         f_tick = case_when(str_detect(f, "El") ~ TRUE,
                           TRUE ~ FALSE))

库（tabulizer）
图书馆（dplyr）
图书馆（tibble）
图书馆（tidyr）
图书馆（readr）
图书馆（stringr）
#pdf文件的路径
pdf_文件%
突变（m_tick=case_when（str_detect（m，“o”）~FALSE，
真~真），
f_tick=case_when（str_detect（f，“El”）~TRUE，
对~错）

希望您可以根据您的用例来调整它。

尝试

tabulizer:：extract\u text

。复选框可以通过特殊字符或代码显示。PDF总是需要大量的数据争论。问题是完整的文档不是表格形式，最初一些数据是段落形式的，比如1-2段，然后我有表格数据。同样，一些数据写在段落中，然后又是表格。@Peter:你能告诉我一些可以解决的数据争论吗？使用OCR，我将pdf数据转换为文本数据。如何识别R中的复选框？我不必使用AbbyyR软件包。。！！还有别的办法吗？