R 如何在不保留原始格式的情况下将文本从pdf文件复制到文本文件
我有一个pdf文件,我想从中提取文本。但是,我不想保持pdf文件的相同间距。我希望文本显示为我手动复制并粘贴pdf中的行。这将从我的文本文件中删除一些美观但不必要的制表符和间距复杂性 例如,如果我正常使用R提取文本,我将得到类似于以下内容的格式:R 如何在不保留原始格式的情况下将文本从pdf文件复制到文本文件,r,R,我有一个pdf文件,我想从中提取文本。但是,我不想保持pdf文件的相同间距。我希望文本显示为我手动复制并粘贴pdf中的行。这将从我的文本文件中删除一些美观但不必要的制表符和间距复杂性 例如,如果我正常使用R提取文本,我将得到类似于以下内容的格式: This is the title of this document 1.0 Hello my name is John and bla
This is the title
of this document
1.0 Hello my name is John and blah balh blah blah blah.
1.1 blah blah blah blah
如果我只是手动复制和粘贴,我会得到类似的结果:
This is the title of this document
1.0 Hello my name is John and blah balh blah blah blah.
1.1 blah blah blah blah blah
我想知道是否有任何方法可以通过R中的代码来实现这一点,而不仅仅是手动复制和粘贴
一个真实的例子是pdf:
如果我手动复制并粘贴第228页的一部分或pdf中的第3页,
我会得到:
Oil and the Macroeconomy since World War 11
James D. Hamilton
University (f/' Virgiiwa
All but one of the U.S. recessions since World War II have been
preceded, typically with a lag of around three-fourths of a year, by a
dramatic increase in the price of crude petroleum. This does not
mean that oil shocks caused these recessions. Evidence is presented,
however, that even over the period 1948-72 this correlation is statistically
significant and nonspurlious, supporting the proposition that
oil shocks were a contributing factor in at least some of the U.S.
recessions prior to 1972. By extension, energy price increases may
account for much of post-OPEC macroeconomic performance.
I. Introduction
The poor performance of the U.S. economy since 1973 is well documented:
1. The rate of growth of real GNP has fallen from an average of
4.0 percent during 1960-72 to 2.4 percent for 1973-81.
2. The 7.6 percent average inflation rate during 1973-81 was
more than double the 3.1 percent realized for 1960-72.
3. The average unemployment rate over 1973-81 of 6.7 percent
was higher than in any year between 1948 and 1972 with the single
exception of the recession of 1958.
This paper is drawn from chap. 2 of my Ph.D. dissertation at the University of
California, Berkeley. Earlier versions of this paper were presented at the NBER/NSF
这是一种完全不同于pdf格式的格式
奖金:
我把我贴的例子弄错了。如果我从google chrome的pdf文档中复制并粘贴,我将获得该输出。如果我从Microsoft Edge复制并粘贴,我会得到如下结果:
Oil and the Macroeconomy since World War 11
James D. Hamilton
University (f/' Virgiiwa
All but one of the U.S. recessions since World War II have been preceded, typically with a lag of around three-fourths of a year, by a dramatic increase in the price of crude petroleum. This does not mean that oil shocks caused these recessions. Evidence is presented, however, that even over the period 1948-72 this correlation is statis- tically significant and nonspurlious, supporting the proposition that oil shocks were a contributing factor in at least some of the U.S. recessions prior to 1972. By extension, energy price increases may account for much of post-OPEC macroeconomic performance.
I. Introduction
The poor performance of the U.S. economy since 1973 is well docu- mented: 1. The rate of growth of real GNP has fallen from an average of 4.0 percent during 1960-72 to 2.4 percent for 1973-81. 2. The 7.6 percent average inflation rate during 1973-81 was more than double the 3.1 percent realized for 1960-72. 3. The average unemployment rate over 1973-81 of 6.7 percent was higher than in any year between 1948 and 1972 with the single exception of the recession of 1958.
This paper is drawn from chap. 2 of my Ph.D. dissertation at the University of California, Berkeley. Earlier versions of this paper were presented at the NBER/NSF
对不起,弄错了。前面的答案对于我当时提出的问题是有效的,但这正是我试图获得的输出类型。据我所知,区别在于每行开头是否有空白。您可以使用
gsub
在R中将其删除。例如:
library(pdftools)
doc <- "https://www.researchgate.net/profile/James_Hamilton11/publication/24108242_Oil_and_the_Macroeconomy_since_World_War_II/links/0c9605252c0916e709000000.pdf"
text <- pdf_text(doc)[[3]]
text_no_ws <- gsub("^|\n +", "\n", text)
cat(text_no_ws)
库(pdftools)
doc我想你应该展示一个真实的例子,说明当你从一个真实的.pdf文件复制粘贴时会发生什么。@Ista我添加了一个真实的例子。这基本上就是你想要的吗?谢谢你的帮助,我在选择添加的输出中犯了一个错误。那不是我想要的。我用期望的输出更新了我的新问题。