Python 从文档中提取页眉和页脚（每页重复的文本）_Python_Algorithm - Fatal编程技术网

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/324.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从文档中提取页眉和页脚（每页重复的文本）_Python_Algorithm - Fatal编程技术网

Python 从文档中提取页眉和页脚（每页重复的文本）

python algorithm

Python 从文档中提取页眉和页脚（每页重复的文本）,python,algorithm,Python,Algorithm,我正在使用各种python库解析pdf文档，并可以将其转换为页面列表（字符串列表）。我想自动删除页眉和页脚，它们是几乎每页都重复的子字符串（不是每页都需要）。我不想太依赖几何（比如看固定的位置）。假设没有可用的元数据我知道difflib.SequenceMatcher类和类似的工具，但这主要适用于字符串对。但我想利用文档有很多页面这一事实，而不仅仅是进行成对比较我对高效算法和可能的python工具（如果有的话）都感兴趣。谢谢您的提示。有一个python库PyMuPDF，它可能会帮助您解决问题

我正在使用各种python库解析pdf文档，并可以将其转换为页面列表（字符串列表）。我想自动删除页眉和页脚，它们是几乎每页都重复的子字符串（不是每页都需要）。我不想太依赖几何（比如看固定的位置）。假设没有可用的元数据

我知道

difflib.SequenceMatcher

类和类似的工具，但这主要适用于字符串对。但我想利用文档有很多页面这一事实，而不仅仅是进行成对比较

我对高效算法和可能的python工具（如果有的话）都感兴趣。谢谢您的提示。
有一个python库
PyMuPDF
，它可能会帮助您解决问题。首先，它不知道任何关于页眉和页脚的信息，但您可以从中提取大量元数据字典并对其进行分析。我也遇到了同样的问题，我只想为每个页面提取pdf文件的标题。我使用了这个元数据，它包含有关文本的信息，如字体大小和字体名称。在我的例子中，与同一页面上的其他文本相比，每个标题的字体大小都更大，因此我使用这些信息进行提取。
谢谢，@Sharmiko，这有时可能有用，但我主要谈论的是仅包含图像和ocr ed（不可见）文本层的扫描文档。字体信息可能取决于OCR质量和配置，但我不会太依赖ti。我对一种算法很感兴趣，它可以从100页中找到“最常重复的块”。然而，对于“数字”pdf，您的建议绝对有用。

[algorithm]相关文章推荐

Algorithm 住房贷款计算公式（算法）？ algorithm

Algorithm 计算链表中可能是循环的节点数 algorithm

Algorithm 盒子塔（堆叠立方体） algorithm

Algorithm 寻找线性时间内最大的子矩阵 algorithm

Algorithm 不使用反馈在数组中查找偶数 algorithm

Algorithm 如何实现AO*算法？ algorithm artificial-intelligence

Algorithm 数独生成困难 algorithm

Algorithm 地理定位算法将比例尺转换为1 algorithm math geometry

Algorithm 128人锦标赛中32名种子选手的分配算法 algorithm

Algorithm 在FPGA上计算y=x/（1+；x^2） algorithm

Algorithm （非压缩）Trie的使用 algorithm data-structures language-agnostic

Algorithm 图中的最大边数 algorithm graph

Algorithm 求一个变量比方程多的线性方程组的解 algorithm math

Algorithm [InterviewBit]两个整数的幂 algorithm recursion

Algorithm 在有向图中的所有可能路径中查找公共路径 algorithm graph

Algorithm “动态规划”；纸牌游戏“； algorithm

Algorithm 按位和子阵列的不同值 algorithm

Algorithm k=2的Kmeans算法，该算法提供相同的集群大小输出 algorithm machine-learning

Algorithm 胶囊-光线（线段）交点，2D 我在我的游戏中编程C++碰撞检测，我正在尝试提出一个算法：我有一个胶囊定义了两个中心点（C1，C2），长度和半径。然后我有一条定义了两点（R1，R2）的光线。我已经知道它们是相交的。我只需要找到包含在胶囊（H1-H2）中的光线的内部部分。提前谢谢你的帮助 >首先让我们看一个图表，以供参考： algorithm math

Algorithm 经典破解编程面试问题a^3+的运行时间；b^3=c^3+；d^3？ algorithm dictionary big-o

随机文章推荐

如何检测winforms控件上按下的鼠标按钮？ winforms

Winforms 如何使Winform轨迹栏（滑块）在触摸屏显示器上表现出合理的行为 winforms

Winforms 如何在WinForm DataGridView事件处理程序中引用列名而不是e.ColumnIndex？ winforms

在winforms中将图像添加到复选框 winforms image checkbox

Winforms 清除组框中的复选框 winforms

在winforms应用程序中，是否应该将错误作为例外传递给UI代码？ winforms exception-handling user-interface

Winforms 如何将.wav声音捆绑到Winform application.exe可执行文件中？ winforms

Winforms CreatePopupMenu上的许多GDI对象 winforms winapi menu

Winforms应用程序作为计划任务 winforms

Winforms 如何通过子winform处理父winform上的statustip？ winforms

Winforms 用异步调用增强我的WCF服务？ winforms wcf

Winforms 在树视图中设置编辑框的位置 winforms

Winforms DataGridViewCheckboxColumn未设置基础数据 winforms

Winforms windows窗体中的移动图 winforms graph

如何在Winforms中将矩形添加到列表框 winforms user-interface

Winforms 通过输入don'更改焦点；t与Telerik RadDropDownList一起工作 winforms visual-studio-2010 c#-4.0 telerik

Winforms datagridview控件继续更新，滚动窗口时gui闪烁 winforms data-binding scroll

Winforms RadTreeView不工作 winforms telerik

Winforms 当控件数据绑定值、文本属性和空数据源时，无法从UltraCombo控件中进行制表 winforms

Winforms onPaint事件期间绘制特定零件的正确方法 winforms

[python]相关推荐

Tags

Spring Security Cryptography Spring Mvc Omnet++ Nestjs Ios File Upload Antlr4 Autohotkey Apache Isabelle Formatting Gatsby D3.js Sql Server 2008 R2 Xslt Matrix Ssrs 2008 Active Directory Subsonic Php Animation Spring Integration Odoo Antlr Big O Xcode Redirect Tomcat Jersey Open Source Blockchain Yocto Akka Postman Webstorm Asterisk Phpunit Git Jenkins Apache Spark Knockout.js Server Jupyter Notebook Math Lua Ldap Jqgrid Delphi Regex Xpages Encoding Sql Server 2005 Flash Assembly Webgl Asp.net Apache Flink Keyboard Django Redis Bazel Ignite Apache Flex Pytorch Batch File Hyperlink Angular6 Fluent Nhibernate Vue.js Memory Windows Installer Angular Material Jquery Ui Logic Scrapy Scroll Intellij Idea Forms Workflow .net 4.0 Streaming Google Bigquery Cassandra Powerbi Iis Oauth 2.0 Google Chrome Testng Raspberry Pi Clearcase Azure Data Factory Canvas Reference Mqtt Build Gps Selenium Stata Logging Android Memory Management Google Api Spring Azure Active Directory Tinymce Automation Stored Procedures Silverstripe Imagemagick Time Complexity Neo4j Security Json Activerecord Orm Twig Cuda Google Drive Api Inno Setup Unicode Azure Sql Database Yii2 Kentico Udp Ocaml Groovy Amazon Ec2 Smtp Atom Editor Enums Spring Boot Mfc Discord Object Puppet Apache2 Domain Driven Design Templates Tags Speech Recognition Generics Amp Html Google Visualization Llvm Bootstrap 4 Activemq Pointers Lambda Seo Cloud Foundry Sequelize.js Ssl Mongodb Database Linux Kernel Permissions Openlayers 3 Coffeescript Spring Batch Browser Ios6 Ansible Libgdx Cocoa Checkbox Openshift Visual Studio 2015 Printing Jquery Mobile File Io Winapi Jsp Couchdb Mdx Frameworks Utf 8 Mercurial Tabs Ibm Mq Video Streaming Junit Io Localization Sms Timer Emacs Python 3.x Coding Style Cocoa Touch Drupal 6 Embedded Opengl Hash Codeigniter Serial Port Web Applications Appium Java Ios4 Post

Copyright © 2024. All Rights Reserved by - Fatal编程技术网