Python 是否可以使用Beauty Soup以编程方式组合某些HTML标记的内容？_Python_Html_Pdf_Beautifulsoup_Epub

Python 是否可以使用Beauty Soup以编程方式组合某些HTML标记的内容？

python html pdf

Python 是否可以使用Beauty Soup以编程方式组合某些HTML标记的内容？,python,html,pdf,beautifulsoup,epub,Python,Html,Pdf,Beautifulsoup,Epub,我正在使用一个名为Calibre的程序将PDF文件转换为EPUB文件，但结果非常混乱且不可读。实际上，EPUB文件只是HTML文件的集合，转换的结果很混乱，因为Calibre将PDF文件的每一行解释为元素，这会在EPUB文件中创建许多难看的换行符由于EPUB实际上是HTML文件的集合，因此可以使用BeautifulSoup对其进行解析。然而，我编写的程序寻找具有“calibre1”类（一个普通段落）的元素，并将这些元素组合成单个元素（因此没有难看的换行符），该程序不起作用，我也不知道为什么美

我正在使用一个名为Calibre的程序将PDF文件转换为EPUB文件，但结果非常混乱且不可读。实际上，EPUB文件只是HTML文件的集合，转换的结果很混乱，因为Calibre将PDF文件的每一行解释为元素，这会在EPUB文件中创建许多难看的换行符

由于EPUB实际上是HTML文件的集合，因此可以使用BeautifulSoup对其进行解析。然而，我编写的程序寻找具有“calibre1”类（一个普通段落）的元素，并将这些元素组合成单个元素（因此没有难看的换行符），该程序不起作用，我也不知道为什么

美丽的汤能处理我想做的事吗

import os
from bs4 import BeautifulSoup

path = "C:\\Users\\Eunice\\Desktop\\eBook"

for pathname, directorynames, filenames in os.walk(path):
    # Get all HTML files in the target directory
    for file_name in filenames:
        # Open each HTML file, which is encoded using the "Latin1" encoding scheme
        with open(pathname + "\\" + file_name, 'r', encoding="Latin1") as file:
            # Create a list, which we will write our new HTML tags to later
            html_elem_list: list = []
            # Create a BS4 object
            soup = BeautifulSoup(file, 'html.parser')
            # Create a list of all BS4 elements, which we will traverse in the proceeding loop
            html_elements = [x for x in soup.find_all()]

            for html_element in html_elements:
                try:
                    # Find the element with a class called "calibre1," which is how Calibre designates normal body text in a book
                    if html_element.attrs['class'][0] in 'calibre1':
                        # Combine the next element with the previous element if both elements are part of the same body text
                        if html_elem_list[-1].attrs['class'][0] in 'calibre1':
                            # Remove nonbreaking spaces from this element before adding it to our list of elements
                            html_elem_list[-1].string = html_elem_list[-1].text.replace(
                                '\n', '&nbsp;') + html_element.text
                    # This element must not be of the "calibre1" class, so add it to the list of elements without combining it with the previous element
                    else:
                        html_elem_list.append(html_element)
                # This element must not have any class, so add it to the list of elements without combining it with the previous element
                except KeyError:
                    html_elem_list.append(html_element)

            # Create a string literal, which we will eventually write to our resultant file
            str_htmlfile = ''
            # For each element in the list of HTML elements, append the string representation of that element (which will be a line of HTML code) to the string literal
            for elem in html_elem_list:
                    str_htmlfile = str_htmlfile + str(elem)
        # Create a new file with a distinct variation of the name of the original file, then write the resultant HTML code to that file
        with open(pathname + "\\" + '_modified_' + file_name, 'wb') as file:
            file.write(str_htmlfile.encode('Latin1'))

以下是一个输入：

<?xml version='1.0' encoding='Latin1'?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">

<body class="calibre">
<p class="calibre5" id="calibre_pb_62">Note for Tyler</p>
<p class="calibre1">In the California registry, there was</p>
<p class="calibre1">a calm breeze blowing through the room. A woman</p>
<p class="calibre1">who must have just walked in quietly beckoned for the</p>
<p class="calibre1">counterman to approach to store her slip.</p>
<p class="calibre1">642</p>
</body></html>


泰勒注意事项
在加利福尼亚注册处，有
一股平静的微风吹过房间。女人
他一定是刚刚悄悄地走了进来，向我们招手
柜台工作人员前来存放她的纸条
642

以下是我期望发生的事情：

<?xml version='1.0' encoding='Latin1'?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">

<body class="calibre">
<p class="calibre5" id="calibre_pb_62">Note for Tyler</p>
<p class="calibre1">In the California registry, there was a calm breeze blowing through the room. A woman who must have just walked in quietly beckoned for the counterman to approach to store her slip.642</p>
</body></html>


泰勒注意事项
在加利福尼亚注册处，房间里吹来了一阵微风。一位一定是刚刚走进来的女士悄悄地招手让柜台服务员过来帮她保管纸条

以下是实际输出：

<html lang="" xml:lang="" xmlns="http://www.w3.org/1999/xhtml">
<body class="calibre">
<p class="calibre5" id="calibre_pb_62">Note for Tyler</p>
<p class="calibre1">In the California registry, there was</p>
<p class="calibre1">a calm breeze blowing through the room. A woman</p>
<p class="calibre1">who must have just walked in quietly beckoned for the</p>
<p class="calibre1">counterman to approach to store her slip.</p>
<p class="calibre1">642</p>
</body></html><body class="calibre">
<p class="calibre5" id="calibre_pb_62">Note for Tyler</p>
<p class="calibre1">In the California registry, there was</p>
<p class="calibre1">a calm breeze blowing through the room. A woman</p>
<p class="calibre1">who must have just walked in quietly beckoned for the</p>
<p class="calibre1">counterman to approach to store her slip.</p>
<p class="calibre1">642</p>
</body><p class="calibre5" id="calibre_pb_62">Note for Tyler</p>


泰勒注意事项
在加利福尼亚注册处，有
一股平静的微风吹过房间。女人
他一定是刚刚悄悄地走了进来，向我们招手
柜台工作人员前来存放她的纸条
642
泰勒注意事项
在加利福尼亚注册处，有
一股平静的微风吹过房间。女人
他一定是刚刚悄悄地走了进来，向我们招手
柜台工作人员前来存放她的纸条
642
泰勒注意事项

这可以使用BeautifulSoup来完成，方法是使用

extract（）

删除不需要的

元素，然后使用new\u tag（）
创建一个新的
标记，其中包含所有删除元素的文本。例如：
html = """<?xml version='1.0' encoding='Latin1'?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">

<body class="calibre">
<p class="calibre5" id="calibre_pb_62">Note for Tyler1</p>
<p class="calibre1">In the California registry, there was</p>
<p class="calibre1">a calm breeze blowing through the room. A woman</p>
<p class="calibre1">who must have just walked in quietly beckoned for the</p>
<p class="calibre1">counterman to approach to store her slip.</p>
<p class="calibre1">642</p>

<p class="calibre5" id="calibre_pb_62">Note for Tyler2</p>
<p class="calibre1">In the California registry, there was</p>
<p class="calibre1">a calm breeze blowing through the room. A woman</p>
<p class="calibre1">who must have just walked in quietly beckoned for the</p>
<p class="calibre1">counterman to approach to store her slip.</p>
<p class="calibre1">642</p>

</body></html>"""

from bs4 import BeautifulSoup
from itertools import groupby
import re

soup = BeautifulSoup(html, "html.parser")

for level, group in groupby(soup.find_all("p", class_=re.compile(r"calibre\d")), lambda x: x["class"][0]):
    if level == "calibre1":
        calibre1 = list(group)
        p_new = soup.new_tag('p', attrs={"class" : "calibre1"})
        p_new.string = ' '.join(p.get_text(strip=True) for p in calibre1)
        calibre1[0].insert_before(p_new)

        for p in calibre1:
            p.extract()

print(soup.prettify())

html=”“”
Tyler1的注释
在加利福尼亚注册处，有
一阵平静的微风吹过房间。一个女人
他一定是刚刚悄悄地走了进来，向我们招手
柜台工作人员前来存放她的纸条
642
Tyler2的注释
在加利福尼亚注册处，有
一阵平静的微风吹过房间。一个女人
他一定是刚刚悄悄地走了进来，向我们招手
柜台工作人员前来存放她的纸条
642
"""
从bs4导入BeautifulSoup
从itertools导入groupby
进口稀土
soup=BeautifulSoup（html，“html.parser”）
对于级别，分组在groupby（soup.find_all（“p”，class=re.compile（r“calibre\d”）），lambda x:x[“class”][0]）：
如果级别==“口径1”：
calibre1=列表（组）
p_new=soup.new_标记（'p'，attrs={“class”：“calibre1}）
p_new.string=''.join（p.get_text（strip=True）表示口径为1的p）
口径1[0]。在之前插入_（p_新）
对于口径为1的p：
p、 摘录（）
打印（soup.prettify（））

将为您提供以下HTML格式：


Tyler1的注释


在加利福尼亚州的登记处，一阵微风吹过房间。一位一定是刚刚走进来的女士悄悄地招手让柜台服务员过来帮她保管纸条。642


Tyler2的注释


在加利福尼亚州的登记处，一阵微风吹过房间。一位一定是刚刚走进来的女士悄悄地招手让柜台服务员过来帮她保管纸条。642


它通过查找calibre1
标记的运行来工作。对于每一次运行，它首先组合来自所有运行的文本，并在第一次运行之前插入一个新标记。然后删除所有旧标记
对于EPUB文件中更复杂的场景，可能需要修改逻辑，但这将有助于开始
问题：以编程方式组合某些HTML标记的内容
本例使用lxml
解析XHTML文件并构建新的XHTML树
import io, os
from lxml import etree

XHTML = b"""<?xml version='1.0' encoding='Latin1'?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<body class="calibre">
<p class="calibre5" id="calibre_pb_62">Note for Tyler</p>
<p class="calibre1">In the California registry, there was</p>
<p class="calibre1">a calm breeze blowing through the room. A woman</p>
<p class="calibre1">who must have just walked in quietly beckoned for the</p>
<p class="calibre1">counterman to approach to store her slip.</p>
<p class="calibre1">642</p>
</body></html>"""

class Calibre2EPUB(etree.iterparse):
    def __init__(self, fh):
        """
        Initialize 'iterparse' to only generate 'start' and 'end' events
        :param fh: File Handle from the XHTML File to parse
        """
        super().__init__(fh, events=('start', 'end'))
        self.parse()

    def element(self, elem, parent=None):
        """
        Copy 'elem' with attributes and text to new Element
        :param elem: Source Element
        :param parent: Parent of the new Element
        :return: New Element
        """
        if parent is None:
            e  = etree.Element(elem.tag, nsmap={None: etree.QName(elem).namespace})
        else:
            e = etree.SubElement(parent, elem.tag)

        [e.set(key, elem.attrib[key]) for key in elem.attrib]

        if elem.text:
            e.text = elem.text

        return e

    def parse(self):
        """
        Parse all Elements, copy Elements 1:1 except <p class:'calibre1' Element
        Aggregate all <p class:'calibre1' text to one Element
        :return: None
        """
        self.calibre1 = None

        for event, elem in self:
            if event == 'start':
                if elem.tag.endswith('html'):
                    self._xhtml = self.element(elem)

                elif elem.tag.endswith('body'):
                    self.body = self.element(elem, parent=self._xhtml)

            if event == 'end':
                if elem.tag.endswith('p'):
                    _class = elem.attrib['class']
                    if not _class == 'calibre1':
                        p = self.element(elem, parent=self.body)
                    else:
                        if self.calibre1 is None:
                            self.calibre1 = self.element(elem, parent=self.body)
                        else:
                            self.calibre1.text += ' ' + elem.text

    @property
    def xhtml(self):
        """
        :return: The new Element Tree XHTML
        """
        return etree.tostring(self._xhtml, xml_declaration=True, encoding='Latin1', pretty_print=True)

输出：
<?xml version='1.0' encoding='Latin1'?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<body class="calibre">
<p class="calibre5" id="calibre_pb_62">Note for Tyler</p>
<p class="calibre1">In the California registry, ... (omitted for brevity)to store her slip. 642</p>
</body></html>


泰勒注意事项
在加利福尼亚注册处。。。（为简洁起见省略）保存她的纸条。642

使用Python:3.5进行测试
您能提供一些html和预期输出吗？谢谢您的建议，QHarr。我添加了输入、预期输出和实际输出。请原谅冗余，我的意思是给你贴上标签，@QHarr。我们可以假设，所有情况下都只有class=“calibre5”
和class=“calibre1”
确实，总会有class=“calibre1”
，而不是class=“calibre5”
。有时，不会有一个class=“calibre5”
，而是一个class=“calibreX”，其中X可以等于2-4。
<?xml version='1.0' encoding='Latin1'?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<body class="calibre">
<p class="calibre5" id="calibre_pb_62">Note for Tyler</p>
<p class="calibre1">In the California registry, ... (omitted for brevity)to store her slip. 642</p>
</body></html>