Python 使用html5lib将任何HTML解析为XML_Python_Xml_Html5lib - Fatal编程技术网

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/multithreading/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用html5lib将任何HTML解析为XML_Python_Xml_Html5lib - Fatal编程技术网

Python 使用html5lib将任何HTML解析为XML

python xml

Python 使用html5lib将任何HTML解析为XML,python,xml,html5lib,Python,Xml,Html5lib,我需要整理HTML页面并用Python将其转换为XML；如果需要，丢失一些“坏”部件我使用了一段时间的标记，但它不理解新的“article”、“footer”标记，也不喜欢不在头部的“meta”；使生成的XML几乎无法处理到目前为止，我喜欢html5lib的功能，但我的第五次测试（非常奇怪的测试）失败了；解析时 <div attr="val""> 这对于格式良好的xml来说不是一个好结果当我作为treebuilder尝试html5lib+lxml时，我将其转换为 <

我需要整理HTML页面并用Python将其转换为XML；如果需要，丢失一些“坏”部件

我使用了一段时间的标记，但它不理解新的“article”、“footer”标记，也不喜欢不在头部的“meta”；使生成的XML几乎无法处理
到目前为止，我喜欢html5lib的功能，但我的第五次测试（非常奇怪的测试）失败了；解析时

<div attr="val"">

这对于格式良好的xml来说不是一个好结果
当我作为treebuilder尝试html5lib+lxml时，我将其转换为

<div attr="val" U00022="">

哪一个更好，但问题是lxml会“吃掉”
标记的结束标记/斜杠，使它们在输出XML时只剩下

您建议使用什么？
您可以使用
方法
将元素设置为自动关闭或不自动关闭，如下所示：

from lxml import etree tree = etree.Element('div', attrib={'attr':'val', 'U00022':''}) etree.tostring(tree) '<div U00022="" attr="val"/>' # parse as self-closing tag etree.tostring(tree, method='xml') '<div U00022="" attr="val"/>' # parse as normal HTML etree.tostring(tree, method='html') '<div U00022="" attr="val"></div>'
打印输出：

<html><head></head><body><div u00022="" attr="val"></div></body></html>

不知何故，method='html'没有帮助，但是method='xml'可以工作，谢谢。@alex29，这太奇怪了！无论如何，我很高兴它有帮助：-）
tree = html5lib.parse('<div attr="val" U00022="">', treebuilder='lxml', namespaceHTMLElements=False) tree.write('yourfilename', method='html')

<html><head></head><body><div u00022="" attr="val"></div></body></html>

[xml]相关文章推荐

刷新标准.net 2.0网格/XMLsource xml gridview

当数据保存在列表中时，Linq转换为XML xml linq

JAX-WS中XML字段中的XML xml

Xml 在第三方网站上输出动态typo3菜单 xml json typo3

是否假定xsi:前缀在XML中是已知的？ xml

Xml XSLT将元素移动到另一个节点，并在移动之前删除一些子元素 xml xslt

Xml 从Web服务添加到字符串的额外转义字符 xml string wcf asp.net-web-api

AppManifest.xml和WMAppManifest.xml之间有什么区别 xml windows-phone-7

如何在maven中的POM.XML中获取现有项目的工件id xml maven

Xml Google Blogger模板-在不使用Jquery或Javascript的情况下突出显示博客上的活动选项卡？ xml

Xml 如何找到包含子字符串的节点？ xml xpath vbscript asp-classic

Xml XSLT-字符串和数字升序，Saxon处理器 xml xslt

Xml 在xpath中为Camel通过字符索引获取子字符串 xml xpath apache-camel

在Inno安装程序中运行schtask.exe之前，使用安装路径创建XML任务文件 xml inno-setup

使用VBA导入XML。单独单元格上的每个标记行 xml vba excel import

Xml 持续时间=”；2s"；无法设置svg元素的动画 xml animation svg

如何在xmlstarlet中使用条件句 xml

Xml XSL-won'；不能处理单个项目 xml xslt

如何将多个XML中的值提取到excel中？ xml excel vba

如何使用US7ASCII将包含汉字和UTF-8编码的XML文件导入Oracle数据库 xml oracle plsql utf-8

随机文章推荐

[python]相关推荐

Python 使用Numpy获得矩阵中数组的平均值
Python Numpy

Python 列出所有边的所有面
Python

Python 查找以前编写的代码的修订版
Python

Python 重新构造文本书目字段的正确用法
Python Python Sphinx

Python 更新wx.Gauge并保持当前帧处于活动状态？
Python Multithreading Wxpython

Python 如何检查列表中的所有字符串？
Python String

Python Django默认设置模块错误
Python Django Google App Engine

在python中动态创建新类型我编写C++插件，公开可能是各种类型的“属性”。简言之，此上下文中的属性是一个变量以及一些元数据。属性主要包含简单类型的值，例如，int、double等，但也可以是用户定义的类型/结构 C++中，插件的客户端可以得到指向属性的指针，然后操作它。
Python C++

（ctypes）msvcrt.printf并用python打印
Python C Python 2.7

Python 如何解决numpy中的冗余线性系统？
Python Numpy Matrix

单元测试时未正确模拟Python类方法
Python Unit Testing

Python 如何使用PyMongo强制读取replicaset次要成员？
Python Mongodb

python3-Read&；从&；到.csv
Python Csv

Gtk3和Python-搜索栏不'；t填充窗口
Python

Python numpy.polyfit是否具有1度配合、TLS或OLS？
Python Numpy

Python 在TensorFlow中将二进制张量的面片转换为十进制
Python Tensorflow

Python 错误：此scipy版本需要Numpy OpenBLAS风格。（尽管我已将其文件直接保存在lib->；site packages文件夹中）
Python Numpy Tensorflow

全球名称'；筛选'；未在python中定义（Pi 3 B）
Python Opencv

python从文件中读取特殊字符并打印它们
Python File Io

使用python进行ARP欺骗scapy不起作用
Python

Python：压缩所有文件夹内容，包括断开的链接，而不跟踪它们
Python Compression

在Python中返回所有可能的顶点着色组合
Python Python 3.x

Python Django runserver不'；尽管端口位于'；听一听'；
Python Django

Python 为什么scipy中的高斯滤波器的阶数给出x和y导数？
Python Numpy Image Processing Computer Vision

Python 使用Selenium保存在浏览器中打开的pdf
Python Selenium

Python 如何使用bot3.resource备份DynamoDB
Python Amazon Dynamodb

Selenium代理IP（更改）不起作用Firefox、python
Python Selenium Firefox Proxy

使用Python中的OpenCV计算图像中的对象数
Python Opencv

Python TensorFlow：如何在SetShapeFn（[]）（：：TensorFlow:：shape_推断：：推断上下文*c）中定义输出的形状
Python Tensorflow

Python 连接jupyter笔记本中的两张iPysheet，以可视化
Python Jupyter Notebook

Tags

Android Fragments Itext Office Js Internet Explorer 8 Uitableview Google Apps Script Cucumber Yocto Uml Import Rspec Asynchronous Spring Boot Types Uwp C Phpunit File Php Safari Eclipse Reference Documentation Log4net Chart.js Migration Shiny Entity Framework Core Windows Mobile Asp.net Mvc Windows 8 Sublimetext2 Apache Camel Tinymce Cocoa Touch Database Design Gradle Nservicebus Flash Rabbitmq Random Statistics Blockchain Javafx Geometry Ip Orm Spotify Xna Concurrency Netty Boost Visual Studio 2013 Emacs Weblogic Pagination Google App Engine Apache Storm Visual C++ Matplotlib Discord Redis Blackberry Wordpress Openstack Wxpython C++11 Parameters Serialization Instagram Plugins Aem Docker Fiware Hibernate Exception Handling Download Apache Flink Ftp Tabs Core Data Amp Html Vuejs2 Directory Go Breeze Ms Office D Hive Bazel Monitoring Object Angular Programming Languages Filter Swiftui Jar Sip Email Computer Science Cuda Redirect Smtp Windows 7 Netbeans Oracle10g Arduino Corda Pdf Xamarin.forms Bootstrap 4 Winforms Websphere Facebook Silverlight Ecmascript 6 For Loop Kubernetes Signalr Debian Xml Text Ibm Cloud Curl React Native Doctrine Atom Editor Numpy Google App Maker Asp.net Mvc 4 Proxy Xquery Pointers Permissions Memory Azure Active Directory Spring Batch Ibm Mq Http Ms Word Dynamic Cakephp Select Oauth Seo Project Management Woocommerce Hyperledger Fabric If Statement Mono Ide Phpstorm Notepad++ Spring Mvc Uiview Openssl Ignite Jsf Windows Services Gdb Sprite Kit Paypal Dynamics Crm 2011 Kentico Deployment Pycharm Docker Compose Sml Jasmine Vba Qt Dns Https Azure Service Fabric Authentication Powershell Web User Interface Sharepoint 2007 Socket.io Outlook Ios4 EmptyTag Visual Studio Colors Nest Cookies Memory Management Oracle Windows Runtime Gcc Swift Git Three.js Jhipster Alfresco .htaccess Synchronization Directx Django Rest Framework Linq

Copyright © 2024. All Rights Reserved by - Fatal编程技术网