Python 为什么Beauty soup会向文档中添加额外的xml声明,以及如何删除它?
我试图解析一个有头的简单xml。代码如下:Python 为什么Beauty soup会向文档中添加额外的xml声明,以及如何删除它?,python,xml,beautifulsoup,Python,Xml,Beautifulsoup,我试图解析一个有头的简单xml。代码如下: str(BeautifulSoup(""" <?xml version="1.0" encoding="UTF-8"?> <data/> """, features='xml')) 当您将xml传递给features参数时,lxml构建xml树本身。所以你不需要自己放头球 >>> str(BeautifulSoup(""" ... <data/> ... """, features='xml'))
str(BeautifulSoup("""
<?xml version="1.0" encoding="UTF-8"?>
<data/>
""", features='xml'))
当您将
xml
传递给features
参数时,lxml
构建xml树本身。所以你不需要自己放头球
>>> str(BeautifulSoup("""
... <data/>
... """, features='xml'))
'<?xml version="1.0" encoding="utf-8"?>\n<data/>'
>>>
>>str(美化组(“”)
...
…”,features='xml'))
“\n”
>>>
是虫子还是我做错了什么
简短的回答是的,你做错了
怎么用?
您得到两个XML声明的原因是,您传入了Beauty Soup使用的features
参数
但这并不是全部的历史。在中使用self.is_xml
,它返回文档的字符串或Unicode表示形式,并且当self.is_xml
为truthy时,它将返回
在我的应用程序中,xml已经有了一个标题。有没有一种有效的方法可以自动删除它?还是叫美女组别理它?我也不知道。我必须搜索。要从字符串(上面最后一行)中删除xml头,类似于
str(soup.split(“\n”)[-1]
beautifulsoup4==4.4.1
lxml==3.4.3
>>> str(BeautifulSoup("""
... <data/>
... """, features='xml'))
'<?xml version="1.0" encoding="utf-8"?>\n<data/>'
>>>
if builder is None:
if isinstance(features, basestring):
features = [features]
if features is None or len(features) == 0:
features = self.DEFAULT_BUILDER_FEATURES
builder_class = builder_registry.lookup(*features)
if builder_class is None:
raise FeatureNotFound(
"Couldn't find a tree builder with the features you "
"requested: %s. Do you need to install a parser library?"
% ",".join(features))
builder = builder_class()
self.builder = builder
self.is_xml = builder.is_xml
self.builder.soup = self
if self.is_xml:
# Print the XML declaration
encoding_part = ''
if eventual_encoding != None:
encoding_part = ' encoding="%s"' % eventual_encoding
prefix = u'<?xml version="1.0"%s?>\n' % encoding_part
...
>>> from bs4 import BeautifulSoup
>>> doc = '''<?xml version="1.0" encoding="UTF-8"?>
... <data/>'''
>>> soup = BeautifulSoup(doc, 'xml')
>>> str(soup)
'<?xml version="1.0" encoding="utf-8"?>\n<data/>'