Python很好地解释了如何从xml文件中获取标记和文本,即使不知道标记的所有名称
我有这样一个xml文件:Python很好地解释了如何从xml文件中获取标记和文本,即使不知道标记的所有名称,python,xml,beautifulsoup,Python,Xml,Beautifulsoup,我有这样一个xml文件: <a> <b>1</b> <c>2</c> <d> <e>3</e> </d> </a> <a> <c>4</c> <f value ="something">5</f> <g value = "other"></g> </a
<a>
<b>1</b>
<c>2</c>
<d>
<e>3</e>
</d>
</a>
<a>
<c>4</c>
<f value ="something">5</f>
<g value = "other"></g>
</a>
这是一个很大的xml文件,不是标准的,所以我只知道
存在,我希望所有信息都包含在这个标记中
我已经试过BeautifulSoup4,但我只能检索文本部分
我的代码
def ProcessXml(xmlFile):
infile = open(xmlFile, 'r')
contents = infile.read()
soup = BeautifulSoup(contents,'xml')
units = soup.find_all('a')
unitsList = []
for i in units:
resultType = i.text,i.next_sibling
resultType = resultType[0].splitlines()
for j in resultType:
if j == '':
resultType.remove(j)
unitsList.append((resultType))
return unitsList
我的输出:
[['1','2','3'],['4','5']]
这是一个非常糟糕的代码,但它确实起到了作用:
def len_descendant(desc):
counter = 0
try:
for i in desc.descendants:
if i!='' and i !='\n':
counter += 1
except Exception:
pass
return counter
def ProcessXml():
infile = open("xmlfile.xml", 'r')
contents = infile.read()
soup = BeautifulSoup(contents,'lxml')
units = soup.find_all('a')
unitsList = []
for i in units:
this_dict = {}
for desc in i.descendants:
print desc, len_descendant(desc)
try:
if desc.has_attr('value'):
has_attribute = True
except Exception:
has_attribute = False
if len_descendant(desc)==1 or has_attribute:
if desc.has_attr('value'):
key = desc.name + " " + desc.attrs.keys()[0] + '=\"' + desc.attrs.values()[0] + '\"'
else:
key = desc.name
try:
value = int(desc.text)
except Exception:
value = None
this_dict[key] = value
unitsList.append(this_dict)
return unitsList
my_dict = ProcessXml()
结果是:
[{'c': 2, 'b': 1, 'e': 3}, {'f value="something"': 5, 'c': 4, 'g value="other"': 1}]
注意:正如MYGz所提到的,'g value=“other”}]
部分无效,因此我认为这是我尝试使用该函数的XML文件:
<a>
<b>1</b>
<c>2</c>
<d>
<e>3</e>
</d>
</a>
<a>
<c>4</c>
<f value ="something">5</f>
<g value = "other">1</g>
</a>
1.
2.
3.
4.
5.
1.
这是我将使用的代码。这是对@Stergios编写的代码的改编。(适用于python 3)这在bs4中是可能的。哪一部分对你不起作用?我是bs4和python新手,我只知道:units=soup。在units中查找I的所有('a'):resultType=I.text,I.next_sibling谢谢@tankorsmash你能把它编辑到你的问题中吗?这会帮助别人回答你的问题。难以阅读的未格式化代码。完成!谢谢@tankorsmash你试过了吗?非常感谢@Stergios还有一件事,我真的需要返回类似{'g vale=“other”:None}的内容,但我想不出我已经更新了上面的代码。这是有史以来最糟糕的代码,但它完成了任务:)非常感谢!!我们同时发布代码,但您的代码看起来更好:)
<a>
<b>1</b>
<c>2</c>
<d>
<e>3</e>
</d>
</a>
<a>
<c>4</c>
<f value ="something">5</f>
<g value = "other">1</g>
</a>
def len_descendant(desc):
counter = 0
try:
for i in desc.descendants:
#print (i)
if i!='' and i !='\n':
counter += 1
except Exception:
pass
return counter
def ProcessXmlTwo(fileName):
infile = open(fileName, 'r')
contents = infile.read()
soup = BeautifulSoup(contents,'xml')
units = soup.find_all('a')
#print (units)
unitsList = []
for i in units:
this_dict = {}
for desc in i.descendants:
flagNoKey = False
if len_descendant(desc)==1 or 'value="true"/>' in str(desc) or 'value="false"/>' in str(desc):
if desc.has_attr('value'):
#print (desc.name)
key = desc.name + " " + list(desc.attrs.keys())[0] + '=\"' + list(desc.attrs.values())[0] + '\"'
elif 'value="true"/>' in str(desc) or 'value="false"/>' in str(desc):
flagNoKey = True
key = desc.name + " " + list(desc.attrs.keys())[0] + '=\"' + list(desc.attrs.values())[0] + '\"'
else:
key = desc.name
if flagNoKey == False:
this_dict[key] = desc.text
else:
this_dict[key] = "None"
unitsList.append(this_dict)
return unitsList