获取两个闭合标记之间的文本XML-Python
我下载了我的Foursquare数据,它是KML格式的。我正在用Python将其作为XML文件进行解析,无法理解如何获取closed a标记和closed description标记之间的文本。(这是我在签入时键入的文本,在下面的示例中是“FINALLY HERE!!With Sonya and co”,但还有一个连字符) 这是数据外观的一个示例获取两个闭合标记之间的文本XML-Python,python,xml,Python,Xml,我下载了我的Foursquare数据,它是KML格式的。我正在用Python将其作为XML文件进行解析,无法理解如何获取closed a标记和closed description标记之间的文本。(这是我在签入时键入的文本,在下面的示例中是“FINALLY HERE!!With Sonya and co”,但还有一个连字符) 这是数据外观的一个示例 <Placemark> <name>hummus grill</name> <description
<Placemark>
<name>hummus grill</name>
<description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description>
<updated>Tue, 24 Jan 12 17:14:00 +0000</updated>
<published>Tue, 24 Jan 12 17:14:00 +0000</published>
<visibility>1</visibility>
<Point>
<extrude>1</extrude>
<altitudeMode>relativeToGround</altitudeMode>
<coordinates>-75.20104383595685,39.9528387056977</coordinates>
</Point>
</Placemark>
我试过这个(下面是数据的开头,有这个标题的东西,还没有弄明白如何处理它)
dom.getElementsByTagName('description')中d的:
description.append(d.firstChild.data.encode('utf-8'))
foursquare签入历史foursquare签入历史:
然后通过这个d.firstChild.nextSibling.firstChild.data.encode('utf-8')访问它,但它只给了我“hummus grill”,我假设它是a标记之间的文本(而不是来自name标记)。你试过使用子字符串吗 例如,假设所有xml都在变量“foo”中
foo = '<description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description>'
只要读懂子字符串,你就能更容易地处理文本。你试过使用子字符串吗 例如,假设所有xml都在变量“foo”中
foo = '<description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description>'
只要仔细阅读子字符串,就能更轻松地处理文本。以下内容对我很有用:
In [44]: description = []
In [45]: for d in dom.getElementsByTagName('description'):
....: description.append(d.firstChild.nextSibling.nextSibling.data.encode('utf-8'))
....:
In [46]: description
Out[46]: ['- FINALLY HERE!! With Sonya and co']
或者,如果要在描述标记中显示整个文本:
from xml.dom.minidom import parse, parseString
def getText(node, recursive = False):
"""
Get all the text associated with this node.
With recursive == True, all text from child nodes is retrieved
"""
L = ['']
for n in node.childNodes:
if n.nodeType in (dom.TEXT_NODE, dom.CDATA_SECTION_NODE):
L.append(n.data)
else:
if not recursive:
return None
L.append(getText(n))
return ''.join(L)
dom = parseString("""<Placemark>
<name>hummus grill</name>
<description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description>
<updated>Tue, 24 Jan 12 17:14:00 +0000</updated>
<published>Tue, 24 Jan 12 17:14:00 +0000</published>
<visibility>1</visibility>
<Point>
<extrude>1</extrude>
<altitudeMode>relativeToGround</altitudeMode>
<coordinates>-75.20104383595685,39.9528387056977</coordinates>
</Point>
</Placemark>""")
description = []
for d in dom.getElementsByTagName('description'):
description.append(getText(d, recursive = True))
print description
从xml.dom.minidom导入解析,解析字符串
def getText(节点,递归=False):
"""
获取与此节点关联的所有文本。
当recursive==True时,将检索子节点中的所有文本
"""
L=['']
对于node.childNodes中的n:
如果n.nodeType位于(dom.TEXT\u节点、dom.CDATA\u节\u节点):
附加(n.数据)
其他:
如果不是递归的:
一无所获
L.append(getText(n))
返回“”。加入(L)
dom=parseString(“”)
鹰嘴豆泥烤架
@-终于来了!索尼娅和他的同事们
1月24日星期二17:14:00+0000
1月24日星期二17:14:00+0000
1.
1.
相对地
-75.20104383595685,39.9528387056977
""")
description=[]
对于dom.getElementsByTagName('description')中的d:
description.append(getText(d,recursive=True))
打印说明
这将打印:
[u'@hummus grill-最后在这里!!与Sonya和co']
以下作品适合我:
In [44]: description = []
In [45]: for d in dom.getElementsByTagName('description'):
....: description.append(d.firstChild.nextSibling.nextSibling.data.encode('utf-8'))
....:
In [46]: description
Out[46]: ['- FINALLY HERE!! With Sonya and co']
或者,如果要在描述标记中显示整个文本:
from xml.dom.minidom import parse, parseString
def getText(node, recursive = False):
"""
Get all the text associated with this node.
With recursive == True, all text from child nodes is retrieved
"""
L = ['']
for n in node.childNodes:
if n.nodeType in (dom.TEXT_NODE, dom.CDATA_SECTION_NODE):
L.append(n.data)
else:
if not recursive:
return None
L.append(getText(n))
return ''.join(L)
dom = parseString("""<Placemark>
<name>hummus grill</name>
<description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description>
<updated>Tue, 24 Jan 12 17:14:00 +0000</updated>
<published>Tue, 24 Jan 12 17:14:00 +0000</published>
<visibility>1</visibility>
<Point>
<extrude>1</extrude>
<altitudeMode>relativeToGround</altitudeMode>
<coordinates>-75.20104383595685,39.9528387056977</coordinates>
</Point>
</Placemark>""")
description = []
for d in dom.getElementsByTagName('description'):
description.append(getText(d, recursive = True))
print description
从xml.dom.minidom导入解析,解析字符串
def getText(节点,递归=False):
"""
获取与此节点关联的所有文本。
当recursive==True时,将检索子节点中的所有文本
"""
L=['']
对于node.childNodes中的n:
如果n.nodeType位于(dom.TEXT\u节点、dom.CDATA\u节\u节点):
附加(n.数据)
其他:
如果不是递归的:
一无所获
L.append(getText(n))
返回“”。加入(L)
dom=parseString(“”)
鹰嘴豆泥烤架
@-终于来了!索尼娅和他的同事们
1月24日星期二17:14:00+0000
1月24日星期二17:14:00+0000
1.
1.
相对地
-75.20104383595685,39.9528387056977
""")
description=[]
对于dom.getElementsByTagName('description')中的d:
description.append(getText(d,recursive=True))
打印说明
这将打印:
[u'@hummus grill-最后在这里!!与Sonya和co']
那么我需要将DOM元素转换为子字符串吗?或者你是在建议一条完全不同的路线?是的。将整个DOM元素设置为一个变量将使您能够轻松地返回并分离某些部分。子字符串往往是解析文本的一种简单方法。那么我需要将DOM元素转换为子字符串吗?或者你是在建议一条完全不同的路线?是的。将整个DOM元素设置为一个变量将使您能够轻松地返回并分离某些部分。子字符串往往是解析文本的一种简单方法。
from xml.dom.minidom import parse, parseString
def getText(node, recursive = False):
"""
Get all the text associated with this node.
With recursive == True, all text from child nodes is retrieved
"""
L = ['']
for n in node.childNodes:
if n.nodeType in (dom.TEXT_NODE, dom.CDATA_SECTION_NODE):
L.append(n.data)
else:
if not recursive:
return None
L.append(getText(n))
return ''.join(L)
dom = parseString("""<Placemark>
<name>hummus grill</name>
<description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description>
<updated>Tue, 24 Jan 12 17:14:00 +0000</updated>
<published>Tue, 24 Jan 12 17:14:00 +0000</published>
<visibility>1</visibility>
<Point>
<extrude>1</extrude>
<altitudeMode>relativeToGround</altitudeMode>
<coordinates>-75.20104383595685,39.9528387056977</coordinates>
</Point>
</Placemark>""")
description = []
for d in dom.getElementsByTagName('description'):
description.append(getText(d, recursive = True))
print description