用Python解析XML站点地图_Python_Xml_Parsing

用Python解析XML站点地图

python xml parsing

用Python解析XML站点地图,python,xml,parsing,Python,Xml,Parsing,我有一个这样的网站地图：其结构如下： <sitemapindex> <sitemap> <loc> http://www.site.co.uk/drag_it/dragitsitemap_static_0.xml </loc> <lastmod>2015-07-07</lastmod> </sitemap> <sitemap> <loc>

我有一个这样的网站地图：其结构如下：

<sitemapindex>
  <sitemap>
    <loc>
    http://www.site.co.uk/drag_it/dragitsitemap_static_0.xml
    </loc>
    <lastmod>2015-07-07</lastmod>
  </sitemap>
  <sitemap>
    <loc>
    http://www.site.co.uk/drag_it/dragitsitemap_alpha_0.xml
    </loc>
    <lastmod>2015-07-07</lastmod>
  </sitemap>
...

但是

元素列表

是空的。

我选择使用和库。我创建了一个字典，其中键是url，值是最后修改的日期

from bs4 import BeautifulSoup
import requests

xmlDict = {}

r = requests.get("http://www.site.co.uk/sitemap.xml")
xml = r.text

soup = BeautifulSoup(xml)
sitemapTags = soup.find_all("sitemap")

print "The number of sitemaps are {0}".format(len(sitemapTags))

for sitemap in sitemapTags:
    xmlDict[sitemap.findNext("loc").text] = sitemap.findNext("lastmod").text

print xmlDict

或与：

这里使用

BeautifulSoup

获取

sitemap

计数并提取文本：

from bs4 import BeautifulSoup as bs

html = """
 <sitemap>
    <loc>
    http://www.site.co.uk/drag_it/dragitsitemap_static_0.xml
    </loc>
    <lastmod>2015-07-07</lastmod>
  </sitemap>
  <sitemap>
    <loc>
    http://www.site.co.uk/drag_it/dragitsitemap_alpha_0.xml
    </loc>
    <lastmod>2015-07-07</lastmod>
  </sitemap>
"""

soup = bs(html, "html.parser")
sitemap_count = len(soup.find_all('sitemap'))
print("sitemap count: %d" % sitemap)
print(soup.get_text())

使用Python 3、请求、处理和列表理解：

import requests
import pandas as pd
import xmltodict

url = "https://www.gov.uk/sitemap.xml"
res = requests.get(url)
raw = xmltodict.parse(res.text)

data = [[r["loc"], r["lastmod"]] for r in raw["sitemapindex"]["sitemap"]]
print("Number of sitemaps:", len(data))
df = pd.DataFrame(data, columns=["links", "lastmod"])

输出：

sitemap count: 2

    http://www.site.co.uk/drag_it/dragitsitemap_static_0.xml

2015-07-07

    http://www.site.co.uk/drag_it/dragitsitemap_alpha_0.xml

2015-07-07

    links                                       lastmod
0   https://www.gov.uk/sitemaps/sitemap_1.xml   2018-11-06T01:10:02+00:00
1   https://www.gov.uk/sitemaps/sitemap_2.xml   2018-11-06T01:10:02+00:00
2   https://www.gov.uk/sitemaps/sitemap_3.xml   2018-11-06T01:10:02+00:00
3   https://www.gov.uk/sitemaps/sitemap_4.xml   2018-11-06T01:10:02+00:00
4   https://www.gov.uk/sitemaps/sitemap_5.xml   2018-11-06T01:10:02+00:00

此函数将从xml中提取所有URL

from bs4 import BeautifulSoup
import requests

def get_urls_of_xml(xml_url):
    r = requests.get(xml_url)
    xml = r.text
    soup = BeautifulSoup(xml)

    links_arr=[]
    for link in soup.findAll('loc'):
        linkstr=str(link)
        linkstr=linkstr.replace("<loc>","")
        linkstr=linkstr.replace("</loc>","")
        links_arr.append(linkstr)

    return links_arr



links_data_arr=get_urls_of_xml("https://www.gov.uk/sitemap.xml")
print(links_data_arr)

从bs4导入美化组
导入请求
def获取xml的url（xml url）：
r=requests.get（xml\u url）
xml=r.text
soup=BeautifulSoup（xml）
链接\u arr=[]
对于soup.findAll（'loc'）中的链接：
linkstr=str（link）
linkstr=linkstr.replace（“，”）
linkstr=linkstr.replace（“，”）
链接\u arr.append（linkstr）
返回链接
links\u data\u arr=获取xml的URLhttps://www.gov.uk/sitemap.xml")
打印（链接\u数据\u arr）

一个好的StackOverflow问题显示了您已经尝试了什么，以及它是如何失败的。（我完全同意Anand的观点，

lxml

是这项工作的正确工具；如果您尝试了它并遇到了问题，那么您将有理由在这里提出一个问题）。也可以使用，不？@tandy，当然——它是内置的，但另一方面，它没有真正的XPath。出于后一个原因，我倾向于忽略它。lxml不起作用，任何人都可以帮助我理解为什么？XML的HTML解析器？我的意思是，它是有效的，但它将是不必要的宽容。@charlesduff更新了我的答案…我以前从未使用过lxml，所以它花了我一点时间。beautifulsou说，由于没有指定，它默认使用lxml解析器，然后将

soup=beautifulsou（xml）

更改为

soup=beautifulsou（xml，'lxml'）

效果完美@Hyperion它可能在您编写该命令后发生了更改，因为到今天为止，BeautifulSoup使用的默认解析器是

html.parser

。

    links                                       lastmod
0   https://www.gov.uk/sitemaps/sitemap_1.xml   2018-11-06T01:10:02+00:00
1   https://www.gov.uk/sitemaps/sitemap_2.xml   2018-11-06T01:10:02+00:00
2   https://www.gov.uk/sitemaps/sitemap_3.xml   2018-11-06T01:10:02+00:00
3   https://www.gov.uk/sitemaps/sitemap_4.xml   2018-11-06T01:10:02+00:00
4   https://www.gov.uk/sitemaps/sitemap_5.xml   2018-11-06T01:10:02+00:00

from bs4 import BeautifulSoup
import requests

def get_urls_of_xml(xml_url):
    r = requests.get(xml_url)
    xml = r.text
    soup = BeautifulSoup(xml)

    links_arr=[]
    for link in soup.findAll('loc'):
        linkstr=str(link)
        linkstr=linkstr.replace("<loc>","")
        linkstr=linkstr.replace("</loc>","")
        links_arr.append(linkstr)

    return links_arr



links_data_arr=get_urls_of_xml("https://www.gov.uk/sitemap.xml")
print(links_data_arr)