Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/spring/12.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
用Python解析XML站点地图_Python_Xml_Parsing - Fatal编程技术网

用Python解析XML站点地图

用Python解析XML站点地图,python,xml,parsing,Python,Xml,Parsing,我有一个这样的网站地图:其结构如下: <sitemapindex> <sitemap> <loc> http://www.site.co.uk/drag_it/dragitsitemap_static_0.xml </loc> <lastmod>2015-07-07</lastmod> </sitemap> <sitemap> <loc>

我有一个这样的网站地图:其结构如下:

<sitemapindex>
  <sitemap>
    <loc>
    http://www.site.co.uk/drag_it/dragitsitemap_static_0.xml
    </loc>
    <lastmod>2015-07-07</lastmod>
  </sitemap>
  <sitemap>
    <loc>
    http://www.site.co.uk/drag_it/dragitsitemap_alpha_0.xml
    </loc>
    <lastmod>2015-07-07</lastmod>
  </sitemap>
...
但是
元素列表
是空的。

我选择使用和库。我创建了一个字典,其中键是url,值是最后修改的日期

from bs4 import BeautifulSoup
import requests

xmlDict = {}

r = requests.get("http://www.site.co.uk/sitemap.xml")
xml = r.text

soup = BeautifulSoup(xml)
sitemapTags = soup.find_all("sitemap")

print "The number of sitemaps are {0}".format(len(sitemapTags))

for sitemap in sitemapTags:
    xmlDict[sitemap.findNext("loc").text] = sitemap.findNext("lastmod").text

print xmlDict
或与:


这里使用
BeautifulSoup
获取
sitemap
计数并提取文本:

from bs4 import BeautifulSoup as bs

html = """
 <sitemap>
    <loc>
    http://www.site.co.uk/drag_it/dragitsitemap_static_0.xml
    </loc>
    <lastmod>2015-07-07</lastmod>
  </sitemap>
  <sitemap>
    <loc>
    http://www.site.co.uk/drag_it/dragitsitemap_alpha_0.xml
    </loc>
    <lastmod>2015-07-07</lastmod>
  </sitemap>
"""

soup = bs(html, "html.parser")
sitemap_count = len(soup.find_all('sitemap'))
print("sitemap count: %d" % sitemap)
print(soup.get_text())

使用Python 3、请求、处理和列表理解:

import requests
import pandas as pd
import xmltodict

url = "https://www.gov.uk/sitemap.xml"
res = requests.get(url)
raw = xmltodict.parse(res.text)

data = [[r["loc"], r["lastmod"]] for r in raw["sitemapindex"]["sitemap"]]
print("Number of sitemaps:", len(data))
df = pd.DataFrame(data, columns=["links", "lastmod"])
输出:

sitemap count: 2

    http://www.site.co.uk/drag_it/dragitsitemap_static_0.xml

2015-07-07

    http://www.site.co.uk/drag_it/dragitsitemap_alpha_0.xml

2015-07-07
    links                                       lastmod
0   https://www.gov.uk/sitemaps/sitemap_1.xml   2018-11-06T01:10:02+00:00
1   https://www.gov.uk/sitemaps/sitemap_2.xml   2018-11-06T01:10:02+00:00
2   https://www.gov.uk/sitemaps/sitemap_3.xml   2018-11-06T01:10:02+00:00
3   https://www.gov.uk/sitemaps/sitemap_4.xml   2018-11-06T01:10:02+00:00
4   https://www.gov.uk/sitemaps/sitemap_5.xml   2018-11-06T01:10:02+00:00

此函数将从xml中提取所有URL

from bs4 import BeautifulSoup
import requests

def get_urls_of_xml(xml_url):
    r = requests.get(xml_url)
    xml = r.text
    soup = BeautifulSoup(xml)

    links_arr=[]
    for link in soup.findAll('loc'):
        linkstr=str(link)
        linkstr=linkstr.replace("<loc>","")
        linkstr=linkstr.replace("</loc>","")
        links_arr.append(linkstr)

    return links_arr



links_data_arr=get_urls_of_xml("https://www.gov.uk/sitemap.xml")
print(links_data_arr)

从bs4导入美化组
导入请求
def获取xml的url(xml url):
r=requests.get(xml\u url)
xml=r.text
soup=BeautifulSoup(xml)
链接\u arr=[]
对于soup.findAll('loc')中的链接:
linkstr=str(link)
linkstr=linkstr.replace(“,”)
linkstr=linkstr.replace(“,”)
链接\u arr.append(linkstr)
返回链接
links\u data\u arr=获取xml的URLhttps://www.gov.uk/sitemap.xml")
打印(链接\u数据\u arr)

一个好的StackOverflow问题显示了您已经尝试了什么,以及它是如何失败的。(我完全同意Anand的观点,
lxml
是这项工作的正确工具;如果您尝试了它并遇到了问题,那么您将有理由在这里提出一个问题)。也可以使用,不?@tandy,当然——它是内置的,但另一方面,它没有真正的XPath。出于后一个原因,我倾向于忽略它。lxml不起作用,任何人都可以帮助我理解为什么?XML的HTML解析器?我的意思是,它是有效的,但它将是不必要的宽容。@charlesduff更新了我的答案…我以前从未使用过lxml,所以它花了我一点时间。beautifulsou说,由于没有指定,它默认使用lxml解析器,然后将
soup=beautifulsou(xml)
更改为
soup=beautifulsou(xml,'lxml')
效果完美@Hyperion它可能在您编写该命令后发生了更改,因为到今天为止,BeautifulSoup使用的默认解析器是
html.parser
    links                                       lastmod
0   https://www.gov.uk/sitemaps/sitemap_1.xml   2018-11-06T01:10:02+00:00
1   https://www.gov.uk/sitemaps/sitemap_2.xml   2018-11-06T01:10:02+00:00
2   https://www.gov.uk/sitemaps/sitemap_3.xml   2018-11-06T01:10:02+00:00
3   https://www.gov.uk/sitemaps/sitemap_4.xml   2018-11-06T01:10:02+00:00
4   https://www.gov.uk/sitemaps/sitemap_5.xml   2018-11-06T01:10:02+00:00
from bs4 import BeautifulSoup
import requests

def get_urls_of_xml(xml_url):
    r = requests.get(xml_url)
    xml = r.text
    soup = BeautifulSoup(xml)

    links_arr=[]
    for link in soup.findAll('loc'):
        linkstr=str(link)
        linkstr=linkstr.replace("<loc>","")
        linkstr=linkstr.replace("</loc>","")
        links_arr.append(linkstr)

    return links_arr



links_data_arr=get_urls_of_xml("https://www.gov.uk/sitemap.xml")
print(links_data_arr)