Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/csharp-4.0/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python BeautifulSoup找不到xml标记,如何修复此问题?_Python_Web Scraping_Beautifulsoup - Fatal编程技术网

Python BeautifulSoup找不到xml标记,如何修复此问题?

Python BeautifulSoup找不到xml标记,如何修复此问题?,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,尝试使用beautifulsoup刮取shopify站点,使用findAll('url')返回空列表。如何检索所需的内容 import requests from bs4 import BeautifulSoup as soupify import lxml webSite = requests.get('https://launch.toytokyo.com/sitemap_pages_1.xml') pageSource = webSite.text webSite.close() pa

尝试使用beautifulsoup刮取shopify站点,使用
findAll('url')
返回空列表。如何检索所需的内容

import requests
from bs4 import BeautifulSoup as soupify
import lxml

webSite = requests.get('https://launch.toytokyo.com/sitemap_pages_1.xml')
pageSource = webSite.text
webSite.close()

pageSource = soupify(pageSource, "xml")
print(pageSource.findAll('url'))
我正在努力抓取的页面:

我得到了什么:一个空列表

我应该得到什么:不是空列表

感谢大家的帮助,解决了代码中的问题,我使用的是旧版本的findAll,而不是find_all。您可以执行以下操作:

import requests
from bs4 import BeautifulSoup as bs

url = 'https://launch.toytokyo.com/sitemap_pages_1.xml'

soup = bs(requests.get(url).content,'html.parser')


urls = [i.text for i in soup.find_all('loc')]
因此,基本上我获取内容并定位包含URL的loc标记,然后获取内容;)

更新:必需的url标记并生成字典

urls = [i for i in soup.find_all('url')]

s = [[{k.name:k.text} for k in urls[i] if not isinstance(k,str)] for i,_ in enumerate(urls)]
使用from pprint import pprint as print可获得s的漂亮打印:

print(s)

注意:您可以使用lxml解析器,因为它比html快。解析器作为
BeautifulSoup
的替代方法,您可以始终使用它来解析位于
loc
标记处的XML URL:

from requests import get
from xml.etree.ElementTree import fromstring, ElementTree
from pprint import pprint

url = 'https://launch.toytokyo.com/sitemap_pages_1.xml'

req = get(url)
tree = ElementTree(fromstring(req.text))

urls = []
for outer in tree.getroot():
    for inner in outer:
        namespace, tag = inner.tag.split("}")
        if tag == 'loc':
            urls.append(inner.text)

pprint(urls)
将在列表中提供以下URL:

['https://launch.toytokyo.com/pages/about',
 'https://launch.toytokyo.com/pages/help',
 'https://launch.toytokyo.com/pages/terms',
 'https://launch.toytokyo.com/pages/visit-us']
从中,您可以将您的信息分组为:

它提供了以下索引为键的字典的默认目录:

defaultdict(<class 'dict'>,
            {0: {'changefreq': 'weekly',
                 'lastmod': '2018-07-26T14:37:12-07:00',
                 'loc': 'https://launch.toytokyo.com/pages/about'},
             1: {'changefreq': 'weekly',
                 'lastmod': '2018-11-26T07:58:43-08:00',
                 'loc': 'https://launch.toytokyo.com/pages/help'},
             2: {'changefreq': 'weekly',
                 'lastmod': '2018-08-02T08:57:58-07:00',
                 'loc': 'https://launch.toytokyo.com/pages/terms'},
             3: {'changefreq': 'weekly',
                 'lastmod': '2018-05-21T15:02:36-07:00',
                 'loc': 'https://launch.toytokyo.com/pages/visit-us'}})
这就形成了不同的结构:

defaultdict(<class 'list'>,
            {'changefreq': ['weekly', 'weekly', 'weekly', 'weekly'],
             'lastmod': ['2018-07-26T14:37:12-07:00',
                         '2018-11-26T07:58:43-08:00',
                         '2018-08-02T08:57:58-07:00',
                         '2018-05-21T15:02:36-07:00'],
             'loc': ['https://launch.toytokyo.com/pages/about',
                     'https://launch.toytokyo.com/pages/help',
                     'https://launch.toytokyo.com/pages/terms',
                     'https://launch.toytokyo.com/pages/visit-us']})
defaultdict(,
{'changefreq':['weekly'、'weekly'、'weekly'、'weekly'、'weekly'],
“lastmod”:[“2018-07-26T14:37:12-07:00”,
“2018-11-26T07:58:43-08:00”,
“2018-08-02T08:57:58-07:00”,
“2018-05-21T15:02:36-07:00”,
'loc':['https://launch.toytokyo.com/pages/about',
'https://launch.toytokyo.com/pages/help',
'https://launch.toytokyo.com/pages/terms',
'https://launch.toytokyo.com/pages/visit-us']})

使用xpath的另一种方法

import requests
from lxml import html
url = 'https://launch.toytokyo.com/sitemap_pages_1.xml'
tree = html.fromstring( requests.get(url).content)
links = [link.text for link in tree.xpath('//url/loc')]
print(links)

我试着用你已经尝试过的方式来展示。您唯一需要纠正的是
网站.text
。如果您使用
网站内容
,则可以获得有效的响应

这是您现有尝试的更正版本:

import requests
from bs4 import BeautifulSoup

webSite = requests.get('https://launch.toytokyo.com/sitemap_pages_1.xml')
pageSource = BeautifulSoup(webSite.content, "xml")
for k in pageSource.find_all('url'):
    link = k.loc.text
    date = k.lastmod.text
    frequency = k.changefreq.text
    print(f'{link}\n{date}\n{frequency}\n')

尝试将“xml”改为“html.parser”,看看它能做什么。我现在不想靠近电脑去尝试,但这是我要做的第一件事,看看返回的是什么。第二件事是看看页面是否动态加载。如果是,请查看使用selenium或html请求库。如果您使用lxml进行解析,则需要将“lxml”作为参数传递,而不是“xml”,因此您的bs对象将使用lxml。让我们知道这是否足以使一切正常。@chitown88尝试使用html.parser无效,我使用请求库获取页面source@lulian在代码的前一个版本中,我传入了lxml,但没有任何效果
import requests
from lxml import html
url = 'https://launch.toytokyo.com/sitemap_pages_1.xml'
tree = html.fromstring( requests.get(url).content)
links = [link.text for link in tree.xpath('//url/loc')]
print(links)
import requests
from bs4 import BeautifulSoup

webSite = requests.get('https://launch.toytokyo.com/sitemap_pages_1.xml')
pageSource = BeautifulSoup(webSite.content, "xml")
for k in pageSource.find_all('url'):
    link = k.loc.text
    date = k.lastmod.text
    frequency = k.changefreq.text
    print(f'{link}\n{date}\n{frequency}\n')