在Python3上使用beautifulsoup4从多个URL提取img src时遇到问题

在Python3上使用beautifulsoup4从多个URL提取img src时遇到问题,python,selenium,web-scraping,beautifulsoup,Python,Selenium,Web Scraping,Beautifulsoup,我正在尝试构建一个scraper,它将遍历产品页面URL列表,解析数据并从照片卷中提取img src URL,这些URL位于“li”元素下,而“ul”元素下具有唯一类“bxslider”。我只是简单地使用soup.findAll('img'['src']),但是在这个站点上还有很多其他的src-img,我不需要它们。我还需要排除类为“bx clone”的任何“li”标记。 我用的是硒、美苏和熊猫 我需要刮取的HTML: <ul class="bxslider" styl

我正在尝试构建一个scraper,它将遍历产品页面URL列表,解析数据并从照片卷中提取img src URL,这些URL位于“li”元素下,而“ul”元素下具有唯一类“bxslider”。我只是简单地使用soup.findAll('img'['src']),但是在这个站点上还有很多其他的src-img,我不需要它们。我还需要排除类为“bx clone”的任何“li”标记。 我用的是硒、美苏和熊猫

我需要刮取的HTML:

<ul class="bxslider" style="width: 1315%; position: relative; left: -410px;"><li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/55/bb/55bb9511-676b-4585-8cf2-99af9ba8baca.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/06/a7/06a700dd-8350-4932-88e9-c941e73e0def.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8c/92/8c92207d-c422-4d94-894c-911a5330e227.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/e0/22/e0224832-75f5-432a-a223-177ff7ffd03c.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8a/e1/8ae1e8d4-76a1-4161-9b17-b7a97e1779fc.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/fc/d5/fcd5a35b-8fb5-463e-9a47-804850f17825.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/98/2e/982ea3c5-ce28-49c8-bef5-b0f85bd99807.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/23/e1/23e153df-75af-4f1b-a4dd-3e0fb1e5a28f.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/94/02/940268f9-04ed-4650-bd9f-01b113b5059b.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li>
<li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li></ul>
我完全迷路了,而且对python和当前使用的所有模块都相当陌生。我需要将这些图像链接放在一个单元格中,与相应的产品页面URL一起,这样行看起来就像这样,只有一个逗号作为分隔符:
productpagelink,图像链接|图像链接|图像链接

我在最后加入了熊猫的部分,因为虽然看起来我的imgs列表被正确地添加了,但我不想给我留下更多的错误空间,我想可能有一个明显的调整。 如果我遗漏了您需要帮助的任何内容,请告诉我,我将进行编辑。谢谢大家!


编辑:我无法共享URL,因为它位于受密码保护的网站后面;Selenium加载得很好,并且会遍历每个URL。

可能不受欢迎的方法,但总会有
re
模块。这是一个更多的工作,但更多的乐趣,太多了

import re

html = """
<ul class="bxslider" style="width: 1315%; position: relative; left: -410px;"><li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/55/bb/55bb9511-676b-4585-8cf2-99af9ba8baca.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/06/a7/06a700dd-8350-4932-88e9-c941e73e0def.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8c/92/8c92207d-c422-4d94-894c-911a5330e227.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/e0/22/e0224832-75f5-432a-a223-177ff7ffd03c.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8a/e1/8ae1e8d4-76a1-4161-9b17-b7a97e1779fc.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/fc/d5/fcd5a35b-8fb5-463e-9a47-804850f17825.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/98/2e/982ea3c5-ce28-49c8-bef5-b0f85bd99807.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/23/e1/23e153df-75af-4f1b-a4dd-3e0fb1e5a28f.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/94/02/940268f9-04ed-4650-bd9f-01b113b5059b.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li>
<li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li></ul>
"""

# Retrieve only list elements based on given <ul> class
list_section_pattern = r'(?:<ul class="bxslider" .*?>)(?P<target>.*?)(?:</ul>)'
p = re.compile(list_section_pattern, flags = re.DOTALL | re.MULTILINE)
list_section = p.search(html).group("target")



# Match pattern to get all URLs; This is pretty straightforward.
href_pattern = r'<img src="(.*?)">'
p = re.compile(href_pattern)

# This should be a list of parsed URLs
urls = p.findall(list_section)


def get_root_url(url_path):
    """Split by forward-slash; Keep everything except image filename."""
    return "/".join(url_path.split(r"/")[:-1])


# Create a dictionary of url roots and image url lists.
url_dict = {}
for url in urls:
    root = get_root_url(url)
    if not root in url_dict:
        url_dict[root] = [url]
    else:
        url_dict[root].append(url)

# Output string for csv file
csv_string = ""
for k, v in url_dict.items():
    # .join() elements with vertical bar.
    tmp = " | ".join(v)
    csv_string += f"{k}, {tmp}\n" # Add a newline character

with open(r"C:\Users\niall\.spyder-py3\didthisworklol.csv", "w", encoding="utf-8") as csvf:
    csvf.write(csv_string)

然后,您可以使用与上面相同的方法发送到文件。

可能不受欢迎的方法,但始终存在
re
模块。这是一个更多的工作,但更多的乐趣,太多了

import re

html = """
<ul class="bxslider" style="width: 1315%; position: relative; left: -410px;"><li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/55/bb/55bb9511-676b-4585-8cf2-99af9ba8baca.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/06/a7/06a700dd-8350-4932-88e9-c941e73e0def.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8c/92/8c92207d-c422-4d94-894c-911a5330e227.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/e0/22/e0224832-75f5-432a-a223-177ff7ffd03c.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8a/e1/8ae1e8d4-76a1-4161-9b17-b7a97e1779fc.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/fc/d5/fcd5a35b-8fb5-463e-9a47-804850f17825.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/98/2e/982ea3c5-ce28-49c8-bef5-b0f85bd99807.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/23/e1/23e153df-75af-4f1b-a4dd-3e0fb1e5a28f.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/94/02/940268f9-04ed-4650-bd9f-01b113b5059b.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li>
<li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li></ul>
"""

# Retrieve only list elements based on given <ul> class
list_section_pattern = r'(?:<ul class="bxslider" .*?>)(?P<target>.*?)(?:</ul>)'
p = re.compile(list_section_pattern, flags = re.DOTALL | re.MULTILINE)
list_section = p.search(html).group("target")



# Match pattern to get all URLs; This is pretty straightforward.
href_pattern = r'<img src="(.*?)">'
p = re.compile(href_pattern)

# This should be a list of parsed URLs
urls = p.findall(list_section)


def get_root_url(url_path):
    """Split by forward-slash; Keep everything except image filename."""
    return "/".join(url_path.split(r"/")[:-1])


# Create a dictionary of url roots and image url lists.
url_dict = {}
for url in urls:
    root = get_root_url(url)
    if not root in url_dict:
        url_dict[root] = [url]
    else:
        url_dict[root].append(url)

# Output string for csv file
csv_string = ""
for k, v in url_dict.items():
    # .join() elements with vertical bar.
    tmp = " | ".join(v)
    csv_string += f"{k}, {tmp}\n" # Add a newline character

with open(r"C:\Users\niall\.spyder-py3\didthisworklol.csv", "w", encoding="utf-8") as csvf:
    csvf.write(csv_string)
然后,您可以使用与上述相同的方法发送到文件。

另一种方法

from simplified_scrapy import SimplifiedDoc, utils
html = """
<ul class="bxslider" style="width: 1315%; position: relative; left: -410px;"><li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/55/bb/55bb9511-676b-4585-8cf2-99af9ba8baca.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/06/a7/06a700dd-8350-4932-88e9-c941e73e0def.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8c/92/8c92207d-c422-4d94-894c-911a5330e227.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/e0/22/e0224832-75f5-432a-a223-177ff7ffd03c.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8a/e1/8ae1e8d4-76a1-4161-9b17-b7a97e1779fc.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/fc/d5/fcd5a35b-8fb5-463e-9a47-804850f17825.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/98/2e/982ea3c5-ce28-49c8-bef5-b0f85bd99807.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/23/e1/23e153df-75af-4f1b-a4dd-3e0fb1e5a28f.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/94/02/940268f9-04ed-4650-bd9f-01b113b5059b.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li>
<li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li></ul>
"""
doc = SimplifiedDoc(html)
images = doc.select('ul.bxslider').selects('img').src
rows = [[src] for src in images] # Change [] to [[]]
utils.save2csv('didthisworklol.csv',rows,newline='') # Save data to file
这里有更多的例子:

另一种方法

from simplified_scrapy import SimplifiedDoc, utils
html = """
<ul class="bxslider" style="width: 1315%; position: relative; left: -410px;"><li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/55/bb/55bb9511-676b-4585-8cf2-99af9ba8baca.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/06/a7/06a700dd-8350-4932-88e9-c941e73e0def.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8c/92/8c92207d-c422-4d94-894c-911a5330e227.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/e0/22/e0224832-75f5-432a-a223-177ff7ffd03c.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8a/e1/8ae1e8d4-76a1-4161-9b17-b7a97e1779fc.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/fc/d5/fcd5a35b-8fb5-463e-9a47-804850f17825.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/98/2e/982ea3c5-ce28-49c8-bef5-b0f85bd99807.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/23/e1/23e153df-75af-4f1b-a4dd-3e0fb1e5a28f.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/94/02/940268f9-04ed-4650-bd9f-01b113b5059b.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li>
<li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li></ul>
"""
doc = SimplifiedDoc(html)
images = doc.select('ul.bxslider').selects('img').src
rows = [[src] for src in images] # Change [] to [[]]
utils.save2csv('didthisworklol.csv',rows,newline='') # Save data to file

这里有更多的例子:

你能分享这个URL吗?为什么不调用boxslaider.findAll('img'['src'])?@AndrejKesely-更新的帖子,但是我正在抓取的页面都在一个密码保护的网站后面。产品页面是什么?只是主url片段?e、 g.-
//d1w0x2adoh4nzy.cloudfront.net/b5/48
@D-e-N虽然这确实为我节省了一些代码行(谢谢!),但它并没有解决我的问题。要继续调试,我认为我的新代码的问题在于找到bxslider类ul标记。。打印该变量基本上为我提供了文档中它后面的所有HTML。嗯……你能分享这个URL吗?为什么不打电话给boxslaider.findAll('img'['src'])?@AndrejKesely-更新的帖子,但是我正在抓取的页面都在一个密码保护的网站后面。产品页面是什么?只是主url片段?e、 g.-
//d1w0x2adoh4nzy.cloudfront.net/b5/48
@D-e-N虽然这确实为我节省了一些代码行(谢谢!),但它并没有解决我的问题。要继续调试,我认为我的新代码的问题在于找到bxslider类ul标记。。打印该变量基本上为我提供了文档中它后面的所有HTML。嗯..啊,这看起来不错-我把它全部推到了一个for循环中,这将获取每个页面的产品html并在循环中运行代码,将html变量设置为页面的源代码,但是我在
list\u section=p.search(html.group)(“target”)中得到一个属性错误
错误是:
AttributeError:'NoneType'对象没有属性'group'
这可能意味着您没有找到任何东西。这可能意味着class标记不在正确的位置,但您可以执行类似于
r'(?:)(?P.*?(:)的操作,
,这只是在“class”属性之前添加了另一个通配符。@haise0在任何情况下,如果您想查看它,我为BeautifulSoup添加了一个部分。啊,这看起来很好-我把它全部推到了一个for循环中,这将获取每个页面的产品html并在循环中运行代码,将html变量设置为页面的源代码,但是我在
list\u section=p.search(html.group)(“target”)中得到一个属性错误
错误是:
AttributeError:'NoneType'对象没有属性'group'
这可能意味着您没有找到任何东西。这可能意味着class标记不在正确的位置,但您可以执行类似于
r'(?:)(?P.*?(:)的操作,
,这只是在“class”属性之前添加了另一个通配符。@haise0在任何情况下,如果您想查看它,我为BeautifulSoup添加了一个部分。
//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg
//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783
//d1w0x2adoh4nzy.cloudfront.net/55/bb/55bb9511-676b-4585-8cf2-99af9ba8baca.jpg
......