使用python+；硒+；美丽的汤找到固定字符串的网址_Python_Selenium_Url_Beautifulsoup_Webdriver

使用python+；硒+；美丽的汤找到固定字符串的网址

python selenium url

使用python+；硒+；美丽的汤找到固定字符串的网址,python,selenium,url,beautifulsoup,webdriver,Python,Selenium,Url,Beautifulsoup,Webdriver,我有一些网址如下： imsges = <img class="wni-logo" src="https://smtgvs.weathernews.jp/s/topics/img/wnilogo_kana@2x.png"/> <img alt="top" id="top_img" src="//smtgvs.weathernews.jp/s/topics/img/201808/201808170115_top_img_A.jpg?1534474260" style="width

我有一些网址如下：

imsges = 
<img class="wni-logo" src="https://smtgvs.weathernews.jp/s/topics/img/wnilogo_kana@2x.png"/>
<img alt="top" id="top_img" src="//smtgvs.weathernews.jp/s/topics/img/201808/201808170115_top_img_A.jpg?1534474260" style="width: 100%;"/>
<img alt="box0" id="box_img0" src="//smtgvs.weathernews.jp/s/topics/img/201808/201808170115_box_img0_A.png?1534474573" style="width:100%"/>
<img alt="box1" class="lazy" data-original="https://smtgvs.weathernews.jp" id="box_img1" src="https://smtgvs.weathernews.jp/s/topics/img/dummy.png" style="width: 100%; display: none;"/>
<img alt="recommend thumb0" height="70" src="https://smtgvs.weathernews.jp/s/topics/thumb/article/201808080245_top_img_A_320x240.jpg?1534473603" width="100px"/>

['https://smtgvs.weathernews.jp/s/topics/img/201808/201808170115_top_img_A.jpg']
['https://smtgvs.weathernews.jp/s/topics/img/201808/201808170115_box_img0_A.png']

我尝试了以下代码：

for image in images:
    imageURL = re.findall('https://smtgvs.weathernews.jp/s/topics/img/.+', urljoin(baseURL, image['src']))

    if imageURL:
        print(imageURL)

我得到了这些结果，你能帮我纠正一下吗

['https://smtgvs.weathernews.jp/s/topics/img/201808/201808170115_top_img_A.jpg?1534474260']
['https://smtgvs.weathernews.jp/s/topics/img/201808/201808170115_box_img0_A.jpg?1534474573']
['https://smtgvs.weathernews.jp/s/topics/img/dummy.png']

您可以直接使用捕获组更改正则表达式

for image in images:
     imageURL = re.findall("(https://smtgvs.weathernews.jp/s/topics/img/[0-9]+/.+)\?[0-9]+", urljoin(baseURL, image['src']))

if imageURL:
    print(imageURL)

编辑：要获取原始数据而不是src字段，请执行以下操作：

soup = BeautifulSoup(html_doc, 'html.parser')
for image in soup.find_all("img"):
    print(image.get("data-original"))

如何在

原始数据

中获取URL，而不是在

src

中获取URL

@mikezang我的答案中有一个编辑：）：

image.get（“原始数据”）