Python 选择并剥离HTML字符串中的img src_Python

Python 选择并剥离HTML字符串中的img src

python

Python 选择并剥离HTML字符串中的img src,python,Python,我感兴趣的是从python中表示为字符串的文本块中的图像标记中剥离s3信条对于字符串中的每个标记（可以有很多），我希望从“.jpeg”开始，在引号的下一个实例结束，并删除这些位置之间的所有内容例如，以下字符串： <p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jp

我感兴趣的是从python中表示为字符串的文本块中的图像标记中剥离s3信条

对于字符串中的每个标记（可以有很多），我希望从“.jpeg”开始，在引号的下一个实例结束，并删除这些位置之间的所有内容

例如，以下字符串：

<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20190430T021347Z&amp;X-Amz-Expires=3600&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>



这是正文中的额外文本

将成为：

<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>



这是正文中的额外文本

我正在努力想办法做到这一点。任何帮助都将不胜感激

谢谢

假设字符串存储在

中：

import re

re.sub('\.jpeg[^\"]+\"', '.jpeg', s)

这将查找以“.jpeg”开头、以引号结尾的区域，并将其替换为空字符串。

使用

re

可以找到并删除

？

和

之间的所有内容。

示例代码

text = '<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20190430T021347Z&amp;X-Amz-Expires=3600&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>'
expected_result = '<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>'

import re

result = re.sub('\?[^"]+', '', text)

print(result == expected_result) # True

使用

BeautifulSoup

解析html，然后使用

urlparse

Ex:

from bs4 import BeautifulSoup
try:
    from urllib.parse import urlparse #python3
except:
    from urlparse import urlparse #python2


html = """<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20190430T021347Z&amp;X-Amz-Expires=3600&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>"""
soup = BeautifulSoup(html, "html.parser")

for img in soup.find_all("img"):   #Find all img tags
    o = urlparse(img["src"])       #Get URL
    print(o.scheme + "://" + o.netloc + o.path)

https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg

Regex不是该作业的工具。一个更强大的解决方案是使用HTML解析器（如BeautifulSoup）提取

img

标记的

src

属性，并使用URL解析器从URL中删除查询：

from bs4 import BeautifulSoup
from urllib.parse import urlsplit

input_str = '''<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20190430T021347Z&amp;X-Amz-Expires=3600&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>'''

soup = BeautifulSoup(input_str, "html.parser")
img_url = soup.find('img')['src']
new_url = urlsplit(img_url)._replace(query=None).geturl()
soup.find('img')['src'] = new_url
print(soup)

这将更新每个

img

标签的

src

属性：

<p><img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/></p><p><br/></p><p> This is extra text in the body.</p>
<img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/><p><br/></p><p> This is extra text in the body.</p>



这是正文中的额外文本


这是正文中的额外文本

到目前为止您尝试了什么？为什么不在“？“然后使用索引0从列表中获取第一项？我想我必须在这里拆分，这是更大的xml@JasonHoward的一部分吗？”？如果是，您可以使用xml解析器使您的生活变得轻松！不，不是。它基本上只是一篇短博客文章的内容，用字符串表示。它包含html的事实应该无关紧要。我会试试这个。有没有办法修改它，让我们先查找图像标记的存在，然后只修改它的内容？我们可以定义一个函数来检查字符串是否包含“”，然后执行替换？那是你要找的东西吗？只需要传递该函数，而不是替换字符串。@JasonHoward您关心html标记这一事实意味着html文本是相关的。@JasonHoward使用正则表达式解析html的主题已传递到堆栈溢出@glhr regex不是解析html的正确工具，但此问题不需要解析所有内容HTMLW如果正文中的文本包含

？

和

“

”，这是正确的glhr。我需要一些其他限定符，如img标记，这样我就不会拾取？或“用户想要显示的字符。您可以创建更复杂的正则表达式正则表达式，应该用于正则语言，afaik这只显示了如何分割url，然后以所需的形式打印它。我希望从字符串中剪切内容，然后保存字符串。谢谢你提供的信息。非常感谢您的时间。我真的不想在字符串中添加额外的标记（html和body）。我们如何防止这种情况？感谢我回答这两个问题。

from bs4 import BeautifulSoup
from urllib.parse import urlsplit

input_str = '''<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20190430T021347Z&amp;X-Amz-Expires=3600&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>'''

soup = BeautifulSoup(input_str, "html.parser")
img_url = soup.find('img')['src']
new_url = urlsplit(img_url)._replace(query=None).geturl()
soup.find('img')['src'] = new_url
print(soup)

<p><img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/></p><p><br/></p><p> This is extra text in the body.</p>

input_str = '''<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20190430T021347Z&amp;X-Amz-Expires=3600&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>
                <img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20190430T021347Z&amp;X-Amz-Expires=3600&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br><p><br></p><p> This is extra text in the body.</p>'''

soup = BeautifulSoup(input_str, "html.parser")

for img in soup.find_all('img'):
    img_url = img['src']
    new_url = urlsplit(img_url)._replace(query=None).geturl()
    img['src'] = new_url
print(soup)

<p><img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/></p><p><br/></p><p> This is extra text in the body.</p>
<img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/><p><br/></p><p> This is extra text in the body.</p>