如何在python中提取特定字符串_Python

如何在python中提取特定字符串

python

如何在python中提取特定字符串,python,Python,我正在尝试提取标记中的特定字符串并保存它们（用于此行中更复杂的处理）。例如，我从文件中读取了一行，当前行是： <center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg" WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall fro

我正在尝试提取标记中的特定字符串并保存它们（用于此行中更复杂的处理）。例如，我从文件中读取了一行，当前行是：

<center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg"  WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road" ***PINIT***></center><br clear="all"><br clear="all">

我将如何在Python中实现这一点

谢谢

虽然这里有几种方法可以解决问题，但我建议使用HTML解析器，它是可扩展的，可以处理HTML中的许多问题。下面是一个使用

BeautifulSoup

的工作示例：

>>> from bs4 import BeautifulSoup
>>> string = """<center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg"  WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road" ***PINIT***></center><br clear="all"><br clear="all">"""
>>> soup = BeautifulSoup(string, 'html.parser')
>>> for attr in ['width', 'height', 'alt']:
...     print('temp{} = {}'.format(attr.title(), soup.img[attr]))
...
tempWidth = 500
tempHeight = 375
tempAlt = Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road

>>来自bs4导入组
>>>string=“”

“
>>>soup=BeautifulSoup（字符串'html.parser'）
>>>对于['width'、'height'、'alt']中的属性：
...     打印（'temp{}={}'。格式（attr.title（），soup.img[attr]））
...
tempWidth=500
温度高度=375
tempAlt=从大橡树平坦的道路向上眺望默塞德河峡谷，朝着Bridalveil瀑布

虽然这里有几种方法可以解决问题，但我建议使用HTML解析器，它是可扩展的，可以处理HTML中的许多问题。下面是一个使用

BeautifulSoup

的工作示例：

>>> from bs4 import BeautifulSoup
>>> string = """<center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg"  WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road" ***PINIT***></center><br clear="all"><br clear="all">"""
>>> soup = BeautifulSoup(string, 'html.parser')
>>> for attr in ['width', 'height', 'alt']:
...     print('temp{} = {}'.format(attr.title(), soup.img[attr]))
...
tempWidth = 500
tempHeight = 375
tempAlt = Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road

>>来自bs4导入组
>>>string=“”

“
>>>soup=BeautifulSoup（字符串'html.parser'）
>>>对于['width'、'height'、'alt']中的属性：
...     打印（'temp{}={}'。格式（attr.title（），soup.img[attr]））
...
tempWidth=500
温度高度=375
tempAlt=从大橡树平坦的道路向上眺望默塞德河峡谷，朝着Bridalveil瀑布

和正则表达式方法：

import re

string = "YOUR STRING"
matches = re.findall("src=\"(.*?)\".*WIDTH=\"(.*?)\".*HEIGHT=\"(.*?)\".*alt=\"(.*?)\"", string)[0]
tempUrl = matches[0]
tempWidth = matches[1]
tempHeight = matches[2]
tempAlt = matches[3]

但所有值都是字符串，所以如果需要，请将其强制转换

要知道使用regex复制/粘贴是个坏主意。很容易出错。

而正则表达式方法：

import re

string = "YOUR STRING"
matches = re.findall("src=\"(.*?)\".*WIDTH=\"(.*?)\".*HEIGHT=\"(.*?)\".*alt=\"(.*?)\"", string)[0]
tempUrl = matches[0]
tempWidth = matches[1]
tempHeight = matches[2]
tempAlt = matches[3]

但所有值都是字符串，所以如果需要，请将其强制转换

要知道使用regex复制/粘贴是个坏主意。可能很容易出错。

让我省去你的麻烦，告诉你regex就是为了这个。别想尝试，以后你只会撞到你的头。如果数据来自web源，请查看BeautifulSoup或scrapy或任何其他“scraping”库。如果已经有了标记，则可以使用解析器遍历节点并收集属性信息。或者取决于python版本，让我来帮你省去麻烦，告诉你正则表达式就是为了这个。别想尝试，以后你只会撞到你的头。如果数据来自web源，请查看BeautifulSoup或scrapy或任何其他“scraping”库。如果已经有了标记，则可以使用解析器遍历节点并收集属性信息。或者取决于最终安装bs4后的python版本，这是一个漂亮的解决方案。谢谢在最终安装bs4之后，这是一个漂亮的解决方案。谢谢