Python 如何获得img herf标记上方的描述文本_Python_Url_Beautifulsoup_Web Crawler

Python 如何获得img herf标记上方的描述文本

python url web-crawler

Python 如何获得img herf标记上方的描述文本,python,url,beautifulsoup,web-crawler,Python,Url,Beautifulsoup,Web Crawler,我想获得图片的描述（或标题），我想批量处理html，而不是通过gooolg检查工具逐个查找xpath来获取文本，因为所有标题或描述都没有通用规则（有些图片没有描述或标题），唯一的办法似乎是找到图片的位置，找到图片周围最近的文字，它很可能是我的目标 data=<p style="margin-top:6pt;margin-bottom:0pt;text-indent:4.54%;font-family:Times New Roman;font-size:10pt;font-weight:no

我想获得图片的描述（或标题），我想批量处理html，而不是通过gooolg检查工具逐个查找xpath来获取文本，因为所有标题或描述都没有通用规则（有些图片没有描述或标题），唯一的办法似乎是找到图片的位置，找到图片周围最近的文字，它很可能是我的目标

data=<p style="margin-top:6pt;margin-bottom:0pt;text-indent:4.54%;font-family:Times New Roman;font-size:10pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">
   The following graph sets forth the cumulative total return to CECO’s shareholders during the five years ended December&nbsp;31, 2018, as well as the following indices: Russell 2000 Index, Standard and Poor’s (“S&amp;P”) 600 Small Cap Industrial Machinery Index, and S&amp;P 500 Index. Assumes $100 was invested on December&nbsp;31, 2013, including the reinvestment of dividends, in each category.
</p>
<p style="margin-top:6pt;margin-bottom:0pt;text-indent:4.54%;font-family:Times New Roman;font-size:10pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">
  <img src="gfsqvgqkrgf1000002.jpg" title="" alt="" style="width:649px;height:254px;">
</p>

但是这不是我想要的您可以通过首先搜索包含


''
从bs4导入BeautifulSoup
soup=BeautifulSoup（数据'lxml'）
打印（soup.select_one（'p:has（img））.find_previous（'p'）.text.strip（））

印刷品：

下图列出了CECO的累计总回报截至2018年12月31日的五年内的股东，以及以下指数：罗素2000指数、标准普尔（“标准普尔”） 600小型工业机械指数和标准普尔500指数。假设 2013年12月31日投资了100美元，包括每种类别的股息

请添加您尝试过的代码，以及作为输出与预期的对比，您得到了什么？我说，我想找到标记的前辈节点，但输出总是它的父节点好主意，但它不起作用！NotImplementedError：仅实现以下伪类：类型的第n个@李文举, 您需要升级到Beautiful Soup 4.7+或安装

SoupSive

，然后直接导入并使用它的API

SoupSive。选择一个（'p:has（img'），Soup）。查找上一个（'p'）。text.strip（））

@李文举另一种方法（不升级）是使用

soup。选择一个（“p>img”）。查找上一个（“p”）。查找上一个（“p”）。文本

。有点傻，但它是有效的@李文举我使用的是版本

beautifulsoup4==4.7.1

from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print(soup.select_one("p > img").find_previous('p'))

data = '''<p style="margin-top:6pt;margin-bottom:0pt;text-indent:4.54%;font-family:Times New Roman;font-size:10pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">
   The following graph sets forth the cumulative total return to CECO’s shareholders during the five years ended December&nbsp;31, 2018, as well as the following indices: Russell 2000 Index, Standard and Poor’s (“S&amp;P”) 600 Small Cap Industrial Machinery Index, and S&amp;P 500 Index. Assumes $100 was invested on December&nbsp;31, 2013, including the reinvestment of dividends, in each category.
</p>
<p style="margin-top:6pt;margin-bottom:0pt;text-indent:4.54%;font-family:Times New Roman;font-size:10pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">
  <img src="gfsqvgqkrgf1000002.jpg" title="" alt="" style="width:649px;height:254px;">
</p>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

print(soup.select_one('p:has(img)').find_previous('p').text.strip())