Python Can';t使用BeautifulSoup从HTML中提取引用
我设法从一个网站上抓取了以下数据,但无法进一步抓取每页上的图像参考。让我举例说明:Python Can';t使用BeautifulSoup从HTML中提取引用,python,beautifulsoup,Python,Beautifulsoup,我设法从一个网站上抓取了以下数据,但无法进一步抓取每页上的图像参考。让我举例说明: data = """ <div class="Answer"> 1. Origin (O): <i>clavicular head - </i>sternal half of clavicle. <i>Sternal head - </i>sternum down to 7th rib & cartil
data = """
<div class="Answer">
1. Origin (O): <i>clavicular head - </i>sternal half of clavicle. <i>Sternal head - </i>sternum down to 7th rib & cartilages of true ribs & aponeurosis of EXTERNAL OBLIQUE.<div>2. Insertion (I): lateral lip of intertubercular sulcus of humerus <b><i>(TIP: 1 missus [LATISSIMUS DORSI] b/w 2 majors [PECTORALIS MAJOR & TERES MAJOR])</i></b></div><div>3. NS: medial & lateral pectoral n. </div><div>4. A: adducts & internally rotates arm; flexes shoulder. </div><div><img src="paste-7450347406b71a5e5c2e6dc2442ca630347acc64.jpg"><br></div><div><b>Image: </b>Gray, Henry. <i>Anatomy of the Human Body.</i> Philadelphia: Lea & Febiger, 1918; Bartleby.com, 2000. <a href="https://www.bartleby.com/107/">www.bartleby.com/107/</a> [Accessed 15 Nov. 2018].</div>
</div>
<div class="Answer">
1. O: outer, upper surface of ribs 3-5. <div>2. I: corocoid process of scapula. </div><div>3. NS: medial pectoral n.</div><div>4. A: lowers the lateral angle & protracts the scapula. </div><div><br></div><div><img src="paste-fbab2e102740a7713816f498946f8cd977010c8f.gif"><br></div><div><b>Image:</b> Case courtesy of Dr Sachintha Hapugoda, <a href="https://radiopaedia.org/">Radiopaedia.org</a>. From the case <a href="https://radiopaedia.org/cases/52195">rID: 52195</a></div>
</div>
"""
但它不起作用,我能做什么?使用以下方法:
soup = BeautifulSoup(data, "html.parser")
img_links = soup.select('div.Answer b')
for el in img_links:
print(''.join(map(repr, el.next_siblings)))
输出:
'Gray, Henry. '<i>Anatomy of the Human Body.</i>' Philadelphia: Lea & Febiger, 1918; Bartleby.com, 2000. '<a href="https://www.bartleby.com/107/">www.bartleby.com/107/</a>'\xa0[Accessed 15 Nov. 2018].'
'\xa0Case courtesy of Dr Sachintha Hapugoda, <a href="https://radiopaedia.org/">Radiopaedia.org</a>. From the case <a href="https://radiopaedia.org/cases/52195">rID: 52195</a>'
“格雷,亨利。”人体解剖学。《费城:李和费比格》,1918;Bartleby.com,2000年\xa0[于2018年11月15日查阅]。'
“\xA0案件由Sachintha Hapugoda博士提供。从案件的
我确信这不是一个聪明的代码,但我认为它会有所帮助`
soup = BeautifulSoup(data, "html.parser")
Answers = soup.find_all("div", {"class":"Answer"})
for answer in Answers:
regex1 = r"<div><b>.*?</b>"
regex2 = r"</div>"
subst = ""
if answer.find_all('b')[-1].next.strip() == 'Image:':
parent_element = answer.find_all('b')[-1].parent
result = re.sub(regex1, subst, str(parent_element))
image_link = re.sub(regex2, subst, str(result))
else:
image_link = "no link"
print(image_link)
soup=BeautifulSoup(数据,“html.parser”)
Answers=soup.find_all(“div”,“class”:“Answer”})
有关答案中的答案:
regex1=r“*?”
regex2=r“”
subst=“”
如果回答。查找所有('b')[-1]。下一个。strip()=='Image:':
parent\u element=answer.find\u all('b')[-1]。parent
result=re.sub(regex1、subst、str(父元素))
image_link=re.sub(regex2、subst、str(结果))
其他:
image\u link=“无链接”
打印(图像链接)
但它会丢失所有HTML信息。我需要将这两个引用插入到另一个HTML页面中。@CodeMonkey,这在您最初的描述中并不清楚。用适当的细节更新你的问题好的解决方案,但是在这里输入code
意味着什么?我错了。谢谢你!还有一些以“Image:”开头的参考资料:我该怎么解释呢?你可以使用另一个regexregex3=r”
'Gray, Henry. '<i>Anatomy of the Human Body.</i>' Philadelphia: Lea & Febiger, 1918; Bartleby.com, 2000. '<a href="https://www.bartleby.com/107/">www.bartleby.com/107/</a>'\xa0[Accessed 15 Nov. 2018].'
'\xa0Case courtesy of Dr Sachintha Hapugoda, <a href="https://radiopaedia.org/">Radiopaedia.org</a>. From the case <a href="https://radiopaedia.org/cases/52195">rID: 52195</a>'
soup = BeautifulSoup(data, "html.parser")
Answers = soup.find_all("div", {"class":"Answer"})
for answer in Answers:
regex1 = r"<div><b>.*?</b>"
regex2 = r"</div>"
subst = ""
if answer.find_all('b')[-1].next.strip() == 'Image:':
parent_element = answer.find_all('b')[-1].parent
result = re.sub(regex1, subst, str(parent_element))
image_link = re.sub(regex2, subst, str(result))
else:
image_link = "no link"
print(image_link)