Python 刮纸不会产生预期的结果
我试图从一个网站抓取一些数据,HTML代码如下所示Python 刮纸不会产生预期的结果,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我试图从一个网站抓取一些数据,HTML代码如下所示 <div class="field-wrapper field field-node--field-test-synonyms field-name-field-test-synonyms field-type-string field-label-inline clearfix"> <div class="field-label">Also Known As</
<div class="field-wrapper field field-node--field-test-synonyms field-name-field-test-synonyms field-type-string field-label-inline clearfix">
<div class="field-label">Also Known As</div>
<div class="field-items">
<div class="field-item">KOH Prep</div>
<div class="field-item">Fungal Smear, Culture, Antigen and Antibody Tests</div>
<div class="field-item">Mycology Tests</div>
<div class="field-item">Fungal Molecular Tests</div>
<div class="field-item">Potassium Hydroxide Preparation</div>
<div class="field-item">Calcofluor White Stain</div>
</div>
</div>
def get_similar_names(sub_url):
response = requests.get(sub_url)
soup = BeautifulSoup(response.content, 'html.parser')
if(soup.find('div', class_='field-label')!= None):
other_names = [
tag.next.next.get_text(strip=True, separator='|').split('|')
for tag in soup.find('div', class_='field-label')
]
return (other_names[0])
else:
return None
网页的实际链接是获取名称有不同的方法 #1-将所有
名称
合并为字符串
作为您预期的输出:
soup.select_one('div.field-items').get_text(',',strip=True)
Output -> KOH Prep,Fungal Smear, Culture, Antigen and Antibody Tests,Mycology Tests,Fungal Molecular Tests,Potassium Hydroxide Preparation,Calcofluor White Stain
#2-获取所有名称作为列表
:
[name.get_text() for name in soup.select('div.field-items > div')]
Output -> ['KOH Prep','Fungal Smear, Culture, Antigen and Antibody Tests','Mycology Tests','Fungal Molecular Tests','Potassium Hydroxide Preparation','Calcofluor White Stain']
#3 uu仅获取代码中的第一个名称
:
soup.select_one('div.field-items > div').get_text()
Output -> KOH Prep
示例
def get_similar_names(sub_url):
response = requests.get(sub_url)
soup = BeautifulSoup(response.content, 'html.parser')
other_names = soup.select_one('div.field-items').get_text(',',strip=True)
return other_names
输出
KOH Prep,Fungal Smear, Culture, Antigen and Antibody Tests,Mycology Tests,Fungal Molecular Tests,Potassium Hydroxide Preparation,Calcofluor White Stain
当我检查您的刮码核心时,如下所示:
from bs4 import BeautifulSoup
html_content='''
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<div class="field-wrapper field field-node--field-test-synonyms field-name-field-test-synonyms field-type-string field-label-inline clearfix">
<div class="field-label">Also Known As</div>
<div class="field-items">
<div class="field-item">KOH Prep</div>
<div class="field-item">Fungal Smear, Culture, Antigen and Antibody Tests</div>
<div class="field-item">Mycology Tests</div>
<div class="field-item">Fungal Molecular Tests</div>
<div class="field-item">Potassium Hydroxide Preparation</div>
<div class="field-item">Calcofluor White Stain</div>
</div>
</div>
</body>
</html>
'''
soup = BeautifulSoup(html_content, 'html.parser')
if(soup.find('div', class_='field-label')!= None):
other_names = [
tag.next.next.get_text(strip=True, separator='|').split('|')
for tag in soup.find('div', class_='field-label')
]
print (other_names)
与您的目标结果相匹配
由于您的代码给出了目标结果,因此,您可能在其他地方遇到了问题,例如,在发送的子url
中。这些项目属于字段项目
类,而不是字段标签
,此外,不清楚您为什么需要使用标记。下一步。下一步。当文本直接位于找到的div标记内时,获取\u text
。此外,如果只返回第一个元素,则不需要使用列表理解
[['KOH Prep', 'Fungal Smear, Culture, Antigen and Antibody Tests', 'Mycology Tests', 'Fungal Molecular Tests', 'Potassium Hydroxide Preparation', 'Calcofluor White Stain']]