Python 刮纸不会产生预期的结果_Python_Web Scraping_Beautifulsoup

Python 刮纸不会产生预期的结果

python web-scraping

Python 刮纸不会产生预期的结果,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我试图从一个网站抓取一些数据，HTML代码如下所示 <div class="field-wrapper field field-node--field-test-synonyms field-name-field-test-synonyms field-type-string field-label-inline clearfix"> <div class="field-label">Also Known As</

我试图从一个网站抓取一些数据，HTML代码如下所示

<div class="field-wrapper field field-node--field-test-synonyms field-name-field-test-synonyms field-type-string field-label-inline clearfix">
      <div class="field-label">Also Known As</div>
    <div class="field-items">
          <div class="field-item">KOH Prep</div>
          <div class="field-item">Fungal Smear, Culture, Antigen and Antibody Tests</div>
          <div class="field-item">Mycology Tests</div>
          <div class="field-item">Fungal Molecular Tests</div>
          <div class="field-item">Potassium Hydroxide Preparation</div>
          <div class="field-item">Calcofluor White Stain</div>
      </div>
</div>

def get_similar_names(sub_url):
    response = requests.get(sub_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    if(soup.find('div', class_='field-label')!= None):
        other_names = [
            tag.next.next.get_text(strip=True, separator='|').split('|')
            for tag in soup.find('div', class_='field-label')
        ]
        return (other_names[0])
    else:
        return None

网页的实际链接是

获取名称有不同的方法

#1-将所有

名称

合并为

字符串

作为您预期的输出：

soup.select_one('div.field-items').get_text(',',strip=True)

Output -> KOH Prep,Fungal Smear, Culture, Antigen and Antibody Tests,Mycology Tests,Fungal Molecular Tests,Potassium Hydroxide Preparation,Calcofluor White Stain

#2-获取所有

名称作为列表
：
[name.get_text() for name in soup.select('div.field-items > div')]

Output -> ['KOH Prep','Fungal Smear, Culture, Antigen and Antibody Tests','Mycology Tests','Fungal Molecular Tests','Potassium Hydroxide Preparation','Calcofluor White Stain']

#3 uu仅获取代码中的第一个名称
：
soup.select_one('div.field-items > div').get_text()

Output -> KOH Prep

示例
def get_similar_names(sub_url):
    response = requests.get(sub_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    other_names = soup.select_one('div.field-items').get_text(',',strip=True)

    return other_names

输出
KOH Prep,Fungal Smear, Culture, Antigen and Antibody Tests,Mycology Tests,Fungal Molecular Tests,Potassium Hydroxide Preparation,Calcofluor White Stain

当我检查您的刮码核心时，如下所示：
from bs4 import BeautifulSoup

html_content='''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
<div class="field-wrapper field field-node--field-test-synonyms field-name-field-test-synonyms field-type-string field-label-inline clearfix">
      <div class="field-label">Also Known As</div>
    <div class="field-items">
          <div class="field-item">KOH Prep</div>
          <div class="field-item">Fungal Smear, Culture, Antigen and Antibody Tests</div>
          <div class="field-item">Mycology Tests</div>
          <div class="field-item">Fungal Molecular Tests</div>
          <div class="field-item">Potassium Hydroxide Preparation</div>
          <div class="field-item">Calcofluor White Stain</div>
      </div>
</div>
</body>
</html>
'''

soup = BeautifulSoup(html_content, 'html.parser')
if(soup.find('div', class_='field-label')!= None):
     other_names = [
       tag.next.next.get_text(strip=True, separator='|').split('|')
       for tag in soup.find('div', class_='field-label')
     ]
     print (other_names)

与您的目标结果相匹配
由于您的代码给出了目标结果，因此，您可能在其他地方遇到了问题，例如，在发送的子url
中。
这些项目属于字段项目
类，而不是字段标签
，此外，不清楚您为什么需要使用标记。下一步。下一步。当文本直接位于找到的div标记内时，获取\u text。此外，如果只返回第一个元素，则不需要使用列表理解
[['KOH Prep', 'Fungal Smear, Culture, Antigen and Antibody Tests', 'Mycology Tests', 'Fungal Molecular Tests', 'Potassium Hydroxide Preparation', 'Calcofluor White Stain']]