Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/334.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 刮纸不会产生预期的结果_Python_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 刮纸不会产生预期的结果

Python 刮纸不会产生预期的结果,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我试图从一个网站抓取一些数据,HTML代码如下所示 <div class="field-wrapper field field-node--field-test-synonyms field-name-field-test-synonyms field-type-string field-label-inline clearfix"> <div class="field-label">Also Known As</

我试图从一个网站抓取一些数据,HTML代码如下所示

<div class="field-wrapper field field-node--field-test-synonyms field-name-field-test-synonyms field-type-string field-label-inline clearfix">
      <div class="field-label">Also Known As</div>
    <div class="field-items">
          <div class="field-item">KOH Prep</div>
          <div class="field-item">Fungal Smear, Culture, Antigen and Antibody Tests</div>
          <div class="field-item">Mycology Tests</div>
          <div class="field-item">Fungal Molecular Tests</div>
          <div class="field-item">Potassium Hydroxide Preparation</div>
          <div class="field-item">Calcofluor White Stain</div>
      </div>
</div>
def get_similar_names(sub_url):
    response = requests.get(sub_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    if(soup.find('div', class_='field-label')!= None):
        other_names = [
            tag.next.next.get_text(strip=True, separator='|').split('|')
            for tag in soup.find('div', class_='field-label')
        ]
        return (other_names[0])
    else:
        return None

网页的实际链接是

获取名称有不同的方法

#1-将所有
名称
合并为
字符串
作为您预期的输出:

soup.select_one('div.field-items').get_text(',',strip=True)

Output -> KOH Prep,Fungal Smear, Culture, Antigen and Antibody Tests,Mycology Tests,Fungal Molecular Tests,Potassium Hydroxide Preparation,Calcofluor White Stain
#2-获取所有
名称作为
列表

[name.get_text() for name in soup.select('div.field-items > div')]

Output -> ['KOH Prep','Fungal Smear, Culture, Antigen and Antibody Tests','Mycology Tests','Fungal Molecular Tests','Potassium Hydroxide Preparation','Calcofluor White Stain']
#3 uu仅获取代码中的第一个
名称

soup.select_one('div.field-items > div').get_text()

Output -> KOH Prep
示例

def get_similar_names(sub_url):
    response = requests.get(sub_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    other_names = soup.select_one('div.field-items').get_text(',',strip=True)

    return other_names
输出

KOH Prep,Fungal Smear, Culture, Antigen and Antibody Tests,Mycology Tests,Fungal Molecular Tests,Potassium Hydroxide Preparation,Calcofluor White Stain

当我检查您的刮码核心时,如下所示:

from bs4 import BeautifulSoup

html_content='''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
<div class="field-wrapper field field-node--field-test-synonyms field-name-field-test-synonyms field-type-string field-label-inline clearfix">
      <div class="field-label">Also Known As</div>
    <div class="field-items">
          <div class="field-item">KOH Prep</div>
          <div class="field-item">Fungal Smear, Culture, Antigen and Antibody Tests</div>
          <div class="field-item">Mycology Tests</div>
          <div class="field-item">Fungal Molecular Tests</div>
          <div class="field-item">Potassium Hydroxide Preparation</div>
          <div class="field-item">Calcofluor White Stain</div>
      </div>
</div>
</body>
</html>
'''

soup = BeautifulSoup(html_content, 'html.parser')
if(soup.find('div', class_='field-label')!= None):
     other_names = [
       tag.next.next.get_text(strip=True, separator='|').split('|')
       for tag in soup.find('div', class_='field-label')
     ]
     print (other_names)
与您的目标结果相匹配


由于您的代码给出了目标结果,因此,您可能在其他地方遇到了问题,例如,在发送的
子url
中。

这些项目属于
字段项目
类,而不是
字段标签
,此外,不清楚您为什么需要使用
标记。下一步。下一步。当文本直接位于找到的div标记内时,获取\u text
。此外,如果只返回第一个元素,则不需要使用列表理解
[['KOH Prep', 'Fungal Smear, Culture, Antigen and Antibody Tests', 'Mycology Tests', 'Fungal Molecular Tests', 'Potassium Hydroxide Preparation', 'Calcofluor White Stain']]