Web scraping 如何在python中使用selenium搜索关键字并选择下拉菜单来刮取数据？_Web Scraping_Beautifulsoup_Selenium Chromedriver

Web scraping 如何在python中使用selenium搜索关键字并选择下拉菜单来刮取数据？

web-scraping

Web scraping 如何在python中使用selenium搜索关键字并选择下拉菜单来刮取数据？,web-scraping,beautifulsoup,selenium-chromedriver,Web Scraping,Beautifulsoup,Selenium Chromedriver,我正在尝试使用python中的selenium从下面的网站中获取信息，该网站有搜索栏和下拉菜单。我想从某个特定地区的诊所中获取结果（姓名、地址、电话号码）。例如，在“国际卫生条例标准”搜索栏中，将关键字标记为“德国法兰克福”，并在Allgemeinmedizin下拉菜单中选择“Hausärzte”选项。我可以使用搜索栏关键字“德国法兰克福”打印结果，但我无法编写代码来选择下拉菜单选项有谁能帮我从Allgemeinmedizin下拉列表中选择“Hausärzte”选项并提取诊所结果（姓名、地址、

我正在尝试使用python中的selenium从下面的网站中获取信息，该网站有搜索栏和下拉菜单。我想从某个特定地区的诊所中获取结果（姓名、地址、电话号码）。例如，在“国际卫生条例标准”搜索栏中，将关键字标记为“德国法兰克福”，并在Allgemeinmedizin下拉菜单中选择“Hausärzte”选项。我可以使用搜索栏关键字“德国法兰克福”打印结果，但我无法编写代码来选择下拉菜单选项

有谁能帮我从Allgemeinmedizin下拉列表中选择“Hausärzte”选项并提取诊所结果（姓名、地址、电话号码）的代码

网站：

https://www.kvwl.de/earzt/index.htm

代码：

好的，我用QHarr的建议为您编写了一篇关于使用API的文章。API使用纬度/经度输入，因此让我们使用

geopy

从地名检索这些输入。然后，我们可以将它们与post请求中的for

Hausärzte

一起传递到网站的API，然后使用

json.loads

将响应加载为json。我不确定您希望如何处理数据，因此为了方便起见，我将它们加载到

pandas

dataframe中。随后，dataframe在列

Id

上运行一个函数，该函数在第二个API请求中传递

Id

，以检索该特定Id的详细信息，并将其连接到dataframe

from geopy.geocoders import Nominatim
import requests
import pandas as pd
import json
import time

location = "München, Germany"
Fachgebiet = '12001_SID' # This code is for Hausärzte, look up other codes here https://www.kvwl.de/DocSearchService/DocSearchService/getExpertiseAreaStructure 

geolocator = Nominatim(user_agent="KVWL_retrieval")
location = geolocator.geocode(location)

headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36",'content-type': 'application/json; charset=UTF-8'}
data = '{"Latitude":' + str(location.latitude) + ',"Longitude":' + str(location.longitude) + ',"DocGender":"","DocNamePattern":"","ExpertiseAreaStructureId":"' + Fachgebiet + '","ApplicableQualificationId":"","SpecialServiceId":"","LanguageId":"","BarrierFreeAttributeFilter":{"ids":[]},"PageId":0,"PageSize":100}'
response = requests.post('https://www.kvwl.de/DocSearchService/DocSearchService/searchDocs', headers=headers, data=data)
r = json.loads(response.content)

df = pd.json_normalize(r['DoctorAbstracts']['DoctorAbstract'])

def get_doctor(id_nr):
    data = '{"Id":"' + id_nr + '"}'
    response = requests.post('https://www.kvwl.de/DocSearchService/DocSearchService/getDoctor', headers=headers, data=data)
    r = json.loads(response.content)
    time.sleep(2) # don't overload the site
    return pd.json_normalize(r)

df.join(df.apply(lambda x: pd.Series(get_doctor(x.Id).to_dict()), 1), rsuffix='_right')

可以使用

df.head（）

浏览数据框，或使用

df.to_csv（'filename.csv'）

或

df.to_excel（'filename.xlsx'）导出到csv或excel

您可以模拟页面请求，也可以冒险找到一个免费API，返回位置的lat lon，并将其传递到post请求中

https://www.kvwl.de/DocSearchService/DocSearchService/searchDocs

with

headers={'content-type'：'application/json；charset=UTF-8'}；数据={“纬度”：50.1109221，“经度”：8.6821，“ExpertisearEastStructureID”：“12001_SID”，“页面大小”：100}

-适当更新lat和lon。100是最大结果集。您可以添加页面参数，例如“PageId”：0是否可以提供任何有此技术需要学习的链接？你好，阿德里安森先生，您已经为我解决了一个巨大的问题，就像这样！！我真的很感谢你的帮助。谢谢！！我对提取的数据有一个问题。对于一些字段，数据被提取为{0:'+49（521）63500'}这是正常情况，需要稍后清理，或者有一种方法可以读取没有花括号的干净值和0:？这些字段被格式化为字典。这显然是json返回它们的方式。您应该能够使用以下内容提取电话号码：

df['column\u name']=df['column\u name']。应用（lambda x:x[0]，如果是instance（x，dict）else x）

。确保用相关列的名称替换

列的名称

。非常感谢！！它工作得很好！

from geopy.geocoders import Nominatim
import requests
import pandas as pd
import json
import time

location = "München, Germany"
Fachgebiet = '12001_SID' # This code is for Hausärzte, look up other codes here https://www.kvwl.de/DocSearchService/DocSearchService/getExpertiseAreaStructure 

geolocator = Nominatim(user_agent="KVWL_retrieval")
location = geolocator.geocode(location)

headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36",'content-type': 'application/json; charset=UTF-8'}
data = '{"Latitude":' + str(location.latitude) + ',"Longitude":' + str(location.longitude) + ',"DocGender":"","DocNamePattern":"","ExpertiseAreaStructureId":"' + Fachgebiet + '","ApplicableQualificationId":"","SpecialServiceId":"","LanguageId":"","BarrierFreeAttributeFilter":{"ids":[]},"PageId":0,"PageSize":100}'
response = requests.post('https://www.kvwl.de/DocSearchService/DocSearchService/searchDocs', headers=headers, data=data)
r = json.loads(response.content)

df = pd.json_normalize(r['DoctorAbstracts']['DoctorAbstract'])

def get_doctor(id_nr):
    data = '{"Id":"' + id_nr + '"}'
    response = requests.post('https://www.kvwl.de/DocSearchService/DocSearchService/getDoctor', headers=headers, data=data)
    r = json.loads(response.content)
    time.sleep(2) # don't overload the site
    return pd.json_normalize(r)

df.join(df.apply(lambda x: pd.Series(get_doctor(x.Id).to_dict()), 1), rsuffix='_right')