如何使用python从网站中提取表
我一直试图从网站上提取表格,但我迷路了。有人能帮我吗? 我的目标是提取范围页的表:如何使用python从网站中提取表,python,pandas,web-scraping,beautifulsoup,python-requests,Python,Pandas,Web Scraping,Beautifulsoup,Python Requests,我一直试图从网站上提取表格,但我迷路了。有人能帮我吗? 我的目标是提取范围页的表: 从中查找组织IDhttps://training.gov.au/Organisation/Details/31102" 查找XHR url,POST方法 导入请求 导入json 作为pd进口熊猫 进口稀土 def get_组织ID(url): #url='1〕https://training.gov.au/Organisation/Details/31102' headers={'User-Agent':'Mo
导入请求
导入json
作为pd进口熊猫
进口稀土
def get_组织ID(url):
#url='1〕https://training.gov.au/Organisation/Details/31102'
headers={'User-Agent':'Mozilla/5.0(Macintosh;Intel Mac OS X 10_14_6)AppleWebKit/537.36(KHTML,如Gecko)Chrome/87.0.4280.67 Safari/537.36'}
resp=requests.get(url,headers=headers)
id_list=re.findall(r'OrganizationID=(.*?&'),分别为文本)
OrganizationID=id\U列表[0]如果id\U列表其他无
返回组织ID
#首先获取组织ID
url='1〕https://training.gov.au/Organisation/Details/31102'
OrganizationID=获取组织ID(url)
def get_AjaxScope资格认证(组织ID):
如果组织ID:
url=f'https://training.gov.au/Organisation/AjaxScopeQualification/{OrganizationID}?tabIndex=4'
标题={
“来源”:https://training.gov.au',
“referer”:f'https://training.gov.au/Organisation/Details/{OrganizationID}?tabIndex=4',
“用户代理”:“Mozilla/5.0(Macintosh;英特尔Mac OS X 10_14_6)AppleWebKit/537.36(KHTML,类似Gecko)Chrome/87.0.4280.67 Safari/537.36”,
“x-request-with':“XMLHttpRequest”
}
数据={'page':'1','size':'100','orderBy':'Code asc','groupBy':'','filter':''}
r=requests.post(url,json=data,headers=headers)
response=json.load(re.sub(r'newdate\(\d+),(\d+),(\d+),(\d+,0,0\)',r''1-\2-\2',r.text))
返回响应
响应=获得认证(组织ID)
dfn=pd.json_规范化(响应,'data',meta=['total'])
打印(dfn.列)
打印(dfn['代码','标题','范围']])
结果:
response['data'][0]
{'Id': '5096634d-4210-4fd4-a51d-f548cd39d57b',
'NrtId': '2feb7e3f-7fc6-4719-ba66-2a066f6635c7',
'RtoId': '3fbfd9c9-3cce-4d69-973e-4e2674f8c5a9',
'TrainingComponentType': 2,
'Code': 'BSB20115',
'Title': 'Certificate II in Business',
'IsImplicit': False,
'ExtentId': '01',
'Extent': 'Deliver and assess',
'StartDate': '2015-3-3',
'EndDate': '2022-3-3',
'DeliveryNsw': True,
'DeliveryVic': True,
'DeliveryQld': True,
'DeliverySa': True,
'DeliveryWa': True,
'DeliveryTas': True,
'DeliveryNt': True,
'DeliveryAct': True,
'ScopeDecisionType': 0,
'ScopeDecision': 'Deliver and assess',
'OverseasCodeAlpha': None,
'OverseasCodeAlhpaList': [],
'OverseasCodeAlphaOutput': ''}
处理->
https://training.gov.au/Search/SearchOrganisation?Name=&IncludeUnregisteredRtos=false&IncludeNotRtos=false&orgSearchByNameSubmit=Search&AdvancedSearch=&JavaScriptEnabled=true
这是ajax链接->https://training.gov.au/Search/AjaxGetOrganisations?implicitNrtScope=True&includeUnregisteredRtosForScopeSearch=True&includeUnregisteredRtos=False&includeNotRtos=False&orgSearchByNameSubmit=Search&JavaScriptEnabled=true
使用ajax链接和post方法获取json数据
更改“大小”:“200”
以修改响应输出行
url = f'https://training.gov.au/Search/AjaxGetOrganisations?implicitNrtScope=True&includeUnregisteredRtosForScopeSearch=True&includeUnregisteredRtos=False&includeNotRtos=False&orgSearchByNameSubmit=Search&JavaScriptEnabled=true'
headers = {
'origin': 'https://training.gov.au',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36',
'x-requested-with': 'XMLHttpRequest'
}
data = {'page': '1', 'size': '200', 'orderBy': 'LegalPersonName-asc', 'groupBy': '', 'filter': ''}
r = requests.post(url, json=data, headers=headers)
response = r.json()
结果 从搜索结果中,您可以将
ea38f597-077e-4c57-b7b6-7ca7dde88399
作为OrganizationID
,无需使用'code':'6639'
解析https://training.gov.au/Organisation/Details/6639
获取组织ID
'Codes': '6639',
https://training.gov.au/Organisation/Details/6639
https://training.gov.au/Organisation/AjaxScopeSkillSet/ea38f597-077e-4c57-b7b6-7ca7dde88399?includeImplicit=True&tabIndex=4&_=1610518795452
此网页使用javascript加载数据,因此您需要一个浏览器。使用selenium将是获取此信息的唯一方法。我也试过硒,但它对我不起作用!但这不是打印表格打印df,而是以不同的格式打印。打印方式如下:5096634d-4210-4fd4-a51d-f548cd39d57b。。。19 . 它不打印带有课程名称、范围和代码的表格。df.iloc[0]或
df.to_excel('file.xlsx')
,打印(df)与实际格式无关。或者修改函数get_AjaxScopeQualification
replacereturn dfn
,使用return response
。所以我需要从这个输出中提取?就像我想要的代码一样,title和extentYou是对的,请使用json结果,因为它包含更多信息。您不需要为每个代码分析2个链接。只需使用结果的OrganizationID
。好的,我会尝试让您知道它是如何运行的,但我非常困惑,因为有4k链接。我想使用for loop
来处理4k链接,就像这样。在这种情况下,需要迭代OrganizationID
列表。
'Codes': '6639',
https://training.gov.au/Organisation/Details/6639
https://training.gov.au/Organisation/AjaxScopeSkillSet/ea38f597-077e-4c57-b7b6-7ca7dde88399?includeImplicit=True&tabIndex=4&_=1610518795452
response['data'][0]
{'OrganisationId': 'ea38f597-077e-4c57-b7b6-7ca7dde88399',
'IsRto': True,
'IsTpd': False,
'Codes': '6639',
'LegalPersonName': '1 EDUCATION PTY LTD',
'LegalPersonNameNonCurrent': 'Brad Fenby and Associates Pty Ltd, Franklyn Scholar (Victoria) Pty Ltd',
'TradingNames': [],
'WebAddresses': ['http://www.1education.com.au'],
'GeneralEnquiriesPhone': '0478752453',
'RegistrationStatus': None,
'ValidationType': 0,
'RtoStatus': 0,
'StatusString': 'Current',
'RegistrationManagerId': '12',
'RegistrationStartDate': '/Date(1554037200000+1100)/',
'RegistrationEndDate': '/Date(1774789200000+1100)/',
'CreatedDate': '/Date(1307654398430+1000)/',
'ExternalLinks': {'ExternalLinkType': 2,
'Description': 'MySkillsRto',
'Url': 'http://www.myskills.gov.au/RegisteredTrainers/Details?rtocode={0}'},
'RtoType': '91',
'ActiveScopeAct': True,
'ActiveScopeNsw': True,
'ActiveScopeVic': True,
'ActiveScopeQld': True,
'ActiveScopeSA': True,
'ActiveScopeNT': True,
'ActiveScopeWA': True,
'ActiveScopeTas': True,
'ActiveScopeInt': True,
'RegistrationManagerShortName': 'ASQA',
'StatusSortOrder': '4',
'MySkillsLink': 'http://www.myskills.gov.au/RegisteredTrainers/Details?rtocode=6639'}