如何使用python从网站中提取表_Python_Pandas_Web Scraping_Beautifulsoup_Python Requests

如何使用python从网站中提取表

python pandas web-scraping

如何使用python从网站中提取表,python,pandas,web-scraping,beautifulsoup,python-requests,Python,Pandas,Web Scraping,Beautifulsoup,Python Requests,我一直试图从网站上提取表格，但我迷路了。有人能帮我吗？我的目标是提取范围页的表：从中查找组织IDhttps://training.gov.au/Organisation/Details/31102" 查找XHR url，POST方法导入请求导入json 作为pd进口熊猫进口稀土 def get_组织ID（url）： #url='1〕https://training.gov.au/Organisation/Details/31102' headers={'User-Agent'：'Mo

我一直试图从网站上提取表格，但我迷路了。有人能帮我吗？我的目标是提取范围页的表：

从中查找组织IDhttps://training.gov.au/Organisation/Details/31102"

查找XHR url，POST方法

导入请求
导入json
作为pd进口熊猫
进口稀土
def get_组织ID（url）：
#url='1〕https://training.gov.au/Organisation/Details/31102'
headers={'User-Agent'：'Mozilla/5.0（Macintosh；Intel Mac OS X 10_14_6）AppleWebKit/537.36（KHTML，如Gecko）Chrome/87.0.4280.67 Safari/537.36'}
resp=requests.get（url，headers=headers）
id_list=re.findall（r'OrganizationID=（.*？&'），分别为文本）
OrganizationID=id\U列表[0]如果id\U列表其他无
返回组织ID
#首先获取组织ID
url='1〕https://training.gov.au/Organisation/Details/31102'
OrganizationID=获取组织ID（url）
def get_AjaxScope资格认证（组织ID）：
如果组织ID：
url=f'https://training.gov.au/Organisation/AjaxScopeQualification/{OrganizationID}？tabIndex=4'
标题={
“来源”：https://training.gov.au',
“referer”：f'https://training.gov.au/Organisation/Details/{OrganizationID}？tabIndex=4'，
“用户代理”：“Mozilla/5.0（Macintosh；英特尔Mac OS X 10_14_6）AppleWebKit/537.36（KHTML，类似Gecko）Chrome/87.0.4280.67 Safari/537.36”，
“x-request-with'：“XMLHttpRequest”
}
数据={'page'：'1'，'size'：'100'，'orderBy'：'Code asc'，'groupBy'：''，'filter'：''}
r=requests.post（url，json=data，headers=headers）
response=json.load（re.sub（r'newdate\（\d+），（\d+），（\d+），（\d+，0,0\）'，r''1-\2-\2'，r.text））
返回响应
响应=获得认证（组织ID）
dfn=pd.json_规范化（响应，'data'，meta=['total']）
打印（dfn.列）
打印（dfn['代码'，'标题'，'范围']]）

结果:

response['data'][0]

{'Id': '5096634d-4210-4fd4-a51d-f548cd39d57b',
 'NrtId': '2feb7e3f-7fc6-4719-ba66-2a066f6635c7',
 'RtoId': '3fbfd9c9-3cce-4d69-973e-4e2674f8c5a9',
 'TrainingComponentType': 2,
 'Code': 'BSB20115',
 'Title': 'Certificate II in Business',
 'IsImplicit': False,
 'ExtentId': '01',
 'Extent': 'Deliver and assess',
 'StartDate': '2015-3-3',
 'EndDate': '2022-3-3',
 'DeliveryNsw': True,
 'DeliveryVic': True,
 'DeliveryQld': True,
 'DeliverySa': True,
 'DeliveryWa': True,
 'DeliveryTas': True,
 'DeliveryNt': True,
 'DeliveryAct': True,
 'ScopeDecisionType': 0,
 'ScopeDecision': 'Deliver and assess',
 'OverseasCodeAlpha': None,
 'OverseasCodeAlhpaList': [],
 'OverseasCodeAlphaOutput': ''}

处理->

https://training.gov.au/Search/SearchOrganisation?Name=&IncludeUnregisteredRtos=false&IncludeNotRtos=false&orgSearchByNameSubmit=Search&AdvancedSearch=&JavaScriptEnabled=true

这是ajax链接->

https://training.gov.au/Search/AjaxGetOrganisations?implicitNrtScope=True&includeUnregisteredRtosForScopeSearch=True&includeUnregisteredRtos=False&includeNotRtos=False&orgSearchByNameSubmit=Search&JavaScriptEnabled=true

使用ajax链接和post方法获取json数据

更改

“大小”：“200”

以修改响应输出行

url = f'https://training.gov.au/Search/AjaxGetOrganisations?implicitNrtScope=True&includeUnregisteredRtosForScopeSearch=True&includeUnregisteredRtos=False&includeNotRtos=False&orgSearchByNameSubmit=Search&JavaScriptEnabled=true'
headers = {
 'origin': 'https://training.gov.au',
 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36',
 'x-requested-with': 'XMLHttpRequest'
}
data = {'page': '1', 'size': '200', 'orderBy': 'LegalPersonName-asc', 'groupBy': '', 'filter': ''}
r = requests.post(url, json=data, headers=headers)
response = r.json()

结果

从搜索结果中，您可以将

ea38f597-077e-4c57-b7b6-7ca7dde88399

作为

OrganizationID

，无需使用

'code'：'6639'

解析

https://training.gov.au/Organisation/Details/6639

获取组织ID

'Codes': '6639',
https://training.gov.au/Organisation/Details/6639
https://training.gov.au/Organisation/AjaxScopeSkillSet/ea38f597-077e-4c57-b7b6-7ca7dde88399?includeImplicit=True&tabIndex=4&_=1610518795452

此网页使用javascript加载数据，因此您需要一个浏览器。使用selenium将是获取此信息的唯一方法。我也试过硒，但它对我不起作用！但这不是打印表格打印df，而是以不同的格式打印。打印方式如下：5096634d-4210-4fd4-a51d-f548cd39d57b。。。19 . 它不打印带有课程名称、范围和代码的表格。df.iloc[0]或

df.to_excel（'file.xlsx'）

，打印（df）与实际格式无关。或者修改函数

get_AjaxScopeQualification

replace

return dfn

，使用

return response

。所以我需要从这个输出中提取？就像我想要的代码一样，title和extentYou是对的，请使用json结果，因为它包含更多信息。您不需要为每个

代码分析2个链接。只需使用结果的OrganizationID
。好的，我会尝试让您知道它是如何运行的，但我非常困惑，因为有4k链接。我想使用for loop
来处理4k链接，就像这样。在这种情况下，需要迭代OrganizationID列表。
'Codes': '6639',
https://training.gov.au/Organisation/Details/6639
https://training.gov.au/Organisation/AjaxScopeSkillSet/ea38f597-077e-4c57-b7b6-7ca7dde88399?includeImplicit=True&tabIndex=4&_=1610518795452

response['data'][0]

{'OrganisationId': 'ea38f597-077e-4c57-b7b6-7ca7dde88399',
 'IsRto': True,
 'IsTpd': False,
 'Codes': '6639',
 'LegalPersonName': '1 EDUCATION PTY LTD',
 'LegalPersonNameNonCurrent': 'Brad Fenby and Associates Pty Ltd, Franklyn Scholar (Victoria) Pty Ltd',
 'TradingNames': [],
 'WebAddresses': ['http://www.1education.com.au'],
 'GeneralEnquiriesPhone': '0478752453',
 'RegistrationStatus': None,
 'ValidationType': 0,
 'RtoStatus': 0,
 'StatusString': 'Current',
 'RegistrationManagerId': '12',
 'RegistrationStartDate': '/Date(1554037200000+1100)/',
 'RegistrationEndDate': '/Date(1774789200000+1100)/',
 'CreatedDate': '/Date(1307654398430+1000)/',
 'ExternalLinks': {'ExternalLinkType': 2,
  'Description': 'MySkillsRto',
  'Url': 'http://www.myskills.gov.au/RegisteredTrainers/Details?rtocode={0}'},
 'RtoType': '91',
 'ActiveScopeAct': True,
 'ActiveScopeNsw': True,
 'ActiveScopeVic': True,
 'ActiveScopeQld': True,
 'ActiveScopeSA': True,
 'ActiveScopeNT': True,
 'ActiveScopeWA': True,
 'ActiveScopeTas': True,
 'ActiveScopeInt': True,
 'RegistrationManagerShortName': 'ASQA',
 'StatusSortOrder': '4',
 'MySkillsLink': 'http://www.myskills.gov.au/RegisteredTrainers/Details?rtocode=6639'}