Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/332.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何使用python从网站中提取表_Python_Pandas_Web Scraping_Beautifulsoup_Python Requests - Fatal编程技术网

如何使用python从网站中提取表

如何使用python从网站中提取表,python,pandas,web-scraping,beautifulsoup,python-requests,Python,Pandas,Web Scraping,Beautifulsoup,Python Requests,我一直试图从网站上提取表格,但我迷路了。有人能帮我吗? 我的目标是提取范围页的表: 从中查找组织IDhttps://training.gov.au/Organisation/Details/31102" 查找XHR url,POST方法 导入请求 导入json 作为pd进口熊猫 进口稀土 def get_组织ID(url): #url='1〕https://training.gov.au/Organisation/Details/31102' headers={'User-Agent':'Mo

我一直试图从网站上提取表格,但我迷路了。有人能帮我吗? 我的目标是提取范围页的表:

  • 从中查找组织IDhttps://training.gov.au/Organisation/Details/31102"
  • 查找XHR url,POST方法
  • 导入请求
    导入json
    作为pd进口熊猫
    进口稀土
    def get_组织ID(url):
    #url='1〕https://training.gov.au/Organisation/Details/31102'
    headers={'User-Agent':'Mozilla/5.0(Macintosh;Intel Mac OS X 10_14_6)AppleWebKit/537.36(KHTML,如Gecko)Chrome/87.0.4280.67 Safari/537.36'}
    resp=requests.get(url,headers=headers)
    id_list=re.findall(r'OrganizationID=(.*?&'),分别为文本)
    OrganizationID=id\U列表[0]如果id\U列表其他无
    返回组织ID
    #首先获取组织ID
    url='1〕https://training.gov.au/Organisation/Details/31102'
    OrganizationID=获取组织ID(url)
    def get_AjaxScope资格认证(组织ID):
    如果组织ID:
    url=f'https://training.gov.au/Organisation/AjaxScopeQualification/{OrganizationID}?tabIndex=4'
    标题={
    “来源”:https://training.gov.au',
    “referer”:f'https://training.gov.au/Organisation/Details/{OrganizationID}?tabIndex=4',
    “用户代理”:“Mozilla/5.0(Macintosh;英特尔Mac OS X 10_14_6)AppleWebKit/537.36(KHTML,类似Gecko)Chrome/87.0.4280.67 Safari/537.36”,
    “x-request-with':“XMLHttpRequest”
    }
    数据={'page':'1','size':'100','orderBy':'Code asc','groupBy':'','filter':''}
    r=requests.post(url,json=data,headers=headers)
    response=json.load(re.sub(r'newdate\(\d+),(\d+),(\d+),(\d+,0,0\)',r''1-\2-\2',r.text))
    返回响应
    响应=获得认证(组织ID)
    dfn=pd.json_规范化(响应,'data',meta=['total'])
    打印(dfn.列)
    打印(dfn['代码','标题','范围']])
    
    结果:

    response['data'][0]
    
    {'Id': '5096634d-4210-4fd4-a51d-f548cd39d57b',
     'NrtId': '2feb7e3f-7fc6-4719-ba66-2a066f6635c7',
     'RtoId': '3fbfd9c9-3cce-4d69-973e-4e2674f8c5a9',
     'TrainingComponentType': 2,
     'Code': 'BSB20115',
     'Title': 'Certificate II in Business',
     'IsImplicit': False,
     'ExtentId': '01',
     'Extent': 'Deliver and assess',
     'StartDate': '2015-3-3',
     'EndDate': '2022-3-3',
     'DeliveryNsw': True,
     'DeliveryVic': True,
     'DeliveryQld': True,
     'DeliverySa': True,
     'DeliveryWa': True,
     'DeliveryTas': True,
     'DeliveryNt': True,
     'DeliveryAct': True,
     'ScopeDecisionType': 0,
     'ScopeDecision': 'Deliver and assess',
     'OverseasCodeAlpha': None,
     'OverseasCodeAlhpaList': [],
     'OverseasCodeAlphaOutput': ''}
    

    处理->
    https://training.gov.au/Search/SearchOrganisation?Name=&IncludeUnregisteredRtos=false&IncludeNotRtos=false&orgSearchByNameSubmit=Search&AdvancedSearch=&JavaScriptEnabled=true

    这是ajax链接->
    https://training.gov.au/Search/AjaxGetOrganisations?implicitNrtScope=True&includeUnregisteredRtosForScopeSearch=True&includeUnregisteredRtos=False&includeNotRtos=False&orgSearchByNameSubmit=Search&JavaScriptEnabled=true

    使用ajax链接和post方法获取json数据

    更改
    “大小”:“200”
    以修改响应输出行

    url = f'https://training.gov.au/Search/AjaxGetOrganisations?implicitNrtScope=True&includeUnregisteredRtosForScopeSearch=True&includeUnregisteredRtos=False&includeNotRtos=False&orgSearchByNameSubmit=Search&JavaScriptEnabled=true'
    headers = {
     'origin': 'https://training.gov.au',
     'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36',
     'x-requested-with': 'XMLHttpRequest'
    }
    data = {'page': '1', 'size': '200', 'orderBy': 'LegalPersonName-asc', 'groupBy': '', 'filter': ''}
    r = requests.post(url, json=data, headers=headers)
    response = r.json()
    

    结果

    从搜索结果中,您可以将
    ea38f597-077e-4c57-b7b6-7ca7dde88399
    作为
    OrganizationID
    ,无需使用
    'code':'6639'
    解析
    https://training.gov.au/Organisation/Details/6639
    获取组织ID

    'Codes': '6639',
    https://training.gov.au/Organisation/Details/6639
    https://training.gov.au/Organisation/AjaxScopeSkillSet/ea38f597-077e-4c57-b7b6-7ca7dde88399?includeImplicit=True&tabIndex=4&_=1610518795452
    

    此网页使用javascript加载数据,因此您需要一个浏览器。使用selenium将是获取此信息的唯一方法。我也试过硒,但它对我不起作用!但这不是打印表格打印df,而是以不同的格式打印。打印方式如下:5096634d-4210-4fd4-a51d-f548cd39d57b。。。19 . 它不打印带有课程名称、范围和代码的表格。df.iloc[0]或
    df.to_excel('file.xlsx')
    ,打印(df)与实际格式无关。或者修改函数
    get_AjaxScopeQualification
    replace
    return dfn
    ,使用
    return response
    。所以我需要从这个输出中提取?就像我想要的代码一样,title和extentYou是对的,请使用json结果,因为它包含更多信息。您不需要为每个
    代码分析2个链接。只需使用结果的
    OrganizationID
    。好的,我会尝试让您知道它是如何运行的,但我非常困惑,因为有4k链接。我想使用
    for loop
    来处理4k链接,就像这样。在这种情况下,需要迭代
    OrganizationID
    列表。
    'Codes': '6639',
    https://training.gov.au/Organisation/Details/6639
    https://training.gov.au/Organisation/AjaxScopeSkillSet/ea38f597-077e-4c57-b7b6-7ca7dde88399?includeImplicit=True&tabIndex=4&_=1610518795452
    
    response['data'][0]
    
    {'OrganisationId': 'ea38f597-077e-4c57-b7b6-7ca7dde88399',
     'IsRto': True,
     'IsTpd': False,
     'Codes': '6639',
     'LegalPersonName': '1 EDUCATION PTY LTD',
     'LegalPersonNameNonCurrent': 'Brad Fenby and Associates Pty Ltd, Franklyn Scholar (Victoria) Pty Ltd',
     'TradingNames': [],
     'WebAddresses': ['http://www.1education.com.au'],
     'GeneralEnquiriesPhone': '0478752453',
     'RegistrationStatus': None,
     'ValidationType': 0,
     'RtoStatus': 0,
     'StatusString': 'Current',
     'RegistrationManagerId': '12',
     'RegistrationStartDate': '/Date(1554037200000+1100)/',
     'RegistrationEndDate': '/Date(1774789200000+1100)/',
     'CreatedDate': '/Date(1307654398430+1000)/',
     'ExternalLinks': {'ExternalLinkType': 2,
      'Description': 'MySkillsRto',
      'Url': 'http://www.myskills.gov.au/RegisteredTrainers/Details?rtocode={0}'},
     'RtoType': '91',
     'ActiveScopeAct': True,
     'ActiveScopeNsw': True,
     'ActiveScopeVic': True,
     'ActiveScopeQld': True,
     'ActiveScopeSA': True,
     'ActiveScopeNT': True,
     'ActiveScopeWA': True,
     'ActiveScopeTas': True,
     'ActiveScopeInt': True,
     'RegistrationManagerShortName': 'ASQA',
     'StatusSortOrder': '4',
     'MySkillsLink': 'http://www.myskills.gov.au/RegisteredTrainers/Details?rtocode=6639'}