用python解析URL

用python解析URL,python,parsing,Python,Parsing,我想解析以下URL: 我想获得文本对应的URL: 具有新颖结构的母线的电池组 我正在使用python,但我对javascript不太熟悉。 我怎样才能完成这项工作 到目前为止,我已经看到了html请求,并尝试了以下代码: from requests_html import HTMLSession from bs4 import BeautifulSoup publication_number_to_scrape = "EP2814089" url = "https://worldwide.e

我想解析以下URL:

我想获得文本对应的URL:

  • 具有新颖结构的母线的电池组
  • 我正在使用python,但我对javascript不太熟悉。 我怎样才能完成这项工作

    到目前为止,我已经看到了html请求,并尝试了以下代码:

    from requests_html import HTMLSession
    from bs4 import BeautifulSoup
    
    publication_number_to_scrape = "EP2814089"
    url = "https://worldwide.espacenet.com/searchResults?ST=singleline&locale=fr_EP&submitted=true&DB=&query=ep2814089" + publication_number_to_scrape
    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    headers = {'User-Agent': user_agent}
    
    # create an HTML Session object
    session = HTMLSession()
    
    # Use the object above to connect to needed webpage
    resp = session.get(url, headers=headers)
    print(resp.content)
    
    # Run JavaScript code on webpage
    html2 = resp.html.render()
    
    soup = BeautifulSoup(resp.content, 'html.parser')
    print(soup)
    
    在打印的结果中,我看到了这一部分:

    </li>
    <li class="bendractive"><a accesskey="b" href="">Liste de résultats</a></li>
    <li class="bendr"><a accesskey="c" class="ptn" href="/mydocumentslist?submitted=true&amp;locale=fr_EP" id="menuPnStar">Ma liste de brevets (<span id="menuPnCount"></span>)</a></li>
    <li class="bendr"><a accesskey="d" href="/queryHistory?locale=fr_EP">Historique des requêtes</a></li>
    <li class="spacer"></li>
    <li class="bendl"><a accesskey="e" href="/settings?locale=fr_EP">Paramètres</a></li>
    <li class="bendl last">
    <a accesskey="f" href="/help?locale=fr_EP&amp;method=handleHelpTopic&amp;topic=index">Aide</a>
    </li>
    
    使用Pyp中的硒

    并获取感兴趣的内容的id或xpath

    就你而言:
    id=publicationId1
    //a[@id='publicationId1']

    或者
    xpath=(../*[normalize space(text())和normalize space(.)='|'])[5]/following::a[2]

    我认为这将完成这项工作:

    导入请求
    从bs4导入BeautifulSoup
    cookies={
    “JSSessionID”:“9ULYIsd9+RmCkgzGPoLdCWMP.espacenet\u levelx\u prod\u 1”,
    'org.springframework.web.servlet.i18n.CookieLocaleResolver.LOCALE':'fr_EP',
    “menuCurrentSearch”:“%2F%2Fworldwide.espacenet.com%2FsearchResults%3FDB%3D%26ST%3Dsingleline%26locale%3Dfr\u EP%26query%3Dep2814089”,
    “当前URL”:“https%3A%2F%2Fworldwide.espacenet.com%2FsearchResults%3FDB%3D%26ST%3dsingline%26locale%3Dfr\u EP%26query%3Dep2814089”,
    “PGS”:“10”,
    }
    标题={
    “连接”:“保持活动状态”,
    “升级不安全请求”:“1”,
    “用户代理”:“Mozilla/5.0(X11;Linux x86_64)AppleWebKit/537.36(KHTML,如Gecko)Chrome/79.0.3945.79 Safari/537.36”,
    'Sec Fetch User':'?1',
    “接受”:“text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed exchange;v=b3;q=0.9”,
    “Sec获取站点”:“无”,
    “秒获取模式”:“导航”,
    “接受编码”:“gzip,deflate,br”,
    ‘接受语言’:‘tr,tr;q=0.9’,
    }
    参数=(
    (‘DB’,“”),
    ('ST','singleline'),
    ('locale','fr_EP'),
    (“查询”,“ep2814089”),
    )
    response=requests.get('https://worldwide.espacenet.com/searchResults,headers=headers,params=params,cookies=cookies)
    soup=BeautifulSoup(response.text'html.parser')
    
    谢谢,该代码还会返回一个结果。如何从结果中提取我要查找的URL?我正在寻找的url是我在敲打文本“具有新颖结构的母线的电池组”时得到的url,即:在加载页面时,我最终使用带有Chrome的F12,并确定了我感兴趣的响应url。
    result = ['EP2814089 (A4)', 'EP2814089 (B1)', ....]