用python解析URL
我想解析以下URL: 我想获得文本对应的URL:用python解析URL,python,parsing,Python,Parsing,我想解析以下URL: 我想获得文本对应的URL: 具有新颖结构的母线的电池组 我正在使用python,但我对javascript不太熟悉。 我怎样才能完成这项工作 到目前为止,我已经看到了html请求,并尝试了以下代码: from requests_html import HTMLSession from bs4 import BeautifulSoup publication_number_to_scrape = "EP2814089" url = "https://worldwide.e
from requests_html import HTMLSession
from bs4 import BeautifulSoup
publication_number_to_scrape = "EP2814089"
url = "https://worldwide.espacenet.com/searchResults?ST=singleline&locale=fr_EP&submitted=true&DB=&query=ep2814089" + publication_number_to_scrape
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
headers = {'User-Agent': user_agent}
# create an HTML Session object
session = HTMLSession()
# Use the object above to connect to needed webpage
resp = session.get(url, headers=headers)
print(resp.content)
# Run JavaScript code on webpage
html2 = resp.html.render()
soup = BeautifulSoup(resp.content, 'html.parser')
print(soup)
在打印的结果中,我看到了这一部分:
</li>
<li class="bendractive"><a accesskey="b" href="">Liste de résultats</a></li>
<li class="bendr"><a accesskey="c" class="ptn" href="/mydocumentslist?submitted=true&locale=fr_EP" id="menuPnStar">Ma liste de brevets (<span id="menuPnCount"></span>)</a></li>
<li class="bendr"><a accesskey="d" href="/queryHistory?locale=fr_EP">Historique des requêtes</a></li>
<li class="spacer"></li>
<li class="bendl"><a accesskey="e" href="/settings?locale=fr_EP">Paramètres</a></li>
<li class="bendl last">
<a accesskey="f" href="/help?locale=fr_EP&method=handleHelpTopic&topic=index">Aide</a>
</li>
使用Pyp中的硒
并获取感兴趣的内容的id或xpath
就你而言:
id=publicationId1
或//a[@id='publicationId1']
或者
xpath=(../*[normalize space(text())和normalize space(.)='|'])[5]/following::a[2]
我认为这将完成这项工作:
导入请求
从bs4导入BeautifulSoup
cookies={
“JSSessionID”:“9ULYIsd9+RmCkgzGPoLdCWMP.espacenet\u levelx\u prod\u 1”,
'org.springframework.web.servlet.i18n.CookieLocaleResolver.LOCALE':'fr_EP',
“menuCurrentSearch”:“%2F%2Fworldwide.espacenet.com%2FsearchResults%3FDB%3D%26ST%3Dsingleline%26locale%3Dfr\u EP%26query%3Dep2814089”,
“当前URL”:“https%3A%2F%2Fworldwide.espacenet.com%2FsearchResults%3FDB%3D%26ST%3dsingline%26locale%3Dfr\u EP%26query%3Dep2814089”,
“PGS”:“10”,
}
标题={
“连接”:“保持活动状态”,
“升级不安全请求”:“1”,
“用户代理”:“Mozilla/5.0(X11;Linux x86_64)AppleWebKit/537.36(KHTML,如Gecko)Chrome/79.0.3945.79 Safari/537.36”,
'Sec Fetch User':'?1',
“接受”:“text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed exchange;v=b3;q=0.9”,
“Sec获取站点”:“无”,
“秒获取模式”:“导航”,
“接受编码”:“gzip,deflate,br”,
‘接受语言’:‘tr,tr;q=0.9’,
}
参数=(
(‘DB’,“”),
('ST','singleline'),
('locale','fr_EP'),
(“查询”,“ep2814089”),
)
response=requests.get('https://worldwide.espacenet.com/searchResults,headers=headers,params=params,cookies=cookies)
soup=BeautifulSoup(response.text'html.parser')
谢谢,该代码还会返回一个结果。如何从结果中提取我要查找的URL?我正在寻找的url是我在敲打文本“具有新颖结构的母线的电池组”时得到的url,即:在加载页面时,我最终使用带有Chrome的F12,并确定了我感兴趣的响应url。
result = ['EP2814089 (A4)', 'EP2814089 (B1)', ....]