用python解析URL_Python_Parsing - Fatal编程技术网

用python解析URL

python parsing

用python解析URL,python,parsing,Python,Parsing,我想解析以下URL：我想获得文本对应的URL：具有新颖结构的母线的电池组我正在使用python，但我对javascript不太熟悉。我怎样才能完成这项工作到目前为止，我已经看到了html请求，并尝试了以下代码： from requests_html import HTMLSession from bs4 import BeautifulSoup publication_number_to_scrape = "EP2814089" url = "https://worldwide.e

我想解析以下URL：

我想获得文本对应的URL：

具有新颖结构的母线的电池组

我正在使用python，但我对javascript不太熟悉。我怎样才能完成这项工作

到目前为止，我已经看到了html请求，并尝试了以下代码：

from requests_html import HTMLSession
from bs4 import BeautifulSoup

publication_number_to_scrape = "EP2814089"
url = "https://worldwide.espacenet.com/searchResults?ST=singleline&locale=fr_EP&submitted=true&DB=&query=ep2814089" + publication_number_to_scrape
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
headers = {'User-Agent': user_agent}

# create an HTML Session object
session = HTMLSession()

# Use the object above to connect to needed webpage
resp = session.get(url, headers=headers)
print(resp.content)

# Run JavaScript code on webpage
html2 = resp.html.render()

soup = BeautifulSoup(resp.content, 'html.parser')
print(soup)

在打印的结果中，我看到了这一部分：

</li>
<li class="bendractive"><a accesskey="b" href="">Liste de résultats</a></li>
<li class="bendr"><a accesskey="c" class="ptn" href="/mydocumentslist?submitted=true&amp;locale=fr_EP" id="menuPnStar">Ma liste de brevets (<span id="menuPnCount"></span>)</a></li>
<li class="bendr"><a accesskey="d" href="/queryHistory?locale=fr_EP">Historique des requêtes</a></li>
<li class="spacer"></li>
<li class="bendl"><a accesskey="e" href="/settings?locale=fr_EP">Paramètres</a></li>
<li class="bendl last">
<a accesskey="f" href="/help?locale=fr_EP&amp;method=handleHelpTopic&amp;topic=index">Aide</a>
</li>

使用Pyp中的硒

并获取感兴趣的内容的id或xpath

就你而言：

id=publicationId1

或

//a[@id='publicationId1']

或者

xpath=（../*[normalize space（text（））和normalize space（.）='|']）[5]/following:：a[2]

我认为这将完成这项工作：

导入请求
从bs4导入BeautifulSoup
cookies={
“JSSessionID”：“9ULYIsd9+RmCkgzGPoLdCWMP.espacenet\u levelx\u prod\u 1”，
'org.springframework.web.servlet.i18n.CookieLocaleResolver.LOCALE'：'fr_EP'，
“menuCurrentSearch”：“%2F%2Fworldwide.espacenet.com%2FsearchResults%3FDB%3D%26ST%3Dsingleline%26locale%3Dfr\u EP%26query%3Dep2814089”，
“当前URL”：“https%3A%2F%2Fworldwide.espacenet.com%2FsearchResults%3FDB%3D%26ST%3dsingline%26locale%3Dfr\u EP%26query%3Dep2814089”，
“PGS”：“10”，
}
标题={
“连接”：“保持活动状态”，
“升级不安全请求”：“1”，
“用户代理”：“Mozilla/5.0（X11；Linux x86_64）AppleWebKit/537.36（KHTML，如Gecko）Chrome/79.0.3945.79 Safari/537.36”，
'Sec Fetch User'：'？1'，
“接受”：“text/html，application/xhtml+xml，application/xml；q=0.9，image/webp，image/apng，*/*；q=0.8，application/signed exchange；v=b3；q=0.9”，
“Sec获取站点”：“无”，
“秒获取模式”：“导航”，
“接受编码”：“gzip，deflate，br”，
‘接受语言’：‘tr，tr；q=0.9’，
}
参数=(
（‘DB’，“”），
（'ST'，'singleline'），
（'locale'，'fr_EP'），
（“查询”，“ep2814089”），
)
response=requests.get（'https://worldwide.espacenet.com/searchResults，headers=headers，params=params，cookies=cookies）
soup=BeautifulSoup（response.text'html.parser'）

谢谢，该代码还会返回一个结果。如何从结果中提取我要查找的URL？我正在寻找的url是我在敲打文本“具有新颖结构的母线的电池组”时得到的url，即：在加载页面时，我最终使用带有Chrome的F12，并确定了我感兴趣的响应url。

result = ['EP2814089 (A4)', 'EP2814089 (B1)', ....]