Python 使用beautifulsoup提取url
使用此代码:Python 使用beautifulsoup提取url,python,web-scraping,Python,Web Scraping,使用此代码: url = "https://github.com/searcho=desc&p=1&q=stars%3A%3E1&s=stars&type=Repositoris" with urllib.request.urlopen(url) as response: html = response.read() html = html.decode('utf-8') with open('page_content.html', 'w', encoding
url = "https://github.com/searcho=desc&p=1&q=stars%3A%3E1&s=stars&type=Repositoris"
with urllib.request.urlopen(url) as response:
html = response.read()
html = html.decode('utf-8')
with open('page_content.html', 'w', encoding='utf-8') as new_file:
new_file.write(html)
soup = BeautifulSoup(html,'lxml')
g_data= soup.findAll("a", {"class":"v-align-middle"})
print(g_data[0])
输出为:
获取属性的值,json.loads()
并将其作为常规python命令使用:
import json
# your other code, up to setting the g_data
data_hydro = g_data[0]['data-hydro-click']
data_hydro = json.loads(data_hydro)
print(data_hydro['payload']['result']['url'])
它在json字符串中,这就是为什么很难找到它
html = """
<h3>
<a href="/freeCodeCamp/freeCodeCamp" class="v-align-middle"data-hydro-click="{"event_type":"search_result.click","payload":{"page_number":1,"query":"stars:>1","result_position":1,"click_id":28457823,"result":{"id":28457823,"global_relay_id":"MDEwOlJlcG9zaXRvcnkyODQ1NzgyMw==","model_name":"Repository","url":"https://github.com/freeCodeCamp/freeCodeCamp"},"originating_request_id":"EB94:4DE3:1D61C50:2AEAFBA:5A8D8E31"}}" data-hydro-hmac="2b170325f8ff481731dd5f65d85e7e94a356f75bdafce1f9c5cc60d112cbc2f8">freeCodeCamp/freeCodeCamp</a>
</h3>
"""
soup = BeautifulSoup(html, 'lxml')
parsed_json = json.loads(soup.a.get('data-hydro-click'))
parsed_json['payload']['result']['url']
# returns 'https://github.com/freeCodeCamp/freeCodeCamp'
html=”“”
"""
soup=BeautifulSoup(html,“lxml”)
parsed_json=json.load(soup.a.get('data-hydro-click'))
已解析的_json['payload']['result']['url']
#返回'https://github.com/freeCodeCamp/freeCodeCamp'
我会尝试标记['result']['url']
它在链接中不是父h3
标记,不是吗?我现在添加了一张图片请不要发布图片和代码截图。另外,展示您尝试了什么,以及您面临的问题。好的,我将添加我的代码和输出,谢谢!感谢您对DOR的回复,对于这行代码-data hydro=g_data[0]['data-hydro-click'],我得到以下错误:无法分配给操作员以下工作代码:DATAHDRO=g_data[0]['data-hydro-click']DATAHDRO=json.loads(DATAHDRO)print(DATAHDRO['payload']['result']['url'])我的错误,在变量名中使用了非法字符;感谢您指出。