我在使用python抓取标题URL时出错
我写了一个代码来抓取标题URL,但我在提取标题URL时遇到了一个错误,请您指导我。 这是我的密码:我在使用python抓取标题URL时出错,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我写了一个代码来抓取标题URL,但我在提取标题URL时遇到了一个错误,请您指导我。 这是我的密码: 导入请求 从bs4导入BeautifulSoup #作为pd进口熊猫 #作为pd进口熊猫 导入csv def get_页面(url): response=requests.get(url) 如果没有响应。确定: 打印('服务器响应:',响应.状态\代码) 其他: # 1. html,2。分析器 soup=BeautifulSoup(response.text'html.parser') 返汤 de
导入请求
从bs4导入BeautifulSoup
#作为pd进口熊猫
#作为pd进口熊猫
导入csv
def get_页面(url):
response=requests.get(url)
如果没有响应。确定:
打印('服务器响应:',响应.状态\代码)
其他:
# 1. html,2。分析器
soup=BeautifulSoup(response.text'html.parser')
返汤
def获取索引数据(soup):
尝试:
titles\u link=soup.find\u all('a',class=“body\u link\u 11”)
除:
标题链接=[]
#URL=[item.get('href')用于标题中的项目\u链接]
打印(标题和链接)
def main():
mainurl=”http://cgsc.cdmhost.com/cdm/search/collection/p4013coll8/" \
“searchterm/1/field/all/mode/all/conn/和/order/nosort/page/1”
获取索引数据(获取页面(mainurl))
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu':
main()
如果要获取所有链接,请尝试以下操作:
def get_page(url):
response = requests.get(url)
if not response.ok:
print('server responded:', response.status_code)
else:
soup = BeautifulSoup(response.text, 'html.parser') # 1. html , 2. parser
return soup
def get_index_data(soup):
try:
titles_link = soup.find_all('a',class_="body_link_11")
except:
titles_link = []
else:
titles_link_output = []
for link in titles_link:
try:
item_id = link.attrs.get('item_id', None) # All titles with valid links will have an item_id
if item_id:
titles_link_output.append("{}{}".format("http://cgsc.cdmhost.com",link.attrs.get('href', None)))
except:
continue
print(titles_link_output)
def main():
mainurl = "http://cgsc.cdmhost.com/cdm/search/collection/p4013coll8/searchterm/1/field/all/mode/all/conn/and/order/nosort/page/1"
get_index_data(get_page(mainurl))
main()
输出:
['http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2653/rec/1', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2385/rec/2', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3309/rec/3', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2425/rec/4', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/150/rec/5', 'http://cgsc.cdmhost.com/cdm/compoundobject/collection/p4013coll8/id/2501/rec/6', 'http://cgsc.cdmhost.com/cdm/compoundobject/collection/p4013coll8/id/2495/rec/7', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3672/rec/8', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3407/rec/9', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/4393/rec/10', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3445/rec/11', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3668/rec/12', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3703/rec/13', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2952/rec/14', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2898/rec/15', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3502/rec/16', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3553/rec/17', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/4052/rec/18', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3440/rec/19', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3583/rec/20']
你的错误是什么?请提供完整的堆栈跟踪。您也应该退房。谢谢您,先生!你能用href添加网站的主要部分吗?这是主要部分“我不明白你添加“主要”部分是什么意思?这是主要部分,这是我们正在删除的另一部分cdm/search/collection/p4013coll8/searchterm/1/field/all/mode/all/conn/和/order/nosort/page/1只是将它们像完整的url一样连接起来,例如: