Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/340.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/87.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 请求库_Python_Html_Web_Screen Scraping - Fatal编程技术网

Python 请求库

Python 请求库,python,html,web,screen-scraping,Python,Html,Web,Screen Scraping,我是Python及其可用库的新手,我正在尝试制作一个脚本来抓取一个网站。我想读取父页面上的所有链接,让脚本解析并从父页面的所有子链接读取数据 出于某种原因,我的代码会出现以下一系列错误: python ./scrape.py / Traceback (most recent call last): File "./scrape.py", line 27, in <module> a = requests.get(url) File "/Library/Python/2

我是Python及其可用库的新手,我正在尝试制作一个脚本来抓取一个网站。我想读取父页面上的所有链接,让脚本解析并从父页面的所有子链接读取数据

出于某种原因,我的代码会出现以下一系列错误:

python ./scrape.py
/
Traceback (most recent call last):
  File "./scrape.py", line 27, in <module>
    a = requests.get(url)
  File "/Library/Python/2.7/site-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/Library/Python/2.7/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Library/Python/2.7/site-packages/requests/sessions.py", line 494, in request
    prep = self.prepare_request(req)
  File "/Library/Python/2.7/site-packages/requests/sessions.py", line 437, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "/Library/Python/2.7/site-packages/requests/models.py", line 305, in prepare
    self.prepare_url(url, params)
  File "/Library/Python/2.7/site-packages/requests/models.py", line 379, in prepare_url
    raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '/': No schema supplied. Perhaps you meant http:///?
python./scrape.py
/
回溯(最近一次呼叫最后一次):
文件“/scrap.py”,第27行,在
a=请求.get(url)
文件“/Library/Python/2.7/site packages/requests/api.py”,第72行,在get中
返回请求('get',url,params=params,**kwargs)
文件“/Library/Python/2.7/site packages/requests/api.py”,第58行,请求中
return session.request(method=method,url=url,**kwargs)
文件“/Library/Python/2.7/site packages/requests/sessions.py”,第494行,在request中
准备=自我准备请求(req)
文件“/Library/Python/2.7/site packages/requests/sessions.py”,第437行,在prepare\u请求中
钩子=合并钩子(request.hooks,self.hooks),
文件“/Library/Python/2.7/site packages/requests/models.py”,第305行,在prepare中
self.prepare_url(url,参数)
文件“/Library/Python/2.7/site packages/requests/models.py”,第379行,在prepare_url中
raise MissingSchema(错误)
requests.exceptions.MissingSchema:无效URL“/”:未提供架构。也许你的意思是http://?
从我的Python脚本:

from bs4 import BeautifulSoup

import requests

#somesite = 'https://www.somesite.com/"

page = 'https://www.investopedia.com/terms/s/stop-limitorder.asp'

count = 0
#url = raw_input("Enter a website to extract the URL's from: ")
r = requests.get(page)              #requests html document
data = r.text                       #set data = to html text
soup = BeautifulSoup(data, "html.parser")          #parse data with BS

#count = 0;
#souplist = []

#list
A = []

#loop to seach for all <a> tags that hold urls, store page data in array
for link in soup.find_all('a'):
    #print(link.get('href'))
    url = link.get('href')
    print(url)

    a = requests.get(url)


    #a = requests.get(url)
    #data1 = a.text
    #souplist.insert(0, BeautifulSoup[data1])
    #++count



#
#for link in soup.find_all('p'):
    #print(link.getText())
从bs4导入美化组
导入请求
#某地https://www.somesite.com/"
佩奇https://www.investopedia.com/terms/s/stop-limitorder.asp'
计数=0
#url=原始输入(“输入一个网站以从中提取url:”)
r=请求。获取(页面)#请求html文档
data=r.text#将data=设置为html文本
soup=beautifulsou(数据,“html.parser”)#用BS解析数据
#计数=0;
#灵魂论者=[]
#名单
A=[]
#循环搜索包含URL的所有标记,将页面数据存储在数组中
查找所有('a'):
#打印(link.get('href'))
url=link.get('href')
打印(url)
a=请求.get(url)
#a=请求.get(url)
#data1=a.text
#souplist.insert(0,美化组[data1])
#++计数
#
#查找所有('p'):
#打印(link.getText())

您正在抓取的页面的某些链接是指向网站()的相对URL。因此,您可能需要通过添加网站来抓取这些URL

from urlparse import urlparse, urljoin

# Python 3
# from urllib.parse import urlparse
# from urllib.parse import urljoin

site = urlparse(page).scheme + "://" + urlparse(page).netloc
for link in soup.find_all('a'):
    #print(link.get('href'))
    url = link.get('href')        
    if not urlparse(url).scheme:
        url = urljoin(site, url)        
    print(url)
    a = requests.get(url)

第1步:做更多的工作。如果您收到一个错误,表明您的URL中缺少
http
https
,请打印您的URL以了解您实际传入的内容,所以在请求之前执行类似
print(URL)
的操作。get(URL)您的错误来自
request.get(URL)
。查看url的打印。它们不是有效的url。打印(url)正在给我提供有效的URL…这是其中一个--它无效-没有
http://
https://
ftp://
file://
等。@ElliotPressman所以请查看该URL,并查看您的错误:
无效URL'/':没有提供任何架构。也许您的意思是http://?
,我们可以看到Python i是的:没有模式。http或https等到哪里去了?(如果你没有明确说明要使用哪个模式或协议,Python不知道你是想要一个网页,还是一个FTP站点,甚至可能是一个IRC服务器提要,等等)