Python3打开并读取不带url名称的url_Python_Python 3.x_Web Scraping

Python3打开并读取不带url名称的url

python python-3.x web-scraping

Python3打开并读取不带url名称的url,python,python-3.x,web-scraping,Python,Python 3.x,Web Scraping,我已经看过了相关的问题，但我没有找到这个问题的答案：我想打开一个url并解析它的内容当我在google.com上这样做的时候，没问题当我在没有文件名的url上执行此操作时，我经常被告知我读取了一个空字符串请参见下面的代码作为示例： import urllib.request #urls = ["http://www.google.com", "http://www.whoscored.com", "http://www.whoscored.com/LiveScores"] #urls

我已经看过了相关的问题，但我没有找到这个问题的答案：

我想打开一个url并解析它的内容

当我在google.com上这样做的时候，没问题

当我在没有文件名的url上执行此操作时，我经常被告知我读取了一个空字符串

请参见下面的代码作为示例：

import urllib.request

#urls = ["http://www.google.com", "http://www.whoscored.com", "http://www.whoscored.com/LiveScores"]
#urls = ["http://www.whoscored.com", "http://www.whoscored.com/LiveScores"]
urls = ["http://www.whoscored.com/LiveScores"]
print("Type of urls: {0}.".format(str(type(urls))))
for url in urls:
    print("\n\n\n\n---------------------------------------------\n\nUrl is: {0}.".format(url))
    sock=urllib.request.urlopen(url)
    print("I have this sock: {0}.".format(sock))
    htmlSource = sock.read()
    print("I read the source code...")
    htmlSourceLine = sock.readlines()
    sock.close()
    htmlSourceString = str(htmlSource)
    print("\n\nType of htmlSourceString: " + str(type(htmlSourceString)))
    htmlSourceString = htmlSourceString.replace(">", ">\n")
    htmlSourceString = htmlSourceString.replace("\\r\\n", "\n")
    print(htmlSourceString)
    print("\n\nI am done with this url: {0}.".format(url))

我不知道是什么原因，有时我会得到一个空字符串作为没有文件名的URL的返回值，例如示例中的“www.whoscored.com/LiveScores”，而“google.com”或“www.whoscored.com”似乎一直在工作

我希望我的表述是可以理解的…

看起来该网站的编码是明确拒绝来自非浏览器客户端的请求。您必须伪造创建会话等，确保cookie按要求来回传递。第三方库可以帮助您完成这些任务，但归根结底，您必须了解有关该站点如何运行的更多信息。

您的代码间歇性地为我工作，但使用和发送用户代理非常有效：

headers = {
    'User-agent': 'Mozilla/5.0,(X11; U; Linux i686; en-GB; rv:1.9.0.1): Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1'}
urls = ["http://www.whoscored.com/LiveScores"]
import requests

print("Type of urls: {0}.".format(str(type(urls))))
for url in urls:
    print("\n\n\n\n---------------------------------------------\n\nUrl is: {0}.".format(url))
    sock= requests.get(url, headers=headers)
    print("I have this sock: {0}.".format(sock))
    htmlSource = sock.content
    print("I read the source code...")
    htmlSourceString = str(htmlSource)
    print("\n\nType of htmlSourceString: " + str(type(htmlSourceString)))
    htmlSourceString = htmlSourceString.replace(">", ">\n")
    htmlSourceString = htmlSourceString.replace("\\r\\n", "\n")
    print(htmlSourceString)
    print("\n\nI am done with this url: {0}.".format(url))

Python中存在一些库，比如用于解析内容的BeautifulSoup。我不知道你想做什么。谢谢你的回答。既然你提到了请求，是的，我在这里的另一篇帖子中看到了一些关于它的东西，如果你能容忍我的话，这就给我带来了另一个问题：我试图安装请求，但显然它不起作用，因为我仍然收到“没有模块命名请求”的消息。我能怎么办？提到我不久前安装了anaconda，我可以从安装过程中看到anaconda接管了python34，这会有帮助吗。我希望你能给我一些好主意，这样我就可以继续了。谢谢你的回答，我想这是我在这篇文章中得到的第一个答案的简短版本；+>一个网站有这么多的运作方式吗？请你详细说明一下，或者指一个能帮我找到的地方，谢谢！我想

自动化

软件包是一个很好的起点。但是要小心——还有很多其他的hHTP要求。你可以看到我是多么的新手，自动化软件包是什么，我在哪里找到它，我可以用它做什么！这并不奇怪，因为我现在发现我把包裹的名字弄错了——我的意思是。很抱歉