Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/311.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何使用my web crawler从URL获取正确的Python源代码?_Python_Python 2.7_Web Crawler - Fatal编程技术网

如何使用my web crawler从URL获取正确的Python源代码?

如何使用my web crawler从URL获取正确的Python源代码?,python,python-2.7,web-crawler,Python,Python 2.7,Web Crawler,我正在尝试使用python编写一个网络爬虫。我正在使用re和请求模块。我想从第一页(这是一个论坛)获取url,并从每个url获取信息 我现在的问题是,我已经将URL存储在列表中了。但我无法进一步获得这些URL的正确源代码 这是我的密码: import re import requests url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&fi

我正在尝试使用python编写一个网络爬虫。我正在使用re请求模块。我想从第一页(这是一个论坛)获取url,并从每个url获取信息

我现在的问题是,我已经将URL存储在列表中了。但我无法进一步获得这些URL的正确源代码

这是我的密码:

import re
import requests

url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'

sourceCode = getsourse(url) # source code of the url page
allLinksinPage = getallLinksinPage(sourceCode) #a List of the urls in current page
for eachLink in allLinksinPage:
    url = 'http://bbs.skykiwi.com/' + eachLink.encode('utf-8')
    html = getsourse(url) #THIS IS WHERE I CAN'T GET THE RIGHT SOURCE CODE


#To get the source code of current url
def getsourse(url):
    header = {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows  NT 10.0; WOW64; Trident/8.0; Touch)'}
    html = requests.get(url, headers=header)
    return html.text

#To get all the links in current page
def getallLinksinPage(sourceCode):
    bigClasses = re.findall('<th class="new">(.*?)</th>', sourceCode, re.S)
    allLinks = []
    for each in bigClasses:
        everylink = re.findall('</em><a href="(.*?)" onclick', each, re.S)[0]
        allLinks.append(everylink)
return allLinks
重新导入
导入请求
url='1〕http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'
sourceCode=getsourse(url)#url页面的源代码
allLinksinPage=getallLinksinPage(源代码)#当前页面中URL的列表
对于allLinksinPage中的每个链接:
url='1〕http://bbs.skykiwi.com/'+eachLink.encode('utf-8')
html=getsourse(url)#这就是我无法获得正确源代码的地方
#获取当前url的源代码
def getsourse(url):
header={'User-Agent':'Mozilla/5.0(兼容;MSIE 9.0;Windows NT 10.0;WOW64;Trident/8.0;Touch)}
html=requests.get(url,headers=header)
返回html.text
#获取当前页面中的所有链接
def getallLinksinPage(源代码):
bigClasses=re.findall(“(.*?”,源代码,re.S)
所有链接=[]
对于BigClass中的每个:

everylink=re.findall(“您在使用函数后定义它们,这样代码就会出错。您也不应该使用re来解析html,请使用如下所示的解析器。还可以使用将基本url连接到链接,您实际需要的是id为
threadlist
的div内锚定标记中的HREF:

import requests
from bs4 import BeautifulSoup
from urlparse import urljoin

url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'



def getsourse(url):
    header = {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows  NT 10.0; WOW64; Trident/8.0; Touch)'}
    html = requests.get(url, headers=header)
    return html.content

#To get all the links in current page
def getallLinksinPage(sourceCode):
    soup = BeautifulSoup(sourceCode)
    return [a["href"] for a in soup.select("#threadlist a.xst")]



sourceCode = getsourse(url) # source code of the url page
allLinksinPage = getallLinksinPage(sourceCode) #a List of the urls in current page
for eachLink in allLinksinPage:
    url = 'http://bbs.skykiwi.com/'
    html = getsourse(urljoin(url, eachLink))
    print(html)
如果在循环中打印
urljoin(url,eachLink)
,您将看到表的所有正确链接以及返回的正确源代码,下面是返回的链接片段:

http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3177846&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3197510&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3201399&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3170748&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3152747&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3168498&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3176639&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3203657&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3190138&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3140191&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3199154&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3156814&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3203435&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3089967&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3199384&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3173489&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3204107&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
如果您在浏览器中访问上述链接,您将看到它使用
http://bbs.skykiwi.com/forum.php?mod=viewthread&;tid=3187289&;extra=page%3D1%26过滤器%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
从结果中,您将看到:

Sorry, specified thread does not exist or has been deleted or is being reviewed
[New Zealand day-dimensional network Community Home]
您可以清楚地看到url中的差异。如果您希望自己的url正常工作,则需要在正则表达式中进行替换:

 everylink = re.findall('</em><a href="(.*?)" onclick', each.replace("&","%26"), re.S)[0]

everylink=re.findall(“你所说的URL的正确源代码是什么意思?你能澄清你的问题并包括任何错误吗?谢谢你的解释,特别是让我了解BeautifulSoup。我第一次听到这个神奇的工具!我正试图用BeautifulSoup完成代码。你能帮我看看其他问题吗?@Padra?”坎宁安