如何使用my web crawler从URL获取正确的Python源代码?
我正在尝试使用python编写一个网络爬虫。我正在使用re和请求模块。我想从第一页(这是一个论坛)获取url,并从每个url获取信息 我现在的问题是,我已经将URL存储在列表中了。但我无法进一步获得这些URL的正确源代码 这是我的密码:如何使用my web crawler从URL获取正确的Python源代码?,python,python-2.7,web-crawler,Python,Python 2.7,Web Crawler,我正在尝试使用python编写一个网络爬虫。我正在使用re和请求模块。我想从第一页(这是一个论坛)获取url,并从每个url获取信息 我现在的问题是,我已经将URL存储在列表中了。但我无法进一步获得这些URL的正确源代码 这是我的密码: import re import requests url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&fi
import re
import requests
url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'
sourceCode = getsourse(url) # source code of the url page
allLinksinPage = getallLinksinPage(sourceCode) #a List of the urls in current page
for eachLink in allLinksinPage:
url = 'http://bbs.skykiwi.com/' + eachLink.encode('utf-8')
html = getsourse(url) #THIS IS WHERE I CAN'T GET THE RIGHT SOURCE CODE
#To get the source code of current url
def getsourse(url):
header = {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 10.0; WOW64; Trident/8.0; Touch)'}
html = requests.get(url, headers=header)
return html.text
#To get all the links in current page
def getallLinksinPage(sourceCode):
bigClasses = re.findall('<th class="new">(.*?)</th>', sourceCode, re.S)
allLinks = []
for each in bigClasses:
everylink = re.findall('</em><a href="(.*?)" onclick', each, re.S)[0]
allLinks.append(everylink)
return allLinks
重新导入
导入请求
url='1〕http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'
sourceCode=getsourse(url)#url页面的源代码
allLinksinPage=getallLinksinPage(源代码)#当前页面中URL的列表
对于allLinksinPage中的每个链接:
url='1〕http://bbs.skykiwi.com/'+eachLink.encode('utf-8')
html=getsourse(url)#这就是我无法获得正确源代码的地方
#获取当前url的源代码
def getsourse(url):
header={'User-Agent':'Mozilla/5.0(兼容;MSIE 9.0;Windows NT 10.0;WOW64;Trident/8.0;Touch)}
html=requests.get(url,headers=header)
返回html.text
#获取当前页面中的所有链接
def getallLinksinPage(源代码):
bigClasses=re.findall(“(.*?”,源代码,re.S)
所有链接=[]
对于BigClass中的每个:
everylink=re.findall(“您在使用函数后定义它们,这样代码就会出错。您也不应该使用re来解析html,请使用如下所示的解析器。还可以使用将基本url连接到链接,您实际需要的是id为threadlist
的div内锚定标记中的HREF:
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'
def getsourse(url):
header = {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 10.0; WOW64; Trident/8.0; Touch)'}
html = requests.get(url, headers=header)
return html.content
#To get all the links in current page
def getallLinksinPage(sourceCode):
soup = BeautifulSoup(sourceCode)
return [a["href"] for a in soup.select("#threadlist a.xst")]
sourceCode = getsourse(url) # source code of the url page
allLinksinPage = getallLinksinPage(sourceCode) #a List of the urls in current page
for eachLink in allLinksinPage:
url = 'http://bbs.skykiwi.com/'
html = getsourse(urljoin(url, eachLink))
print(html)
如果在循环中打印urljoin(url,eachLink)
,您将看到表的所有正确链接以及返回的正确源代码,下面是返回的链接片段:
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3177846&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3197510&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3201399&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3170748&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3152747&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3168498&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3176639&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3203657&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3190138&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3140191&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3199154&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3156814&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3203435&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3089967&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3199384&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3173489&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3204107&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
如果您在浏览器中访问上述链接,您将看到它使用http://bbs.skykiwi.com/forum.php?mod=viewthread&;tid=3187289&;extra=page%3D1%26过滤器%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
从结果中,您将看到:
Sorry, specified thread does not exist or has been deleted or is being reviewed
[New Zealand day-dimensional network Community Home]
您可以清楚地看到url中的差异。如果您希望自己的url正常工作,则需要在正则表达式中进行替换:
everylink = re.findall('</em><a href="(.*?)" onclick', each.replace("&","%26"), re.S)[0]
everylink=re.findall(“你所说的URL的正确源代码是什么意思?你能澄清你的问题并包括任何错误吗?谢谢你的解释,特别是让我了解BeautifulSoup。我第一次听到这个神奇的工具!我正试图用BeautifulSoup完成代码。你能帮我看看其他问题吗?@Padra?”坎宁安