Python BS4爬虫索引器

Python BS4爬虫索引器,python,Python,我试图创建一个简单的爬虫程序,从网站中提取元数据并将信息保存到csv中。到目前为止,我被困在这里,我遵循了一些指南,但现在被错误所困扰: 索引器:索引列表超出范围 from urllib import urlopen from BeautifulSoup import BeautifulSoup import re # Copy all of the content from the provided web page webpage = urlopen('http://www.tidyawa

我试图创建一个简单的爬虫程序,从网站中提取元数据并将信息保存到csv中。到目前为止,我被困在这里,我遵循了一些指南,但现在被错误所困扰:

索引器:索引列表超出范围

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

# Copy all of the content from the provided web page
webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()

# Grab everything that lies between the title tags using a REGEX
patFinderTitle = re.compile('<title>(.*)</title>')

# Grab the link to the original article using a REGEX
patFinderLink = re.compile('<link rel.*href="(.*)" />')

# Store all of the titles and links found in 2 lists
findPatTitle = re.findall(patFinderTitle,webpage)
findPatLink = re.findall(patFinderLink,webpage)

# Create an iterator that will cycle through the first 16 articles and skip a few
listIterator = []
listIterator[:] = range(2,16)

# Print out the results to screen
for i in listIterator:
    print findPatTitle[i] # The title
    print findPatLink[i] # The link to the original article

articlePage = urlopen(findPatLink[i]).read() # Grab all of the content from original article

divBegin = articlePage.find('<div>') # Locate the div provided
article = articlePage[divBegin:(divBegin+1000)] # Copy the first 1000 characters after the div

# Pass the article to the Beautiful Soup Module
soup = BeautifulSoup(article)

# Tell Beautiful Soup to locate all of the p tags and store them in a list
paragList = soup.findAll('p')

# Print all of the paragraphs to screen
for i in paragList:
    print i
    print '\n'

# Here I retrieve and print to screen the titles and links with just Beautiful Soup
soup2 = BeautifulSoup(webpage)

print soup2.findAll('title')
print soup2.findAll('link')

titleSoup = soup2.findAll('title')
linkSoup = soup2.findAll('link')

for i in listIterator:
    print titleSoup[i]
    print linkSoup[i]
    print '\n'

谢谢。

看来您并没有使用bs4所能提供的所有功能

出现此错误是因为patFinderTitle的长度只有一个,因为所有html通常每个文档只有一个title元素

获取HTML标题的一种简单方法是使用bs4本身:

from bs4 import BeautifulSoup
from urllib import urlopen

webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()
soup = BeautifulSoup(webpage)

# get the content of title
title = soup.title.text
如果您尝试以当前方式迭代findPatLink,可能会得到相同的错误,因为它的长度为6。对我来说,如果你想获得所有链接元素或所有锚元素,这还不够清楚,但是坚持第一个想法,你可以再次使用bs4改进你的代码:

link_href_list = [link['href'] for link in soup.find_all("link")]
最后,因为您不需要一些URL,所以可以按您想要的方式对链接列表进行切片。排除第一个和第二个结果的最后一个表达式的改进版本可能是:

link_href_list = [link['href'] for link in soup.find_all("link")[2:]]

你能把范围缩小到错误的确切位置吗?至少,完整的回溯比最后一行有用得多。请阅读
i
的值是什么,以及
findPatTitle
中有什么(如果有的话)?
print(len(findPatTitle))
print(len(findPatLink))
将启发您
列表迭代器=范围(2,16)
足够并使用beautifulsou提取标题等。。
link_href_list = [link['href'] for link in soup.find_all("link")[2:]]