Python BS4爬虫索引器_Python - Fatal编程技术网

Python BS4爬虫索引器

python

Python BS4爬虫索引器,python,Python,我试图创建一个简单的爬虫程序，从网站中提取元数据并将信息保存到csv中。到目前为止，我被困在这里，我遵循了一些指南，但现在被错误所困扰：索引器：索引列表超出范围 from urllib import urlopen from BeautifulSoup import BeautifulSoup import re # Copy all of the content from the provided web page webpage = urlopen('http://www.tidyawa

我试图创建一个简单的爬虫程序，从网站中提取元数据并将信息保存到csv中。到目前为止，我被困在这里，我遵循了一些指南，但现在被错误所困扰：

索引器：索引列表超出范围

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

# Copy all of the content from the provided web page
webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()

# Grab everything that lies between the title tags using a REGEX
patFinderTitle = re.compile('<title>(.*)</title>')

# Grab the link to the original article using a REGEX
patFinderLink = re.compile('<link rel.*href="(.*)" />')

# Store all of the titles and links found in 2 lists
findPatTitle = re.findall(patFinderTitle,webpage)
findPatLink = re.findall(patFinderLink,webpage)

# Create an iterator that will cycle through the first 16 articles and skip a few
listIterator = []
listIterator[:] = range(2,16)

# Print out the results to screen
for i in listIterator:
    print findPatTitle[i] # The title
    print findPatLink[i] # The link to the original article

articlePage = urlopen(findPatLink[i]).read() # Grab all of the content from original article

divBegin = articlePage.find('<div>') # Locate the div provided
article = articlePage[divBegin:(divBegin+1000)] # Copy the first 1000 characters after the div

# Pass the article to the Beautiful Soup Module
soup = BeautifulSoup(article)

# Tell Beautiful Soup to locate all of the p tags and store them in a list
paragList = soup.findAll('p')

# Print all of the paragraphs to screen
for i in paragList:
    print i
    print '\n'

# Here I retrieve and print to screen the titles and links with just Beautiful Soup
soup2 = BeautifulSoup(webpage)

print soup2.findAll('title')
print soup2.findAll('link')

titleSoup = soup2.findAll('title')
linkSoup = soup2.findAll('link')

for i in listIterator:
    print titleSoup[i]
    print linkSoup[i]
    print '\n'

谢谢。

看来您并没有使用bs4所能提供的所有功能

出现此错误是因为patFinderTitle的长度只有一个，因为所有html通常每个文档只有一个title元素

获取HTML标题的一种简单方法是使用bs4本身：

from bs4 import BeautifulSoup
from urllib import urlopen

webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()
soup = BeautifulSoup(webpage)

# get the content of title
title = soup.title.text

如果您尝试以当前方式迭代findPatLink，可能会得到相同的错误，因为它的长度为6。对我来说，如果你想获得所有链接元素或所有锚元素，这还不够清楚，但是坚持第一个想法，你可以再次使用bs4改进你的代码：

link_href_list = [link['href'] for link in soup.find_all("link")]

最后，因为您不需要一些URL，所以可以按您想要的方式对链接列表进行切片。排除第一个和第二个结果的最后一个表达式的改进版本可能是：

link_href_list = [link['href'] for link in soup.find_all("link")[2:]]

你能把范围缩小到错误的确切位置吗？至少，完整的回溯比最后一行有用得多。请阅读

的值是什么，以及

findPatTitle

中有什么（如果有的话）？

print（len（findPatTitle））

和

print（len（findPatLink））

将启发您

列表迭代器=范围（2,16）

足够并使用beautifulsou提取标题等。。

link_href_list = [link['href'] for link in soup.find_all("link")[2:]]