BeautifulSoup4刮削不能超出网站的第一页(Python 3.6)

BeautifulSoup4刮削不能超出网站的第一页(Python 3.6),python,pandas,web-scraping,beautifulsoup,Python,Pandas,Web Scraping,Beautifulsoup,我正试图从这个网站的第一页到第14页: 这是我的密码: import requests as r from bs4 import BeautifulSoup as soup import pandas #make a list of all web pages' urls webpages=[] for i in range(15): root_url = 'https://cross-currents.berkeley.edu/archives?author=&title=

我正试图从这个网站的第一页到第14页: 这是我的密码:

import requests as r
from bs4 import BeautifulSoup as soup
import pandas 

#make a list of all web pages' urls
webpages=[]
for i in range(15):
    root_url = 'https://cross-currents.berkeley.edu/archives?author=&title=&type=All&issue=All&region=All&page='+ str(i)
    webpages.append(root_url)
    print(webpages)

#start looping through all pages
for item in webpages:  
    headers = {'User-Agent': 'Mozilla/5.0'}
    data = r.get(item, headers=headers)
    page_soup = soup(data.text, 'html.parser')

#find targeted info and put them into a list to be exported to a csv file via pandas
    title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]
    title = [el.replace('\n', '') for el in title_list]

#export to csv file via pandas
    dataset = {'Title': title}
    df = pandas.DataFrame(dataset)
    df.index.name = 'ArticleID'
    df.to_csv('example31.csv',encoding="utf-8")

输出csv文件仅包含最后一页的目标信息。当我打印“网页”时,它显示所有网页的URL都已正确地放入列表中。我做错了什么?提前谢谢你

您只需为所有页面覆盖相同的输出CSV文件,您可以在“附加”模式下调用
.to_CSV()
,将新数据添加到现有文件的末尾:

df.to_csv('example31.csv', mode='a', encoding="utf-8", header=False)
或者,更好的做法是将标题收集到标题列表中,然后转储到CSV中一次:

#start looping through all pages
titles = []
for item in webpages:
    headers = {'User-Agent': 'Mozilla/5.0'}
    data = r.get(item, headers=headers)
    page_soup = soup(data.text, 'html.parser')

    #find targeted info and put them into a list to be exported to a csv file via pandas
    title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]

    titles += [el.replace('\n', '') for el in title_list]

# export to csv file via pandas
dataset = [{'Title': title} for title in titles]
df = pandas.DataFrame(dataset)
df.index.name = 'ArticleID'
df.to_csv('example31.csv', encoding="utf-8")

除了Alexe发布的内容之外,另一种方法是将数据帧添加到新的数据帧中,然后将其写入CSV

将finalDf声明为循环外部的数据帧:

finalDf = pandas.DataFrame()
稍后,请执行以下操作:

for item in webpages:
    headers = {'User-Agent': 'Mozilla/5.0'}
    data = r.get(item, headers=headers)
    page_soup = soup(data.text, 'html.parser')

#find targeted info and put them into lists to be exported to a csv file   via pandas
    title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]
    title = [el.replace('\n', '') for el in title_list]

#export to csv file via pandas
    dataset = {'Title': title}
    df = pandas.DataFrame(dataset)
    finalDf = finalDf.append(df)
    #df.index.name = 'ArticleID'
    #df.to_csv('example31.csv', mode='a', encoding="utf-8", header=False)

finalDf = finalDf.reset_index(drop = True)
finalDf.index.name = 'ArticleID'
finalDf.to_csv('example31.csv', encoding="utf-8")

请注意带有
finalDf

的行,非常感谢!!您的第一个建议不起作用(仍然是相同的结果),但“将标题收集到标题列表中,然后一次性转储到CSV”效果非常好!谢谢你的意见!这起作用了,只是索引号在19之后一直返回0(不是0-400,而是0-19,然后是0-19)。知道为什么会这样吗?@AshleyLiu我已经通过添加reset_index()更新了答案,您现在将拥有0-400:)