Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/312.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/loops/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何循环和保存每次迭代的数据_Python_Loops_For Loop_Beautifulsoup - Fatal编程技术网

Python 如何循环和保存每次迭代的数据

Python 如何循环和保存每次迭代的数据,python,loops,for-loop,beautifulsoup,Python,Loops,For Loop,Beautifulsoup,我正试图学习如何用python从网页中抓取数据,但在如何用python构造嵌套循环方面遇到了麻烦。我在如何解决这个问题上得到了一些帮助()。我试图让代码在不同的几周(甚至几年)的网页中进行迭代。下面是我目前所拥有的,但它并没有重复我想要的两个星期并保存下来 import requests, re, json from bs4 import BeautifulSoup weeks=['1','2'] data = pd.DataFrame(columns=['Teams','Link']) sc

我正试图学习如何用python从网页中抓取数据,但在如何用python构造嵌套循环方面遇到了麻烦。我在如何解决这个问题上得到了一些帮助()。我试图让代码在不同的几周(甚至几年)的网页中进行迭代。下面是我目前所拥有的,但它并没有重复我想要的两个星期并保存下来

import requests, re, json
from bs4 import BeautifulSoup
weeks=['1','2']
data = pd.DataFrame(columns=['Teams','Link'])

scripts_head = soup.find('head').find_all('script')
all_links = {}
for i in weeks:
    r = requests.get(r'https://www.espn.com/college-football/scoreboard/_/year/2018/seasontype/2/week/'+i)
    soup = BeautifulSoup(r.text, 'html.parser')
    for script in scripts_head:
        if 'window.espn.scoreboardData' in script.text:
            json_scoreboard = json.loads(re.search(r'({.*?});', script.text).group(1))
            for event in json_scoreboard['events']:
                name = event['name']
                for link in event['links']:
                    if link['text'] == 'Gamecast':
                        gamecast = link['href']
                all_links[name] = gamecast
                #Save data to dataframe
                data2=pd.DataFrame(list(all_links.items()),columns=['Teams','Link'])
        #Append new data to existing data        
        data=data.append(data2,ignore_index = True)


#Save dataframe with all links to csv for future use
data.to_csv(r'game_id_data.csv')

编辑:为了增加一些澄清,它创建了一周的数据副本,并重复地将其添加到末尾。我还编辑了代码以包含适当的库,它应该能够被复制、粘贴并在python中运行。

问题在于循环逻辑:

    if 'window.espn.scoreboardData' in script.text:
        ...
            data2=pd.DataFrame(list(all_links.items()),columns=['Teams','Link'])
    #Append new data to existing data        
    data=data.append(data2,ignore_index = True)

你在最后一行的缩进是错误的。如前所述,无论是否有新的记分板数据,都会附加
data2
。如果不这样做,则跳过
if
主体,只需附加上前面的
数据2
值。

因此我提出的解决方法如下,我仍然在最终数据集中获得重复的游戏ID,但至少我正在循环整个所需集并获得所有这些ID。然后在最后我重复数据消除

import requests, re, json
from bs4 import BeautifulSoup
import csv
import pandas as pd

years=['2015','2016','2017','2018']
weeks=['1','2','3','4','5','6','7','8','9','10','11','12','13','14']
data = pd.DataFrame(columns=['Teams','Link'])

all_links = {}
for year in years:
    for i in weeks:
        r = requests.get(r'https://www.espn.com/college-football/scoreboard/_/year/'+ year + '/seasontype/2/week/'+i)
        soup = BeautifulSoup(r.text, 'html.parser')
        scripts_head = soup.find('head').find_all('script')
        for script in scripts_head:
            if 'window.espn.scoreboardData' in script.text:
                json_scoreboard = json.loads(re.search(r'({.*?});', script.text).group(1))
                for event in json_scoreboard['events']:
                    name = event['name']
                    for link in event['links']:
                        if link['text'] == 'Gamecast':
                            gamecast = link['href']
                    all_links[name] = gamecast
                #Save data to dataframe
                data2=pd.DataFrame(list(all_links.items()),columns=['Teams','Link'])
                #Append new data to existing data        
                data=data.append(data2,ignore_index = True)


#Save dataframe with all links to csv for future use
data_test=data.drop_duplicates(keep='first')
data_test.to_csv(r'all_years_deduped.csv')

欢迎来到StackOverflow。请按照您创建此帐户时的建议,阅读并遵循帮助文档中的发布指南。适用于这里。在您发布MCVE代码并准确指定问题之前,我们无法有效地帮助您。我们应该能够将您发布的代码粘贴到文本文件中,并重现您指定的问题。“它不工作”不是一个问题规范。请查看编辑的问题,并让我知道是否还有问题。因此,我尝试了所有可能的缩进的最后一行,似乎没有一行返回正确。你是说我需要在下一次迭代之前添加逻辑删除上一次运行吗?不,只是直接的问题是将
append
命令放在
if
下,而不是之后。如果您还有其他问题,请发布新问题或按照MCVE说明中的建议更新现有问题。