在循环中递增行索引，直到处理完数据帧中的所有行-python_Python_Dataframe_Indexing_Row_Increment

在循环中递增行索引，直到处理完数据帧中的所有行-python

python dataframe indexing

在循环中递增行索引，直到处理完数据帧中的所有行-python,python,dataframe,indexing,row,increment,Python,Dataframe,Indexing,Row,Increment,我一次从一个数据框中读取几行，以处理大约130万行的文本分类。我正在使用df.iloc[from_row:to_row] 我对当前活动使用Colab。我有单独的代码块，作为最终下载切片行的分类数据帧的流使用要在下载处理后的数据帧后，自动将从_行：到_行数字递增100、500或1000，每次递增后，直到处理最后一行 temp = temp.iloc[100:201] ##Manually updating this part and running rest of the code test=

我一次从一个数据框中读取几行，以处理大约130万行的文本分类。我正在使用df.iloc[from_row:to_row]

我对当前活动使用Colab。我有单独的代码块，作为最终下载切片行的分类数据帧的流使用

要在下载处理后的数据帧后，自动将从_行：到_行数字递增100、500或1000，每次递增后，直到处理最后一行

temp = temp.iloc[100:201] ##Manually updating this part and running rest of the code

test=[]

**# classifiying sentences in text column**
for row in tqdm(temp['comments'].values):
  res = label_classify(row)
  test.append(res)
temp['test'] = test

**# mapping right labels to appropriate rows**
list_of_rows = temp.test.to_list()
th = 0.4    #whatever threshold value you want
result = list(map(lambda x: get_label_score_dict(x, th), list_of_rows))
result_df = pd.DataFrame(result)

**## concatenating labeled df to original df and downloading (to avoid losing processed data incase Colab reconnects or I lose the session)**
*# Merging dfs*
temp.reset_index(drop=True, inplace=True)
concatenated_df_new = pd.concat( [temp, result_df], axis=1)
merge_df = pd.DataFrame()
merge_df = pd.concat([merge_df,concatenated_df_new], axis=0, ignore_index=True)

**# downloading final dataframe**

from google.colab import files
merge_df.to_csv('merge_nps.csv') 
files.download('merge_nps.csv')

可能会有一种完全不同的方法。现在我用python编写代码的时间更短了

任何关于如何将其编写为函数或如何增加计数器（df.iloc[从_row:to _row]）的帮助或想法都会很有帮助。

如果您想进行某种批处理，可以使用以下方法：

import numpy as np

batch_size = 1024   # or whatever you want the batch size to be
num_samples = 1.3e6 # or whatever the exact value is

num_batches = int(np.ceil(num_samples / batch_size))
for i in range(num_batches):
    temp = temp.iloc[i*batch_size : (i+1)*batch_size]

    #... rest of the code

未经测试，但您已经了解了这个想法。

如果您想进行某种批处理，您可以执行以下操作：

import numpy as np

batch_size = 1024   # or whatever you want the batch size to be
num_samples = 1.3e6 # or whatever the exact value is

num_batches = int(np.ceil(num_samples / batch_size))
for i in range(num_batches):
    temp = temp.iloc[i*batch_size : (i+1)*batch_size]

    #... rest of the code

未经测试，但你明白了。

我不明白你为什么不能把

df.iloc[：]

…？@N.JonasFigge请详细说明一下。我不明白你的建议通常，你可以使用

：

来获取所有元素-我不明白为什么你必须依次进行…。@N.JonasFigge我必须处理130万行，使用.apply（lambda row:label_classify（row））方法处理Colab中的所有行估计需要17小时。因此，我正在尝试为每次迭代下载处理后的输出，以避免在会话结束时丢失处理后的数据。我不明白为什么您不能将

df.iloc[：]

…？@N.JonasFigge请详细说明。我不明白你的建议通常，你可以使用

：

来获取所有元素-我不明白为什么你必须依次进行…。@N.JonasFigge我必须处理130万行，使用.apply（lambda row:label_classify（row））方法处理Colab中的所有行估计需要17小时。因此，我尝试为每次迭代下载处理后的输出，以避免在会话结束时丢失处理后的数据。