Python 通过每次追加一行来创建数据帧_Python_Pandas_Dataframe_Append

Python 通过每次追加一行来创建数据帧

python pandas dataframe

Python 通过每次追加一行来创建数据帧,python,pandas,dataframe,append,Python,Pandas,Dataframe,Append,我知道pandas设计用于加载完全填充的数据框，但我需要创建一个空数据框，然后逐个添加行。最好的方法是什么我成功创建了一个空数据帧，其中包含： res = DataFrame(columns=('lib', 'qty1', 'qty2')) 然后我可以添加一个新行并用以下内容填充字段： res = res.set_value(len(res), 'qty1', 10.0) 它可以工作，但看起来很奇怪：-/（无法添加字符串值）如何将新行添加到数据框（具有不同的列类型）？您可以使用pand

我知道pandas设计用于加载完全填充的

数据框

，但我需要创建一个空数据框，然后逐个添加行。最好的方法是什么
我成功创建了一个空数据帧，其中包含：

res = DataFrame(columns=('lib', 'qty1', 'qty2'))
然后我可以添加一个新行并用以下内容填充字段：

res = res.set_value(len(res), 'qty1', 10.0)
它可以工作，但看起来很奇怪：-/（无法添加字符串值）

如何将新行添加到数据框（具有不同的列类型）？
您可以使用
pandas.concat（）
或
DataFrame.append（）
。有关详细信息和示例，请参见。
如果您可以提前获取数据帧的所有数据，则有一种比附加到数据帧快得多的方法：

创建字典列表，其中每个字典对应于一个输入数据行
从该列表创建数据帧
我有一个类似的任务，一行一行地添加到数据框需要30分钟，从几秒钟内完成的字典列表创建一个数据框

rows_list = [] for row in input_rows: dict1 = {} # get input row in dictionary format # key = col_name dict1.update(blah..) rows_list.append(dict1) df = pd.DataFrame(rows_list)

有关有效的附加，请参见和
通过
loc/ix
在不存在的关键索引数据上添加行。e、 g:

In [1]: se = pd.Series([1,2,3]) In [2]: se Out[2]: 0 1 1 2 2 3 dtype: int64 In [3]: se[5] = 5. In [4]: se Out[4]: 0 1.0 1 2.0 2 3.0 5 5.0 dtype: float64
或：

您可以使用
df.loc[i]
，其中索引为
i
的行将是您在数据帧中指定的行

>>> import pandas as pd >>> from numpy.random import randint >>> df = pd.DataFrame(columns=['lib', 'qty1', 'qty2']) >>> for i in range(5): >>> df.loc[i] = ['name' + str(i)] + list(randint(10, size=2)) >>> df lib qty1 qty2 0 name0 3 3 1 name1 2 4 2 name2 2 8 3 name3 2 1 4 name4 9 6

如果您事先知道条目数，则应通过提供索引预先分配空间（以不同答案中的数据示例为例）：
速度比较

In[30]: %timeit tryThis() # function wrapper for this answer In[31]: %timeit tryOther() # function wrapper without index (see, for example, @fred) 1000 loops, best of 3: 1.23 ms per loop 100 loops, best of 3: 2.31 ms per loop
从评论中可以看出，尺寸为6000时，速度差变得更大：

增加数组（12）的大小和行数（500）会使速度差更为惊人：313毫秒对2.29秒

您可以使用
ignore\u index
选项将单行追加为字典

>>> f = pandas.DataFrame(data = {'Animal':['cow','horse'], 'Color':['blue', 'red']}) >>> f Animal Color 0 cow blue 1 horse red >>> f.append({'Animal':'mouse', 'Color':'black'}, ignore_index=True) Animal Color 0 cow blue 1 horse red 2 mouse black

这不是对OP问题的回答，而是一个玩具示例来说明@ShikharDua的答案，我发现上面的答案非常有用
虽然这个片段很简单，但在实际数据中，我有1000行和许多列，我希望能够按不同的列进行分组，然后对多个taget列执行下面的统计。因此，采用一种可靠的方法一次构建一行数据帧是一种极大的方便。谢谢你@ShikharDua

import pandas as pd BaseData = pd.DataFrame({ 'Customer' : ['Acme','Mega','Acme','Acme','Mega','Acme'], 'Territory' : ['West','East','South','West','East','South'], 'Product' : ['Econ','Luxe','Econ','Std','Std','Econ']}) BaseData columns = ['Customer','Num Unique Products', 'List Unique Products'] rows_list=[] for name, group in BaseData.groupby('Customer'): RecordtoAdd={} #initialise an empty dict RecordtoAdd.update({'Customer' : name}) # RecordtoAdd.update({'Num Unique Products' : len(pd.unique(group['Product']))}) RecordtoAdd.update({'List Unique Products' : pd.unique(group['Product'])}) rows_list.append(RecordtoAdd) AnalysedData = pd.DataFrame(rows_list) print('Base Data : \n',BaseData,'\n\n Analysed Data : \n',AnalysedData)

创建一个新记录（数据框）并添加到旧数据框中
通过值列表和相应的列名称创建新记录（数据框）
另一种方法（可能不是很有效）：
您还可以像这样增强DataFrame类：

import pandas as pd def add_row(self, row): self.loc[len(self.index)] = row pd.DataFrame.add_row = add_row

为了达到蟒蛇式的目的，这里添加我的答案：

res = pd.DataFrame(columns=('lib', 'qty1', 'qty2')) res = res.append([{'qty1':10.0}], ignore_index=True) print(res.head()) lib qty1 qty2 0 NaN 10.0 NaN

简单点。通过将列表作为输入，该列表将作为数据框中的行附加：-

import pandas as pd res = pd.DataFrame(columns=('lib', 'qty1', 'qty2')) for i in range(5): res_list = list(map(int, input().split())) res = res.append(pd.Series(res_list,index=['lib','qty1','qty2']), ignore_index=True)

您还可以建立列表列表并将其转换为数据帧-

import pandas as pd columns = ['i','double','square'] rows = [] for i in range(6): row = [i, i*2, i*i] rows.append(row) df = pd.DataFrame(rows, columns=columns)
给予
i double square 0 0 0 0 1 1 2 1 2 2 4 4 3 3 6 9 4 4 8 16 5 5 10 25 我喜欢双正方形 0 0 0 0 1 1 2 1 2 2 4 4 3 3 6 9 4 4 8 16 5 5 10 25
这将负责将项添加到空数据帧。问题是第一个索引的
df.index.max（）==nan
：

df = pd.DataFrame(columns=['timeMS', 'accelX', 'accelY', 'accelZ', 'gyroX', 'gyroY', 'gyroZ']) df.loc[0 if math.isnan(df.index.max()) else df.index.max() + 1] = [x for x in range(7)]

已经很久了，但我也面临着同样的问题。在这里找到了很多有趣的答案。所以我不知道该用什么方法
在向数据帧添加大量行的情况下，我对速度性能感兴趣。所以我尝试了4种最流行的方法并检查了它们的速度
于2019年更新使用新版本的软件包。之后也进行了更新
速度性能
使用.append（）

使用.loc（）

将.loc与预分配（）一起使用

最后使用dict并创建DataFrame（）
结果（以秒为单位）：

|------------|-------------|-------------|-------------| | Approach | 1000 rows | 5000 rows | 10 000 rows | |------------|-------------|-------------|-------------| | .append | 0.69 | 3.39 | 6.78 | |------------|-------------|-------------|-------------| | .loc w/o | 0.74 | 3.90 | 8.35 | | prealloc | | | | |------------|-------------|-------------|-------------| | .loc with | 0.24 | 2.58 | 8.70 | | prealloc | | | | |------------|-------------|-------------|-------------| | dict | 0.012 | 0.046 | 0.084 | |------------|-------------|-------------|-------------|

import pandas as pd import numpy as np import time del df1, df2, df3, df4 numOfRows = 1000 # append startTime = time.perf_counter() df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E']) for i in range( 1,numOfRows-4): df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True) print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows)) print(df1.shape) # .loc w/o prealloc startTime = time.perf_counter() df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E']) for i in range( 1,numOfRows): df2.loc[i] = np.random.randint(100, size=(1,5))[0] print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows)) print(df2.shape) # .loc with prealloc df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] ) startTime = time.perf_counter() for i in range( 1,numOfRows): df3.loc[i] = np.random.randint(100, size=(1,5))[0] print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows)) print(df3.shape) # dict startTime = time.perf_counter() row_list = [] for i in range (0,5): row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])) for i in range( 1,numOfRows-4): dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']) row_list.append(dict1) df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E']) print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows)) print(df4.shape)
也感谢您的有用评论-我更新了代码
所以我自己通过字典使用加法

代码：

|------------|-------------|-------------|-------------| | Approach | 1000 rows | 5000 rows | 10 000 rows | |------------|-------------|-------------|-------------| | .append | 0.69 | 3.39 | 6.78 | |------------|-------------|-------------|-------------| | .loc w/o | 0.74 | 3.90 | 8.35 | | prealloc | | | | |------------|-------------|-------------|-------------| | .loc with | 0.24 | 2.58 | 8.70 | | prealloc | | | | |------------|-------------|-------------|-------------| | dict | 0.012 | 0.046 | 0.084 | |------------|-------------|-------------|-------------|

import pandas as pd import numpy as np import time del df1, df2, df3, df4 numOfRows = 1000 # append startTime = time.perf_counter() df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E']) for i in range( 1,numOfRows-4): df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True) print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows)) print(df1.shape) # .loc w/o prealloc startTime = time.perf_counter() df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E']) for i in range( 1,numOfRows): df2.loc[i] = np.random.randint(100, size=(1,5))[0] print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows)) print(df2.shape) # .loc with prealloc df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] ) startTime = time.perf_counter() for i in range( 1,numOfRows): df3.loc[i] = np.random.randint(100, size=(1,5))[0] print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows)) print(df3.shape) # dict startTime = time.perf_counter() row_list = [] for i in range (0,5): row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])) for i in range( 1,numOfRows-4): dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']) row_list.append(dict1) df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E']) print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows)) print(df4.shape)

另外，我相信，我的认识并不完美，也许有一些优化
想出了一个简单而好的方法：

>>> df A B C one 1 2 3 >>> df.loc["two"] = [4,5,6] >>> df A B C one 1 2 3 two 4 5 6

请注意注释中提到的性能注意事项
以下是在数据帧中添加/追加行的方法

def add_row(df, row): df.loc[-1] = row df.index = df.index + 1 return df.sort_index() add_row(df, [1,2,3])

它可用于在空数据框或填充数据框中插入/追加一行
我们经常看到要分配给一个数据框行的构造
df.loc[subscript]=…
。Mikhail_Sam发布了一篇文章，其中包括这个构造以及最后使用dict和create DataFrame的方法。他发现后者是目前为止最快的。但是，如果我们用
df3.values[i]=…
替换他的代码中的
df3.loc[i]=…
（使用预分配的数据帧），结果会发生显著变化，因为该方法的性能与使用dict的方法类似。因此，我们应该更经常地考虑使用
df.values[subscript]=…
。但是请注意，
.values
采用一个从零开始的下标，它可能不同于DataFrame.index。
您可以使用generator对象创建DataFrame，这将比列表更节省内存

num = 10 # Generator function to generate generator object def numgen_func(num): for i in range(num): yield ('name_{}'.format(i), (i*i), (i*i*i)) # Generator expression to generate generator object (Only once data get populated, can not be re used) numgen_expression = (('name_{}'.format(i), (i*i), (i*i*i)) for i in range(num) ) df = pd.DataFrame(data=numgen_func(num), columns=('lib', 'qty1', 'qty2'))
要将原始数据添加到现有数据帧，可以使用append方法

df = df.append([{ 'lib': "name_20", 'qty1': 20, 'qty2': 400 }])

与ShikharDua的答案中的字典列表不同，我们还可以将表表示为列表字典，其中每个列表按行顺序存储一列，前提是我们事先知道列最后，我们构建了一次数据帧。
对于c列和n行，这使用1个字典和c列表，而不是1个列表和n个字典。“字典列表”方法使每个字典存储所有键，并要求为每一行创建一个新字典。这里我们只附加到列表，这是固定时间，理论上非常快

# current data data = {"Animal":["cow", "horse"], "Color":["blue", "red"]} # adding a new row (be careful to ensure every column gets another value) data["Animal"].append("mouse") data["Color"].append("black") # at the end, construct our DataFrame df = pd.DataFrame(data) # Animal Color # 0 cow blue # 1 horse red # 2 mouse black

pandas.DataFrame.append
追加（self，other，ignore\u index=False，verify\u integrity=False，sort=Fal
num = 10 # Generator function to generate generator object def numgen_func(num): for i in range(num): yield ('name_{}'.format(i), (i*i), (i*i*i)) # Generator expression to generate generator object (Only once data get populated, can not be re used) numgen_expression = (('name_{}'.format(i), (i*i), (i*i*i)) for i in range(num) ) df = pd.DataFrame(data=numgen_func(num), columns=('lib', 'qty1', 'qty2'))

df = df.append([{ 'lib': "name_20", 'qty1': 20, 'qty2': 400 }])

# current data data = {"Animal":["cow", "horse"], "Color":["blue", "red"]} # adding a new row (be careful to ensure every column gets another value) data["Animal"].append("mouse") data["Color"].append("black") # at the end, construct our DataFrame df = pd.DataFrame(data) # Animal Color # 0 cow blue # 1 horse red # 2 mouse black

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB')) df.append(df2)

df.append(df2, ignore_index=True)

valuestoappend = [va1,val2,val3] res = res.append(pd.Series(valuestoappend,index = ['lib', 'qty1', 'qty2']),ignore_index = True)

df2=df.to_dict() values=["s_101","hyderabad",10,20,16,13,15,12,12,13,25,26,25,27,"good","bad"] #this is total row that we are going to add i=0 for x in df.columns: #here df.columns gives us the main dictionary key df2[x][101]=values[i] #here the 101 is our index number it is also key of sub dictionary i+=1

new_dict = {put input for new row here} new_list = [put your index here] new_df = pd.DataFrame(data=new_dict, index=new_list) df = pd.concat([existing_df, new_df])

# Assuming your df has 4 columns (str, int, str, bool) df.loc[df.shape[0]] = ['col1Value', 100, 'col3Value', False]

df.loc[len(df)] = ['col1Value', 100, 'col3Value', False]

initial_data = {'lib': np.array([1,2,3,4]), 'qty1': [1,2,3,4], 'qty2': [1,2,3,4]} df = pd.DataFrame(initial_data) df lib qty1 qty2 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 val_1 = [10] val_2 = [14] val_3 = [20] df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3})) lib qty1 qty2 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 0 10 14 20

val_1 = [10, 11, 12, 13] val_2 = [14, 15, 16, 17] val_3 = [20, 21, 22, 43] df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3})) lib qty1 qty2 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 0 10 14 20 1 11 15 21 2 12 16 22 3 13 17 43

data = [] for a, b, c in some_function_that_yields_data(): data.append([a, b, c]) df = pd.DataFrame(data, columns=['A', 'B', 'C'])

# Creates empty DataFrame and appends df = pd.DataFrame(columns=['A', 'B', 'C']) for a, b, c in some_function_that_yields_data(): df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True) # This is equally bad: # df = pd.concat( # [df, pd.Series({'A': i, 'B': b, 'C': c})], # ignore_index=True)

# Creates DataFrame of NaNs and overwrites values. df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5)) for a, b, c in some_function_that_yields_data(): df.loc[len(df)] = [a, b, c]

import pandas as pd import numpy as np from string import ascii_uppercase startTime = time.perf_counter() numcols, numrows = 5, 10000 npdf = np.ones((numrows, numcols)) for row in range(numrows): npdf[row, 0:] = np.random.randint(0, 100, (1, numcols)) df5 = pd.DataFrame(npdf, columns=list(ascii_uppercase[:numcols])) print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows)) print(df5.shape)

df.loc[len(df)] = new_list

df.append(new_df)

df.loc[len(df)]=['name5',9,0]