Python 通过每次追加一行来创建数据帧
我知道pandas设计用于加载完全填充的Python 通过每次追加一行来创建数据帧,python,pandas,dataframe,append,Python,Pandas,Dataframe,Append,我知道pandas设计用于加载完全填充的数据框,但我需要创建一个空数据框,然后逐个添加行。 最好的方法是什么 我成功创建了一个空数据帧,其中包含: res = DataFrame(columns=('lib', 'qty1', 'qty2')) 然后我可以添加一个新行并用以下内容填充字段: res = res.set_value(len(res), 'qty1', 10.0) 它可以工作,但看起来很奇怪:-/(无法添加字符串值) 如何将新行添加到数据框(具有不同的列类型)?您可以使用pand
数据框
,但我需要创建一个空数据框,然后逐个添加行。
最好的方法是什么
我成功创建了一个空数据帧,其中包含:
res = DataFrame(columns=('lib', 'qty1', 'qty2'))
然后我可以添加一个新行并用以下内容填充字段:
res = res.set_value(len(res), 'qty1', 10.0)
它可以工作,但看起来很奇怪:-/(无法添加字符串值)
如何将新行添加到数据框(具有不同的列类型)?您可以使用
pandas.concat()
或DataFrame.append()
。有关详细信息和示例,请参见。如果您可以提前获取数据帧的所有数据,则有一种比附加到数据帧快得多的方法:
rows_list = []
for row in input_rows:
dict1 = {}
# get input row in dictionary format
# key = col_name
dict1.update(blah..)
rows_list.append(dict1)
df = pd.DataFrame(rows_list)
有关有效的附加,请参见和 通过
loc/ix
在不存在的关键索引数据上添加行。e、 g:
In [1]: se = pd.Series([1,2,3])
In [2]: se
Out[2]:
0 1
1 2
2 3
dtype: int64
In [3]: se[5] = 5.
In [4]: se
Out[4]:
0 1.0
1 2.0
2 3.0
5 5.0
dtype: float64
或:
您可以使用
df.loc[i]
,其中索引为i
的行将是您在数据帧中指定的行
>>> import pandas as pd
>>> from numpy.random import randint
>>> df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
>>> for i in range(5):
>>> df.loc[i] = ['name' + str(i)] + list(randint(10, size=2))
>>> df
lib qty1 qty2
0 name0 3 3
1 name1 2 4
2 name2 2 8
3 name3 2 1
4 name4 9 6
如果您事先知道条目数,则应通过提供索引预先分配空间(以不同答案中的数据示例为例): 速度比较
In[30]: %timeit tryThis() # function wrapper for this answer
In[31]: %timeit tryOther() # function wrapper without index (see, for example, @fred)
1000 loops, best of 3: 1.23 ms per loop
100 loops, best of 3: 2.31 ms per loop
从评论中可以看出,尺寸为6000时,速度差变得更大:
增加数组(12)的大小和行数(500)会使 速度差更为惊人:313毫秒对2.29秒
您可以使用
ignore\u index
选项将单行追加为字典
>>> f = pandas.DataFrame(data = {'Animal':['cow','horse'], 'Color':['blue', 'red']})
>>> f
Animal Color
0 cow blue
1 horse red
>>> f.append({'Animal':'mouse', 'Color':'black'}, ignore_index=True)
Animal Color
0 cow blue
1 horse red
2 mouse black
这不是对OP问题的回答,而是一个玩具示例来说明@ShikharDua的答案,我发现上面的答案非常有用 虽然这个片段很简单,但在实际数据中,我有1000行和许多列,我希望能够按不同的列进行分组,然后对多个taget列执行下面的统计。因此,采用一种可靠的方法一次构建一行数据帧是一种极大的方便。谢谢你@ShikharDua
import pandas as pd
BaseData = pd.DataFrame({ 'Customer' : ['Acme','Mega','Acme','Acme','Mega','Acme'],
'Territory' : ['West','East','South','West','East','South'],
'Product' : ['Econ','Luxe','Econ','Std','Std','Econ']})
BaseData
columns = ['Customer','Num Unique Products', 'List Unique Products']
rows_list=[]
for name, group in BaseData.groupby('Customer'):
RecordtoAdd={} #initialise an empty dict
RecordtoAdd.update({'Customer' : name}) #
RecordtoAdd.update({'Num Unique Products' : len(pd.unique(group['Product']))})
RecordtoAdd.update({'List Unique Products' : pd.unique(group['Product'])})
rows_list.append(RecordtoAdd)
AnalysedData = pd.DataFrame(rows_list)
print('Base Data : \n',BaseData,'\n\n Analysed Data : \n',AnalysedData)
创建一个新记录(数据框)并添加到旧数据框中
通过值列表和相应的列名称创建新记录(数据框) 另一种方法(可能不是很有效): 您还可以像这样增强DataFrame类:
import pandas as pd
def add_row(self, row):
self.loc[len(self.index)] = row
pd.DataFrame.add_row = add_row
为了达到蟒蛇式的目的,这里添加我的答案:
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
res = res.append([{'qty1':10.0}], ignore_index=True)
print(res.head())
lib qty1 qty2
0 NaN 10.0 NaN
简单点。通过将列表作为输入,该列表将作为数据框中的行附加:-
import pandas as pd
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
for i in range(5):
res_list = list(map(int, input().split()))
res = res.append(pd.Series(res_list,index=['lib','qty1','qty2']), ignore_index=True)
您还可以建立列表列表并将其转换为数据帧-
import pandas as pd
columns = ['i','double','square']
rows = []
for i in range(6):
row = [i, i*2, i*i]
rows.append(row)
df = pd.DataFrame(rows, columns=columns)
给予
i double square
0 0 0 0
1 1 2 1
2 2 4 4
3 3 6 9
4 4 8 16
5 5 10 25
我喜欢双正方形
0 0 0 0
1 1 2 1
2 2 4 4
3 3 6 9
4 4 8 16
5 5 10 25
这将负责将项添加到空数据帧。问题是第一个索引的
df.index.max()==nan
:
df = pd.DataFrame(columns=['timeMS', 'accelX', 'accelY', 'accelZ', 'gyroX', 'gyroY', 'gyroZ'])
df.loc[0 if math.isnan(df.index.max()) else df.index.max() + 1] = [x for x in range(7)]
已经很久了,但我也面临着同样的问题。在这里找到了很多有趣的答案。所以我不知道该用什么方法 在向数据帧添加大量行的情况下,我对速度性能感兴趣。所以我尝试了4种最流行的方法并检查了它们的速度 于2019年更新使用新版本的软件包。 之后也进行了更新 速度性能
|------------|-------------|-------------|-------------|
| Approach | 1000 rows | 5000 rows | 10 000 rows |
|------------|-------------|-------------|-------------|
| .append | 0.69 | 3.39 | 6.78 |
|------------|-------------|-------------|-------------|
| .loc w/o | 0.74 | 3.90 | 8.35 |
| prealloc | | | |
|------------|-------------|-------------|-------------|
| .loc with | 0.24 | 2.58 | 8.70 |
| prealloc | | | |
|------------|-------------|-------------|-------------|
| dict | 0.012 | 0.046 | 0.084 |
|------------|-------------|-------------|-------------|
import pandas as pd
import numpy as np
import time
del df1, df2, df3, df4
numOfRows = 1000
# append
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows-4):
df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df1.shape)
# .loc w/o prealloc
startTime = time.perf_counter()
df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
df2.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df2.shape)
# .loc with prealloc
df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] )
startTime = time.perf_counter()
for i in range( 1,numOfRows):
df3.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df3.shape)
# dict
startTime = time.perf_counter()
row_list = []
for i in range (0,5):
row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows-4):
dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])
row_list.append(dict1)
df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df4.shape)
也感谢您的有用评论-我更新了代码
所以我自己通过字典使用加法
代码:
|------------|-------------|-------------|-------------|
| Approach | 1000 rows | 5000 rows | 10 000 rows |
|------------|-------------|-------------|-------------|
| .append | 0.69 | 3.39 | 6.78 |
|------------|-------------|-------------|-------------|
| .loc w/o | 0.74 | 3.90 | 8.35 |
| prealloc | | | |
|------------|-------------|-------------|-------------|
| .loc with | 0.24 | 2.58 | 8.70 |
| prealloc | | | |
|------------|-------------|-------------|-------------|
| dict | 0.012 | 0.046 | 0.084 |
|------------|-------------|-------------|-------------|
import pandas as pd
import numpy as np
import time
del df1, df2, df3, df4
numOfRows = 1000
# append
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows-4):
df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df1.shape)
# .loc w/o prealloc
startTime = time.perf_counter()
df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
df2.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df2.shape)
# .loc with prealloc
df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] )
startTime = time.perf_counter()
for i in range( 1,numOfRows):
df3.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df3.shape)
# dict
startTime = time.perf_counter()
row_list = []
for i in range (0,5):
row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows-4):
dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])
row_list.append(dict1)
df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df4.shape)
另外,我相信,我的认识并不完美,也许有一些优化 想出了一个简单而好的方法:
>>> df
A B C
one 1 2 3
>>> df.loc["two"] = [4,5,6]
>>> df
A B C
one 1 2 3
two 4 5 6
请注意注释中提到的性能注意事项以下是在数据帧中添加/追加行的方法
def add_row(df, row):
df.loc[-1] = row
df.index = df.index + 1
return df.sort_index()
add_row(df, [1,2,3])
它可用于在空数据框或填充数据框中插入/追加一行我们经常看到要分配给一个数据框行的构造
df.loc[subscript]=…
。Mikhail_Sam发布了一篇文章,其中包括这个构造以及最后使用dict和create DataFrame的方法。他发现后者是目前为止最快的。但是,如果我们用df3.values[i]=…
替换他的代码中的df3.loc[i]=…
(使用预分配的数据帧),结果会发生显著变化,因为该方法的性能与使用dict的方法类似。因此,我们应该更经常地考虑使用df.values[subscript]=…
。但是请注意,.values
采用一个从零开始的下标,它可能不同于DataFrame.index。您可以使用generator对象创建DataFrame,这将比列表更节省内存
num = 10
# Generator function to generate generator object
def numgen_func(num):
for i in range(num):
yield ('name_{}'.format(i), (i*i), (i*i*i))
# Generator expression to generate generator object (Only once data get populated, can not be re used)
numgen_expression = (('name_{}'.format(i), (i*i), (i*i*i)) for i in range(num) )
df = pd.DataFrame(data=numgen_func(num), columns=('lib', 'qty1', 'qty2'))
要将原始数据添加到现有数据帧,可以使用append方法
df = df.append([{ 'lib': "name_20", 'qty1': 20, 'qty2': 400 }])
与ShikharDua的答案中的字典列表不同,我们还可以将表表示为列表字典,其中每个列表按行顺序存储一列,前提是我们事先知道列最后,我们构建了一次数据帧。 对于c列和n行,这使用1个字典和c列表,而不是1个列表和n个字典。“字典列表”方法使每个字典存储所有键,并要求为每一行创建一个新字典。这里我们只附加到列表,这是固定时间,理论上非常快
# current data
data = {"Animal":["cow", "horse"], "Color":["blue", "red"]}
# adding a new row (be careful to ensure every column gets another value)
data["Animal"].append("mouse")
data["Color"].append("black")
# at the end, construct our DataFrame
df = pd.DataFrame(data)
# Animal Color
# 0 cow blue
# 1 horse red
# 2 mouse black
pandas.DataFrame.append 追加(self,other,ignore\u index=False,verify\u integrity=False,sort=Fal
num = 10
# Generator function to generate generator object
def numgen_func(num):
for i in range(num):
yield ('name_{}'.format(i), (i*i), (i*i*i))
# Generator expression to generate generator object (Only once data get populated, can not be re used)
numgen_expression = (('name_{}'.format(i), (i*i), (i*i*i)) for i in range(num) )
df = pd.DataFrame(data=numgen_func(num), columns=('lib', 'qty1', 'qty2'))
df = df.append([{ 'lib': "name_20", 'qty1': 20, 'qty2': 400 }])
# current data
data = {"Animal":["cow", "horse"], "Color":["blue", "red"]}
# adding a new row (be careful to ensure every column gets another value)
data["Animal"].append("mouse")
data["Color"].append("black")
# at the end, construct our DataFrame
df = pd.DataFrame(data)
# Animal Color
# 0 cow blue
# 1 horse red
# 2 mouse black
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df.append(df2)
df.append(df2, ignore_index=True)
valuestoappend = [va1,val2,val3]
res = res.append(pd.Series(valuestoappend,index = ['lib', 'qty1', 'qty2']),ignore_index = True)
df2=df.to_dict()
values=["s_101","hyderabad",10,20,16,13,15,12,12,13,25,26,25,27,"good","bad"] #this is total row that we are going to add
i=0
for x in df.columns: #here df.columns gives us the main dictionary key
df2[x][101]=values[i] #here the 101 is our index number it is also key of sub dictionary
i+=1
new_dict = {put input for new row here}
new_list = [put your index here]
new_df = pd.DataFrame(data=new_dict, index=new_list)
df = pd.concat([existing_df, new_df])
# Assuming your df has 4 columns (str, int, str, bool)
df.loc[df.shape[0]] = ['col1Value', 100, 'col3Value', False]
df.loc[len(df)] = ['col1Value', 100, 'col3Value', False]
initial_data = {'lib': np.array([1,2,3,4]), 'qty1': [1,2,3,4], 'qty2': [1,2,3,4]}
df = pd.DataFrame(initial_data)
df
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
val_1 = [10]
val_2 = [14]
val_3 = [20]
df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
0 10 14 20
val_1 = [10, 11, 12, 13]
val_2 = [14, 15, 16, 17]
val_3 = [20, 21, 22, 43]
df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
0 10 14 20
1 11 15 21
2 12 16 22
3 13 17 43
data = []
for a, b, c in some_function_that_yields_data():
data.append([a, b, c])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
# Creates empty DataFrame and appends
df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True)
# This is equally bad:
# df = pd.concat(
# [df, pd.Series({'A': i, 'B': b, 'C': c})],
# ignore_index=True)
# Creates DataFrame of NaNs and overwrites values.
df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
for a, b, c in some_function_that_yields_data():
df.loc[len(df)] = [a, b, c]
import pandas as pd
import numpy as np
from string import ascii_uppercase
startTime = time.perf_counter()
numcols, numrows = 5, 10000
npdf = np.ones((numrows, numcols))
for row in range(numrows):
npdf[row, 0:] = np.random.randint(0, 100, (1, numcols))
df5 = pd.DataFrame(npdf, columns=list(ascii_uppercase[:numcols]))
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df5.shape)
df.loc[len(df)] = new_list
df.append(new_df)
df.loc[len(df)]=['name5',9,0]