Python pandas-将具有多个日期索引的csv合并为单个日期索引

Python pandas-将具有多个日期索引的csv合并为单个日期索引,python,pandas,loops,csv,merge,Python,Pandas,Loops,Csv,Merge,嗨,我在电子表格中有如下数据 |aaa-date |aaa-val|bbb-date |bbb-val|ccc-date |ccc-val| |----------|-------|----------|-------|----------|-------| |08-04-2008|-20.943|31-03-2008|-23.869|26-03-2008|+1.401 | |09-04-2008|-20.943|01-04-2008|-19.813|27-03-2008|+1.376 |

嗨,我在电子表格中有如下数据

|aaa-date  |aaa-val|bbb-date  |bbb-val|ccc-date  |ccc-val|
|----------|-------|----------|-------|----------|-------|
|08-04-2008|-20.943|31-03-2008|-23.869|26-03-2008|+1.401 |
|09-04-2008|-20.943|01-04-2008|-19.813|27-03-2008|+1.376 |
|10-04-2008|-18.868|02-04-2008|-18.929|28-03-2008|-0.534 |
|11-04-2008|-19.057|03-04-2008|-19.917|31-03-2008|+0.688 |
|14-04-2008|-20.000|04-04-2008|-20.125|01-04-2008|+3.336 |
|15-04-2008|-18.868|07-04-2008|-21.321|02-04-2008|+3.413 |
|16-04-2008|-16.226|08-04-2008|-22.517|03-04-2008|+4.177 |
|17-04-2008|-14.340|09-04-2008|-24.857|04-04-2008|+4.279 |
|18-04-2008|-12.830|10-04-2008|-24.701|07-04-2008|+2.445 |
|21-04-2008|-15.472|11-04-2008|-24.857|08-04-2008|+1.146 |
我想导入这个(csv或xlsx)并得到一个数据帧,该数据帧只有一个日期索引和aaa valbbb valccc val e、 g

除了加载到临时帧中,然后在日期/值列对之间循环之外,还有其他聪明的方法吗


谢谢

我在查找其他内容时发现了这篇文章,我相信它可以帮助您:

基本上,您可以为每个时间/数据范围读取特定列范围的文件(如果要使用列名,请使用lambda方法)。然后,我将日期字段重命名为相同的名称,或者将日期字段设置为索引。然后连接到多个完整的外部联接以合并所有数据

编辑-一个简单的concat不会像我最初写的那样工作。我建议在日期列上使用完整的外部联接

[从链接]

使用可调用函数的另一种方法是包含lambda表达式。下面是一个示例,其中我们只希望包含一个已定义的列列表。为了便于比较,我们通过将名称转换为小写来规范化名称

显示CONCAT和MERGE之间差异的编辑:

import pandas as pd
import numpy as np
from common import  show_table

df1 = pd.DataFrame(data=[[1, 1], [2, 2]], columns=['a','b'])
print(df1)
#    a  b
# 0  1  1
# 1  2  2

df2 = pd.DataFrame(data=[[1, 1], [3, 3]], columns=['a','c'])
print(df2)
#    a  c
# 0  1  1
# 1  3  3

# no good...
df3 = pd.concat([df1, df2])
print(df3)
#    a    b    c
# 0  1  1.0  NaN
# 1  2  2.0  NaN
# 0  1  NaN  1.0
# 1  3  NaN  3.0


# good
df4 = pd.merge(df1, df2, how='outer', on='a')
print(df4)
#    a    b    c
# 0  1  1.0  1.0
# 1  2  2.0  NaN
# 2  3  NaN  3.0
编辑以进行索引验证-索引上的Concat不执行完全外部联接

import pandas as pd
import numpy as np

df1 = pd.DataFrame(data=[[1, 1], [2, 2]], columns=['a','b'])
df1 = df1.set_index('a')
print(df1)
#    b
# a   
# 1  1
# 2  2
df2 = pd.DataFrame(data=[[1, 1], [3, 3]], columns=['a','c'])
df2 = df2.set_index('a')
print(df2)
#    c
# a   
# 1  1
# 3  3

# no good...
df3 = pd.concat([df1, df2])
print(df3)
#      b    c
# a          
# 1  1.0  NaN
# 2  2.0  NaN
# 1  NaN  1.0
# 3  NaN  3.0

# good
df4 = pd.merge(df1, df2, how='outer', left_index=True, right_index=True)
print(df4)
#      b    c
# a          
# 1  1.0  1.0
# 2  2.0  NaN
# 3  NaN  3.0

我只是在查找其他内容时发现了这篇文章,我相信它可以帮助您:

基本上,您可以为每个时间/数据范围读取特定列范围的文件(如果要使用列名,请使用lambda方法)。然后,我将日期字段重命名为相同的名称,或者将日期字段设置为索引。然后连接到多个完整的外部联接以合并所有数据

编辑-一个简单的concat不会像我最初写的那样工作。我建议在日期列上使用完整的外部联接

[从链接]

使用可调用函数的另一种方法是包含lambda表达式。下面是一个示例,其中我们只希望包含一个已定义的列列表。为了便于比较,我们通过将名称转换为小写来规范化名称

显示CONCAT和MERGE之间差异的编辑:

import pandas as pd
import numpy as np
from common import  show_table

df1 = pd.DataFrame(data=[[1, 1], [2, 2]], columns=['a','b'])
print(df1)
#    a  b
# 0  1  1
# 1  2  2

df2 = pd.DataFrame(data=[[1, 1], [3, 3]], columns=['a','c'])
print(df2)
#    a  c
# 0  1  1
# 1  3  3

# no good...
df3 = pd.concat([df1, df2])
print(df3)
#    a    b    c
# 0  1  1.0  NaN
# 1  2  2.0  NaN
# 0  1  NaN  1.0
# 1  3  NaN  3.0


# good
df4 = pd.merge(df1, df2, how='outer', on='a')
print(df4)
#    a    b    c
# 0  1  1.0  1.0
# 1  2  2.0  NaN
# 2  3  NaN  3.0
编辑以进行索引验证-索引上的Concat不执行完全外部联接

import pandas as pd
import numpy as np

df1 = pd.DataFrame(data=[[1, 1], [2, 2]], columns=['a','b'])
df1 = df1.set_index('a')
print(df1)
#    b
# a   
# 1  1
# 2  2
df2 = pd.DataFrame(data=[[1, 1], [3, 3]], columns=['a','c'])
df2 = df2.set_index('a')
print(df2)
#    c
# a   
# 1  1
# 3  3

# no good...
df3 = pd.concat([df1, df2])
print(df3)
#      b    c
# a          
# 1  1.0  NaN
# 2  2.0  NaN
# 1  NaN  1.0
# 3  NaN  3.0

# good
df4 = pd.merge(df1, df2, how='outer', left_index=True, right_index=True)
print(df4)
#      b    c
# a          
# 1  1.0  1.0
# 2  2.0  NaN
# 3  NaN  3.0

您可以先分离数据帧,然后合并它们…:

data_csv = io.StringIO('''|aaa-date  |aaa-val|bbb-date  |bbb-val|ccc-date  |ccc-val|
|08-04-2008|-20.943|31-03-2008|-23.869|26-03-2008|+1.401 |
|09-04-2008|-20.943|01-04-2008|-19.813|27-03-2008|+1.376 |
|10-04-2008|-18.868|02-04-2008|-18.929|28-03-2008|-0.534 |
|11-04-2008|-19.057|03-04-2008|-19.917|31-03-2008|+0.688 |
|14-04-2008|-20.000|04-04-2008|-20.125|01-04-2008|+3.336 |
|15-04-2008|-18.868|07-04-2008|-21.321|02-04-2008|+3.413 |
|16-04-2008|-16.226|08-04-2008|-22.517|03-04-2008|+4.177 |
|17-04-2008|-14.340|09-04-2008|-24.857|04-04-2008|+4.279 |
|18-04-2008|-12.830|10-04-2008|-24.701|07-04-2008|+2.445 |
|21-04-2008|-15.472|11-04-2008|-24.857|08-04-2008|+1.146 |''')
df = pd.read_csv(data_csv,sep=r'\s*\|\s*',engine='python').iloc[:,1:-1]
column_names = df.columns.tolist()
cols = [col.split('-')[0] for col in column_names][::2]
cols
dfs = [df[[col+'-date',col+'-val']] for col in cols]
new_df = pd.DataFrame({'date':[]})
for dfi,col in zip(dfs,column_names[::2]):
    new_df = new_df.merge(dfi.rename(columns={col:'date'}),how='outer')
new_df
输出:

    date        aaa-val bbb-val ccc-val
0   08-04-2008  -20.943 -22.517 1.146
1   09-04-2008  -20.943 -24.857 NaN
2   10-04-2008  -18.868 -24.701 NaN
3   11-04-2008  -19.057 -24.857 NaN
4   14-04-2008  -20.000 NaN     NaN
5   15-04-2008  -18.868 NaN     NaN
6   16-04-2008  -16.226 NaN     NaN
7   17-04-2008  -14.340 NaN     NaN
8   18-04-2008  -12.830 NaN     NaN
9   21-04-2008  -15.472 NaN     NaN
10  31-03-2008  NaN     -23.869 0.688
11  01-04-2008  NaN     -19.813 3.336
12  02-04-2008  NaN     -18.929 3.413
13  03-04-2008  NaN     -19.917 4.177
14  04-04-2008  NaN     -20.125 4.279
15  07-04-2008  NaN     -21.321 2.445
16  26-03-2008  NaN     NaN     1.401
17  27-03-2008  NaN NaN 1.376
18  28-03-2008  NaN NaN -0.534

您可以先分离数据帧,然后合并它们…:

data_csv = io.StringIO('''|aaa-date  |aaa-val|bbb-date  |bbb-val|ccc-date  |ccc-val|
|08-04-2008|-20.943|31-03-2008|-23.869|26-03-2008|+1.401 |
|09-04-2008|-20.943|01-04-2008|-19.813|27-03-2008|+1.376 |
|10-04-2008|-18.868|02-04-2008|-18.929|28-03-2008|-0.534 |
|11-04-2008|-19.057|03-04-2008|-19.917|31-03-2008|+0.688 |
|14-04-2008|-20.000|04-04-2008|-20.125|01-04-2008|+3.336 |
|15-04-2008|-18.868|07-04-2008|-21.321|02-04-2008|+3.413 |
|16-04-2008|-16.226|08-04-2008|-22.517|03-04-2008|+4.177 |
|17-04-2008|-14.340|09-04-2008|-24.857|04-04-2008|+4.279 |
|18-04-2008|-12.830|10-04-2008|-24.701|07-04-2008|+2.445 |
|21-04-2008|-15.472|11-04-2008|-24.857|08-04-2008|+1.146 |''')
df = pd.read_csv(data_csv,sep=r'\s*\|\s*',engine='python').iloc[:,1:-1]
column_names = df.columns.tolist()
cols = [col.split('-')[0] for col in column_names][::2]
cols
dfs = [df[[col+'-date',col+'-val']] for col in cols]
new_df = pd.DataFrame({'date':[]})
for dfi,col in zip(dfs,column_names[::2]):
    new_df = new_df.merge(dfi.rename(columns={col:'date'}),how='outer')
new_df
输出:

    date        aaa-val bbb-val ccc-val
0   08-04-2008  -20.943 -22.517 1.146
1   09-04-2008  -20.943 -24.857 NaN
2   10-04-2008  -18.868 -24.701 NaN
3   11-04-2008  -19.057 -24.857 NaN
4   14-04-2008  -20.000 NaN     NaN
5   15-04-2008  -18.868 NaN     NaN
6   16-04-2008  -16.226 NaN     NaN
7   17-04-2008  -14.340 NaN     NaN
8   18-04-2008  -12.830 NaN     NaN
9   21-04-2008  -15.472 NaN     NaN
10  31-03-2008  NaN     -23.869 0.688
11  01-04-2008  NaN     -19.813 3.336
12  02-04-2008  NaN     -18.929 3.413
13  03-04-2008  NaN     -19.917 4.177
14  04-04-2008  NaN     -20.125 4.279
15  07-04-2008  NaN     -21.321 2.445
16  26-03-2008  NaN     NaN     1.401
17  27-03-2008  NaN NaN 1.376
18  28-03-2008  NaN NaN -0.534

所以FWIW这就是我结束的地方-我的数据集是176列x 3300行,并且
concat
axis=1
似乎比
merge

df = pd.read_csv('data.csv')
i = 0
new_df = pd.DataFrame()

while 2*(i+1) < len(df.columns):
    colname = df.columns[2*i + 1]
    tmp = df.iloc[:,[2*i, 2*i+1]].dropna()
    tmp.columns.values[0]='date'
    tmp.set_index('date', inplace=True)
    new_df = pd.concat([new_df, tmp], axis=1)
    i += 1
df=pd.read\u csv('data.csv'))
i=0
new_df=pd.DataFrame()
而2*(i+1)
意见:

  • 我不认为你可以避免在初始数据帧中循环-我找不到一个有帮助的函数

  • iloc[:,[2*i,2*i+1]]
    是一个非常有用的构造,可以拉出感兴趣的列-这可能对新手有帮助


  • 谢谢大家,John

    所以FWIW这就是我结束的地方-我的数据集是176列x 3300行,
    concat
    ,轴=1
    似乎比
    合并

    df = pd.read_csv('data.csv')
    i = 0
    new_df = pd.DataFrame()
    
    while 2*(i+1) < len(df.columns):
        colname = df.columns[2*i + 1]
        tmp = df.iloc[:,[2*i, 2*i+1]].dropna()
        tmp.columns.values[0]='date'
        tmp.set_index('date', inplace=True)
        new_df = pd.concat([new_df, tmp], axis=1)
        i += 1
    
    df=pd.read\u csv('data.csv'))
    i=0
    new_df=pd.DataFrame()
    而2*(i+1)
    意见:

  • 我不认为你可以避免在初始数据帧中循环-我找不到一个有帮助的函数

  • iloc[:,[2*i,2*i+1]]
    是一个非常有用的构造,可以拉出感兴趣的列-这可能对新手有帮助


  • 谢谢大家,约翰

    谢谢伊恩-基本上是循环加康卡特。pbpython网站看起来也值得关注,所以谢谢你,谢谢!但请参阅我的编辑,建议加入concat指定的数据。如果在数据的多个细分中存在特定日期的数据,简单的concat将重复日期字段。我认为,如果将日期列指定为索引,concat将起作用,默认情况下,concat将执行外部联接-请参阅-我再次感谢您的澄清和感谢-如果您使用
    axis=1
    调用
    concat
    ,则它将执行外部联接,并给出与
    df4
    示例
    df3=pd.concat([df1,df2],axis=1)相同的结果
    再次感谢-熊猫新手,这对美沙克伊恩很有帮助-基本上是循环加康卡特。pbpython网站看起来也值得关注,所以谢谢你,谢谢!但请参阅我的编辑,建议加入concat指定的数据。如果在数据的多个细分中存在特定日期的数据,简单的concat将重复日期字段。我认为,如果将日期列指定为索引,concat将起作用,默认情况下,concat将执行外部联接-请参阅-我再次感谢您的澄清和感谢-如果您使用
    axis=1
    调用
    concat
    ,则它将执行外部联接,并给出与
    df4
    示例
    df3=pd.concat([df1,df2],axis=1)相同的结果
    再次感谢-熊猫新手,这对你的测量时间有帮助吗?您能显示ourpur数据帧吗?@adirabargil
    concat
    实现需要750ms,
    merge
    实现需要1188ms,因此58%的长行程。。欢迎你接受你自己的答案……你测量过时间吗?您能显示我们的数据帧吗?@a