在Python中合并两个数据集
我有两组x-y数据,它们的在Python中合并两个数据集,python,csv,numpy,Python,Csv,Numpy,我有两组x-y数据,它们的x值应该合并。为了举例说明,第一组看起来像这样: 0.5;3.4 0.8;3.8 0.9;1.2 1.3;1.1 1.9;2.3 0.3;-0.2 0.8;-0.9 1.0;0.1 1.5;1.2 1.6;6.3 第二组是这样的: 0.5;3.4 0.8;3.8 0.9;1.2 1.3;1.1 1.9;2.3 0.3;-0.2 0.8;-0.9 1.0;0.1 1.5;1.2 1.6;6.3 数据位于两个单独的csv文件中。我想将两个文件合并为一个文件,以便x值
x
值应该合并。为了举例说明,第一组看起来像这样:
0.5;3.4
0.8;3.8
0.9;1.2
1.3;1.1
1.9;2.3
0.3;-0.2
0.8;-0.9
1.0;0.1
1.5;1.2
1.6;6.3
第二组是这样的:
0.5;3.4
0.8;3.8
0.9;1.2
1.3;1.1
1.9;2.3
0.3;-0.2
0.8;-0.9
1.0;0.1
1.5;1.2
1.6;6.3
数据位于两个单独的csv文件中。我想将两个文件合并为一个文件,以便x
值按顺序排列,y
值显示在两列中,并完成其(线性)插值(y1
和y2
)。第二列包含第一个数据集的y
值(加上插值),第三列包含第二个数据集的y
值
0.3;y1;-0.2
0.5;3.4;y2
0.8;3.8;-0.9
0.9;1.2;y2
1.0;y1;0.1
1.3;1.1;y2
1.5;y1;1.2
1.6;y1;6.3
1.9;2.3;y2
到目前为止,我唯一的想法是将数据读入numpy数组,将它们连接在一起,对值进行排序,并计算前后值的平均值,以防值为空
在Python中有没有更优雅的方法来实现这一点
编辑:这是我的尝试。虽然脚本相当长,但它工作正常并提供了我想象的结果
#-*- coding: utf-8 -*-
import numpy as np
from matplotlib import pyplot as plt
from scipy.interpolate import interp1d
import csv
# Read data files and turn them into numpy array for further processing
def read_datafile(file_name):
data = np.loadtxt(file_name, delimiter=";")
return data
data1 = read_datafile("testcsv1.csv")
data2 = read_datafile("testcsv2.csv")
# Add empty column at the appropriate position
emptycol1 = np.empty((len(data1), 3))
emptycol1[:] = np.nan
emptycol2 = np.empty((len(data2), 3))
emptycol2[:] = np.nan
emptycol1[:,:-1] = data1
emptycol2[:,[0, 2]] = data2
# Merge and sort the data sets. Create empty array to add final results
merged_temp = np.concatenate((emptycol1, emptycol2))
merged_temp = np.array(sorted(merged_temp, key = lambda x: float(x[0])))
merged = np.empty((1, 3))
# Check for entries where the x values already match. Merge those into one row
i = 0
while i < len(merged_temp)-1:
if merged_temp[i, 0] == merged_temp[i+1, 0]:
newrow = np.array([merged_temp[i, 0], merged_temp[i, 1], merged_temp[i+1, 2]])
merged = np.vstack((merged, newrow))
i += 2
else:
newrow = np.array([merged_temp[i, 0], merged_temp[i, 1], merged_temp[i, 2]])
merged = np.vstack((merged, newrow))
i += 1
# Check for so far undefined values (gaps in the data). Interpolate between them (linearly)
for i in range(len(merged)-1):
# First y column
if np.isnan(merged[i, 1]) == True:
# If only one value is missing (maybe not necessary to separate this case)
if (np.isnan(merged[i-1, 1]) == False) and (np.isnan(merged[i+1, 1]) == False):
merged[i, 1] = (merged[i-1, 1] + merged[i+1, 1])/2
# If two or more values are missing
elif np.isnan(merged[i, 1]) == True:
l = 0
while (np.isnan(merged[i+l, 1]) == True) and (i+l != len(merged)-1):
l += 1
x1 = np.array([i-1, i+l]) # endpoints
x = np.linspace(i, i+l-1, l, endpoint=True) # missing points
y = np.array([merged[i-1, 1], merged[i+l, 1]]) # values at endpoints
f = interp1d(x1, y) # linear interpolation
for k in x:
merged[k, 1] = f(k)
# Second y column
if np.isnan(merged[i, 2]) == True:
# If only one value is missing
if (np.isnan(merged[i-1, 2]) == False) and (np.isnan(merged[i+1, 2]) == False):
merged[i, 2] = (merged[i-1, 2] + merged[i+1, 2])/2
# If two or more values are missing
elif np.isnan(merged[i, 2]) == True:
l = 0
while (np.isnan(merged[i+l, 2]) == True) and (i+l != len(merged)-1):
l += 1
x1 = np.array([i-1, i+l]) # endpoints
x = np.linspace(i, i+l-1, l, endpoint=True) # missing points
y = np.array([merged[i-1, 2], merged[i+l, 2]]) # values at endpoints
f = interp1d(x1, y) # linear interpolation
for k in x:
merged[k, 2] = f(k)
# Remove lines which still have "nan" values (beginning and end). This could be prevented by an extrapolation
merged = merged[~np.isnan(merged).any(axis=1)]
merged = np.delete(merged, (0), axis=0)
# Write table to new csv file in the same directory
with open("testcsv_merged.csv", "w") as mergedfile:
writer = csv.writer(mergedfile)
[writer.writerow(r) for r in merged]
#-*-编码:utf-8-*-
将numpy作为np导入
从matplotlib导入pyplot作为plt
从scipy.interpolate导入interp1d
导入csv
#读取数据文件并将其转换为numpy数组以进行进一步处理
def read_数据文件(文件名):
data=np.loadtxt(文件名,分隔符=“;”)
返回数据
data1=读取数据文件(“testcsv1.csv”)
数据2=读取数据文件(“testcsv2.csv”)
#在适当位置添加空列
emptycol1=np.empty((len(data1),3))
emptycol1[:]=np.nan
emptycol2=np.empty((len(data2),3))
emptycol2[:]=np.nan
emptycol1[:,:-1]=数据1
emptycol2[:,[0,2]]=data2
#合并和排序数据集。创建空数组以添加最终结果
合并的\u temp=np.连接((空基COL1,空基COL2))
合并的临时数组(已排序(合并的临时数组,key=lambda x:float(x[0]))
合并=np.空((1,3))
#检查x值已匹配的条目。把它们合并成一行
i=0
当i
我将使用pandas
进行此类处理:
import pandas as pd
#I assumed you have no headers in the data files
df1 = pd.read_csv('./dataset1.txt',sep=';',header=None)
df2 = pd.read_csv('./dataset2.txt',sep=';',header=None)
#Join the datasets using full outer join on the first column in both datasets
df_merged = df1.merge(df2, on=0, how='outer')
#Fill the nulls with the desirable values in this case the average of the column
df_merged['1_x'].fillna(df_merged['1_x'].mean(),inplace=True)
df_merged['1_y'].fillna(df_merged['1_y'].mean(),inplace=True)
输出:
print(df_merged)
0 1_x 1_y
0 0.5 3.4 y2
1 0.8 3.8 -0.9
2 0.9 1.2 y2
3 1.3 1.1 y2
4 1.9 2.3 y2
5 0.3 y1 -0.2
6 1.0 y1 0.1
7 1.5 y1 1.2
8 1.6 y1 6.3
您可以轻松更改列名:
df_merged.columns = ['col1','col2','col3']
您还可以使用sort\u values
方法轻松对值进行排序:
df_merged.sort_values('col1')
最后,您可以使用以下方法轻松地将最终的数据帧
转换为numpy
数组:
import numpy as np
np.array(df_merged)
我将使用
pandas
进行此类处理:
import pandas as pd
#I assumed you have no headers in the data files
df1 = pd.read_csv('./dataset1.txt',sep=';',header=None)
df2 = pd.read_csv('./dataset2.txt',sep=';',header=None)
#Join the datasets using full outer join on the first column in both datasets
df_merged = df1.merge(df2, on=0, how='outer')
#Fill the nulls with the desirable values in this case the average of the column
df_merged['1_x'].fillna(df_merged['1_x'].mean(),inplace=True)
df_merged['1_y'].fillna(df_merged['1_y'].mean(),inplace=True)
输出:
print(df_merged)
0 1_x 1_y
0 0.5 3.4 y2
1 0.8 3.8 -0.9
2 0.9 1.2 y2
3 1.3 1.1 y2
4 1.9 2.3 y2
5 0.3 y1 -0.2
6 1.0 y1 0.1
7 1.5 y1 1.2
8 1.6 y1 6.3
您可以轻松更改列名:
df_merged.columns = ['col1','col2','col3']
您还可以使用sort\u values
方法轻松对值进行排序:
df_merged.sort_values('col1')
最后,您可以使用以下方法轻松地将最终的数据帧
转换为numpy
数组:
import numpy as np
np.array(df_merged)
一行代码:
dfi=pd.merge(df1,df2,'outer',0)。设置索引(0)。排序索引()。插值()
完整的pandas版本+numpy插值可更好地调整边缘:
#df1 = pd.read_clipboard(header=None,sep=';')
#df2 = pd.read_clipboard(header=None,sep=';')
import pylab as pl
df = pd.merge(df1,df2,'outer',0).sort_values(0)
df['y1']=scipy.interpolate.interp1d(*df1.values.T,fill_value='extrapolate')(df[0])
df['y2']=scipy.interpolate.interp1d(*df2.values.T,fill_value='extrapolate')(df[0])
ax=pl.gca()
df1.set_index(0).plot(lw=0,marker='o',ax=ax)
df2.set_index(0).plot(lw=0,marker='o',ax=ax)
df.set_index(0).loc[:,['y1','y2']].plot(ax=ax)
pl.show()
绘图:
数据:
In [344]: df1
Out[344]:
0 1
0 0.5 3.4
1 0.8 3.8
2 0.9 1.2
3 1.3 1.1
4 1.9 2.3
In [345]: df2
Out[345]:
0 1
0 0.3 -0.2
1 0.8 -0.9
2 1.0 0.1
3 1.5 1.2
4 1.6 6.3
In [346]: df
Out[346]:
0 1_x 1_y y1 y2
5 0.3 NaN -0.2 -20.713281 -0.200000
0 0.5 3.4 NaN 3.400000 -3.021563
1 0.8 3.8 -0.9 3.800000 -0.900000
2 0.9 1.2 NaN 1.200000 -0.092830
6 1.0 NaN 0.1 -0.265527 0.100000
3 1.3 1.1 NaN 1.100000 -1.960323
7 1.5 NaN 1.2 3.760937 1.200000
8 1.6 NaN 6.3 4.701230 6.300000
4 1.9 2.3 NaN 2.300000 44.318059
一行代码:
dfi=pd.merge(df1,df2,'outer',0)。设置索引(0)。排序索引()。插值()
完整的pandas版本+numpy插值可更好地调整边缘:
#df1 = pd.read_clipboard(header=None,sep=';')
#df2 = pd.read_clipboard(header=None,sep=';')
import pylab as pl
df = pd.merge(df1,df2,'outer',0).sort_values(0)
df['y1']=scipy.interpolate.interp1d(*df1.values.T,fill_value='extrapolate')(df[0])
df['y2']=scipy.interpolate.interp1d(*df2.values.T,fill_value='extrapolate')(df[0])
ax=pl.gca()
df1.set_index(0).plot(lw=0,marker='o',ax=ax)
df2.set_index(0).plot(lw=0,marker='o',ax=ax)
df.set_index(0).loc[:,['y1','y2']].plot(ax=ax)
pl.show()
绘图:
数据:
In [344]: df1
Out[344]:
0 1
0 0.5 3.4
1 0.8 3.8
2 0.9 1.2
3 1.3 1.1
4 1.9 2.3
In [345]: df2
Out[345]:
0 1
0 0.3 -0.2
1 0.8 -0.9
2 1.0 0.1
3 1.5 1.2
4 1.6 6.3
In [346]: df
Out[346]:
0 1_x 1_y y1 y2
5 0.3 NaN -0.2 -20.713281 -0.200000
0 0.5 3.4 NaN 3.400000 -3.021563
1 0.8 3.8 -0.9 3.800000 -0.900000
2 0.9 1.2 NaN 1.200000 -0.092830
6 1.0 NaN 0.1 -0.265527 0.100000
3 1.3 1.1 NaN 1.100000 -1.960323
7 1.5 NaN 1.2 3.760937 1.200000
8 1.6 NaN 6.3 4.701230 6.300000
4 1.9 2.3 NaN 2.300000 44.318059
你已经试过什么了吗?我会使用而不仅仅是计算前后值的平均值。你已经试过什么了吗?我会使用而不仅仅是计算前后值的平均值。