Python-转换数据帧和切片_Python_Pandas

Python-转换数据帧和切片

python pandas

Python-转换数据帧和切片,python,pandas,Python,Pandas,我附上了一个截图来帮助解释。我从克利夫兰心脏数据集中提取了一个数据框，它包含76列，将它们放入7列，并将其他列包装到下一行。我试图找出如何将数据帧转换为可读格式，如右侧的数据帧所示变量xyz将始终相同，但我列出的其他字母变量将不同。我想我可以使用data.loc[：，：'xyz']来开始，但我不确定从这里开始： data = pd.read_csv("../resources/cleveland.data") data.loc[:, :'xyz'] 然后我必须从那里开始，为这些变量指定列名

我附上了一个截图来帮助解释。我从克利夫兰心脏数据集中提取了一个数据框，它包含76列，将它们放入7列，并将其他列包装到下一行。我试图找出如何将数据帧转换为可读格式，如右侧的数据帧所示

变量xyz将始终相同，但我列出的其他字母变量将不同。我想我可以使用data.loc[：，：'xyz']来开始，但我不确定从这里开始：

data = pd.read_csv("../resources/cleveland.data")
data.loc[:, :'xyz']

然后我必须从那里开始，为这些变量指定列名。令人惊讶的是，一旦我解决了这一问题，训练、测试、验证这一部分就会容易得多。提前谢谢你的帮助。（我是新手）

输入数据

1 a b c d xyz 2 e f g h xyz 3 i j k

col_name col_1 col_2 col_3 col_4 col_5 col_6 line 1.0 1 a b c d xyz 2.0 2 e f g h xyz 3.0 3 i j k NaN NaN
代码

import pandas as pd import numpy as np # The initial data doesn't contain header so set header to None df = pd.read_csv("../resources/cleveland.data", header=None) cols = df.columns.tolist() # Reset the index to get the line number in the durty file df = df.reset_index() # After having melt the df, you can filter the df in order to have every values in one column. # Those values are in the right order df = pd.melt(df, id_vars=['index'], value_vars=cols) df = df.sort_values(by=['index', 'variable']) # Then you can set the line number df['line'] = np.where(df.value == 'xyz', 1, np.nan) df.line = df.line.cumsum() df.line = df.line.bfill() # If the file doesn't end with 'xyz', we have to set the line number to df.line.max() + 1 df.loc[df.line.isna(), 'line'] = df.line.max() + 1 df.line = df.line.ffill() # We can set the column names as interger with a groupby cumsum df['one'] = 1 df['col_name'] = df.groupby(['line'])['one'].cumsum() df['col_name'] = "col_" + df['col_name'].astype('str') # Then we can pivot the table df = df[['value', 'line', 'col_name']] df = df.pivot(index='line', columns='col_name', values='value') print(df)
输出数据

1 a b c d xyz 2 e f g h xyz 3 i j k

col_name col_1 col_2 col_3 col_4 col_5 col_6 line 1.0 1 a b c d xyz 2.0 2 e f g h xyz 3.0 3 i j k NaN NaN

在形成一个包含所有值的大数组后，使用
numpy
。
np.array\u split
+
np.where
的组合，用于在
xyz
之后的索引上拆分：
示例数据：
test.csv
代码输出：来自@CharlesR数据

0 1 2 3 4 5 0 1 a b c d xyz 1 2 e f g h xyz 2 3 i j k None None

这看起来很有用，但是，a，b，c，d，e，f，g，h，i，j，k，xyz都在一行上，然后我们必须解析出下一行。我总是希望在xyz停止生产线，然后在之后的位置开始新生产线。然后从那里停在xyz，然后在之后的位置开始新线。这些立场并不总是相同的。
0 1 2 3 4 5 0 1 a b c d xyz 1 2 e f g h xyz 2 3 i j k None None