Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/361.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python-转换数据帧和切片_Python_Pandas - Fatal编程技术网

Python-转换数据帧和切片

Python-转换数据帧和切片,python,pandas,Python,Pandas,我附上了一个截图来帮助解释。我从克利夫兰心脏数据集中提取了一个数据框,它包含76列,将它们放入7列,并将其他列包装到下一行。我试图找出如何将数据帧转换为可读格式,如右侧的数据帧所示 变量xyz将始终相同,但我列出的其他字母变量将不同。我想我可以使用data.loc[:,:'xyz']来开始,但我不确定从这里开始: data = pd.read_csv("../resources/cleveland.data") data.loc[:, :'xyz'] 然后我必须从那里开始,为这些变量指定列名

我附上了一个截图来帮助解释。我从克利夫兰心脏数据集中提取了一个数据框,它包含76列,将它们放入7列,并将其他列包装到下一行。我试图找出如何将数据帧转换为可读格式,如右侧的数据帧所示

变量xyz将始终相同,但我列出的其他字母变量将不同。我想我可以使用data.loc[:,:'xyz']来开始,但我不确定从这里开始:

data = pd.read_csv("../resources/cleveland.data")
data.loc[:, :'xyz']

然后我必须从那里开始,为这些变量指定列名。令人惊讶的是,一旦我解决了这一问题,训练、测试、验证这一部分就会容易得多。提前谢谢你的帮助。(我是新手)

输入数据

1   a   b   c
d   xyz 2   e
f   g   h   xyz
3   i   j   k
col_name col_1 col_2 col_3 col_4 col_5 col_6
line
1.0          1     a     b     c     d   xyz
2.0          2     e     f     g     h   xyz
3.0          3     i     j     k   NaN   NaN
代码

import pandas as pd
import numpy as np

# The initial data doesn't contain header so set header to None
df = pd.read_csv("../resources/cleveland.data", header=None)
cols = df.columns.tolist()

# Reset the index to get the line number in the durty file
df = df.reset_index()

# After having melt the df, you can filter the df in order to have every values in one column.
# Those values are in the right order
df = pd.melt(df, id_vars=['index'], value_vars=cols)
df = df.sort_values(by=['index', 'variable'])

# Then you can set the line number
df['line'] = np.where(df.value == 'xyz', 1, np.nan)
df.line = df.line.cumsum()
df.line = df.line.bfill()

# If the file doesn't end with 'xyz', we have to set the line number to df.line.max() + 1
df.loc[df.line.isna(), 'line'] = df.line.max() + 1
df.line = df.line.ffill()

# We can set the column names as interger with a groupby cumsum
df['one'] = 1
df['col_name'] = df.groupby(['line'])['one'].cumsum()
df['col_name'] = "col_" + df['col_name'].astype('str')

# Then we can pivot the table
df = df[['value', 'line', 'col_name']]
df = df.pivot(index='line', columns='col_name', values='value')
print(df)
输出数据

1   a   b   c
d   xyz 2   e
f   g   h   xyz
3   i   j   k
col_name col_1 col_2 col_3 col_4 col_5 col_6
line
1.0          1     a     b     c     d   xyz
2.0          2     e     f     g     h   xyz
3.0          3     i     j     k   NaN   NaN

在形成一个包含所有值的大数组后,使用
numpy
np.array\u split
+
np.where
的组合,用于在
xyz
之后的索引上拆分:

示例数据:
test.csv
代码 输出: 来自@CharlesR数据

   0  1  2  3     4     5
0  1  a  b  c     d   xyz
1  2  e  f  g     h   xyz
2  3  i  j  k  None  None

这看起来很有用,但是,a,b,c,d,e,f,g,h,i,j,k,xyz都在一行上,然后我们必须解析出下一行。我总是希望在xyz停止生产线,然后在之后的位置开始新生产线。然后从那里停在xyz,然后在之后的位置开始新线。这些立场并不总是相同的。
   0  1  2  3     4     5
0  1  a  b  c     d   xyz
1  2  e  f  g     h   xyz
2  3  i  j  k  None  None