Python(Jupyter笔记本):复制数据帧索引导致值的长度与索引的长度不匹配错误

Python(Jupyter笔记本):复制数据帧索引导致值的长度与索引的长度不匹配错误,python,pandas,dataframe,indexing,split,Python,Pandas,Dataframe,Indexing,Split,我有一个数据框,其中列中的单元格有多个值,并用“;”分隔。我正在尝试拆分多个值(在一个单元格中),并为拆分的值创建新行。类似于下面的示例: > In: df > Out: | Year | State | Ingredient | Species | | 1998 | CA | egg; pork | sp1;sp2 | 我试图实现的结果如下所示: > In: df > Out: | Year | State | Ingredient | Species | |

我有一个数据框,其中列中的单元格有多个值,并用“;”分隔。我正在尝试拆分多个值(在一个单元格中),并为拆分的值创建新行。类似于下面的示例:

> In: df
> Out:
| Year | State | Ingredient | Species |
| 1998 |  CA   | egg; pork  | sp1;sp2 |
我试图实现的结果如下所示:

> In: df
> Out:
| Year | State | Ingredient | Species |
| 1998 |  CA   | egg        | sp1     |
| 1998 |  CA   | egg        | sp1     |
| 1998 |  CA   | pork       | sp2     |
| 1998 |  CA   | pork       | sp2     |
我发现了一种像这样分割数据帧的方法,但它只工作一次。我使用的代码如下所示:

sp = df['Species'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy()
df1['Species] = sp.values
fd = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
j = fd.index.get_level_values(0)
df2 = df1.loc[j].copy()
df2['Ingredient'] = fd.values
当我首先使用原始数据帧(df)在“物种”列上执行此操作时,它会工作

但是,当我在df1上再次执行此代码时,试图分割所有的“成分”,它会给我一个错误,指出值的长度与索引的长度不匹配。如下图所示:

sp = df['Species'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy()
df1['Species] = sp.values
fd = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
j = fd.index.get_level_values(0)
df2 = df1.loc[j].copy()
df2['Ingredient'] = fd.values
我做了很多试验来找出它为什么会返回错误消息,我意识到当我再次在df1上执行这个调用来创建df2时,它会在执行df2=df1.loc[j].copy()时将行数/索引数加倍。因此,给我比我需要更多的行。但是,如果我将“df1”替换为“df”(原始数据帧),那么这个错误就不会出现,并且可以正常工作

有解决办法吗?或者有没有其他方法来分割它

多谢各位


这是我第一次发表关于堆栈溢出的文章,我也是Python新手。如果格式不正确,很抱歉。

我已尝试解决您的问题。我无法解决你方法中的问题。因为您提供了预期的输出,所以我能够想出另一种方法。希望这是简洁的,可以解决您的问题

df = pd.DataFrame(columns=['Year', 'State', 'Ingredient', 'Species'])
df.loc[0] = [1998, 'CA', 'egg; pork', 'sp1;sp2']   # Same input df as problem
print df
sp = df['Species'][0].split(';') # Separating by species
df = pd.concat([df]*len(sp), ignore_index=True) # Add len(sp) more rows
df['Species'] = sp
ing = df['Ingredient'][0].split(';')
df = pd.concat([df]*len(ing), ignore_index=True) 
df['Ingredient'] = ing*len(sp)    # Replicate ingredient len(sp) number of times
print df
   Year State Ingredient  Species
0  1998    CA  egg; pork  sp1;sp2
   Year State Ingredient Species
0  1998    CA        egg     sp1
1  1998    CA       pork     sp2
2  1998    CA        egg     sp1
3  1998    CA       pork     sp2
这是我第一次回答。。。请让我知道我是否应该对此答案进行任何更改,以添加更多细节或格式。谢谢

编辑:我能找出你的方法出了什么问题。创建数据帧副本时,必须重置索引,否则,当获得值为0的索引数时,将获得多个值,因为它们当前都为0。见下文

sp = df['Species'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy()
print df1
fd = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
j = fd.index.get_level_values(0)
print j

df1 = df.loc[i].copy().reset_index(drop=True)
print df1
fd = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
j = fd.index.get_level_values(0)
print j
输出:

   Year State Ingredient  Species
0  1998    CA  egg; pork  sp1;sp2
0  1998    CA  egg; pork  sp1;sp2
Int64Index([0, 0, 0, 0], dtype='int64')
   Year State Ingredient  Species
0  1998    CA  egg; pork  sp1;sp2
1  1998    CA  egg; pork  sp1;sp2
Int64Index([0, 0, 1, 1], dtype='int64')
带修复程序的原始代码:

df = pd.DataFrame(columns=['Year', 'State', 'Ingredient', 'Species'])
df.loc[0] = [1998, 'CA', 'egg; pork', 'sp1;sp2']
#print df

sp = df['Species'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy().reset_index(drop=True, inplace=False)
df1['Species'] = sp.values


fd = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
j = fd.index.get_level_values(0)
df2 = df1.loc[j].copy().reset_index(drop=True, inplace=False)
df2['Ingredient'] = fd.values
print df2

希望有帮助

我试了一下你的问题。我无法解决你方法中的问题。因为您提供了预期的输出,所以我能够想出另一种方法。希望这是简洁的,可以解决您的问题

df = pd.DataFrame(columns=['Year', 'State', 'Ingredient', 'Species'])
df.loc[0] = [1998, 'CA', 'egg; pork', 'sp1;sp2']   # Same input df as problem
print df
sp = df['Species'][0].split(';') # Separating by species
df = pd.concat([df]*len(sp), ignore_index=True) # Add len(sp) more rows
df['Species'] = sp
ing = df['Ingredient'][0].split(';')
df = pd.concat([df]*len(ing), ignore_index=True) 
df['Ingredient'] = ing*len(sp)    # Replicate ingredient len(sp) number of times
print df
   Year State Ingredient  Species
0  1998    CA  egg; pork  sp1;sp2
   Year State Ingredient Species
0  1998    CA        egg     sp1
1  1998    CA       pork     sp2
2  1998    CA        egg     sp1
3  1998    CA       pork     sp2
这是我第一次回答。。。请让我知道我是否应该对此答案进行任何更改,以添加更多细节或格式。谢谢

编辑:我能找出你的方法出了什么问题。创建数据帧副本时,必须重置索引,否则,当获得值为0的索引数时,将获得多个值,因为它们当前都为0。见下文

sp = df['Species'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy()
print df1
fd = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
j = fd.index.get_level_values(0)
print j

df1 = df.loc[i].copy().reset_index(drop=True)
print df1
fd = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
j = fd.index.get_level_values(0)
print j
输出:

   Year State Ingredient  Species
0  1998    CA  egg; pork  sp1;sp2
0  1998    CA  egg; pork  sp1;sp2
Int64Index([0, 0, 0, 0], dtype='int64')
   Year State Ingredient  Species
0  1998    CA  egg; pork  sp1;sp2
1  1998    CA  egg; pork  sp1;sp2
Int64Index([0, 0, 1, 1], dtype='int64')
带修复程序的原始代码:

df = pd.DataFrame(columns=['Year', 'State', 'Ingredient', 'Species'])
df.loc[0] = [1998, 'CA', 'egg; pork', 'sp1;sp2']
#print df

sp = df['Species'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy().reset_index(drop=True, inplace=False)
df1['Species'] = sp.values


fd = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
j = fd.index.get_level_values(0)
df2 = df1.loc[j].copy().reset_index(drop=True, inplace=False)
df2['Ingredient'] = fd.values
print df2
希望有帮助

借助上面所示的vk的“带修复的原始代码”。它帮助我解决了错误“值的长度与索引的长度不匹配”。解决方案是:我需要在代码中的适当位置放置reset_index()

原始代码:

## Separate multiple entries in cells in 'Species' column to new rows:
sp = df['Species'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy()
df1['Species] = sp.values

## Separate multiple entries in cells in 'Ingredient' column to new rows:
ing = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
df2 = df1.loc[j].copy()
df2['Ingredient'] = ing.values
## Separate multiple entries in 'Species' column cell into rows
sp = df['Species'].str.split(';', expand=True).stack()
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy().reset_index()
df1['Species'] = sp.values

del df1['index'] ## a column called "index" is generated when you execute reset_index()

## Separate multiple entries in 'Ingredient' column cell into rows:
ing = df1['Ingredient'].str.split(';', expand=True).stack()
j = ing.index.get_level_values(0)
df2 = df1.loc[j].copy()
df2['Ingredient'] = ing.values
固定代码:

## Separate multiple entries in cells in 'Species' column to new rows:
sp = df['Species'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy()
df1['Species] = sp.values

## Separate multiple entries in cells in 'Ingredient' column to new rows:
ing = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
df2 = df1.loc[j].copy()
df2['Ingredient'] = ing.values
## Separate multiple entries in 'Species' column cell into rows
sp = df['Species'].str.split(';', expand=True).stack()
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy().reset_index()
df1['Species'] = sp.values

del df1['index'] ## a column called "index" is generated when you execute reset_index()

## Separate multiple entries in 'Ingredient' column cell into rows:
ing = df1['Ingredient'].str.split(';', expand=True).stack()
j = ing.index.get_level_values(0)
df2 = df1.loc[j].copy()
df2['Ingredient'] = ing.values
我用“修复代码”得到了我想要的输出。

在上面所示的vk“修复原始代码”的帮助下。它帮助我解决了错误“值的长度与索引的长度不匹配”。解决方案是:我需要在代码中的适当位置放置reset_index()

原始代码:

## Separate multiple entries in cells in 'Species' column to new rows:
sp = df['Species'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy()
df1['Species] = sp.values

## Separate multiple entries in cells in 'Ingredient' column to new rows:
ing = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
df2 = df1.loc[j].copy()
df2['Ingredient'] = ing.values
## Separate multiple entries in 'Species' column cell into rows
sp = df['Species'].str.split(';', expand=True).stack()
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy().reset_index()
df1['Species'] = sp.values

del df1['index'] ## a column called "index" is generated when you execute reset_index()

## Separate multiple entries in 'Ingredient' column cell into rows:
ing = df1['Ingredient'].str.split(';', expand=True).stack()
j = ing.index.get_level_values(0)
df2 = df1.loc[j].copy()
df2['Ingredient'] = ing.values
固定代码:

## Separate multiple entries in cells in 'Species' column to new rows:
sp = df['Species'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy()
df1['Species] = sp.values

## Separate multiple entries in cells in 'Ingredient' column to new rows:
ing = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
df2 = df1.loc[j].copy()
df2['Ingredient'] = ing.values
## Separate multiple entries in 'Species' column cell into rows
sp = df['Species'].str.split(';', expand=True).stack()
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy().reset_index()
df1['Species'] = sp.values

del df1['index'] ## a column called "index" is generated when you execute reset_index()

## Separate multiple entries in 'Ingredient' column cell into rows:
ing = df1['Ingredient'].str.split(';', expand=True).stack()
j = ing.index.get_level_values(0)
df2 = df1.loc[j].copy()
df2['Ingredient'] = ing.values

我用“固定代码”得到了我想要的输出。

感谢您的回复!我试过你的代码,但效果不太好。我认为你的方法对你有效,因为数据集很小。我有一个庞大而复杂的数据集,所以这就是为什么它不适合我。谢谢你的“编辑”建议,这真的帮助了我的思考过程,我从你的方法中学到了很多。如果我找到解决办法,我会让你知道!对原始代码的修复是否有效?我知道之前的问题是功能不正确,现在的问题是性能问题吗?是的!我现在明白了。这正是您在“修复原始代码”中所说的。我会把我的答案贴在下面。谢谢你,vk!感谢您的回复!我试过你的代码,但效果不太好。我认为你的方法对你有效,因为数据集很小。我有一个庞大而复杂的数据集,所以这就是为什么它不适合我。谢谢你的“编辑”建议,这真的帮助了我的思考过程,我从你的方法中学到了很多。如果我找到解决办法,我会让你知道!对原始代码的修复是否有效?我知道之前的问题是功能不正确,现在的问题是性能问题吗?是的!我现在明白了。这正是您在“修复原始代码”中所说的。我会把我的答案贴在下面。谢谢你,vk!