Python 如何将具有多列的dataframe重塑为预期的dataframe？_Python_Pandas

Python 如何将具有多列的dataframe重塑为预期的dataframe？

python pandas

Python 如何将具有多列的dataframe重塑为预期的dataframe？,python,pandas,Python,Pandas,我有一个文本文件列表，需要放在一个数据帧中，所以我读取文件并将它们连接到一个数据帧中。但是，生成的数据帧有多个列（452列），但我想将此数据帧重塑为自定义的数据帧。我的意思是我只想要两列，比如0和1列；以下是我的数据的外观：以下是我在数据上的尝试： import pandas as pd allfiles=glob.glob('C:\\fake\\*.txt') dfs=pd.concat([pd.read_csv(file, header = None, sep = '\n', quot

我有一个文本文件列表，需要放在一个数据帧中，所以我读取文件并将它们连接到一个数据帧中。但是，生成的数据帧有多个列（452列），但我想将此数据帧重塑为自定义的数据帧。我的意思是我只想要两列，比如

和

列；以下是我的数据的外观：

以下是我在数据上的尝试：

import pandas as pd

allfiles=glob.glob('C:\\fake\\*.txt')
dfs=pd.concat([pd.read_csv(file, header = None, sep = '\n', quoting=3, skip_blank_lines = True).T for file in allfiles], axis=1)

现在，我希望只使用两列（如

和

）来简单地重塑这个结果数据帧。我该怎么做？有什么想法吗

更新：所需输出：

以下是我的预期输出（仅示例）：

更新2：原始数据

以下是原始文本文件的外观（我正在将多个文本文件读入一个只有两列的数据框）：

简单地去掉最外层的轴规格；i、而不是

In [44]: pd.concat([pd.read_csv(file, header = None, sep = '\n', quoting=3, skip_blank_lines = True).T for file in allfiles], axis=1)
Out[44]:
        0       1       0       1       0       1
0  test1a  test1b  test2a  test2b  test3a  test3b

做

编辑，现在文章已经编辑完毕：

例如，使用以下输入：

In [79]: !cat blah.test
test1a

test1b
In [80]: !cat blah2.test
test2a

test2b
In [81]: !cat blah3.test
test3a

test3b
In [82]: allfiles
Out[82]: ['blah.test', 'blah2.test', 'blah3.test']

我们得到了期望的输出：

In [83]: pd.concat(pd.read_csv(file, header=None, sep='\n', quoting=3, skip_blank_lines=True).T for file in allfiles)
Out[83]:
        0       1
0  test1a  test1b
0  test2a  test2b
0  test3a  test3b

根据以下评论编辑#2：

至少有一个文件包含两个以上的非空行，需要进一步处理。在你的情况下，我可能会这样做

In [169]: df = pd.concat(pd.read_csv(file, header=None, sep='\n', quoting=3, skip_blank_lines = True).T for file in allfiles).reset_index(drop=True).fillna('')

In [170]: df_clean = pd.DataFrame({'headline': df[0], 'context': df.loc[:, 1:].apply(' '.join, axis=1)})

In [171]: df_clean.head()
Out[171]:
                                            headline                                            context
0   Alex Jones Vindicated in "Pizzagate" Controversy  "Alex Jones, purveyor of the independent inves...
1                            THE BIG DATA CONSPIRACY  Government and Silicon Valley are looking to e...
2  California Surprisingly Lenient on Auto Emissi...  Setting Up Face-Off With Trump "California's c...
3  Mexicans Are Chomping at the Bit to Stop NAFTA...  Mexico has been unfairly gaining from NAFTA as...
4  Breaking News: Snapchat to purchase Twitter fo...  Yahoo and AOL could be extremely popular over ...

简单地去掉最外层的轴规格；i、而不是

In [44]: pd.concat([pd.read_csv(file, header = None, sep = '\n', quoting=3, skip_blank_lines = True).T for file in allfiles], axis=1)
Out[44]:
        0       1       0       1       0       1
0  test1a  test1b  test2a  test2b  test3a  test3b

做

编辑，现在文章已经编辑完毕：

例如，使用以下输入：

In [79]: !cat blah.test
test1a

test1b
In [80]: !cat blah2.test
test2a

test2b
In [81]: !cat blah3.test
test3a

test3b
In [82]: allfiles
Out[82]: ['blah.test', 'blah2.test', 'blah3.test']

我们得到了期望的输出：

In [83]: pd.concat(pd.read_csv(file, header=None, sep='\n', quoting=3, skip_blank_lines=True).T for file in allfiles)
Out[83]:
        0       1
0  test1a  test1b
0  test2a  test2b
0  test3a  test3b

根据以下评论编辑#2：

至少有一个文件包含两个以上的非空行，需要进一步处理。在你的情况下，我可能会这样做

In [169]: df = pd.concat(pd.read_csv(file, header=None, sep='\n', quoting=3, skip_blank_lines = True).T for file in allfiles).reset_index(drop=True).fillna('')

In [170]: df_clean = pd.DataFrame({'headline': df[0], 'context': df.loc[:, 1:].apply(' '.join, axis=1)})

In [171]: df_clean.head()
Out[171]:
                                            headline                                            context
0   Alex Jones Vindicated in "Pizzagate" Controversy  "Alex Jones, purveyor of the independent inves...
1                            THE BIG DATA CONSPIRACY  Government and Silicon Valley are looking to e...
2  California Surprisingly Lenient on Auto Emissi...  Setting Up Face-Off With Trump "California's c...
3  Mexicans Are Chomping at the Bit to Stop NAFTA...  Mexico has been unfairly gaining from NAFTA as...
4  Breaking News: Snapchat to purchase Twitter fo...  Yahoo and AOL could be extremely popular over ...

你的意思是你只想要一列文本文件？试试

df.T

？@MohitMotwani不，那不是我想要的。请看我更新的帖子，我把可复制的输出放在那里？嗯，如果我们不知道我们在做什么，我们就不能复制你的输出reshaping@MohitMotwani我用可复制的输入数据和可复制的预期输出更新了我的帖子。知道吗？你的意思是你只想为你的文本文件写一个专栏？试试

df.T

？@MohitMotwani不，那不是我想要的。请看我更新的帖子，我把可复制的输出放在那里？嗯，如果我们不知道我们在做什么，我们就不能复制你的输出reshaping@MohitMotwani我用可复制的输入数据和可复制的预期输出更新了我的帖子。知道吗？可能你的一个文件包含两行以上的非空行。为了找出哪一个，让

df

作为上述操作的结果，并查看

df[df[2].notnull（）]

；它与给定输入的形式完全匹配，因此至少有一个输入文件的格式不正确。您可以使用

df[df[2].notnull（）]

来确定哪一个，并以其为例，说明代码运行完全正常。您是否调查了第三列以查看它何时变为非空？要么这样做，要么提供所有输入文件。我试图了解为什么会发生这种情况，可能会截断第二列中的长文本，并给出一个值

NaN

。下面是您可以浏览的示例数据列表：，我不明白为什么长文本在dataframe中变短并生成一个额外的列。我怎样才能解决这个问题？有什么想法吗？相应地更新了答案。可能您的一个文件包含两行以上的非空行。为了找出哪一个，让

df

作为上述操作的结果，并查看

df[df[2].notnull（）]

；它与给定输入的形式完全匹配，因此至少有一个输入文件的格式不正确。您可以使用

df[df[2].notnull（）]

NaN

。下面是您可以浏览的示例数据列表：，我不明白为什么长文本在dataframe中变短并生成一个额外的列。我怎样才能解决这个问题？有什么想法吗？相应地更新了答案。