Python 用熊猫制作特征表_Python_Pandas_Dataframe

Python 用熊猫制作特征表

python pandas dataframe

Python 用熊猫制作特征表,python,pandas,dataframe,Python,Pandas,Dataframe,我有两个文件夹，每个文件夹都有这样的文件- file1 file2 a 32 b 32 b 23 d 12 c 28 r 7 请注意，每个单词中的所有字母都不是强制性的。它们可以是任意顺序现在我想创建一个这种格式的表- a b c d........r s t.... class 32 23 28 0 0 32 0 12 7 0 0 ... 0

我有两个文件夹，每个文件夹都有这样的文件-

file1          file2
a 32           b  32
b 23           d  12
c 28           r  7

请注意，每个单词中的所有字母都不是强制性的。它们可以是任意顺序

现在我想创建一个这种格式的表-

 a   b    c   d........r s t....   class
32  23    28  0
0    32   0   12       7 0 0 ...    0


...................................... 1

每行包含文件中的字母值，如果字母表不存在，则0.0类表示第一个文件夹中的文件，1类表示第二个文件夹中的文件

我的尝试-

import os
import pandas as pd



dir_list = "........","........"] #CHANGE INPUT PATH

df = pd.DataFrame(columns=['class'])
count=0
for l in dir_list: 
    for root, dirs, files in os.walk(l):
        for name in files:
            outfile2 = open(root+"/"+name,'r')
            line = outfile2.readline()
            print(name)
            count+=1

            while line:
                words=line.split(" ")

                if words[0] not in df.columns:
                    df[words[0]]=words[1]

                elif words[0] in df.columns:
                    df.iloc[count-1][words[0]]=words[1]



                line = outfile2.readline()

            if l=="   ":
               df[count-1]['class']='M'
            else:
               df[count-1]['class']='B'


df=df.fillna(0)

print(df)

错误-

Traceback (most recent call last):
  File "E:\anaconda\lib\site-packages\pandas\core\indexes\base.py", line 2525, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "new.py", line 34, in <module>
    df[count-1]['class']='M'
  File "E:\anaconda\lib\site-packages\pandas\core\frame.py", line 2139, in __getitem__
    return self._getitem_column(key)
  File "E:\anaconda\lib\site-packages\pandas\core\frame.py", line 2146, in _getitem_column
    return self._get_item_cache(key)
  File "E:\anaconda\lib\site-packages\pandas\core\generic.py", line 1842, in _get_item_cache
    values = self._data.get(item)
  File "E:\anaconda\lib\site-packages\pandas\core\internals.py", line 3843, in get
    loc = self.items.get_loc(item)
  File "E:\anaconda\lib\site-packages\pandas\core\indexes\base.py", line 2527, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0

回溯（最近一次呼叫最后一次）：
文件“E:\anaconda\lib\site packages\pandas\core\index\base.py”，第2525行，在get\U loc中
返回发动机。获取位置（钥匙）
文件“pandas/_libs/index.pyx”，第117行，在pandas._libs.index.IndexEngine.get_loc中
文件“pandas/_libs/index.pyx”，第139行，在pandas._libs.index.IndexEngine.get_loc中
pandas._libs.hashtable.PyObjectHashTable.get_项中的第1265行文件“pandas/_libs/hashtable_class_helper.pxi”
pandas._libs.hashtable.PyObjectHashTable.get_项中的文件“pandas/_libs/hashtable_class_helper.pxi”，第1273行
关键错误：0
在处理上述异常期间，发生了另一个异常：
回溯（最近一次呼叫最后一次）：
文件“new.py”，第34行，在
df[count-1]['class']='M'
文件“E:\anaconda\lib\site packages\pandas\core\frame.py”，第2139行，在\uu getitem中__
返回self.\u getitem\u列（键）
文件“E:\anaconda\lib\site packages\pandas\core\frame.py”，第2146行，在\u getitem\u列中
返回self.\u获取\u项目\u缓存（密钥）
文件“E:\anaconda\lib\site packages\pandas\core\generic.py”，第1842行，在\u get\u item\u缓存中
values=self.\u data.get（项目）
get中第3843行的文件“E:\anaconda\lib\site packages\pandas\core\internals.py”
loc=自身项目。获取loc（项目）
文件“E:\anaconda\lib\site packages\pandas\core\index\base.py”，第2527行，在get\U loc中
返回self.\u引擎。获取\u loc（self.\u可能\u cast\u索引器（键））
文件“pandas/_libs/index.pyx”，第117行，在pandas._libs.index.IndexEngine.get_loc中
文件“pandas/_libs/index.pyx”，第139行，在pandas._libs.index.IndexEngine.get_loc中
pandas._libs.hashtable.PyObjectHashTable.get_项中的第1265行文件“pandas/_libs/hashtable_class_helper.pxi”
pandas._libs.hashtable.PyObjectHashTable.get_项中的文件“pandas/_libs/hashtable_class_helper.pxi”，第1273行
关键错误：0

你能补充一下你的方法有什么问题吗？你应该用pd.read\u csv（）读取你的文件，然后用df.transpose（）反转索引和columns@CoMartel如果所有的字母都在所有的文件中，那就行了。@ubuntu\u noob不一定。您可以为每个文件创建一个临时数据框，然后对其进行压缩all@CoMartel所有数据帧的形状都会不同。这会不会产生问题？