Python 循环遍历字典,并使用ItErrors将它们附加到数据帧
我对蟒蛇和熊猫还很陌生。我需要一些关于我正在使用的代码的帮助 我有一个名为Python 循环遍历字典,并使用ItErrors将它们附加到数据帧,python,pandas,dataframe,Python,Pandas,Dataframe,我对蟒蛇和熊猫还很陌生。我需要一些关于我正在使用的代码的帮助 我有一个名为df的字典,其中包含一些文件及其txt格式的内容。此字典的键是文件名(date.txt),值是它的内容。下面是它的样子: {'02_01_2020': 0 0 1 229017 Cust_1 CUR ... 1 2 629324 Cust_2
df
的字典,其中包含一些文件及其txt格式的内容。此字典的键是文件名(date.txt),值是它的内容。下面是它的样子:
{'02_01_2020': 0
0 1 229017 Cust_1 CUR ...
1 2 629324 Cust_2 CUR ...
2 3 863300 Cust_3 CUR ...
3 4 670338 Cust_4 CUR ...
4 5 987039 Cust_5 CUR ...
5 6 485912 Cust_6 CUR ...,'03_01_2020': 0
0 1 122403 Cust_1 CUR ...
1 2 779269 Cust_2 CUR ...
2 3 728965 Cust_3 CUR ...
3 4 527716 Cust_4 CUR ...
4 5 796179 Cust_5 CUR ...
5 6 027872 Cust_6 CUR ...
6 7 449767 Cust_7 CUR ...
7 8 598752 Cust_8 CUR ...
8 9 180422 Cust_9 CUR ..., .... goes until the last file ('31_01_2020')}
如上所示,每个文件包含不同的数据。文件02_01_2020.txt有6个条目,文件03_01_2020.txt有9个条目,依此类推,直到最后一个文件(31_01_2020.txt)
我在这里的目标是将必要的信息分离到它们自己的列中(客户名称、货币等),并将文件名插入到名为paid_date的单独列中。我使用iterrows()
循环浏览这个字典文件。代码如下:
def data_process(df):
#dataframe that i created outside this function
global df_data_1
for key,value in df.items():
df1 = pd.DataFrame(value)
df1['Paid_date'] = key.replace('_', '/')
#df1.insert(1, 'Paid_date', key.replace('_','/')) - another attempt to insert the col
for index,row in df1.iterrows():
df_Item_Num = row.str.slice(start = 0, stop=2) # entry number
df_DUMP_1 = row.str.slice(start = 0, stop=23) # not used
df_NAME = row.str.slice(start = 23, stop=40)
df_CURRENCY = row.str.slice(start = 40, stop=54)
df_AMOUNT = row.str.slice(start = 54, stop=66)
df_DATE = row.str.slice(start = 68, stop=86)
df_DUMP_2 = row.str.slice(start = 87, stop=-1) # not used
df_ALL_ITEMS = pd.concat([df_Item_Num, df_NAME, df_CURRENCY, df_AMOUNT, df_DATE], ignore_index=True)
df_data_1 = df_data_1.append(df_ALL_ITEMS, ignore_index=True)
return df_data_1
#SET UP EMPTY LISTS & Dictionary
filelist = []
filename = []
df = {}
def file_process(mydir):
for path, dirs, files in os.walk(mydir):
for file in files:
if file.endswith('.txt'):
filelist.append(file)
filename.append(file[0:10])
return filelist, filename
当我禁用传递键的列创建代码时,结果如下所示:
0 1 2 3 4
0 1 Cust_1 CUR Amount Date_Time
1 2 Cust_2 CUR Amount Date_Time
2 3 Cust_3 CUR Amount Date_Time
3 4 Cust_4 CUR Amount Date_Time
4 5 Cust_5 CUR Amount Date_Time
.. .. ... ... ... ...
185 10 Cust_6 CUR Amount Date_Time
186 11 Cust_7 CUR Amount Date_Time
187 12 Cust_8 CUR Amount Date_Time
188 13 Cust_9 CUR Amount Date_Time
189 14 Cust_10 CUR Amount Date_Time
0 1 2 3 ... 6 7 8 9
0 1 02 Cust_1 ... Amount Date_Time
1 2 02 Cust_2 ... Amount Date_Time
2 3 02 Cust_3 ... Amount Date_Time
3 4 02 Cust_4 ... Amount Date_Time
4 5 02 Cust_5 ... Amount Date_Time
.. .. .. ... .. ... ... .. ... ..
185 10 31 Cust_6 ... Amount Date_Time
186 11 31 Cust_7 ... Amount Date_Time
187 12 31 Cust_8 ... Amount Date_Time
188 13 31 Cust_9 ... Amount Date_Time
189 14 31 Cust_10 ... Amount Date_Time
1 CUST_NAME_1 CURRENCY AMOUNT DATE_TIME
2 CUST_NAME_2 CURRENCY AMOUNT DATE_TIME
3 CUST_NAME_3 CURRENCY AMOUNT DATE_TIME
4 CUST_NAME_4 CURRENCY AMOUNT DATE_TIME
5 CUST_NAME_5 CURRENCY AMOUNT DATE_TIME
这正是我所需要的,只包括paid_date列(我需要文件名存储在与特定文件对应的每一行中。例如,02_01_2020将打印6次到6行,03_01_2020到9行,等等)。但是,当我启用列创建代码时,结果如下:
0 1 2 3 4
0 1 Cust_1 CUR Amount Date_Time
1 2 Cust_2 CUR Amount Date_Time
2 3 Cust_3 CUR Amount Date_Time
3 4 Cust_4 CUR Amount Date_Time
4 5 Cust_5 CUR Amount Date_Time
.. .. ... ... ... ...
185 10 Cust_6 CUR Amount Date_Time
186 11 Cust_7 CUR Amount Date_Time
187 12 Cust_8 CUR Amount Date_Time
188 13 Cust_9 CUR Amount Date_Time
189 14 Cust_10 CUR Amount Date_Time
0 1 2 3 ... 6 7 8 9
0 1 02 Cust_1 ... Amount Date_Time
1 2 02 Cust_2 ... Amount Date_Time
2 3 02 Cust_3 ... Amount Date_Time
3 4 02 Cust_4 ... Amount Date_Time
4 5 02 Cust_5 ... Amount Date_Time
.. .. .. ... .. ... ... .. ... ..
185 10 31 Cust_6 ... Amount Date_Time
186 11 31 Cust_7 ... Amount Date_Time
187 12 31 Cust_8 ... Amount Date_Time
188 13 31 Cust_9 ... Amount Date_Time
189 14 31 Cust_10 ... Amount Date_Time
1 CUST_NAME_1 CURRENCY AMOUNT DATE_TIME
2 CUST_NAME_2 CURRENCY AMOUNT DATE_TIME
3 CUST_NAME_3 CURRENCY AMOUNT DATE_TIME
4 CUST_NAME_4 CURRENCY AMOUNT DATE_TIME
5 CUST_NAME_5 CURRENCY AMOUNT DATE_TIME
我有两个新的空列,显然键(文件名)没有完全插入(只有日期以某种方式存储在新列中,不包括月份和年份)。解决这个问题最有效的方法是什么?任何帮助都将不胜感激。多谢各位
编辑1
我正在处理的每个txt文件的条目如下所示:
0 1 2 3 4
0 1 Cust_1 CUR Amount Date_Time
1 2 Cust_2 CUR Amount Date_Time
2 3 Cust_3 CUR Amount Date_Time
3 4 Cust_4 CUR Amount Date_Time
4 5 Cust_5 CUR Amount Date_Time
.. .. ... ... ... ...
185 10 Cust_6 CUR Amount Date_Time
186 11 Cust_7 CUR Amount Date_Time
187 12 Cust_8 CUR Amount Date_Time
188 13 Cust_9 CUR Amount Date_Time
189 14 Cust_10 CUR Amount Date_Time
0 1 2 3 ... 6 7 8 9
0 1 02 Cust_1 ... Amount Date_Time
1 2 02 Cust_2 ... Amount Date_Time
2 3 02 Cust_3 ... Amount Date_Time
3 4 02 Cust_4 ... Amount Date_Time
4 5 02 Cust_5 ... Amount Date_Time
.. .. .. ... .. ... ... .. ... ..
185 10 31 Cust_6 ... Amount Date_Time
186 11 31 Cust_7 ... Amount Date_Time
187 12 31 Cust_8 ... Amount Date_Time
188 13 31 Cust_9 ... Amount Date_Time
189 14 31 Cust_10 ... Amount Date_Time
1 CUST_NAME_1 CURRENCY AMOUNT DATE_TIME
2 CUST_NAME_2 CURRENCY AMOUNT DATE_TIME
3 CUST_NAME_3 CURRENCY AMOUNT DATE_TIME
4 CUST_NAME_4 CURRENCY AMOUNT DATE_TIME
5 CUST_NAME_5 CURRENCY AMOUNT DATE_TIME
在txt文件中,有很多空白分隔了信息,正如上面所示。我的代码首先要做的是循环遍历我的计算机中存储所有文件的目录,并将它们附加到两个列表中。代码如下:
def data_process(df):
#dataframe that i created outside this function
global df_data_1
for key,value in df.items():
df1 = pd.DataFrame(value)
df1['Paid_date'] = key.replace('_', '/')
#df1.insert(1, 'Paid_date', key.replace('_','/')) - another attempt to insert the col
for index,row in df1.iterrows():
df_Item_Num = row.str.slice(start = 0, stop=2) # entry number
df_DUMP_1 = row.str.slice(start = 0, stop=23) # not used
df_NAME = row.str.slice(start = 23, stop=40)
df_CURRENCY = row.str.slice(start = 40, stop=54)
df_AMOUNT = row.str.slice(start = 54, stop=66)
df_DATE = row.str.slice(start = 68, stop=86)
df_DUMP_2 = row.str.slice(start = 87, stop=-1) # not used
df_ALL_ITEMS = pd.concat([df_Item_Num, df_NAME, df_CURRENCY, df_AMOUNT, df_DATE], ignore_index=True)
df_data_1 = df_data_1.append(df_ALL_ITEMS, ignore_index=True)
return df_data_1
#SET UP EMPTY LISTS & Dictionary
filelist = []
filename = []
df = {}
def file_process(mydir):
for path, dirs, files in os.walk(mydir):
for file in files:
if file.endswith('.txt'):
filelist.append(file)
filename.append(file[0:10])
return filelist, filename
上面的代码返回两个列表
在df1.iterrows():中的
索引行中,我所做的(或认为我所做的)是对iterrows()
返回的每个序列进行切片,只保留我想要的信息,并将它们连接到一个空数据帧中。这有效吗?或者还有其他方法吗?“我有一个名为df的字典”,然后你展示一些不是字典的东西,但它被包装在{}
中(而df
是一个应该为实际数据帧保留的名称,以避免混淆)。很明显,你在这方面做了很多工作(很好的第一篇文章!),但我不能完全理解你正在处理的格式。同样,df1中索引行的中的所有内容都是毫无意义的;您只需丢弃结果并覆盖每个循环上的变量。我怀疑你实际上是在处理一个制表符分隔符file@roganjosh您好,谢谢您的回复和建议(将在将来的代码中记住它们)。我的想法是从一个文件中获取所有数据,将重要信息移动到它们自己的列(名称、货币等),然后将文件名存储到它自己的列中,并根据它的条目进行打印。当我打印print(type(df))
时,它会说。这是一本字典,不是吗?或者可能是另一种情况?我对python及其数据结构还是相当陌生的:DIt可以是一个字典,但您发布的不是有效的语法。我假设它是一个大字符串文字,但无论如何,你都需要引号,如果它是以制表符分隔的,那么我认为你尝试分割字符串的效率非常低;您可以指定分隔符来读取CSV数据(可能是)