Python 准备excel写入的数据框
我已开始编辑read.excel,结果见下表: | descr | serial | ref | type | val | qty | uom | |----------- |-------- |---------------------------------- |-------- |----- |----- |----- | | Product 1 | NaN | 12345 | type 1 | NaN | 6 | PCS | | Product 2 | NaN | 23456 | NaN | NaN | 4 | PCS | | Product 3 | NaN | 66778 MAKER: MANUFACTURER 1 ... | type 2 | NaN | 4 | PCS | | Product 4 | NaN | 88776 MAKER: MANUFACTURER 2 ... | NaN | NaN | 2 | PCS | | Product 5 | 500283 | 99117 MAKER: MANUFACTURER 1 ... | NaN | NaN | 12 | PCS | | Product 6 | 500283 | 00116 MAKER: MANUFACTURER 1 ... | NaN | NaN | 12 | PCS | | Product 7 | 900078 | 307128 MAKER: MANUFACTURER 3 ... | NaN | NaN | 12 | PCS | | Product 8 | 900078 | 411354 MAKER: MANUFACTURER 3 ... | NaN | NaN | 2 | PCS | |描述|序列|参考|类型|价值|数量|计量单位| |----------- |-------- |---------------------------------- |-------- |----- |----- |----- | |产品1 |南| 12345 |类型1 |南| 6 |件| |产品2 |南| 23456 |南|南| 4 |件| |产品3 |南| 66778制造商:制造商1…|类型2 |南| 4 |件| |产品4 |南| 88776制造商:制造商2…|楠|楠| 2件| |产品5 | 500283 | 99117制造商:制造商1…|楠|楠| 12件| |产品6 | 500283 | 00116制造商:制造商1…|楠|楠| 12件| |产品7 | 900078 | 307128制造商:制造商3…|楠|楠| 12件| |产品8 | 900078 | 411354制造商:制造商3…|楠|楠| 2件| 我现在有两个问题Python 准备excel写入的数据框,python,pandas,Python,Pandas,我已开始编辑read.excel,结果见下表: | descr | serial | ref | type | val | qty | uom | |----------- |-------- |---------------------------------- |-------- |----- |----- |----- | | Product 1 | NaN
descr,serial,ref,type,val,qty,uom
Product 1,,12345,type 1,,6,PCS
Product 2,,23456,,,4,PCS
Product 3,,66778 MAKER: MANUFACTURER 1,type 2,,4,PCS
Product 4,,88776 MAKER: MANUFACTURER 2,,,2,
加载数据并创建一个新的数据框,名为cleaned
,该数据框将根据所需输出进行操作和按摩
import pandas as pd
import numpy as np
raw = pd.read_csv("data.csv") # reading in the example file
cleaned = pd.DataFrame() # creating new dataframe
cleaned['ref (int)'] = raw['ref'].str.split(' ').str[0].copy() # creating ref (int) column that is just the first plat of the ref colum
# moving the rest of the data over
cleaned['description'] = raw['descr']
cleaned['ref_maker'] = raw['ref'].str.split(' ').str[1:].apply(' '.join) # making a new column for the rest of ref description if there is a text part after the integer in the ref column
cleaned['type_full'] = raw['type']
cleaned['qty'] = raw['qty']
clear_mask = cleaned.duplicated(['ref', 'qty'], keep='first') # looking for rows where the ref and qty values are the same as above, we dont want that to show up so this creates a series of booleans
cleaned.loc[clear_mask, 'qty'] = '' # setting duplicates to empty strings
cleaned.loc[clear_mask, 'ref'] = ''
cols = cleaned.columns.tolist() # rearranging columns so that qty is at the end
cols.append(cols.pop(cols.index('qty')))
cleaned = cleaned[cols]
print(cleaned)
现在我们有了一个数据帧(清理后的
),它看起来像这样:
ref (int) description ref_maker type_full qty
0 12345 Product 1 type 1 6
1 23456 Product 2 NaN 4
2 66778 Product 3 MAKER: MANUFACTURER 1 type 2 4
3 88776 Product 4 MAKER: MANUFACTURER 2 NaN 2
ref (int) qty desc
0 12345 6 Product 1
1 12345 6 type 1
2 23456 4 Product 2
3 66778 4 Product 3
4 66778 4 MAKER: MANUFACTURER 1
5 66778 4 type 2
6 88776 2 Product 4
7 88776 2 MAKER: MANUFACTURER 2
现在我们需要清理它
cleaned.replace('', np.NaN, inplace=True) # replacing empty strings with NaN
cleaned.set_index(['ref (int)', 'qty'], inplace=True) # fixing ref and qty columns for when it stacks (stacking will help make the multi-lined duplicates you wanted)
cleaned = cleaned.stack().to_frame().reset_index() # stacking the dataframe and then converting it back to a dataframe
(供参考),.stack()
命令将为您提供以下信息(这几乎是您想要的):
现在我们再做一点清洁:
del cleaned['level_2'] # cleaning up old remnants from the stack (level_2 corresponds to the column names that you dont want in your final output)
cleaned.dropna() # deleting rows that have no values
cleaned.columns = ['ref', 'qty', 'desc'] # renaming the columns for clarity
现在看起来是这样的:
ref (int) description ref_maker type_full qty
0 12345 Product 1 type 1 6
1 23456 Product 2 NaN 4
2 66778 Product 3 MAKER: MANUFACTURER 1 type 2 4
3 88776 Product 4 MAKER: MANUFACTURER 2 NaN 2
ref (int) qty desc
0 12345 6 Product 1
1 12345 6 type 1
2 23456 4 Product 2
3 66778 4 Product 3
4 66778 4 MAKER: MANUFACTURER 1
5 66778 4 type 2
6 88776 2 Product 4
7 88776 2 MAKER: MANUFACTURER 2
最后一步是用空字符串替换重复值,使其与所需输出匹配
import pandas as pd
import numpy as np
raw = pd.read_csv("data.csv") # reading in the example file
cleaned = pd.DataFrame() # creating new dataframe
cleaned['ref (int)'] = raw['ref'].str.split(' ').str[0].copy() # creating ref (int) column that is just the first plat of the ref colum
# moving the rest of the data over
cleaned['description'] = raw['descr']
cleaned['ref_maker'] = raw['ref'].str.split(' ').str[1:].apply(' '.join) # making a new column for the rest of ref description if there is a text part after the integer in the ref column
cleaned['type_full'] = raw['type']
cleaned['qty'] = raw['qty']
clear_mask = cleaned.duplicated(['ref', 'qty'], keep='first') # looking for rows where the ref and qty values are the same as above, we dont want that to show up so this creates a series of booleans
cleaned.loc[clear_mask, 'qty'] = '' # setting duplicates to empty strings
cleaned.loc[clear_mask, 'ref'] = ''
cols = cleaned.columns.tolist() # rearranging columns so that qty is at the end
cols.append(cols.pop(cols.index('qty')))
cleaned = cleaned[cols]
print(cleaned)
以下是最终输出:
ref (int) desc qty
0 12345 Product 1 6
1 type 1
2 23456 Product 2 4
3 66778 Product 3 4
4 MAKER: MANUFACTURER 1
5 type 2
6 88776 Product 4 2
7 MAKER: MANUFACTURER 2
请将示例数据以文本形式而不是图像形式发布。谢谢,先生!只有一件事:已清理。设置索引(['ref(int)'),您能告诉我如何(以及在代码中的位置)实现以下内容。*ref(int)行应保持原样。我找到了如何将“MAKER:”更改为“MKR:”例如,如果其他列的字符数不超过30个字符,我想将它们缝合在一起。我不想用串联方式剪切列。无论如何,谢谢。