Python 蟒蛇,熊猫,df 2部分问题:1。如何根据特定条件向列表中添加列2。如何从df中删除这些列
真正的问题就在下面,但这是背景信息 我的最终目标是创建一个通用python文件(script?.py文件),用于打开excel文件,确定数据的组织方式,“清理”不可用的数据,然后运行多元线性回归分析。我被困在“清洁”部分,但我有想法,只是不知道如何去做 这是数据的外观(在excel文件中): 下面是我到目前为止的代码(带说明) 到目前为止,我所做的只是确定数据的方向。现在我需要“清理”数据Python 蟒蛇,熊猫,df 2部分问题:1。如何根据特定条件向列表中添加列2。如何从df中删除这些列,python,excel,pandas,dataframe,Python,Excel,Pandas,Dataframe,真正的问题就在下面,但这是背景信息 我的最终目标是创建一个通用python文件(script?.py文件),用于打开excel文件,确定数据的组织方式,“清理”不可用的数据,然后运行多元线性回归分析。我被困在“清洁”部分,但我有想法,只是不知道如何去做 这是数据的外观(在excel文件中): 下面是我到目前为止的代码(带说明) 到目前为止,我所做的只是确定数据的方向。现在我需要“清理”数据 ######this is the real work indent...########
######this is the real work indent...##########################
#quantify how many how many empty values are in each column..............https://datascience.stackexchange.com/questions/12645/how-to-count-the-number-of-missing-values-in-each-row-in-pandas-dataframe
num_of_empty_cells_in_columns = df.isnull().sum(axis=0)
# sort the coulmns based on how many empty values they have............https://data-flair.training/blogs/sort-pandas-dataframes-series-array/
columns_pd_sorted = num_of_empty_cells_in_columns.sort_values(ascending=True)
现在我对列进行了“排序”(基于整个列中有多少空单元格),我只需要选择第一列作为“最低值”。这意味着这要么是Y值(对于后面的多元线性回归分析),要么是数据量与Y值相同的数据字段
#find the lowest value (this is the value of the already sorted array)
lowest_value=columns_pd_sorted[0]
我还想取所有空字段的平均值。这个平均值(平均值)将在以后使用(我想)
我的最终目标是:
y value data 1 data 2 data 3 data 5 data 6
282 1 215 169 14 147
148 0 250 307 232 134
351 1 191 343 189 9
31 0 32 327 8 201
33 0 503 484 85 166
我想对上述预期的最终数据进行多元线性回归分析
非常感谢您的帮助,谢谢 下面是一种使用
dropna()
函数的方法。首先,我们有初始数据帧:
print(df) # initial data frame
y_value data_1 data_2 data_3 data_4 data_5 data_6
0 282 1 215.0 169.0 NaN 14.0 147.0
1 148 0 250.0 307.0 NaN 232.0 134.0
2 351 1 191.0 343.0 NaN 189.0 9.0
3 31 0 32.0 327.0 NaN 8.0 201.0
4 33 0 503.0 484.0 NaN 85.0 166.0
5 973 0 651.0 134.0 NaN 128.0 NaN
6 329 0 300.0 186.0 NaN 195.0 NaN
7 271 1 543.0 18.0 NaN NaN NaN
8 814 1 544.0 123.0 NaN NaN NaN
9 274 1 349.0 209.0 NaN NaN NaN
10 425 1 NaN NaN NaN NaN NaN
接下来,(a)如果每个值都是NaN,则删除列;如果任何值都是NaN,则删除行:
# un-comment the next line to transpose the data frame (e.g., based on user input / user confirmation)
# df = df.transpose()
# delete columns with all NaN
df = df.dropna(axis=1, how='all')
# delete rows with 1 or more NaN
df = df.dropna(axis=0, how='any')
print(df)
y_value data_1 data_2 data_3 data_5 data_6
0 282 1 215.0 169.0 14.0 147.0
1 148 0 250.0 307.0 232.0 134.0
2 351 1 191.0 343.0 189.0 9.0
3 31 0 32.0 327.0 8.0 201.0
4 33 0 503.0 484.0 85.0 166.0
dropna()
y_variable_candidates = []
for col in num_of_empty_cells_in_columns:
if col=lowest_value:
y_variable_candidates=y_variable_candidates + col
y_variable = y_variable_candidates[1]
y_variable_confirmation = input('currently your y variable is ' + str(y_variable) +' it appears that there are many y variable candidates, such as' + str(y_variable_candidates) + 'press enter if the current y variable is okay, otherwise press a number key to indicate which column should be the y variable')
#... more code later on
mostly_empty_columns = []
for col in num_of_empty_cells_in_columns:
if col>mean_empty_cells:
mostly_empty_columns=mostly_empty_columns + col
#some code to get user to confirm to delete all the selected columns
y value data 1 data 2 data 3 data 5 data 6
282 1 215 169 14 147
148 0 250 307 232 134
351 1 191 343 189 9
31 0 32 327 8 201
33 0 503 484 85 166
print(df) # initial data frame
y_value data_1 data_2 data_3 data_4 data_5 data_6
0 282 1 215.0 169.0 NaN 14.0 147.0
1 148 0 250.0 307.0 NaN 232.0 134.0
2 351 1 191.0 343.0 NaN 189.0 9.0
3 31 0 32.0 327.0 NaN 8.0 201.0
4 33 0 503.0 484.0 NaN 85.0 166.0
5 973 0 651.0 134.0 NaN 128.0 NaN
6 329 0 300.0 186.0 NaN 195.0 NaN
7 271 1 543.0 18.0 NaN NaN NaN
8 814 1 544.0 123.0 NaN NaN NaN
9 274 1 349.0 209.0 NaN NaN NaN
10 425 1 NaN NaN NaN NaN NaN
# un-comment the next line to transpose the data frame (e.g., based on user input / user confirmation)
# df = df.transpose()
# delete columns with all NaN
df = df.dropna(axis=1, how='all')
# delete rows with 1 or more NaN
df = df.dropna(axis=0, how='any')
print(df)
y_value data_1 data_2 data_3 data_5 data_6
0 282 1 215.0 169.0 14.0 147.0
1 148 0 250.0 307.0 232.0 134.0
2 351 1 191.0 343.0 189.0 9.0
3 31 0 32.0 327.0 8.0 201.0
4 33 0 503.0 484.0 85.0 166.0