Python 3.x 我想分割数据并按行和列获取值
我希望通过将数据集拆分为80:20%的比率,将数据集与行和列一起拆分,其中80%是训练数据,20%是测试数据。但我可以将数据集分成80%,但不能分成20%Python 3.x 我想分割数据并按行和列获取值,python-3.x,machine-learning,scikit-learn,spyder,Python 3.x,Machine Learning,Scikit Learn,Spyder,我希望通过将数据集拆分为80:20%的比率,将数据集与行和列一起拆分,其中80%是训练数据,20%是测试数据。但我可以将数据集分成80%,但不能分成20% import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression city_attributes = pd.read_csv
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
city_attributes = pd.read_csv('./input/city_attributes.csv')
humidity = pd.read_csv('./input/humidity.csv')
pressure = pd.read_csv('./input/pressure.csv')
temperature = pd.read_csv('./input/temperature.csv')
weather_description = pd.read_csv('./input/weather_description.csv')
wind_direction = pd.read_csv('./input/wind_direction.csv')
wind_speed = pd.read_csv('./input/wind_speed.csv')
# we can reshape these using pd.melt
humidity = pd.melt(humidity, id_vars = ['datetime'], value_name = 'humidity', var_name = 'City')
pressure = pd.melt(pressure, id_vars = ['datetime'], value_name = 'pressure', var_name = 'City')
temperature = pd.melt(temperature, id_vars = ['datetime'], value_name = 'temperature', var_name = 'City')
weather_description = pd.melt(weather_description, id_vars = ['datetime'], value_name = 'weather_description', var_name = 'City')
wind_direction = pd.melt(wind_direction, id_vars = ['datetime'], value_name = 'wind_direction', var_name = 'City')
wind_speed = pd.melt(wind_speed, id_vars = ['datetime'], value_name = 'wind_speed', var_name = 'City')
# combine all of the dataframes created above
weather = pd.concat([humidity, pressure, temperature, wind_direction, wind_speed, weather_description], axis = 1)
weather = weather.loc[:,~weather.columns.duplicated()] # indexing: every row, only the columns that aren't duplicates
# now we can merge this with the city attributes
weather = pd.merge(city_attributes,weather, on = 'City')
weather = weather.dropna()
first = pd.DataFrame()
rest = pd.DataFrame()
total_size = weather.shape[0]
train_size = 1277055
test_size = 319264
if len(weather) > train_size:
first = weather[:1277055]
rest = weather[319264:]
print(rest)
当前您的代码
train_size = 1277055
test_size = 319264
if len(weather) > train_size:
first = weather[:1277055]
rest = weather[319264:]
将rest定义为第319264行之后的所有行,而第一行正确地定义为前1277055行。也许你想要的是
train_size = 1277055
test_size = 319264
if len(weather) > (train_size + test_size):
first = weather.iloc[:train_size, :]
rest = weather.iloc[(train_size + 1):(train_size + test_size + 1), :] # same as weather[1277056:1596320, :]
或者使用sklearn的列车测试分割:
train_size = 1277055
test_size = 319264
train_idx, test_idx = train_test_split(weather.index, train_size = train_size , test_size = test_size )
df_train = weather.iloc[train_idx, :]
df_test = weather.iloc[test_idx, :]
用法示例:
In [1]: import numpy as np
...: import pandas as pd
...: train_size = 1277055
...: test_size = 319264
...: weather = pd.DataFrame(np.random.randint(0,100,size=(train_size+test_size, 4)), columns=list('ABCD'))
...: print(weather.head())
A B C D
0 13 91 68 35
1 52 30 52 59
2 16 22 73 24
3 62 86 27 96
4 88 54 23 4
In [2]: if len(weather) >= (train_size + test_size):
...: print('subsetting')
...: first = weather.iloc[:train_size, :]
...: rest = weather.iloc[(train_size + 1):(train_size + test_size + 1), :]
...:
...: print(first.shape)
...: print(rest.shape)
...:
subsetting
(1277055, 4)
(319263, 4)
要在位置x拆分数组,请使用
left = array[:x]
right = array[x:]
与相同
x
。因为x
是一个位置,而不是计数。您会得到什么错误或意外结果?您导入了train\u test\u split
,但未使用它。该函数应该完全满足您的需要。通过使用train_test_分割数据可以按列分割,而不是按行分割,我已经对其进行了测试。对于您提到的第一个结果,我得到了以下输出空数据帧列:[]索引:[]您能告诉我您使用的两个建议版本中的哪一个吗?另外,您是否在原始代码中插入了代码,替换了if len(weather).
块?我已经使用了这个块,是的,我已经在原始代码中插入了代码,可能需要首先进行一些更正=pd.DataFrame()rest=pd.DataFrame()if len(weather)>(训练大小+测试大小):好,作为旁注,请记住'first=pd.DataFrame()rest=pd.DataFrame()`是不必要的,因为weather df的子集返回一个不同的df对象。您的天气df的总大小是多少?您确定天气df至少有(训练大小+测试大小)行吗?weather.shape的输出是多少?