Python 使用逻辑将数据采样转换为不同的比率
我有一个如下所示的数据框架,我想对数据进行采样,其中每个客户的数据应分为三个部分,即使用订单id进行序列(70%)、验证(15%)和测试(15%)。每个客户应在所有三个部分中出现。每个客户的订单id计数和项目可能不同 数据帧:Python 使用逻辑将数据采样转换为不同的比率,python,pandas,data-science,training-data,sampling,Python,Pandas,Data Science,Training Data,Sampling,我有一个如下所示的数据框架,我想对数据进行采样,其中每个客户的数据应分为三个部分,即使用订单id进行序列(70%)、验证(15%)和测试(15%)。每个客户应在所有三个部分中出现。每个客户的订单id计数和项目可能不同 数据帧: Customer Orderid item_name A 1 orange A 1 apple A 1 banana A 2 ap
Customer Orderid item_name
A 1 orange
A 1 apple
A 1 banana
A 2 apple
A 2 carrot
A 3 orange
A 4 grape
A 4 watermelon
A 4 banana
B 1 pineapple
B 2 banana
B 3 papaya
B 3 Lime
采样后的所有三个数据集(培训、验证和测试)应包含相同数量的客户,验证和测试的项目应为培训的子集
预期结果:
train: should contain all customers and all item_names (70% of complete data)
train:
customer item
A orange
A apple
A banana
A carrot
A grape
A watermelon
B pinepple
B banana
B papaya
B Lime
validation : should contain all customers and item_names can be subset of train(15% of complete data)
customer item
A orange
A apple
A banana
B pinepple
B banana
B papaya
B Lime
test : should contain all customers and item_names can be subset of train(15% of complete data)
Customer item
A carrot
A grape
A watermelon
B papaya
B Lime
正如@Parth在评论中提到的,首先您需要有一个符合这种分层拆分条件的数据集。然后,您可以创建一个组合了“Customer”和“item_name”的新列,以提供“train_test_split”方法的“stratify”参数,它是sklearn的一部分 下面,你可以找到一个例子
import pandas as pd
from sklearn.model_selection import train_test_split
#Create sample data
data = {
"Customer":["A", "A", "A", "A","A","A","A","A","A", "B", "B", "B","B", "B", "B", "B","B","B"],
"Orderid":[1, 1, 1, 2, 2, 2, 2, 3, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2],
"item_name":[
"orange",
"apple",
"orange",
"apple",
"orange",
"apple",
"orange",
"apple",
"orange",
"apple",
"orange",
"apple",
"orange",
"apple",
"orange",
"apple",
"orange",
"apple"
]
}
# Convert data to dataframe
df = pd.DataFrame(data)
# Create a new column with combination of "Customer" and "item_name" to feed the "stratify" parameter
# train_test_split method which is a part of "sklearn.model_selection"
df["CustAndItem"] = df["Customer"]+"_"+df["item_name"]
# First split the "train" and "test" set. In this example I have split %40 of the data as "test"
# and %60 of data as "train"
X_train, X_test, y_train, y_test = train_test_split(df.index,
df["CustAndItem"],
test_size=0.4,
stratify=df["CustAndItem"])
# Get actual data after split operation
df_train = df.loc[X_train].copy(True)
df_test = df.loc[X_test].copy(True)
# Now split "test" set to "validation" and "test" sets. In this example I have split them equally
# (test_size = 0.5) which will contain %20 of the main set.
X_validate, X_test, y_validate, y_test = train_test_split(df_test.index,
df_test["CustAndItem"],
test_size= 0.5,
stratify=df_test["CustAndItem"])
# Get actual data after split
df_validate = df_test.loc[X_validate]
df_test = df_test.loc[X_test]
# Print results
print(df_train)
print(df_validate)
print(df_test)
正如@Parth在评论中提到的,首先您需要有一个符合这种分层拆分条件的数据集。然后,您可以创建一个组合了“Customer”和“item_name”的新列,以提供“train_test_split”方法的“stratify”参数,它是sklearn的一部分 下面,你可以找到一个例子
import pandas as pd
from sklearn.model_selection import train_test_split
#Create sample data
data = {
"Customer":["A", "A", "A", "A","A","A","A","A","A", "B", "B", "B","B", "B", "B", "B","B","B"],
"Orderid":[1, 1, 1, 2, 2, 2, 2, 3, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2],
"item_name":[
"orange",
"apple",
"orange",
"apple",
"orange",
"apple",
"orange",
"apple",
"orange",
"apple",
"orange",
"apple",
"orange",
"apple",
"orange",
"apple",
"orange",
"apple"
]
}
# Convert data to dataframe
df = pd.DataFrame(data)
# Create a new column with combination of "Customer" and "item_name" to feed the "stratify" parameter
# train_test_split method which is a part of "sklearn.model_selection"
df["CustAndItem"] = df["Customer"]+"_"+df["item_name"]
# First split the "train" and "test" set. In this example I have split %40 of the data as "test"
# and %60 of data as "train"
X_train, X_test, y_train, y_test = train_test_split(df.index,
df["CustAndItem"],
test_size=0.4,
stratify=df["CustAndItem"])
# Get actual data after split operation
df_train = df.loc[X_train].copy(True)
df_test = df.loc[X_test].copy(True)
# Now split "test" set to "validation" and "test" sets. In this example I have split them equally
# (test_size = 0.5) which will contain %20 of the main set.
X_validate, X_test, y_validate, y_test = train_test_split(df_test.index,
df_test["CustAndItem"],
test_size= 0.5,
stratify=df_test["CustAndItem"])
# Get actual data after split
df_validate = df_test.loc[X_validate]
df_test = df_test.loc[X_test]
# Print results
print(df_train)
print(df_validate)
print(df_test)
例如,客户A和项目橙色,只有2个条目。在这种情况下,不可能将它们分成3个桶。如果您可以根据需要发布3个bucket的样本预期数据,那么它将非常有用。@parth,对其进行了修改,对上述问题的任何输入@Serdar ERİİÇ的回答似乎是实现您所需的最简单的方法。但是,如果某个(客户、项目)组合的示例很少,它就会失败。如果您知道实际数据并非如此,则可以继续,否则您需要编写自定义代码,其中必须对每个(客户、项目)组合进行随机抽样。例如,客户A和项目橙色,只有2个条目。在这种情况下,不可能将它们分成3个桶。如果您可以根据需要发布3个bucket的样本预期数据,那么它将非常有用。@parth,对其进行了修改,对上述问题的任何输入@Serdar ERİİÇ的回答似乎是实现您所需的最简单的方法。但是,如果某个(客户、项目)组合的示例很少,它就会失败。如果您知道在实际数据中并非如此,则可以继续,否则您需要编写自定义代码,其中必须对每个(客户、项目)组合进行随机抽样。感谢您的回复,如果我将测试大小从0.4减少到0.3,我将得到以下错误。alueError:y中填充最少的类只有1个成员,这太少了。任何类的最小组数不能小于2。我可以知道为什么吗?这是因为数据的大小。例如,设想一个有5行的数据帧。其中两个标签相同,另外两个标签不同,最后一个标签完全不同。如果要根据标签将此数据均匀地拆分为2个数据帧,则不可能。因为您的标签只有一行。可能是在拆分验证集和测试集时出现此错误。因此,可以打印
df_测试
,并观察具有uniuqeCustAndItem
值的行。然后,您可以附加更多行以平衡数据。感谢您的响应,如果我将测试大小从0.4减少到0.3。我得到以下错误。alueError:y中填充最少的类只有1个成员,这太少了。任何类的最小组数不能小于2。我可以知道为什么吗?这是因为数据的大小。例如,设想一个有5行的数据帧。其中两个标签相同,另外两个标签不同,最后一个标签完全不同。如果要根据标签将此数据均匀地拆分为2个数据帧,则不可能。因为您的标签只有一行。可能是在拆分验证集和测试集时出现此错误。因此,可以打印df_测试
,并观察具有uniuqeCustAndItem
值的行。然后,您可以附加更多行以平衡数据。