Hadoop pig脚本要采集10块训练数据,pig脚本被卡住了
背景 我有一个二进制分类任务,其中数据高度不平衡。具体来说,有 标签为0的数据比标签为1的数据多得多。为了解决这个问题,我计划进行二次抽样 标签为0的数据与标签为1的数据的大小大致匹配。我是用猪的剧本写的。而不是 仅对一个训练数据块进行采样,我做了10次,生成10个数据块来训练10个分类器 与袋装相似,以减少差异 示例清管器脚本Hadoop pig脚本要采集10块训练数据,pig脚本被卡住了,hadoop,machine-learning,apache-pig,sampling,bootstrapping,Hadoop,Machine Learning,Apache Pig,Sampling,Bootstrapping,背景 我有一个二进制分类任务,其中数据高度不平衡。具体来说,有 标签为0的数据比标签为1的数据多得多。为了解决这个问题,我计划进行二次抽样 标签为0的数据与标签为1的数据的大小大致匹配。我是用猪的剧本写的。而不是 仅对一个训练数据块进行采样,我做了10次,生成10个数据块来训练10个分类器 与袋装相似,以减少差异 示例清管器脚本 --------------------------------- -- generate training chunk i ---------------------
---------------------------------
-- generate training chunk i
---------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData '$RATIO';
-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunk1,labelOneTrainingData;
-- join two tables to get all the features back from table 'dataFeatures'
trainingChunkiFeatures = JOIN trainingChunkiRaw BY id, dataFeatures BY id;
-- in order to shuffle data, I give a random number to each data
trainingChunki = FOREACH trainingChunkiFeatures GENERATE
trainingChunkiRaw::id AS id,
trainingChunkiRaw::label AS label,
dataFeatures::features AS features,
RANDOM() AS r;
-- shuffle the data
trainingChunkiShuffledRandom = ORDER trainingChunki BY r;
-- store this chunk of data into s3
trainingChunkiToStore = FOREACH trainingChunkiShuffledRandom GENERATE
id AS id,
label AS label,
features AS features;
STORE trainingChunkiToStore INTO '$training_data_i_s3_path' USING PigStorage(',');
在我的real pig脚本中,我这样做了10次以生成10个数据块
问题
我遇到的问题是,如果我选择生成10个数据块,那么有太多的mapper/reducer任务,超过10K。大多数
地图绘制者做的事情很少(运行时间少于1分钟)。在某个时候,整个猪的脚本都被卡住了。只能运行一个mapper/reducer任务,所有其他mapper/reducer任务都被阻止
我尝试过的
--------------------------------------------------------------------------
-- generate training chunk i
--------------------------------------------------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData $RATIO;
-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunki, labelOneTrainingData;
STORE trainingChunkiRaw INTO '$training_data_i_s3_path' USING PigStorage(',');
--------------------------------------------------------------------------
-- generate training chunk i
--------------------------------------------------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData $RATIO;
-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunki, labelOneTrainingData;
trainingChunki = FOREACH trainingChunkiRaw GENERATE
id,
label,
features,
RANDOM() AS r;
-- shuffle data
trainingChunkiShuffledRandom = ORDER trainingChunki BY r;
trainingChunkiToStore = FOREACH trainingChunkiShuffledRandom GENERATE
id AS id,
label AS label,
features AS features;
STORE trainingChunkiToStore INTO '$training_data_i_s3_path' USING PigStorage(',');
过了一会儿,我想我明白了。问题可能是那里有多个
STORE
语句。默认情况下,pig脚本将以批处理方式运行。因此,对于每个数据块,都有一个作业正在运行,这会导致资源不足,例如映射器和还原器的插槽。所有作业都无法完成,因为每个作业都需要更多的mapper/reducer插槽
解决方案
pig-M-f pig_script.pg
,它一次只执行一条语句,没有任何优化。这可能并不理想,因为没有进行优化。对我来说,这是可以接受的李>
EXEC
,强制执行某些执行顺序,这在本例中非常有用过了一会儿,我想我明白了。问题可能是那里有多个
STORE
语句。默认情况下,pig脚本将以批处理方式运行。因此,对于每个数据块,都有一个作业正在运行,这会导致资源不足,例如映射器和还原器的插槽。所有作业都无法完成,因为每个作业都需要更多的mapper/reducer插槽
解决方案
pig-M-f pig_script.pg
,它一次只执行一条语句,没有任何优化。这可能并不理想,因为没有进行优化。对我来说,这是可以接受的李>
EXEC
,强制执行某些执行顺序,这在本例中非常有用