Hadoop 使用Java运行EmbeddedPig时,Pig脚本中的ORDER BY job失败

Hadoop 使用Java运行EmbeddedPig时,Pig脚本中的ORDER BY job失败,hadoop,apache-pig,Hadoop,Apache Pig,我有下面这个pig脚本,它使用gruntshell(将结果存储到HDFS中,没有任何问题)可以完美地工作;但是,如果我使用Java EmbeddedPig运行相同的脚本,则上一个作业(ORDER BY)失败。如果我用其他人(如GROUP或FOREACH GENERATE)替换ORDER BY job,那么整个脚本将在Java EmbeddedPig中成功。所以我认为这是导致问题的顺序。有人有过这方面的经验吗?任何帮助都将不胜感激 猪的脚本: REGISTER pig-udf-0.0.1-

我有下面这个pig脚本,它使用gruntshell(将结果存储到HDFS中,没有任何问题)可以完美地工作;但是,如果我使用Java EmbeddedPig运行相同的脚本,则上一个作业(ORDER BY)失败。如果我用其他人(如GROUP或FOREACH GENERATE)替换ORDER BY job,那么整个脚本将在Java EmbeddedPig中成功。所以我认为这是导致问题的顺序。有人有过这方面的经验吗?任何帮助都将不胜感激

猪的脚本:

    REGISTER pig-udf-0.0.1-SNAPSHOT.jar;
    user_similarity = LOAD '/tmp/sample-sim-score-results-31/part-r-00000' USING PigStorage('\t') AS (user_id: chararray, sim_user_id: chararray, basic_sim_score: float, alt_sim_score: float);
    simplified_user_similarity = FOREACH user_similarity GENERATE $0 AS user_id, $1 AS sim_user_id, $2 AS sim_score;
    grouped_user_similarity = GROUP simplified_user_similarity BY user_id;
    ordered_user_similarity = FOREACH grouped_user_similarity {                           
        sorted = ORDER simplified_user_similarity BY sim_score DESC;
        top    = LIMIT sorted 10;
        GENERATE group, top;
    };
    top_influencers = FOREACH ordered_user_similarity GENERATE com.aol.grapevine.similarity.pig.udf.AssignPointsToTopInfluencer($1, 10);
    all_influence_scores = FOREACH top_influencers GENERATE FLATTEN($0);
    grouped_influence_scores = GROUP all_influence_scores BY bag_of_topSimUserTuples::user_id;
    influence_scores = FOREACH grouped_influence_scores GENERATE group AS user_id, SUM(all_influence_scores.bag_of_topSimUserTuples::points) AS influence_score;
    ordered_influence_scores = ORDER influence_scores BY influence_score DESC;
    STORE ordered_influence_scores INTO '/tmp/cc-test-results-1' USING PigStorage();
来自Pig的错误日志:

12/04/05 10:00:56 INFO pigstats.ScriptState: Pig script settings are added to the job
12/04/05 10:00:56 INFO mapReduceLayer.JobControlCompiler: mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
12/04/05 10:00:58 INFO mapReduceLayer.JobControlCompiler: Setting up single store job
12/04/05 10:00:58 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
12/04/05 10:00:58 INFO mapReduceLayer.MapReduceLauncher: 1 map-reduce job(s) waiting for submission.
12/04/05 10:00:58 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/04/05 10:00:58 INFO input.FileInputFormat: Total input paths to process : 1
12/04/05 10:00:58 INFO util.MapRedUtil: Total input paths to process : 1
12/04/05 10:00:58 INFO util.MapRedUtil: Total input paths (combined) to process : 1
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating tmp-1546565755 in /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134-work-6955502337234509704 with rwxr-xr-x
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Cached hdfs://localhost/tmp/temp1725960134/tmp-1546565755#pigsample_854728855_1333645258470 as /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Cached hdfs://localhost/tmp/temp1725960134/tmp-1546565755#pigsample_854728855_1333645258470 as /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755
12/04/05 10:00:58 WARN mapred.LocalJobRunner: LocalJobRunner does not support symlinking into current working dir.
12/04/05 10:00:58 INFO mapred.TaskRunner: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755 <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/pigsample_854728855_1333645258470
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.jar.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.jar.crc
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.split.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.split.crc
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.splitmetainfo.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.splitmetainfo.crc
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.xml.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.xml.crc
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.jar <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.jar
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.split <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.split
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.splitmetainfo <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.splitmetainfo
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.xml <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.xml
12/04/05 10:00:59 INFO mapred.Task:  Using ResourceCalculatorPlugin : null
12/04/05 10:00:59 INFO mapred.MapTask: io.sort.mb = 100
12/04/05 10:00:59 INFO mapred.MapTask: data buffer = 79691776/99614720
12/04/05 10:00:59 INFO mapred.MapTask: record buffer = 262144/327680
12/04/05 10:00:59 WARN mapred.LocalJobRunner: job_local_0004
java.lang.RuntimeException: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/Users/cchuang/workspace/grapevine-rec/pigsample_854728855_1333645258470
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:139)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
    at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:560)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/Users/cchuang/workspace/grapevine-rec/pigsample_854728855_1333645258470
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInputFormat.listStatus(PigFileInputFormat.java:37)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
    at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:153)
    at org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:112)
    ... 6 more
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Deleted path /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755
12/04/05 10:00:59 INFO mapReduceLayer.MapReduceLauncher: HadoopJobId: job_local_0004
12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: job job_local_0004 has failed! Stop running all dependent jobs
12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: 100% complete
12/04/05 10:01:04 ERROR pigstats.PigStatsUtil: 1 map reduce job(s) failed!
12/04/05 10:01:04 INFO pigstats.PigStats: Script Statistics:

HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt    Features
0.20.2-cdh3u3    0.8.1-cdh3u3    cchuang    2012-04-05 10:00:34    2012-04-05 10:01:04    GROUP_BY,ORDER_BY

Some jobs have failed! Stop running all dependent jobs

Job Stats (time in seconds):
JobId    Maps    Reduces    MaxMapTime    MinMapTIme    AvgMapTime    MaxReduceTime    MinReduceTime    AvgReduceTime    Alias    Feature    Outputs
job_local_0001    0    0    0    0    0    0    0    0    all_influence_scores,grouped_user_similarity,simplified_user_similarity,user_similarity    GROUP_BY   
job_local_0002    0    0    0    0    0    0    0    0    grouped_influence_scores,influence_scores    GROUP_BY,COMBINER   
job_local_0003    0    0    0    0    0    0    0    0    ordered_influence_scores    SAMPLER   

Failed Jobs:
JobId    Alias    Feature    Message    Outputs
job_local_0004    ordered_influence_scores    ORDER_BY    Message: Job failed! Error - NA    /tmp/cc-test-results-1,

Input(s):
Successfully read 0 records from: "/tmp/sample-sim-score-results-31/part-r-00000"

Output(s):
Failed to produce result in "/tmp/cc-test-results-1"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_local_0001    ->    job_local_0002,
job_local_0002    ->    job_local_0003,
job_local_0003    ->    job_local_0004,
job_local_0004


12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: Some jobs have failed! Stop running all dependent jobs
12/04/05 10:00:56 INFO pigstats.ScriptState:Pig脚本设置已添加到作业中
12/04/05 10:00:56信息mapReduceLayer.JobControlCompiler:mapred.job.reduce.markreset.buffer.percent未设置,设置为默认值0.3
12/04/05 10:00:58信息mapReduceLayer.JobControlCompiler:设置单存储作业
12/04/05 10:00:58信息jvm.JvmMetrics:无法使用processName=JobTracker初始化jvm度量,sessionId=-已初始化
12/04/05 10:00:58信息mapReduceLayer.MapReduceLauncher:1个地图缩减作业正在等待提交。
12/04/05 10:00:58 WARN mapred.JobClient:使用GenericOptionsParser解析参数。应用程序应该为相同的应用程序实现工具。
12/04/05 10:00:58信息输入。文件输入格式:要处理的总输入路径:1
12/04/05 10:00:58 INFO util.MapRedUtil:要处理的总输入路径:1
12/04/05 10:00:58 INFO util.MapRedUtil:要处理的总输入路径(组合):1
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager:在/var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1259060134-work-695550237234509704中使用rwxr-xr-x创建tmp-15465657575755
12/04/05 10:00:58信息文件缓存.TrackerDistributedCacheManager:缓存hdfs://localhost/tmp/temp1725960134/tmp-1546565755#pigsample_854728855_1333645258470 as/var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/433479531300639107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755
12/04/05 10:00:58信息文件缓存.TrackerDistributedCacheManager:缓存hdfs://localhost/tmp/temp1725960134/tmp-1546565755#pigsample_854728855_1333645258470 as/var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/433479531300639107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755
12/04/05 10:00:58警告映射。LocalJobRunner:LocalJobRunner不支持符号链接到当前工作目录。

12/04/05 10:00:58 INFO mapred.TaskRunner:创建符号链接:/var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755确保PIG_HOME环境变量设置为PIG安装。

确保PIG_HOME环境变量设置为PIG安装。

黄,我在一个更简单的查询中遇到了同样的问题,同样的错误。你找到解决问题的方法了吗?还是找到了问题的根源?谢谢。我遇到了同样的问题。这两行一定与错误有关,但我不知道如何修复:-(--
12/04/05 10:00:58 WARN mapred.LocalJobRunner:LocalJobRunner不支持到当前工作目录的符号链接。
--
输入路径不存在:文件:/Users/cchuang/workspace/grapevine rec/pigsample_854728855_13336454555;更简单。你找到问题的解决方案或根源了吗?谢谢。我遇到了相同的问题。这两行一定与错误有关,但我不知道如何修复:-(--
12/04/05 10:00:58 WARN mapred.LocalJobRunner:LocalJobRunner不支持符号链接到当前工作目录。
--
输入路径不存在:文件:/Users/cchuang/workspace/grapevine rec/pigsample_854728855_13336445258470
请参阅