Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/image/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Hadoop Pig:如何将hdfs ls的输出加载到别名中?_Hadoop_Hdfs_Apache Pig - Fatal编程技术网

Hadoop Pig:如何将hdfs ls的输出加载到别名中?

Hadoop Pig:如何将hdfs ls的输出加载到别名中?,hadoop,hdfs,apache-pig,Hadoop,Hdfs,Apache Pig,我试图查看hdfs中的文件,并评估哪些文件比某个日期早。我想做一个hdfs ls并将其输出传递到一个pigLOAD命令中 在对@DonaldMiner的回答中,包括一个输出文件名的shell脚本;我借用这个来传递一个文件名列表。但是,我不想加载文件的内容,我只想加载ls命令的输出,并将文件名视为文本 这是myfirstscript.pig: test = LOAD '$files' as (moddate:chararray, modtime:chararray, filename:charar

我试图查看hdfs中的文件,并评估哪些文件比某个日期早。我想做一个
hdfs ls
并将其输出传递到一个pig
LOAD
命令中

在对@DonaldMiner的回答中,包括一个输出文件名的shell脚本;我借用这个来传递一个文件名列表。但是,我不想加载文件的内容,我只想加载
ls
命令的输出,并将文件名视为文本

这是myfirstscript.pig:

test = LOAD '$files' as (moddate:chararray, modtime:chararray, filename:chararray);

illustrate test;
我称之为:

pig -p files="`./filesysoutput.sh`" myfirstscript.pig 
其中filesysoutput.sh包含:

hadoop fs -ls -R /hbase/imagestore | grep '\-rw' | awk 'BEGIN { FS = ",[ \t]*|[ \t]+" } {print $6, $7, $8}' | tr '\n' ','
这将生成如下输出:

2012-07-27 17:56 /hbase/imagestore/.tableinfo.0000000001,2012-04-23 19:27 /hbase/imagestore/08e36507d743367e1de57c359360b64c/.regioninfo,2012-05-10 12:13 /hbase/imagestore/08e36507d743367e1de57c359360b64c/0/7818124910159371133,2012-05-10 15:01 /hbase/imagestore/08e36507d743367e1de57c359360b64c/1/5537238047267916113,2012-05-09 19:40 /hbase/imagestore/08e36507d743367e1de57c359360b64c/2/6836317764645542272,2012-05-10 07:04 /hbase/imagestore/08e36507d743367e1de57c359360b64c/3/7276147895747401630,...
因为我只需要日期、时间和文件名,所以我只在输入到pig脚本的输出中包含这些字段。当我尝试运行此程序时,它肯定是在尝试将实际文件加载到
test
别名中:

 $ pig -p files="`./filesysoutput.sh`" myfirstscript.pig 
2013-05-29 17:40:10.773 java[50830:1203] Unable to load realm info from SCDynamicStore
2013-05-29 17:40:10.827 java[50830:1203] Unable to load realm info from SCDynamicStore
2013-05-29 17:40:20,570 [main] INFO  org.apache.pig.Main - Logging error messages to: /Users/username/Environment/pig-0.9.2-cdh4.0.1/scripts/test/pig_1369863620569.log
2013-05-29 17:40:20,769 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://stage-hadoop101.cluster:8020
2013-05-29 17:40:20,771 [main] WARN  org.apache.hadoop.conf.Configuration - mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
2013-05-29 17:40:20,773 [main] WARN  org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS
2013-05-29 17:40:20.836 java[50847:1203] Unable to load realm info from SCDynamicStore
2013-05-29 17:40:20.879 java[50847:1203] Unable to load realm info from SCDynamicStore
2013-05-29 17:40:21,138 [main] WARN  org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS
2013-05-29 17:40:21,452 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: 
<file myfirstscript.pig, line 3, column 7> pig script failed to validate: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2012-07-27 17:56%20/hbase/imagestore/.tableinfo.0000000001
Details at logfile: /Users/username/Environment/pig-0.9.2-cdh4.0.1/scripts/test/pig_1369863620569.log
$pig-p files=“`./filesysoutput.sh`”myfirstscript.pig
2013-05-29 17:40:10.773 java[50830:1203]无法从SCDynamicStore加载领域信息
2013-05-29 17:40:10.827 java[50830:1203]无法从SCDynamicStore加载领域信息
2013-05-29 17:40:20570[main]INFO org.apache.pig.main-将错误消息记录到:/Users/username/Environment/pig-0.9.2-cdh4.0.1/scripts/test/pig1369863620569.log
2013-05-29 17:40:20769[main]INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine-连接到hadoop文件系统的位置:hdfs://stage-hadoop101.cluster:8020
2013-05-29 17:40:20771[main]WARN org.apache.hadoop.conf.Configuration-mapred.used.genericoptionsparser不推荐使用。相反,请使用mapreduce.client.genericoptionsparser.used
2013-05-29 17:40:20773[main]警告org.apache.hadoop.conf.Configuration-fs.default.name已被弃用。而是使用fs.defaultFS
2013-05-29 17:40:20.836 java[50847:1203]无法从SCDynamicStore加载领域信息
2013-05-29 17:40:20.879 java[50847:1203]无法从SCDynamicStore加载领域信息
2013-05-29 17:40:21138[main]WARN org.apache.hadoop.conf.Configuration-fs.default.name不推荐使用。而是使用fs.defaultFS
2013-05-29 17:40:21452[main]错误org.apache.pig.tools.grunt.grunt-错误1200:pig脚本无法解析:
pig脚本无法验证:java.lang.IllegalArgumentException:java.net.URISyntaxException:绝对URI中的相对路径:2012-07-27 17:56%20/hbase/imagestore/.tableinfo.0000000001
日志文件中的详细信息:/Users/username/Environment/pig-0.9.2-cdh4.0.1/scripts/test/pig_1369863620569.log

您可以尝试另一种方法-使用一个dummy.txt输入文件(只有一行),然后通过命令使用
流别名来处理
hadoop fs-ls的输出,就像您当前所做的那样:

grunt> dummy = load '/tmp/dummy.txt';   
grunt> fs -cat /tmp/dummy.txt;
dummy
grunt> files = STREAM dummy THROUGH 
    `hadoop fs -ls -R /hbase/imagestore | grep '\-rw' | awk 'BEGIN { OFS="\t"; FS = ",[ \t]*|[ \t]+" } {print $6, $7, $8}'` 
    AS (moddate:chararray, modtime:chararray, filename:chararray);
注:上述内容未经测试-我模拟了类似于本地模式清管器的东西,它成功了(注:我向awk OFS添加了一些选项,并且不得不稍微更改grep):


使用基于java或python的嵌入式pig怎么样


这是一个非常酷的想法,但在反复尝试之后,我认为这不管用。我能够让它运行而不失败;但是,尽管它声称“成功地将记录存储在:'file:/tmp/temp168676408/tmp-77338624'”中,但我的系统上不存在该位置。在本地或mapreduce模式下也会出现相同的结果。我想我必须尝试使用Perl创建自己的外部脚本我不明白dummy.txt中的任何内容都会通过脚本流式传输,或者backticks中的任何内容;我有一个空的dummy.txt文件,所以没有任何东西可以在脚本中传输;因此,脚本中没有任何内容被分配给别名。我确实看到了你的内容,但我不明白为什么这很重要,所以我没有包括在内。一旦我把它放进去,我就从脚本中得到了内容。希望我说的有道理。谢谢@ChrisWhite,一个winnar就是你!请帮忙:对不起,我已经三年没用猪了,所以我不知道我能帮上多少忙。
grunt> files = STREAM dummy THROUGH \
    `ls -l | grep "\\-rw" | awk 'BEGIN { OFS = "\t"; FS = ",[ \t]*|[ \t]+" } {print $6, $7, $9}'` \
     AS (month:chararray, day:chararray, file:chararray);

grunt> dump files

(Dec,31,CTX.DAT)
(Dec,23,examples.desktop)
(Feb,8,installer.img.gz)
(Feb,8,install.py)
(Apr,25,mapred-site.xml)
(Apr,14,sqlite)