通过python UDF将文本文件导入到pig

通过python UDF将文本文件导入到pig,python,apache-pig,user-defined-functions,Python,Apache Pig,User Defined Functions,我尝试在使用python udf时将文件加载到pig,我尝试了两种方法: •(myudf1,sample1.pig):尝试从python读取文件,该文件位于我的客户机服务器上 •(myudf2,sample2.pig):首先将文件从hdfs加载到grunt shell,然后将其作为参数传递给python udf myudf1.py from __future__ import with_statement def get_words(dir): stopwords=set() w

我尝试在使用python udf时将文件加载到pig,我尝试了两种方法:

•(myudf1,sample1.pig):尝试从python读取文件,该文件位于我的客户机服务器上

•(myudf2,sample2.pig):首先将文件从hdfs加载到grunt shell,然后将其作为参数传递给python udf

myudf1.py

from __future__ import with_statement
def get_words(dir):
    stopwords=set()
    with open(dir) as f1:
        for line1 in f1:
            stopwords.update([line1.decode('ascii','ignore').split("\n")[0]])
    return stopwords

stopwords=get_words("/home/zhge/uwc/mappings/english_stop.txt")

@outputSchema("findit: int")
def findit(stp):
    stp=str(stp)
    if stp in stopwords:
        return 1
    else:
        return 0
样本1.1:

REGISTER '/home/zhge/uwc/scripts/myudf1.py' USING jython as pyudf;
item_title = load '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',')  AS (title:chararray);

T = limit item_title 1;
S = FOREACH T GENERATE pyudf.findit(title);
DUMP S
我得到:IOError:(2,'没有这样的文件或目录','/home/zhge/uwc/mappings/english_stop.txt')

对于解决方案2:

myudf2:

def get_wordlists(wordbag):
    stopwords=set()
    for t in wordbag:
        stopwords.update(t.decode('ascii','ignore'))
    return stopwords


@outputSchema("findit: int")
def findit(stopwordbag, stp):
    stopwords=get_wordlists(stopwordbag)
    stp=str(stp)
    if stp in stopwords:
        return 1
    else:
        return 0
样本2.猪

REGISTER '/home/zhge/uwc/scripts/myudf2.py' USING jython as pyudf;

stops = load '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);
-- this step works fine and i can see the "stops" obejct is loaded to pig 
item_title = load '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',')  AS (title:chararray);
T = limit item_title 1;
S = FOREACH T GENERATE pyudf.findit(stops.stop_w, title);
DUMP S;
然后我得到:
错误org.apache.pig.tools.grunt.grunt-错误1066:无法打开别名S的迭代器。后端错误:Scalar在输出中有多行。第一:(a)第二:(as

第二个示例应该有效。虽然您限制了错误的表达式,但它应该位于
停止关系上。因此它应该是:

stops = LOAD '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);

item_title = LOAD '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = LIMIT stops 1;
S = FOREACH item_title GENERATE pyudf.findit(T.stop_w, title);

但是,由于您似乎需要先处理所有停止词,这还不够。您需要对所有停止词进行
分组
,然后将结果传递给
get_wordlist
函数:

stops = LOAD '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);

item_title = LOAD '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = FOREACH (GROUP stops ALL) GENERATE pyudf.get_wordlists(stops) AS ready;
S = FOREACH item_title GENERATE pyudf.findit(T.ready, title);

您必须更新您的UDF以接受DICT列表,但此方法才能起作用。

对于在查找此帖子时找到此帖子的人,请参阅。