Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/283.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Hadoop中Map函数的输入拆分_Python_Hadoop_Hadoop Streaming_Hadoop2_Hadoop Partitioning - Fatal编程技术网

Python Hadoop中Map函数的输入拆分

Python Hadoop中Map函数的输入拆分,python,hadoop,hadoop-streaming,hadoop2,hadoop-partitioning,Python,Hadoop,Hadoop Streaming,Hadoop2,Hadoop Partitioning,这是我在Hadoop中的第一个实现。我试图在Map Reduce中实现我的概率数据集算法。在我的数据集中,最后一列将有一些id(数据集中唯一id的数量等于集群中的节点数量)。我必须基于此列值划分数据集,并且集群中的每个节点都应该处理每组记录 例如,如果集群中有三个节点,对于下面的数据集,一个节点应该处理id=1的所有记录,另一个节点处理id=2的记录,另一个节点处理id=3的记录 name time dept id -------------------- b1 2:00pm z1

这是我在Hadoop中的第一个实现。我试图在Map Reduce中实现我的概率数据集算法。在我的数据集中,最后一列将有一些id(数据集中唯一id的数量等于集群中的节点数量)。我必须基于此列值划分数据集,并且集群中的每个节点都应该处理每组记录

例如,如果集群中有三个节点,对于下面的数据集,一个节点应该处理id=1的所有记录,另一个节点处理id=2的记录,另一个节点处理id=3的记录

name time  dept  id
--------------------
 b1  2:00pm z1   1
 b2  3:00pm z2   2
 c1  4:00pm y2   1
 b3  3:00pm z3   3
 c4  4:00pm x2   2
我的map函数应该将每个分割作为输入,并在每个节点中并行处理它

我只是想了解,在Hadoop中哪种方法是可行的。输入此数据集作为my map函数的输入,并使用map传递附加参数以基于id值分割数据。 或者预先将数据拆分为“n”(节点数)子集并将其加载到节点中,如果这是正确的方法,则说明如何根据不同节点中的值和负载拆分数据。因为,我从阅读资料中了解到,hadoop根据指定的大小将数据分割成块。如何在加载时指定特定条件。总而言之,我正在用python编写程序


请有人给我建议。谢谢

如果我理解了你的问题,最好的方法是将数据集加载到一个hive表中,然后用python编写UDF。之后,做如下操作:

select your_python_udf(name, time, dept, id) from table group by id;
#!/usr/bin/env python
import sys
from collections import defaultdict
def process_data(key_id, data_list):
   # data_list has all the lines for key_id

all_data = defaultdict(list)
for line in sys.stdin:
    line = line.strip()
    cols = line.split("\t")
    key = cols[0]
    orig_line = "\t".join(cols[1:])
    all_data[key].append(orig_line)
for key, data in all_data.iteritems():
    process_data(key, data)
这看起来像是reduce阶段,所以在启动查询之前,您可能需要这样做

set mapred.reduce.tasks=50;
如何创建自定义自定义自定义项:


对您来说,最简单的事情可能是让映射器以id作为键输出数据,这将保证一个reducer将获得特定id的所有记录,然后在reducer阶段进行处理

比如说,

输入数据:

 b1  2:00pm z1   1
 b2  3:00pm z2   2
 c1  4:00pm y2   1
 b3  3:00pm z3   3
 c4  4:00pm x2   2
#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    cols = line.split("\t")
    key = cols[-1]
    print key + "\t" + line
 1  b1  2:00pm z1   1
 2  b2  3:00pm z2   2
 1  c1  4:00pm y2   1
 3  b3  3:00pm z3   3
 2  c4  4:00pm x2   2
 1  b1  2:00pm z1   1
 1  c1  4:00pm y2   1
 2  b2  3:00pm z2   2
 3  b3  3:00pm z3   3
#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    cols = line.split("\t")
    orig_line = "\t".join(cols[1:])
    # do stuff...
映射程序代码:

 b1  2:00pm z1   1
 b2  3:00pm z2   2
 c1  4:00pm y2   1
 b3  3:00pm z3   3
 c4  4:00pm x2   2
#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    cols = line.split("\t")
    key = cols[-1]
    print key + "\t" + line
 1  b1  2:00pm z1   1
 2  b2  3:00pm z2   2
 1  c1  4:00pm y2   1
 3  b3  3:00pm z3   3
 2  c4  4:00pm x2   2
 1  b1  2:00pm z1   1
 1  c1  4:00pm y2   1
 2  b2  3:00pm z2   2
 3  b3  3:00pm z3   3
#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    cols = line.split("\t")
    orig_line = "\t".join(cols[1:])
    # do stuff...
地图输出:

 b1  2:00pm z1   1
 b2  3:00pm z2   2
 c1  4:00pm y2   1
 b3  3:00pm z3   3
 c4  4:00pm x2   2
#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    cols = line.split("\t")
    key = cols[-1]
    print key + "\t" + line
 1  b1  2:00pm z1   1
 2  b2  3:00pm z2   2
 1  c1  4:00pm y2   1
 3  b3  3:00pm z3   3
 2  c4  4:00pm x2   2
 1  b1  2:00pm z1   1
 1  c1  4:00pm y2   1
 2  b2  3:00pm z2   2
 3  b3  3:00pm z3   3
#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    cols = line.split("\t")
    orig_line = "\t".join(cols[1:])
    # do stuff...
减速器1输入:

 b1  2:00pm z1   1
 b2  3:00pm z2   2
 c1  4:00pm y2   1
 b3  3:00pm z3   3
 c4  4:00pm x2   2
#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    cols = line.split("\t")
    key = cols[-1]
    print key + "\t" + line
 1  b1  2:00pm z1   1
 2  b2  3:00pm z2   2
 1  c1  4:00pm y2   1
 3  b3  3:00pm z3   3
 2  c4  4:00pm x2   2
 1  b1  2:00pm z1   1
 1  c1  4:00pm y2   1
 2  b2  3:00pm z2   2
 3  b3  3:00pm z3   3
#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    cols = line.split("\t")
    orig_line = "\t".join(cols[1:])
    # do stuff...
减速器2输入:

 b1  2:00pm z1   1
 b2  3:00pm z2   2
 c1  4:00pm y2   1
 b3  3:00pm z3   3
 c4  4:00pm x2   2
#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    cols = line.split("\t")
    key = cols[-1]
    print key + "\t" + line
 1  b1  2:00pm z1   1
 2  b2  3:00pm z2   2
 1  c1  4:00pm y2   1
 3  b3  3:00pm z3   3
 2  c4  4:00pm x2   2
 1  b1  2:00pm z1   1
 1  c1  4:00pm y2   1
 2  b2  3:00pm z2   2
 3  b3  3:00pm z3   3
#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    cols = line.split("\t")
    orig_line = "\t".join(cols[1:])
    # do stuff...
减速器3输入:

 b1  2:00pm z1   1
 b2  3:00pm z2   2
 c1  4:00pm y2   1
 b3  3:00pm z3   3
 c4  4:00pm x2   2
#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    cols = line.split("\t")
    key = cols[-1]
    print key + "\t" + line
 1  b1  2:00pm z1   1
 2  b2  3:00pm z2   2
 1  c1  4:00pm y2   1
 3  b3  3:00pm z3   3
 2  c4  4:00pm x2   2
 1  b1  2:00pm z1   1
 1  c1  4:00pm y2   1
 2  b2  3:00pm z2   2
 3  b3  3:00pm z3   3
#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    cols = line.split("\t")
    orig_line = "\t".join(cols[1:])
    # do stuff...
减速器代码:

 b1  2:00pm z1   1
 b2  3:00pm z2   2
 c1  4:00pm y2   1
 b3  3:00pm z3   3
 c4  4:00pm x2   2
#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    cols = line.split("\t")
    key = cols[-1]
    print key + "\t" + line
 1  b1  2:00pm z1   1
 2  b2  3:00pm z2   2
 1  c1  4:00pm y2   1
 3  b3  3:00pm z3   3
 2  c4  4:00pm x2   2
 1  b1  2:00pm z1   1
 1  c1  4:00pm y2   1
 2  b2  3:00pm z2   2
 3  b3  3:00pm z3   3
#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    cols = line.split("\t")
    orig_line = "\t".join(cols[1:])
    # do stuff...
请注意,通过这种方式,单个reducer可能会获得多个键,但数据将被排序,您可以使用mapred.reduce.tasks选项控制reducer的数量

编辑 如果您想在reducer中按键收集数据,您可以这样做(不确定它是否会按原样运行,但您明白了)

如果您不担心在reducer步骤中内存不足,可以将代码简化如下:

select your_python_udf(name, time, dept, id) from table group by id;
#!/usr/bin/env python
import sys
from collections import defaultdict
def process_data(key_id, data_list):
   # data_list has all the lines for key_id

all_data = defaultdict(list)
for line in sys.stdin:
    line = line.strip()
    cols = line.split("\t")
    key = cols[0]
    orig_line = "\t".join(cols[1:])
    all_data[key].append(orig_line)
for key, data in all_data.iteritems():
    process_data(key, data)

你好谢谢你的解释,但是你能澄清一下我该如何利用这些数据吗。这应该是我的映射步骤,因为我必须在每个节点的每个子集(基于id)上运行我的算法。所有内容不应在同一节点中运行。然后,所有这些节点的输出将在reduce阶段使用,以进行处理并获得组合输出。单个reducer不应获得多个键,因为我希望我的算法在每个数据集记录(属于一个键的记录)上运行,这将影响我的输出。在从我的算法得到所有键的输出后,我需要通过找到输出的最小子集来进一步减少它。我如何才能做到这一点?即使减速器将获得多个键,您也可以随心所欲地处理它。例如,收集数据直到到达下一个键(记住,它是有序的),然后处理上一个键的数据。使用这种方法,为了进一步减少它,您需要在输出上运行另一个map/reduce作业,或者只在输出数据上运行一个python脚本(如果它足够小的话)。酷。单个键是否可以拆分为多个减速器?因为我一次需要单个键的所有记录来运行我的算法no,hadoop保证单个键的所有记录都将由同一个reducer处理。这很好,在我获得reducer的输入后,我只需要reduce输入“1b12:00pm z1”中的这个“b12:00pm z1”,我添加了1来表示我的数据集编号,在这种情况下,我如何在python中处理它,因为我不确定reducer获取输入的格式是什么以及如何解析它。