MapReduce 2键减缩器-Python_Python_Hadoop_Mapreduce_Hadoop Streaming_Reducers

MapReduce 2键减缩器-Python

python hadoop mapreduce

MapReduce 2键减缩器-Python,python,hadoop,mapreduce,hadoop-streaming,reducers,Python,Hadoop,Mapreduce,Hadoop Streaming,Reducers,这应该很简单，我已经花了几个小时在这上面示例数据（名称、二进制、计数）：所需的示例输出（名称、二进制、计数）：每个名称都需要有自己的二进制键0或1。基于二进制键，对count列求和。请注意所需输出中的“reduce” 我已经提供了我的一些代码，我正在尝试在reducer中不使用列表或字典来执行 “”“ Reducer将名称与二进制文件结合起来，并对其进行部分计数输入：名称\t二进制\t计数输出：名称\t二进制\t计数 “”“ 由于某些原因，它无法正确打印。（通过的第一个名字是fun

这应该很简单，我已经花了几个小时在这上面

示例数据（名称、二进制、计数）：

所需的示例输出（名称、二进制、计数）：

每个名称都需要有自己的二进制键0或1。基于二进制键，对count列求和。请注意所需输出中的“reduce”

我已经提供了我的一些代码，我正在尝试在reducer中不使用列表或字典来执行
“”“ Reducer将名称与二进制文件结合起来，并对其进行部分计数
输入：名称\t二进制\t计数
输出：名称\t二进制\t计数
“”“
由于某些原因，它无法正确打印。（通过的第一个名字是funky）我也不确定最好的方法是通过一个计数和零个计数（也显示二进制标签）的所有打印

任何帮助都将不胜感激。谢谢
我认为最好使用熊猫图书馆

import pandas as pd from io import StringIO a ="""Adam 0 1 Adam 1 1 Adam 0 1 Mike 1 1 Mike 0 1 Mike 1 1""" text = StringIO(a) name, binary, count = [],[],[] for line in text.readlines(): a = line.strip().split(" ") name.append(a[0]) binary.append(a[1]) count.append(a[2]) df = pd.DataFrame({'name': name, "binary": binary, "count": count}) df['count'] = df['count'].astype(int) df = df.groupby(['name', 'binary'])['count'].sum().reset_index() print(df) name binary count 0 Adam 0 2 1 Adam 1 1 2 Mike 0 1 3 Mike 1 2
如果您的数据已存在于csv或文本文件中。它可以用熊猫来读

df = pd.read_csv('path to your file')

压痕不好，情况处理不当

import re import sys current_name = None zero_count, one_count = 0,0 i = 0 for line in sys.stdin: # parse the input name, binary, count = line.split('\t') #print(name) #print(current_name) if(i == 0): current_name = name i = i + 1 if(name == current_name): if int(binary) == 0: zero_count += int(count) elif int(binary) == 1: one_count += int(count) else: print(f'{current_name}\t{0} \t{zero_count}') print(f'{current_name}\t{1} \t{one_count}') current_name = name #print(current_name) zero_count, one_count = 0,0 if int(binary) == 0: zero_count += int(count) elif int(binary) == 1: one_count += int(count) print(f'{current_name}\t{0} \t{zero_count}') print(f'{current_name}\t{1} \t{one_count}')
“i”处理第一行输入没有“当前名称”的情况（它只运行一次）。
在else块中，您必须重新初始化“zero\u count”和“one\u count”，并对新的“current\u name”进行计算
我的代码的输出：

Adam 0 2 Adam 1 1 Mike 0 1 Mike 1 2

我认为OP想要使用Hadoop
df = pd.read_csv('path to your file')

import re import sys current_name = None zero_count, one_count = 0,0 i = 0 for line in sys.stdin: # parse the input name, binary, count = line.split('\t') #print(name) #print(current_name) if(i == 0): current_name = name i = i + 1 if(name == current_name): if int(binary) == 0: zero_count += int(count) elif int(binary) == 1: one_count += int(count) else: print(f'{current_name}\t{0} \t{zero_count}') print(f'{current_name}\t{1} \t{one_count}') current_name = name #print(current_name) zero_count, one_count = 0,0 if int(binary) == 0: zero_count += int(count) elif int(binary) == 1: one_count += int(count) print(f'{current_name}\t{0} \t{zero_count}') print(f'{current_name}\t{1} \t{one_count}')

Adam 0 2 Adam 1 1 Mike 0 1 Mike 1 2