MapReduce 2键减缩器-Python
这应该很简单,我已经花了几个小时在这上面 示例数据(名称、二进制、计数): 所需的示例输出(名称、二进制、计数): 每个名称都需要有自己的二进制键0或1。基于二进制键,对count列求和。请注意所需输出中的“reduce” 我已经提供了我的一些代码,我正在尝试在reducer中不使用列表或字典来执行 “”“ Reducer将名称与二进制文件结合起来,并对其进行部分计数 输入: 名称\t二进制\t计数 输出: 名称\t二进制\t计数MapReduce 2键减缩器-Python,python,hadoop,mapreduce,hadoop-streaming,reducers,Python,Hadoop,Mapreduce,Hadoop Streaming,Reducers,这应该很简单,我已经花了几个小时在这上面 示例数据(名称、二进制、计数): 所需的示例输出(名称、二进制、计数): 每个名称都需要有自己的二进制键0或1。基于二进制键,对count列求和。请注意所需输出中的“reduce” 我已经提供了我的一些代码,我正在尝试在reducer中不使用列表或字典来执行 “”“ Reducer将名称与二进制文件结合起来,并对其进行部分计数 输入: 名称\t二进制\t计数 输出: 名称\t二进制\t计数 “”“ 由于某些原因,它无法正确打印。(通过的第一个名字是fun
“”“ 由于某些原因,它无法正确打印。(通过的第一个名字是funky)我也不确定最好的方法是通过一个计数和零个计数(也显示二进制标签)的所有打印
任何帮助都将不胜感激。谢谢 我认为最好使用熊猫图书馆
import pandas as pd
from io import StringIO
a ="""Adam 0 1
Adam 1 1
Adam 0 1
Mike 1 1
Mike 0 1
Mike 1 1"""
text = StringIO(a)
name, binary, count = [],[],[]
for line in text.readlines():
a = line.strip().split(" ")
name.append(a[0])
binary.append(a[1])
count.append(a[2])
df = pd.DataFrame({'name': name, "binary": binary, "count": count})
df['count'] = df['count'].astype(int)
df = df.groupby(['name', 'binary'])['count'].sum().reset_index()
print(df)
name binary count
0 Adam 0 2
1 Adam 1 1
2 Mike 0 1
3 Mike 1 2
如果您的数据已存在于csv或文本文件中。它可以用熊猫来读
df = pd.read_csv('path to your file')
压痕不好,情况处理不当
import re
import sys
current_name = None
zero_count, one_count = 0,0
i = 0
for line in sys.stdin:
# parse the input
name, binary, count = line.split('\t')
#print(name)
#print(current_name)
if(i == 0):
current_name = name
i = i + 1
if(name == current_name):
if int(binary) == 0:
zero_count += int(count)
elif int(binary) == 1:
one_count += int(count)
else:
print(f'{current_name}\t{0} \t{zero_count}')
print(f'{current_name}\t{1} \t{one_count}')
current_name = name
#print(current_name)
zero_count, one_count = 0,0
if int(binary) == 0:
zero_count += int(count)
elif int(binary) == 1:
one_count += int(count)
print(f'{current_name}\t{0} \t{zero_count}')
print(f'{current_name}\t{1} \t{one_count}')
“i”处理第一行输入没有“当前名称”的情况(它只运行一次)。在else块中,您必须重新初始化“zero\u count”和“one\u count”,并对新的“current\u name”进行计算 我的代码的输出:
Adam 0 2
Adam 1 1
Mike 0 1
Mike 1 2
我认为OP想要使用Hadoop
df = pd.read_csv('path to your file')
import re
import sys
current_name = None
zero_count, one_count = 0,0
i = 0
for line in sys.stdin:
# parse the input
name, binary, count = line.split('\t')
#print(name)
#print(current_name)
if(i == 0):
current_name = name
i = i + 1
if(name == current_name):
if int(binary) == 0:
zero_count += int(count)
elif int(binary) == 1:
one_count += int(count)
else:
print(f'{current_name}\t{0} \t{zero_count}')
print(f'{current_name}\t{1} \t{one_count}')
current_name = name
#print(current_name)
zero_count, one_count = 0,0
if int(binary) == 0:
zero_count += int(count)
elif int(binary) == 1:
one_count += int(count)
print(f'{current_name}\t{0} \t{zero_count}')
print(f'{current_name}\t{1} \t{one_count}')
Adam 0 2
Adam 1 1
Mike 0 1
Mike 1 2