Python 包含HTML标记的文件上的Hadoop MapReduce作业_Python_Hadoop_Mapreduce

Python 包含HTML标记的文件上的Hadoop MapReduce作业

python hadoop mapreduce

Python 包含HTML标记的文件上的Hadoop MapReduce作业,python,hadoop,mapreduce,Python,Hadoop,Mapreduce,我有一大堆大型HTML文件，我想对它们运行Hadoop MapReduce作业，以查找最常用的单词。我用Python编写了映射器和reducer，并使用Hadoop流来运行它们这是我的地图： #!/usr/bin/env python import sys import re import string def remove_html_tags(in_text): ''' Remove any HTML tags that are found. ''' global flag

我有一大堆大型HTML文件，我想对它们运行Hadoop MapReduce作业，以查找最常用的单词。我用Python编写了映射器和reducer，并使用Hadoop流来运行它们

这是我的地图：

#!/usr/bin/env python

import sys
import re
import string

def remove_html_tags(in_text):
'''
Remove any HTML tags that are found. 

'''
    global flag
    in_text=in_text.lstrip()
    in_text=in_text.rstrip()
    in_text=in_text+"\n"

    if flag==True: 
        in_text="<"+in_text
        flag=False
    if re.search('^<',in_text)!=None and re.search('(>\n+)$', in_text)==None: 
        in_text=in_text+">"
        flag=True
    p = re.compile(r'<[^<]*?>')
    in_text=p.sub('', in_text)
    return in_text

# input comes from STDIN (standard input)
global flag
flag=False
for line in sys.stdin:
    # remove leading and trailing whitespace, set to lowercase and remove HTMl tags
    line = line.strip().lower()
    line = remove_html_tags(line)
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
       # write the results to STDOUT (standard output);
       # what we output here will be the input for the
       # Reduce step, i.e. the input for reducer.py
       #
       # tab-delimited; the trivial word count is 1
       if word =='': continue
       for c in string.punctuation:
           word= word.replace(c,'')

       print '%s\t%s' % (word, 1)

每当我只是通过管道传输一个小样本小字符串，比如“hello world hello world…”，我就会得到一个排名列表的正确输出。但是，当我尝试使用一个小的HTML文件，并尝试使用cat将HTML管道传输到我的映射器时，我得到以下错误（input2包含一些HTML代码）：

rohanbk@hadoop：~$cat input2 |/home/rohanbk/mapper.py | sort |/home/rohanbk/reducer.py
回溯（最近一次呼叫最后一次）：
文件“/home/rohanbk/reducer.py”，第15行，在
字，计数=行。拆分（'\t'，1）
ValueError:需要超过1个值才能解包

有人能解释我为什么会得到这个吗？另外，调试MapReduce作业程序的好方法是什么

即使只需执行以下操作，也可以复制错误：

echo "hello - world" | ./mapper.py  | sort | ./reducer.py

问题在于：

if word =='': continue
for c in string.punctuation:
           word= word.replace(c,'')

如果

word

是单个标点符号，就像上面输入的情况一样（拆分后），那么它将转换为空字符串。因此，只需在替换后将空字符串的检查移动到。

即使只需执行以下操作，也可以复制错误：

echo "hello - world" | ./mapper.py  | sort | ./reducer.py

问题在于：

if word =='': continue
for c in string.punctuation:
           word= word.replace(c,'')

如果

word

是单个标点符号，就像上面输入的情况一样（拆分后），那么它将转换为空字符串。所以，只需将一个空字符串检查到替换后。

是否安全地假设如果使用CAT并获得期望的输出，那么MapReduce步骤将起作用？对于更令人满意的Python／Hadoop集成体验，您可以考虑使用Dimo。如果您使用CAT并获得期望的输出，那么安全吗？MapReduce的步骤会起作用吗？对于一个更令人满意的Python／Hadoop集成体验，你可以考虑使用DimBo。