用Python编写的Hadoop程序-使用生成器读取文件_Python_Hadoop_Generator_Yield

用Python编写的Hadoop程序-使用生成器读取文件

python hadoop

用Python编写的Hadoop程序-使用生成器读取文件,python,hadoop,generator,yield,Python,Hadoop,Generator,Yield,在本教程中，我试图理解如何使用Python编写Hadoop程序这是mapper.py： #!/usr/bin/env python """A more advanced Mapper, using Python iterators and generators.""" import sys def read_input(file): for line in file: # split the line into words yield line.sp

在本教程中，我试图理解如何使用Python编写Hadoop程序

这是mapper.py：

#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""

import sys

def read_input(file):
    for line in file:
        # split the line into words
        yield line.split()

def main(separator='\t'):
    # input comes from STDIN (standard input)
    data = read_input(sys.stdin)
    for words in data:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        for word in words:
            print '%s%s%d' % (word, separator, 1)

if __name__ == "__main__":
    main()

我不明白

yield

的用法<代码>读取输入一次生成一行。但是，

main

只调用一次

read\u input

，它对应于文件的第一行。如何读取其余的行呢？

实际上，

main

多次调用

read\u input

data = read_input(sys.stdin)
# Causes a generator to be assigned to data.
for words in data:

在for循环的每个循环中，调用

数据

，它是由

读取输入

返回的生成器。

数据

的输出分配给

单词

基本上，

for words in data

是“调用数据并将输出分配给words，然后执行循环块”的缩写。

看看这个答案：+1另一种思考方式是

for words in read\u input（sys.stdin）

，其中

read\u input

是一个动态创建的列表。@mr2ert-确实，同意。