Python 如何在flatMap函数中实现迭代

Python 如何在flatMap函数中实现迭代,python,python-2.7,hadoop,iteration,pyspark,Python,Python 2.7,Hadoop,Iteration,Pyspark,我正在将一个多行记录文本文件读入RDD。底层数据如下所示 Time MHist::852-YF-007 2016-05-10 00:00:00 0 2016-05-09 23:59:00 0 2016-05-09 23:58:00 0 Time MHist::852-YF-008 2016-05-10 00:00:00 0 2016-05-09 23:59:00 0 2016-05-09 23:58:00 0 不,我想转换RDD,以便获得键(时间戳,值)的映射。这可以通

我正在将一个多行记录文本文件读入RDD。底层数据如下所示

Time    MHist::852-YF-007   
2016-05-10 00:00:00 0
2016-05-09 23:59:00 0
2016-05-09 23:58:00 0
Time    MHist::852-YF-008   
2016-05-10 00:00:00 0
2016-05-09 23:59:00 0
2016-05-09 23:58:00 0
不,我想转换RDD,以便获得键(时间戳,值)的映射。这可以通过几个步骤来完成。但我只想在一次调用中提取这些信息(但在Python2.7中不是3)

RDD如下所示:

[(0, u''),
 (12,
  u'852-YF-007\t\r\n2016-05-10 00:00:00\t0\r\n2016-05-09 23:59:00\t0\r\n2016-05-09 23:58:00\t0\r\n2016-05-09 23:57:00\t0\r\n2016-05-09 23:56:00\t0\r\n2016-05-09 23:55:00\t0\r\n2016-05-09 23:54:00\t0\r\n2016-05-09 23:53:00\t0\r\n2016-05-09 23:52:00\t0\r\n2016-05-09 23:51:00\t0\r\n2016-05-09 23:50:00\t0\r\n2016-05-09 23:49:00\t0\r\n2016-05-09 23:48:00\t0\r\n2016-05-09 23:47:00\t0\r\n2016-05-09 23:46:00\t0\r\n2016-05-09 23:45:00\t0\r\n2016-05-09 23:44:00\t0\r\n2016-05-09 23:43:00\t0\r\n2016-05-09 23:42:00\t0\n'),
 (473,
  u'852-YF-008\t\r\n2016-05-10 00:00:00\t0\r\n2016-05-09 23:59:00\t0\r\n2016-05-09 23:58:00\t0\r\n2016-05-09 23:57:00\t0\r\n2016-05-09 23:56:00\t0\r\n2016-05-09 23:55:00\t0\r\n2016-05-09 23:54:00\t0\r\n2016-05-09 23:53:00\t0\r\n2016-05-09 23:52:00\t0\r\n2016-05-09 23:51:00\t0\r\n2016-05-09 23:50:00\t0\r\n2016-05-09 23:49:00\t0\r\n2016-05-09 23:48:00\t0\r\n2016-05-09 23:47:00\t0\r\n2016-05-09 23:46:00\t0\r\n2016-05-09 23:45:00\t0\r\n2016-05-09 23:44:00\t0\r\n2016-05-09 23:43:00\t0\r\n2016-05-09 23:42:00\t0')]
对于每一对,有趣的部分是值(内容)。在该值中,第一项是键/名称,其余是带有时间戳的值。因此,我尝试使用以下方法:

sheet = sc.newAPIHadoopFile(
    'sample.txt',
    'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'org.apache.hadoop.io.Text',
    conf={'textinputformat.record.delimiter': 'Time\tMHist::'}
)

from operator import itemgetter

def process(pair):
    _, content = pair
    if not content: 
        pass

    lines = content.splitlines();
    #k = lines[0].strip()
    #vs =lines[1:]
    k, vs = itemgetter(0, slice(1, None), lines)
    #k, *vs = [x.strip() for x in content.splitlines()]  # Python 3 syntax

    for v in vs:
        try:
            ds, x = v.split("\t")
            yield k, (dateutil.parser.parse(ds), float(x))  # or int(x)
            return
        except ValueError:
            pass

sheet.flatMap(process).take(5)
但我得到了这个错误:

TypeError:“operator.itemgetter”对象不可编辑

进入函数的一对具有char位置(我可以忽略)和内容。内容应按\r\n拆分,行数组的第一项是键,而其他项则作为flatMap的键时间戳值

那么,在我的过程方法中,我做错了什么

同时,由于Stackoverflow和其他的帮助,我提出了这个解决方案。这个很好用:

# reads a text file in TSV notation having the key-value no as first column but 
# as a randomly occuring line followed by its values. Remark: a variable might occur in several files

#Time    MHist::852-YF-007   
#2016-05-10 00:00:00 0
#2016-05-09 23:59:00 0
#2016-05-09 23:58:00 0
#Time    MHist::852-YF-008   
#2016-05-10 00:00:00 0
#2016-05-09 23:59:00 0
#2016-05-09 23:58:00 0

#imports
from operator import itemgetter
from datetime import datetime

#read the text file with special record-delimiter --> all lines after Time\tMHist:: are the values for that variable
sheet = sc.newAPIHadoopFile(
    'sample.txt',
    'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'org.apache.hadoop.io.Text',
    conf={'textinputformat.record.delimiter': 'Time\tMHist::'}
)

#this avoid using multiple map/flatMap/mapValues/flatMapValues calls by extracting the values at once
def process_and_extract(pair):
    # first part will be the char-position within the file, which we can ignore
    # second is the real content as one string and not yet splitted
    _, content = pair
    if not content: 
        pass

    try:
        # once the content is split into lines:
        # 1. the first line will have the bare variable name since we removed the preceeding 
        # part when opening the file (see delimiter above)
        # 2. the second line until the end will include the values for the current variable

        # Python 2.7 syntax
        #clean = itemgetter(0, slice(1, None))(lines)
        clean = [x.strip() for x in content.splitlines()]
        k, vs = clean[0], clean[1:]    

        # Python 3 syntax
        #k, *vs = [x.strip() for x in content.splitlines()] 
        #for v in vs*:

        for v in vs:
            try:
                # split timestamp and value and convert (cast) them from string to correct data type
                ds, x = v.split("\t")
                yield k, (datetime.strptime(ds, "%Y-%m-%d %H:%M:%S"), float(x))
            except ValueError:
                # might occur if a line format is corrupt
                pass
    except IndexError:
        # might occur if content is empty or iregular
        pass

# read, flatten, extract and reduce the file at once        
sheet.flatMap(process_and_extract) \
    .reduceByKey(lambda x, y: x + y) \
    .take(5)
第二个版本避免了for each循环,最终速度提高了20%:

start_time = time.time()

#read the text file with special record-delimiter --> all lines after Time\tMHist:: are the values for that variable
sheet = sc.newAPIHadoopFile(
    'sample.txt',
    'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'org.apache.hadoop.io.Text',
    conf={'textinputformat.record.delimiter': 'Time\tMHist::'}
)

def extract_blob(pair):
    if not pair:
        pass

    try:
        offset, content = pair
        if not content: 
            pass

        clean = [x.strip() for x in content.splitlines()]
        if not clean or len(clean) < 2:
            pass

        k, vs = clean[0], clean[1:]
        if not k:
            pass

        return k.strip(), vs
    except IndexError:
        # might occur if content is empty or malformed
        pass

def extract_line(pair):
    if not pair:
        pass

    key, line = pair;
    if not key or not line:
        pass

    # split timestamp and value and convert (cast) them from string to correct data type
    content = line.split("\t")
    if not content or len(content) < 2:
        pass

    try:
        ds, x = content
        if not ds or not x:
            pass 

        return (key, datetime.strptime(ds, "%Y-%m-%d %H:%M:%S"), float(x))
    except ValueError:
        # might occur if a line format is corrupt
        pass

def check_empty(x):
    return not (x == None)

#drop keys and filter out non-empty entries
non_empty = sheet.filter(lambda (k, v): v)

#group lines having variable name at first line
grouped_lines = non_empty.map(extract_blob)

#extract variable name and split it from the variable values
flat_lines = grouped_lines.flatMapValues(lambda x: x)

#extract the values from the value
flat_triples = flat_lines.map(extract_line).filter(check_empty)

#convert to dataframe
df = flat_triples.toDF(["Variable", "Time", "Value"])

df.write \
    .partitionBy("Variable") \
    .saveAsTable('Observations', format='parquet', mode='overwrite', path=output_hdfs_filepath)

print("loading and saving done in {} seconds".format(time.time() - start_time));
start\u time=time.time()
#使用特殊记录分隔符-->读取文本文件时间之后的所有行\t历史::是该变量的值
sheet=sc.newapiHadoop文件(
“sample.txt”,
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'textinputformat.record.delimiter':'Time\tHist::'}
)
def extract_blob(成对):
如果没有配对:
通过
尝试:
偏移量,内容=对
如果不是内容:
通过
clean=[x.strip()表示content.splitlines()中的x]
如果不清洁或透镜(清洁)<2:
通过
k、 vs=干净[0],干净[1:]
如果不是k:
通过
返回k.strip(),vs
除索引器外:
#如果内容为空或格式不正确,则可能发生此错误
通过
def抽取管路(成对):
如果没有配对:
通过
键,行=对;
如果不是键或非行:
通过
#拆分时间戳和值,并将它们从字符串转换为正确的数据类型
内容=行分割(“\t”)
如果不是内容或长度(内容)<2:
通过
尝试:
ds,x=内容
如果不是ds或不是x:
通过
返回(键,datetime.strtime(ds,“%Y-%m-%d%H:%m:%S”),浮点(x))
除值错误外:
#如果行格式已损坏,则可能会发生这种情况
通过
def检查_为空(x):
返回not(x==无)
#删除键并过滤掉非空条目
非空=表过滤器(λ(k,v):v)
#在第一行具有变量名的组行
分组的\u行=非空的.map(提取\u blob)
#提取变量名并将其从变量值中拆分
扁平线=成组线。扁平映射值(λx:x)
#从值中提取值
平面三元组=平面线。映射(提取线)。过滤器(检查为空)
#转换为数据帧
df=flat_triples.toDF([“变量”、“时间”、“值”])
写\
.partitionBy(“变量”)\
.saveAsTable('Observations',format='parquet',mode='overwrite',path=output\u hdfs\u filepath)
打印(“加载和保存在{}秒内完成”。格式(time.time()-start_time));

itemgetter
返回一个函数,该函数接受一个对象,并为传递给
itemgetter
的每个参数调用
\uuu getitem\uuuuu
。因此,您必须在
行中调用它:

itemgetter(0, slice(1, None))(lines)
大致相当于

[lines[i] for i in [0, slice(1, None)])
其中
行[slice(1,None)]
基本上是
行[1:]

这意味着您必须首先确保
不为空,否则
行[0]
将失败

if lines:  # bool(empty_sequence) is False
    k, vs = itemgetter(0, slice(1, None))(lines)
    for v in vs:
        ...
包括博士学位在内的所有课程:

def process(pair):
    r"""
    >>> list(process((0, u'')))
    []
    >>> kvs = list(process((
    ... 12,
    ... u'852-YF-007\t\r\n2016-05-10 00:00:00\t0\r\n2016-05-09 23:59:00\t0')))
    >>> kvs[0] 
    (u'852-YF-007', (datetime.datetime(2016, 5, 10, 0, 0), 0.0))
    >>> kvs[1]
    (u'852-YF-007', (datetime.datetime(2016, 5, 9, 23, 59), 0.0))
    >>> list(process((
    ... 10,
    ... u'852-YF-007\t\r\n2ad-05-10 00')))
    []
    """ 
    _, content = pair
    clean = [x.strip() for x in content.strip().splitlines()]

    if clean:
        k, vs = itemgetter(0, slice(1, None))(clean)
        for v in vs:
            try:
                ds, x = v.split("\t")
                yield k, (datetime.strptime(ds, "%Y-%m-%d %H:%M:%S"), float(x))
            except ValueError:
                pass 

itemgetter
返回一个函数,该函数接受一个对象,并为传递给
itemgetter
的每个参数调用
\uuuu getitem\uuuuu
。因此,您必须在
行中调用它:

itemgetter(0, slice(1, None))(lines)
大致相当于

[lines[i] for i in [0, slice(1, None)])
其中
行[slice(1,None)]
基本上是
行[1:]

这意味着您必须首先确保
不为空,否则
行[0]
将失败

if lines:  # bool(empty_sequence) is False
    k, vs = itemgetter(0, slice(1, None))(lines)
    for v in vs:
        ...
包括博士学位在内的所有课程:

def process(pair):
    r"""
    >>> list(process((0, u'')))
    []
    >>> kvs = list(process((
    ... 12,
    ... u'852-YF-007\t\r\n2016-05-10 00:00:00\t0\r\n2016-05-09 23:59:00\t0')))
    >>> kvs[0] 
    (u'852-YF-007', (datetime.datetime(2016, 5, 10, 0, 0), 0.0))
    >>> kvs[1]
    (u'852-YF-007', (datetime.datetime(2016, 5, 9, 23, 59), 0.0))
    >>> list(process((
    ... 10,
    ... u'852-YF-007\t\r\n2ad-05-10 00')))
    []
    """ 
    _, content = pair
    clean = [x.strip() for x in content.strip().splitlines()]

    if clean:
        k, vs = itemgetter(0, slice(1, None))(clean)
        for v in vs:
            try:
                ds, x = v.split("\t")
                yield k, (datetime.strptime(ds, "%Y-%m-%d %H:%M:%S"), float(x))
            except ValueError:
                pass 

可能是重复的不,不是。这是关于flatMap函数以及如何传递自己的函数回调,而不是使用lambda。在这个回调中,使用yield-return。但是我得到一个错误:
TypeError:'operator.itemgetter'对象不可编辑
,所以它怎么不一样呢?是的,这个错误令人沮丧。我确实知道如何正确使用pair参数。更新后的版本有效。然后我得到了IndexOutOfBoundsError,直到我意识到文件的第一次读取包含一个空数组作为内容。这是关于flatMap函数以及如何传递自己的函数回调,而不是使用lambda。在这个回调中,使用yield-return。但是我得到一个错误:
TypeError:'operator.itemgetter'对象不可编辑
,所以它怎么不一样呢?是的,这个错误令人沮丧。我确实知道如何正确使用pair参数。更新后的版本有效。然后我得到了IndexOutOfBoundsError,直到我意识到文件的第一次读取包含一个空数组作为内容。是的,因为它根本不起作用。如果不包含内容:。。。我还有索引器。我还没有找到任何解决办法。即使在len(线)<2的情况下进行测试,也会通过。。。没用。我不想这么说,但简单检查一下是否有粗鲁行为对我来说是有效的,并且通过了医生考试。但仍然试图改变它,因为for循环使它非常慢。我严重怀疑for循环问题在这里是个问题。您可以实现自定义格式或小型解析器,以避免content.strip().splitlines()中x的
[x.strip()]
,但在这里无法避免顺序循环。是什么让你觉得它慢了?你有经验吗