Python 如何在flatMap函数中实现迭代_Python_Python 2.7_Hadoop_Iteration_Pyspark

Python 如何在flatMap函数中实现迭代

python python-2.7 hadoop pyspark

Python 如何在flatMap函数中实现迭代,python,python-2.7,hadoop,iteration,pyspark,Python,Python 2.7,Hadoop,Iteration,Pyspark,我正在将一个多行记录文本文件读入RDD。底层数据如下所示 Time MHist::852-YF-007 2016-05-10 00:00:00 0 2016-05-09 23:59:00 0 2016-05-09 23:58:00 0 Time MHist::852-YF-008 2016-05-10 00:00:00 0 2016-05-09 23:59:00 0 2016-05-09 23:58:00 0 不，我想转换RDD，以便获得键（时间戳，值）的映射。这可以通

我正在将一个多行记录文本文件读入RDD。底层数据如下所示

Time    MHist::852-YF-007   
2016-05-10 00:00:00 0
2016-05-09 23:59:00 0
2016-05-09 23:58:00 0
Time    MHist::852-YF-008   
2016-05-10 00:00:00 0
2016-05-09 23:59:00 0
2016-05-09 23:58:00 0

不，我想转换RDD，以便获得键（时间戳，值）的映射。这可以通过几个步骤来完成。但我只想在一次调用中提取这些信息（但在Python2.7中不是3）

RDD如下所示：

[(0, u''),
 (12,
  u'852-YF-007\t\r\n2016-05-10 00:00:00\t0\r\n2016-05-09 23:59:00\t0\r\n2016-05-09 23:58:00\t0\r\n2016-05-09 23:57:00\t0\r\n2016-05-09 23:56:00\t0\r\n2016-05-09 23:55:00\t0\r\n2016-05-09 23:54:00\t0\r\n2016-05-09 23:53:00\t0\r\n2016-05-09 23:52:00\t0\r\n2016-05-09 23:51:00\t0\r\n2016-05-09 23:50:00\t0\r\n2016-05-09 23:49:00\t0\r\n2016-05-09 23:48:00\t0\r\n2016-05-09 23:47:00\t0\r\n2016-05-09 23:46:00\t0\r\n2016-05-09 23:45:00\t0\r\n2016-05-09 23:44:00\t0\r\n2016-05-09 23:43:00\t0\r\n2016-05-09 23:42:00\t0\n'),
 (473,
  u'852-YF-008\t\r\n2016-05-10 00:00:00\t0\r\n2016-05-09 23:59:00\t0\r\n2016-05-09 23:58:00\t0\r\n2016-05-09 23:57:00\t0\r\n2016-05-09 23:56:00\t0\r\n2016-05-09 23:55:00\t0\r\n2016-05-09 23:54:00\t0\r\n2016-05-09 23:53:00\t0\r\n2016-05-09 23:52:00\t0\r\n2016-05-09 23:51:00\t0\r\n2016-05-09 23:50:00\t0\r\n2016-05-09 23:49:00\t0\r\n2016-05-09 23:48:00\t0\r\n2016-05-09 23:47:00\t0\r\n2016-05-09 23:46:00\t0\r\n2016-05-09 23:45:00\t0\r\n2016-05-09 23:44:00\t0\r\n2016-05-09 23:43:00\t0\r\n2016-05-09 23:42:00\t0')]

对于每一对，有趣的部分是值（内容）。在该值中，第一项是键/名称，其余是带有时间戳的值。因此，我尝试使用以下方法：

sheet = sc.newAPIHadoopFile(
    'sample.txt',
    'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'org.apache.hadoop.io.Text',
    conf={'textinputformat.record.delimiter': 'Time\tMHist::'}
)

from operator import itemgetter

def process(pair):
    _, content = pair
    if not content: 
        pass

    lines = content.splitlines();
    #k = lines[0].strip()
    #vs =lines[1:]
    k, vs = itemgetter(0, slice(1, None), lines)
    #k, *vs = [x.strip() for x in content.splitlines()]  # Python 3 syntax

    for v in vs:
        try:
            ds, x = v.split("\t")
            yield k, (dateutil.parser.parse(ds), float(x))  # or int(x)
            return
        except ValueError:
            pass

sheet.flatMap(process).take(5)

但我得到了这个错误：

TypeError:“operator.itemgetter”对象不可编辑

进入函数的一对具有char位置（我可以忽略）和内容。内容应按\r\n拆分，行数组的第一项是键，而其他项则作为flatMap的键时间戳值

那么，在我的过程方法中，我做错了什么

同时，由于Stackoverflow和其他的帮助，我提出了这个解决方案。这个很好用：

# reads a text file in TSV notation having the key-value no as first column but 
# as a randomly occuring line followed by its values. Remark: a variable might occur in several files

#Time    MHist::852-YF-007   
#2016-05-10 00:00:00 0
#2016-05-09 23:59:00 0
#2016-05-09 23:58:00 0
#Time    MHist::852-YF-008   
#2016-05-10 00:00:00 0
#2016-05-09 23:59:00 0
#2016-05-09 23:58:00 0

#imports
from operator import itemgetter
from datetime import datetime

#read the text file with special record-delimiter --> all lines after Time\tMHist:: are the values for that variable
sheet = sc.newAPIHadoopFile(
    'sample.txt',
    'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'org.apache.hadoop.io.Text',
    conf={'textinputformat.record.delimiter': 'Time\tMHist::'}
)

#this avoid using multiple map/flatMap/mapValues/flatMapValues calls by extracting the values at once
def process_and_extract(pair):
    # first part will be the char-position within the file, which we can ignore
    # second is the real content as one string and not yet splitted
    _, content = pair
    if not content: 
        pass

    try:
        # once the content is split into lines:
        # 1. the first line will have the bare variable name since we removed the preceeding 
        # part when opening the file (see delimiter above)
        # 2. the second line until the end will include the values for the current variable

        # Python 2.7 syntax
        #clean = itemgetter(0, slice(1, None))(lines)
        clean = [x.strip() for x in content.splitlines()]
        k, vs = clean[0], clean[1:]    

        # Python 3 syntax
        #k, *vs = [x.strip() for x in content.splitlines()] 
        #for v in vs*:

        for v in vs:
            try:
                # split timestamp and value and convert (cast) them from string to correct data type
                ds, x = v.split("\t")
                yield k, (datetime.strptime(ds, "%Y-%m-%d %H:%M:%S"), float(x))
            except ValueError:
                # might occur if a line format is corrupt
                pass
    except IndexError:
        # might occur if content is empty or iregular
        pass

# read, flatten, extract and reduce the file at once        
sheet.flatMap(process_and_extract) \
    .reduceByKey(lambda x, y: x + y) \
    .take(5)

第二个版本避免了for each循环，最终速度提高了20%：

start_time = time.time()

#read the text file with special record-delimiter --> all lines after Time\tMHist:: are the values for that variable
sheet = sc.newAPIHadoopFile(
    'sample.txt',
    'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'org.apache.hadoop.io.Text',
    conf={'textinputformat.record.delimiter': 'Time\tMHist::'}
)

def extract_blob(pair):
    if not pair:
        pass

    try:
        offset, content = pair
        if not content: 
            pass

        clean = [x.strip() for x in content.splitlines()]
        if not clean or len(clean) < 2:
            pass

        k, vs = clean[0], clean[1:]
        if not k:
            pass

        return k.strip(), vs
    except IndexError:
        # might occur if content is empty or malformed
        pass

def extract_line(pair):
    if not pair:
        pass

    key, line = pair;
    if not key or not line:
        pass

    # split timestamp and value and convert (cast) them from string to correct data type
    content = line.split("\t")
    if not content or len(content) < 2:
        pass

    try:
        ds, x = content
        if not ds or not x:
            pass 

        return (key, datetime.strptime(ds, "%Y-%m-%d %H:%M:%S"), float(x))
    except ValueError:
        # might occur if a line format is corrupt
        pass

def check_empty(x):
    return not (x == None)

#drop keys and filter out non-empty entries
non_empty = sheet.filter(lambda (k, v): v)

#group lines having variable name at first line
grouped_lines = non_empty.map(extract_blob)

#extract variable name and split it from the variable values
flat_lines = grouped_lines.flatMapValues(lambda x: x)

#extract the values from the value
flat_triples = flat_lines.map(extract_line).filter(check_empty)

#convert to dataframe
df = flat_triples.toDF(["Variable", "Time", "Value"])

df.write \
    .partitionBy("Variable") \
    .saveAsTable('Observations', format='parquet', mode='overwrite', path=output_hdfs_filepath)

print("loading and saving done in {} seconds".format(time.time() - start_time));

start\u time=time.time（）
#使用特殊记录分隔符-->读取文本文件时间之后的所有行\t历史：：是该变量的值
sheet=sc.newapiHadoop文件(
“sample.txt”，
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat'，
'org.apache.hadoop.io.LongWritable'，
'org.apache.hadoop.io.Text'，
conf={'textinputformat.record.delimiter'：'Time\tHist:：'}
)
def extract_blob（成对）：
如果没有配对：
通过
尝试：
偏移量，内容=对
如果不是内容：
通过
clean=[x.strip（）表示content.splitlines（）中的x]
如果不清洁或透镜（清洁）<2：
通过
k、 vs=干净[0]，干净[1:]
如果不是k：
通过
返回k.strip（），vs
除索引器外：
#如果内容为空或格式不正确，则可能发生此错误
通过
def抽取管路（成对）：
如果没有配对：
通过
键，行=对；
如果不是键或非行：
通过
#拆分时间戳和值，并将它们从字符串转换为正确的数据类型
内容=行分割（“\t”）
如果不是内容或长度（内容）<2：
通过
尝试：
ds，x=内容
如果不是ds或不是x：
通过
返回（键，datetime.strtime（ds，“%Y-%m-%d%H:%m:%S”），浮点（x））
除值错误外：
#如果行格式已损坏，则可能会发生这种情况
通过
def检查_为空（x）：
返回not（x==无）
#删除键并过滤掉非空条目
非空=表过滤器（λ（k，v）：v）
#在第一行具有变量名的组行
分组的\u行=非空的.map（提取\u blob）
#提取变量名并将其从变量值中拆分
扁平线=成组线。扁平映射值（λx:x）
#从值中提取值
平面三元组=平面线。映射（提取线）。过滤器（检查为空）
#转换为数据帧
df=flat_triples.toDF（[“变量”、“时间”、“值”]）
写\
.partitionBy（“变量”）\
.saveAsTable（'Observations'，format='parquet'，mode='overwrite'，path=output\u hdfs\u filepath）
打印（“加载和保存在{}秒内完成”。格式（time.time（）-start_time））；

itemgetter

返回一个函数，该函数接受一个对象，并为传递给

itemgetter

的每个参数调用

\uuu getitem\uuuuu

。因此，您必须在

行中调用它：
itemgetter(0, slice(1, None))(lines)

大致相当于
[lines[i] for i in [0, slice(1, None)])

其中行[slice（1，None）]
基本上是行[1:]

这意味着您必须首先确保行
不为空，否则行[0]
将失败
if lines:  # bool(empty_sequence) is False
    k, vs = itemgetter(0, slice(1, None))(lines)
    for v in vs:
        ...

包括博士学位在内的所有课程：
def process(pair):
    r"""
    >>> list(process((0, u'')))
    []
    >>> kvs = list(process((
    ... 12,
    ... u'852-YF-007\t\r\n2016-05-10 00:00:00\t0\r\n2016-05-09 23:59:00\t0')))
    >>> kvs[0] 
    (u'852-YF-007', (datetime.datetime(2016, 5, 10, 0, 0), 0.0))
    >>> kvs[1]
    (u'852-YF-007', (datetime.datetime(2016, 5, 9, 23, 59), 0.0))
    >>> list(process((
    ... 10,
    ... u'852-YF-007\t\r\n2ad-05-10 00')))
    []
    """ 
    _, content = pair
    clean = [x.strip() for x in content.strip().splitlines()]

    if clean:
        k, vs = itemgetter(0, slice(1, None))(clean)
        for v in vs:
            try:
                ds, x = v.split("\t")
                yield k, (datetime.strptime(ds, "%Y-%m-%d %H:%M:%S"), float(x))
            except ValueError:
                pass 

itemgetter
返回一个函数，该函数接受一个对象，并为传递给itemgetter
的每个参数调用\uuuu getitem\uuuuu
。因此，您必须在行中调用它：
itemgetter(0, slice(1, None))(lines)

大致相当于
[lines[i] for i in [0, slice(1, None)])

其中行[slice（1，None）]
基本上是行[1:]

这意味着您必须首先确保行
不为空，否则行[0]
将失败
if lines:  # bool(empty_sequence) is False
    k, vs = itemgetter(0, slice(1, None))(lines)
    for v in vs:
        ...

包括博士学位在内的所有课程：
def process(pair):
    r"""
    >>> list(process((0, u'')))
    []
    >>> kvs = list(process((
    ... 12,
    ... u'852-YF-007\t\r\n2016-05-10 00:00:00\t0\r\n2016-05-09 23:59:00\t0')))
    >>> kvs[0] 
    (u'852-YF-007', (datetime.datetime(2016, 5, 10, 0, 0), 0.0))
    >>> kvs[1]
    (u'852-YF-007', (datetime.datetime(2016, 5, 9, 23, 59), 0.0))
    >>> list(process((
    ... 10,
    ... u'852-YF-007\t\r\n2ad-05-10 00')))
    []
    """ 
    _, content = pair
    clean = [x.strip() for x in content.strip().splitlines()]

    if clean:
        k, vs = itemgetter(0, slice(1, None))(clean)
        for v in vs:
            try:
                ds, x = v.split("\t")
                yield k, (datetime.strptime(ds, "%Y-%m-%d %H:%M:%S"), float(x))
            except ValueError:
                pass 

可能是重复的不，不是。这是关于flatMap函数以及如何传递自己的函数回调，而不是使用lambda。在这个回调中，使用yield-return。但是我得到一个错误：TypeError:'operator.itemgetter'对象不可编辑
，所以它怎么不一样呢？是的，这个错误令人沮丧。我确实知道如何正确使用pair参数。更新后的版本有效。然后我得到了IndexOutOfBoundsError，直到我意识到文件的第一次读取包含一个空数组作为内容。这是关于flatMap函数以及如何传递自己的函数回调，而不是使用lambda。在这个回调中，使用yield-return。但是我得到一个错误：TypeError:'operator.itemgetter'对象不可编辑
，所以它怎么不一样呢？是的，这个错误令人沮丧。我确实知道如何正确使用pair参数。更新后的版本有效。然后我得到了IndexOutOfBoundsError，直到我意识到文件的第一次读取包含一个空数组作为内容。是的，因为它根本不起作用。如果不包含内容：。。。我还有索引器。我还没有找到任何解决办法。即使在len（线）<2的情况下进行测试，也会通过。。。没用。我不想这么说，但简单检查一下是否有粗鲁行为对我来说是有效的，并且通过了医生考试。但仍然试图改变它，因为for循环使它非常慢。我严重怀疑for循环问题在这里是个问题。您可以实现自定义格式或小型解析器，以避免content.strip（）.splitlines（）中x的[x.strip（）]
，但在这里无法避免顺序循环。是什么让你觉得它慢了？你有经验吗