Python PySpark XML到JSON,带时间序列数据

Python PySpark XML到JSON,带时间序列数据,python,apache-spark,pyspark,spark-dataframe,Python,Apache Spark,Pyspark,Spark Dataframe,我有将近50万个包含时间序列数据的XML文件,每个文件大约2-3MB,每个文件包含大约10k行时间序列数据。想法是将每个唯一ID的XML文件转换为JSON。但是,每个ID的时间序列数据需要分解为行大小为10的批,并转换为JSON并写入NoSQL数据库。最初,编写代码是为了对每个ID迭代一个单片数据帧,并按行大小10递增,然后将文档写入db def resample_idx(X,resample_rate): for idx in range(0,len(X),resample_rate)

我有将近50万个包含时间序列数据的XML文件,每个文件大约2-3MB,每个文件包含大约10k行时间序列数据。想法是将每个唯一ID的XML文件转换为JSON。但是,每个ID的时间序列数据需要分解为行大小为10的批,并转换为JSON并写入NoSQL数据库。最初,编写代码是为了对每个ID迭代一个单片数据帧,并按行大小10递增,然后将文档写入db

def resample_idx(X,resample_rate):
    for idx in range(0,len(X),resample_rate):
        yield X.iloc[idx:idx+resample_rate,:]

# Batch Documents 
    for idx, df_batch in enumerate(resample_idx(df,10))
        dict_ = {}
        dict_['id'] = soup.find('id').contents[0]
        dict_['data'] = [v for k,v in pd.DataFrame.to_dict(df_batch.T).items()]
JSON文档的示例如下所示:

{'id':123456A,
'data': [{'A': 251.23,
          'B': 130.56,
          'dtim': Timestamp('2011-03-24 11:18:13.350000')
         },
         {
          'A': 253.23,
          'B': 140.56,
          'dtim': Timestamp('2011-03-24 11:19:21.310000')
         },
         .........
        ]
},
{'id':123593X,
'data': [{'A': 641.13,
          'B': 220.51,
          'C': 10.45
          'dtim': Timestamp('2011-03-26 12:11:13.350000')
         },
         {
          'A': 153.25,
          'B': 810.16,
          'C': 12.5
          'dtim': Timestamp('2011-03-26 12:11:13.310000')
         },
         .........
        ]
}
这对于一个小样本来说效果很好,但很快就会意识到这在创建批时不会扩展。因此,希望在Spark中复制这一点。与Spark合作的经验有限,但以下是我迄今为止所尝试的:

首先获取所有ID的所有时间序列数据:

df = sqlContext.read.format("com.databricks.spark.xml").options(rowTag='log').load("dbfs:/mnt/timedata/")
XML模式

 |-- _id: string (nullable = true)   
 |-- collect_list(TimeData): array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- data: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- ColNames: string (nullable = true)
 |    |    |-- Units: string (nullable = true)
获取Spark数据帧的SQL查询 d=df.select(“\u id”,“TimeData.data”,“TimeData.ColNames”)

当前火花数据帧

+--------------------+--------------------+--------------------+
|                id  |                data|            ColNames|
+--------------------+--------------------+--------------------+
|123456A             |[2011-03-24 11:18...|dTim,A,B            |
|123456A             |[2011-03-24 11:19...|dTim,A,B            |
|123593X             |[2011-03-26 12:11...|dTim,A,B,C          |
|123593X             |[2011-03-26 12:11...|dTim,A,B,C          |
+--------------------+--------------------+--------------------+
+--------------------+--------------------+----------+----------+
|                id  |               dTime|         A|         B|
+--------------------+--------------------+----------+----------+
|123456A             |2011-03-24 11:18... |    251.23|    130.56|
|123456A             |2011-03-24 11:19... |    253.23|    140.56|
+--------------------+--------------------+----------+----------+

+--------------------+--------------------+----------+----------+----------+
|                id  |               dTime|         A|         B|         C|
+--------------------+--------------------+----------+----------+----------+
|123593X             |2011-03-26 12:11... |    641.13|    220.51|     10.45|
|123593X             |2011-03-26 12:11... |    153.25|    810.16|      12.5|
+--------------------+-------------------+---------- +----------+----------+
预期火花数据帧

+--------------------+--------------------+--------------------+
|                id  |                data|            ColNames|
+--------------------+--------------------+--------------------+
|123456A             |[2011-03-24 11:18...|dTim,A,B            |
|123456A             |[2011-03-24 11:19...|dTim,A,B            |
|123593X             |[2011-03-26 12:11...|dTim,A,B,C          |
|123593X             |[2011-03-26 12:11...|dTim,A,B,C          |
+--------------------+--------------------+--------------------+
+--------------------+--------------------+----------+----------+
|                id  |               dTime|         A|         B|
+--------------------+--------------------+----------+----------+
|123456A             |2011-03-24 11:18... |    251.23|    130.56|
|123456A             |2011-03-24 11:19... |    253.23|    140.56|
+--------------------+--------------------+----------+----------+

+--------------------+--------------------+----------+----------+----------+
|                id  |               dTime|         A|         B|         C|
+--------------------+--------------------+----------+----------+----------+
|123593X             |2011-03-26 12:11... |    641.13|    220.51|     10.45|
|123593X             |2011-03-26 12:11... |    153.25|    810.16|      12.5|
+--------------------+-------------------+---------- +----------+----------+
我在这里只显示了两个时间戳的数据,但是我怎样才能将上面的数据帧转换成每N行(每个id)的批处理JSON文件,类似于使用上面所示的Pandas的方式?最初的想法是执行groupBy并对每个ID应用UDF?输出类似于上面的JSON结构

XML结构:

<log>
   <id>"ABC"</id>
   <TimeData>
      <colNames>dTim,colA,colB,colC,</colNames>
      <data>2011-03-24T11:18:13.350Z,0.139,38.988,0,110.307</data>
      <data>2011-03-24T11:18:43.897Z,0.138,39.017,0,110.307</data>
  </TimeData>
</log>

“ABC”
dTim,可乐,可乐,可乐,
2011-03-24T11:18:13.350Z,0.139,38.988010.307
2011-03-24T11:18:43.897Z,0.138,39.017,0110.307

请注意,每个ID没有固定数量的CONAME,其范围可能在5-30之间,具体取决于为该ID收集的数据源。

根据信息,这可能是一个解决方案。不幸的是,我的Python有点生疏,但是这里应该有所有scala函数的等价物

// Assume nth is based of dTim ordering
val windowSpec = Window
  .partitionBy($"_id")
  .orderBy($"dTim".desc)

val nthRow  = 2  // define the nthItem to be fetched

df.select(
  $"_id",
  $"TimeData.data".getItem(0).getItem(0).cast(TimestampType).alias("dTim"),
  $"TimeData.data".getItem(0).getItem(1).cast(DoubleType).alias("A"),
  $"TimeData.data".getItem(0).getItem(2).cast(DoubleType).alias("B"),
  $"TimeData.data".getItem(0).getItem(3).cast(DoubleType).alias("C")
).withColumn("n", row_number().over(windowSpec))
  .filter(col("n") === nthRow)
  .drop("n")
.show()
将输出如下内容

+-------+--------------------+------+------+-----+
|    _id|                dTim|     A|     B|    C|
+-------+--------------------+------+------+-----+
|123456A|2011-03-24 11:18:...|251.23|130.56| null|
|123593X|2011-03-26 12:11:...|641.13|220.51|10.45|
+-------+--------------------+------+------+-----+
如果我知道多一点,我会改进答案的


更新

我喜欢这个难题,因此如果我正确理解了问题,这可能是一个解决方案:

我已经创建了3个xml文件,每个文件有2条数据记录,总共有2个不同的ID

val df=spark
.sqlContext
阅读
.format(“com.databricks.spark.xml”)
.选项(“行标记”、“日志”)
.load(“src/main/resources/xml”)
//可能计算量很大,如果可能的话,可能先缓存df,否则在样本上运行它,否则可能会硬编码
val colNames=df
.选择(分解(拆分($“TimeData.colNames”,“,”))。作为(“col”))
.distinct()
.filter($“col”=!=lit(“dTim”)和&$“col”=!=“”)
.collect()
.map(u.getString(0))
托利斯先生
.分类
//或者列出所有可能的列
//val COLNAME=列表(“colA”、“colB”、“colC”)
//基于XML的列名和数据是必须拆分的逗号分隔字符串。可以使用sql split函数完成,但此UDF将列映射到正确的字段
def mapColsToData=udf((cols:String,data:Seq[String])=>
if(cols==null | | data==null)Seq.empty[Map[String,String]]
否则{
data.map(str=>(cols.split(“,”)zip str.split(“,”).toMap)
}
)
//此操作的结果是所有XML的每个数据点有一条记录。每个数据记录都是colName->data的key->value映射
val denorm=df.select($“id”,explode(mapColsToData($“TimeData.colNames”,“TimeData.data”)).as(“data”))
显示(假)
输出:

+-------+-------------------------------------------------------------------------------+
|id     |data                                                                           |
+-------+-------------------------------------------------------------------------------+
|123456A|Map(dTim -> 2011-03-24T11:18:13.350Z, colA -> 0.139, colB -> 38.988, colC -> 0)|
|123456A|Map(dTim -> 2011-03-24T11:18:43.897Z, colA -> 0.138, colB -> 39.017, colC -> 0)|
|123593X|Map(dTim -> 2011-03-26T11:20:13.350Z, colA -> 1.139, colB -> 28.988)           |
|123593X|Map(dTim -> 2011-03-26T11:20:43.897Z, colA -> 1.138, colB -> 29.017)           |
|123456A|Map(dTim -> 2011-03-27T11:18:13.350Z, colA -> 0.129, colB -> 35.988, colC -> 0)|
|123456A|Map(dTim -> 2011-03-27T11:18:43.897Z, colA -> 0.128, colB -> 35.017, colC -> 0)|
+-------+-------------------------------------------------------------------------------+
+-------+--------------------+--------------------+
|     id|                dTim|                data|
+-------+--------------------+--------------------+
|123456A|2011-03-24 12:18:...|Map(dTim -> 2011-...|
|123456A|2011-03-24 12:18:...|Map(dTim -> 2011-...|
|123593X|2011-03-26 12:20:...|Map(dTim -> 2011-...|
|123593X|2011-03-26 12:20:...|Map(dTim -> 2011-...|
|123456A|2011-03-27 13:18:...|Map(dTim -> 2011-...|
|123456A|2011-03-27 13:18:...|Map(dTim -> 2011-...|
+-------+--------------------+--------------------+
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id     |data                                                                                                                                                              |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|123593X|[Map(dTim -> 2011-03-26T11:20:43.897Z, colA -> 1.138, colB -> 29.017), Map(dTim -> 2011-03-26T11:20:13.350Z, colA -> 1.139, colB -> 28.988)]                      |
|123456A|[Map(dTim -> 2011-03-27T11:18:43.897Z, colA -> 0.128, colB -> 35.017, colC -> 0), Map(dTim -> 2011-03-27T11:18:13.350Z, colA -> 0.129, colB -> 35.988, colC -> 0)]|
|123456A|[Map(dTim -> 2011-03-24T11:18:43.897Z, colA -> 0.138, colB -> 39.017, colC -> 0), Map(dTim -> 2011-03-24T11:18:13.350Z, colA -> 0.139, colB -> 38.988, colC -> 0)]|
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//现在根据predef/found columnNames为每个映射值创建列
val columized=denorm.select(
$“id”,
$“data.dTim”.cast(TimestampType).alias(“dTim”),
$“数据”
)
columized.show()
输出:

+-------+-------------------------------------------------------------------------------+
|id     |data                                                                           |
+-------+-------------------------------------------------------------------------------+
|123456A|Map(dTim -> 2011-03-24T11:18:13.350Z, colA -> 0.139, colB -> 38.988, colC -> 0)|
|123456A|Map(dTim -> 2011-03-24T11:18:43.897Z, colA -> 0.138, colB -> 39.017, colC -> 0)|
|123593X|Map(dTim -> 2011-03-26T11:20:13.350Z, colA -> 1.139, colB -> 28.988)           |
|123593X|Map(dTim -> 2011-03-26T11:20:43.897Z, colA -> 1.138, colB -> 29.017)           |
|123456A|Map(dTim -> 2011-03-27T11:18:13.350Z, colA -> 0.129, colB -> 35.988, colC -> 0)|
|123456A|Map(dTim -> 2011-03-27T11:18:43.897Z, colA -> 0.128, colB -> 35.017, colC -> 0)|
+-------+-------------------------------------------------------------------------------+
+-------+--------------------+--------------------+
|     id|                dTim|                data|
+-------+--------------------+--------------------+
|123456A|2011-03-24 12:18:...|Map(dTim -> 2011-...|
|123456A|2011-03-24 12:18:...|Map(dTim -> 2011-...|
|123593X|2011-03-26 12:20:...|Map(dTim -> 2011-...|
|123593X|2011-03-26 12:20:...|Map(dTim -> 2011-...|
|123456A|2011-03-27 13:18:...|Map(dTim -> 2011-...|
|123456A|2011-03-27 13:18:...|Map(dTim -> 2011-...|
+-------+--------------------+--------------------+
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id     |data                                                                                                                                                              |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|123593X|[Map(dTim -> 2011-03-26T11:20:43.897Z, colA -> 1.138, colB -> 29.017), Map(dTim -> 2011-03-26T11:20:13.350Z, colA -> 1.139, colB -> 28.988)]                      |
|123456A|[Map(dTim -> 2011-03-27T11:18:43.897Z, colA -> 0.128, colB -> 35.017, colC -> 0), Map(dTim -> 2011-03-27T11:18:13.350Z, colA -> 0.129, colB -> 35.988, colC -> 0)]|
|123456A|[Map(dTim -> 2011-03-24T11:18:43.897Z, colA -> 0.138, colB -> 39.017, colC -> 0), Map(dTim -> 2011-03-24T11:18:13.350Z, colA -> 0.139, colB -> 38.988, colC -> 0)]|
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//创建要在其上重新采样的窗口
val windowSpec=Window
.partitionBy($“id”)
.orderBy($“dTim”.desc)
val重采样率=2
//根据重采样率添加batchId。分批分组
val批处理=列化
.withColumn(“批次ID”,楼层((windowSpec上方的行号())-lit(1))/lit(重采样)))
.groupBy($“id”,“batchId”)
.agg(收集列表($“数据”).as(“数据”))
.drop(“batchId”)
批处理。显示(错误)
输出:

+-------+-------------------------------------------------------------------------------+
|id     |data                                                                           |
+-------+-------------------------------------------------------------------------------+
|123456A|Map(dTim -> 2011-03-24T11:18:13.350Z, colA -> 0.139, colB -> 38.988, colC -> 0)|
|123456A|Map(dTim -> 2011-03-24T11:18:43.897Z, colA -> 0.138, colB -> 39.017, colC -> 0)|
|123593X|Map(dTim -> 2011-03-26T11:20:13.350Z, colA -> 1.139, colB -> 28.988)           |
|123593X|Map(dTim -> 2011-03-26T11:20:43.897Z, colA -> 1.138, colB -> 29.017)           |
|123456A|Map(dTim -> 2011-03-27T11:18:13.350Z, colA -> 0.129, colB -> 35.988, colC -> 0)|
|123456A|Map(dTim -> 2011-03-27T11:18:43.897Z, colA -> 0.128, colB -> 35.017, colC -> 0)|
+-------+-------------------------------------------------------------------------------+
+-------+--------------------+--------------------+
|     id|                dTim|                data|
+-------+--------------------+--------------------+
|123456A|2011-03-24 12:18:...|Map(dTim -> 2011-...|
|123456A|2011-03-24 12:18:...|Map(dTim -> 2011-...|
|123593X|2011-03-26 12:20:...|Map(dTim -> 2011-...|
|123593X|2011-03-26 12:20:...|Map(dTim -> 2011-...|
|123456A|2011-03-27 13:18:...|Map(dTim -> 2011-...|
|123456A|2011-03-27 13:18:...|Map(dTim -> 2011-...|
+-------+--------------------+--------------------+
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id     |data                                                                                                                                                              |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|123593X|[Map(dTim -> 2011-03-26T11:20:43.897Z, colA -> 1.138, colB -> 29.017), Map(dTim -> 2011-03-26T11:20:13.350Z, colA -> 1.139, colB -> 28.988)]                      |
|123456A|[Map(dTim -> 2011-03-27T11:18:43.897Z, colA -> 0.128, colB -> 35.017, colC -> 0), Map(dTim -> 2011-03-27T11:18:13.350Z, colA -> 0.129, colB -> 35.988, colC -> 0)]|
|123456A|[Map(dTim -> 2011-03-24T11:18:43.897Z, colA -> 0.138, colB -> 39.017, colC -> 0), Map(dTim -> 2011-03-24T11:18:13.350Z, colA -> 0.139, colB -> 38.988, colC -> 0)]|
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//存储为一个巨大的json文件(如果您可以处理多个json,请删除重新分配,对于master也更好)
batched.repartition(1.write.mode(SaveMode.Overwrite).json(“/tmp/xml”)
输出json:

{“id”:“123593X”,“data”:[{“dTim”:“2011-03-26T12:20:43.897+01:00”,“colA”:“1.138”,“colB”:“29.017”},{“dTim”:“2011-03-26T12:20:13.350+01:00”,“colA”:“1.139”,“colB”:“28.988”}
{“id”:“123456A”,“data”:[{“dTim”:“2011-03-27T13:18:43.897+02:00”,“colA”:“0.128”,“colB”:“35.017”,“colC”:“0”},{“dTim”:“2011-03-27T13:18:13.350+02:00”,“colA”:“0.129”,“colB”:“35.988”,“colC”:“0”}]
{“id”:“123456A”,“data”:[{“dTim”:“2011-03-24T12:18:43.897+01:00”,“colA”:“0.138”,“colB”:“39.017”,“colC”:“0”},{“dTim”:“2011-03-24T12:18:13.350+01:00”,“colA”:“0.139”,“colB”:“38.988”,“colC”:“0”}]

这里是另一种不依赖于硬编码列名的方法。基本上,我们的想法是分解
数据
ColNames
列,得到一个“熔化”的DF,然后我们可以旋转它以得到您想要的形式:

# define function that processes elements of rdd
# underlying the DF to get a melted RDD
def process(row, cols):
    """cols is list of target columns to explode"""
    row=row.asDict()
    exploded=[[row['id']]+list(elt) for elt in zip(*[row[col] for col in cols])]    
    return(exploded)


#Now split ColNames:
df=df.withColumn('col_split', f.split('ColNames',","))

# define target cols to explode, each element of each col 
# can be of different length
cols=['data', 'col_split']

# apply function and flatmap the results to get melted RDD/DF
df=df.select(['id']+cols).rdd\
    .flatMap(lambda row: process(row, cols))\
    .toDF(schema=['id', 'value', 'name'])

# Pivot to get the required form
df.groupby('id').pivot('name').agg(f.max('value')).show()

您有一些示例XML文件吗?并且输出应该成为如上所示的json,对吗?@TomLous是的,输出应该与如上所示相同。将使用xml文件片段进行更新是否有未提及的
.groupBy(“id”).agg(collect_list($“TimeData”))
来解释xml架构?感谢您的回复。我更新了上面的问题,使之更清楚。如果可能,希望避免硬编码列名,因为每个ID的列名都会有所不同。@TraceSmith这是您的意图吗?