Python 如何使用reduceByKey（pyspark）创建嵌套结构？_Python_Pyspark_Rdd_Reduce - Fatal编程技术网

Python 如何使用reduceByKey（pyspark）创建嵌套结构？

python pyspark

Python 如何使用reduceByKey（pyspark）创建嵌套结构？,python,pyspark,rdd,reduce,Python,Pyspark,Rdd,Reduce,我正在使用spark（pyspark）处理一个数据集，我想基于3个值对其进行分区并写回S3。数据集如下所示- customerId、productId、createDate 我想通过customerId、productId、createDate对这些数据进行分区。因此，当我将这个分区数据写入s3时，它应该具有以下结构- customerId=1 productId='A1' createDate=2019-10 createDate=2019-11 createDat

我正在使用spark（pyspark）处理一个数据集，我想基于3个值对其进行分区并写回S3。数据集如下所示-

customerId、productId、createDate

我想通过customerId、productId、createDate对这些数据进行分区。因此，当我将这个分区数据写入s3时，它应该具有以下结构-

customerId=1 productId='A1' createDate=2019-10 createDate=2019-11 createDate=2019-12 productId='A2' createDate=2019-10 createDate=2019-11 createDate=2019-12
下面是我用来创建分区的代码

rdd = sc.textFile("data.json") #sc is spark context r1.map(lambda r: (r["customerId"], r["productId"],r["createDate"])).distinct().map(lambda r: (r[0], ([r[1]],[r[2]]))).reduceByKey(lambda a, b: (a[0] + b[0],a[1] + b[1])).collect()
[（'1'，（[A1，A2]， ['2019-12', '2019-11', '2019-10', '2019-12', '2019-11', “2019-10”]）]

这段代码确实给了我一个平面结构，而不是我提到的嵌套结构。有可能改变我描述的方式吗。任何指针都是高度敏感的。
首先将JSON文件读取到dataframe

import json a=[json.dumps("/data.json")] jsonRDD = sc.parallelize(a) df = spark.read.json(jsonRDD)
然后使用
groupby
和
collectlist
获取所需格式

import pyspark.sql.functions as func df.groupby('customerId','productId').agg(func.collectList('createDate')).collect()

[pyspark]相关文章推荐

哪个序列化程序更适合pyspark：MarshallSerializer还是PickleSerializer？ pyspark serialization

Pyspark 使用python将空列添加到Spark中的dataframe pyspark

pyspark决策树中的样本权重 pyspark

Pyspark 如何删除dataframe中的列 pyspark

Pyspark-数据帧上的深度优先搜索 pyspark

pyspark-使用FAILFAST选项在加载失败后捕获格式错误的JSON文件名 pyspark

Pyspark窗口函数 pyspark

pyspark/dataframe-创建嵌套结构 pyspark

Pyspark 如何遍历spark数据帧的列并逐个访问其中的值？ pyspark

在Pyspark df中将字典键添加为列名，将字典值添加为该列的常量值 pyspark

Pyspark 在联接时在表之间添加克隆 pyspark

子字符串函数返回列类型而不是值。有没有办法从pyspark中的列类型中获取值 pyspark

在pyspark中时使用的groupby pyspark

Pyspark 关闭通过django应用程序中的芹菜任务启动的spark上下文 pyspark django-rest-framework

Pyspark 在决策树分类器上拟合RDD数据时出错 pyspark

pyspark在创建拼花地板文件时更改时间戳 pyspark

Pyspark 读取HDFS中存储的模型（.pkl和.scl）文件 pyspark

如何更改pyspark中的列值（模式） pyspark

使用PySpark和不使用window对来自Kafka的流数据执行滚动平均 pyspark apache-kafka

我可以问一个pyspark调用的简单例子吗？每个工作节点上都有一个库？ pyspark

随机文章推荐

Hazelcast中驱逐政策的差异 hazelcast

当服务器无法处理查询时，Hazelcast挂起 hazelcast

Hazelcast 3.3.3。群集恢复后EntryListener中没有更新 hazelcast

Hazelcast动态广域网复制配置；可能的 hazelcast

Hazelcast EntryProcessor序列化的行为是什么？ hazelcast

hazelcast-有没有一种方法可以迭代映射键和/或要查询的值 hazelcast

Hazelcast 在群集中的应用程序上运行相同的IsScheduledExecutorService时，如何防止重复任务？ hazelcast

[python]相关推荐

Python是否在winXP、win7中共享文件夹？
Python

Python Can'；t在nose测试中导入logging.handlers
Python

Subversion在Python提交挂钩中获取文本状态数据
Python Svn

Python 以前工作的SAGE安装已损坏
Python Ubuntu

Python 如何仅按日期从Mercurial服务器获取修订号？
Python Mercurial

删除Python列表中几乎重复的元素，同时保留变量
Python List

Python 下载附件并将邮件标记为不可见
Python Email

Python到OAM汇编语言
Python Assembly

Python pyxb无法识别的DdomRootNodeError
Python Xml Exception Xsd

获得；“属性错误”；在python中从nltk工具箱打印停止字时
Python

使用python替换大文本文件的4-5行
Python String

Python 从列表列表中的目标打印行
Python Python 3.x

Python问题在凌晨4点左右放弃了
Python Function Python 3.x

Python 切片图形
Python Indexing Graph

将python变量值插入SQL表
Python Sqlite Tkinter

Python 根据表中的特定列连接具有不同长度的列
Python Pandas Dataframe

Python-迭代字符串并以目标的形式添加字符
Python String For Loop

python Selenium——headless参数触发Web端的安全特性
Python Selenium Selenium Webdriver

从文本文件创建Dict+；Python:TypeError:'；str'；对象不支持项分配
Python Dictionary

Python 如何将图像作为matplotlib中的打印面（外边框）？
Python Matplotlib

Python将使用匹配的列附加到现有excel
Python Excel Pandas

Python 尼姆游戏巨蟒
Python

Python 如何在PyCharm中使用已安装的软件包？
Python Pycharm

Python：将txt中的行更改为列
Python

如何使用Python将每日数据转换为每周、每月和季度数据
Python Pandas Dataframe

Python QListWidget与自定义小部件
Python Qt

Python 在tensorflow中创建X_测试、X_训练、Y_测试、Y_训练
Python Tensorflow

Python 如何通过点使用pydantic json字段
Python

Python 如何在“中添加文本”；文本框“；为了一个形象？
Python Python 3.x Image Image Processing

Python 尝试使用OpenTelemetry库导入StatusCanonicalCode时出现导入错误
Python

Tags

Filter Blockchain Jakarta Ee Xmpp Swiftui Msbuild Web Services Mongodb Graphviz Z3 Webgl Haskell For Loop Automation Salesforce Zend Framework2 Dask Coding Style Http Asp.net Mvc 2 Documentation Sip Libgdx Wxpython Antlr Ruby On Rails 4 Embedded Facebook Graph Api Talend Discord Dynamic Apache Storm Ssl Linq Hibernate Erlang Download Types Time Algorithm Polymer Csv Gradle Windows Runtime Dynamics Crm Mercurial EmptyTag Next.js D Jsp Corda Model View Controller Gatsby Iis Google App Engine Architecture Wordpress Autodesk Forge Socket.io Xamarin.android Browser Cassandra Glassfish React Native Dojo Internet Explorer Reference Nosql Youtube Install4j Graphql Network Programming Cloud Cmake Virtualbox Botframework Com Spring Aframe Timer Iframe Teamcity Amazon Cloudformation User Interface Entity Framework Sorting Dom Phpstorm Vaadin Azure Cosmosdb Cryptography Robotframework Wso2 Grafana Macos Lotus Notes Osgi Go Gridview Logstash Linux Kernel Actionscript 3 Gmail Docker Compose Lucene Laravel 5 Android Studio Vim Ms Word Uitableview Mod Rewrite Drupal 7 C++ Cli Configuration Mobile Maps Ada Cmd Node.js Cron Python Sphinx Cygwin Kubernetes Wicket Plot Chart.js Mariadb Google Maps Selenium Webdriver Ms Office Soap Xamarin.ios Actions On Google Jquery Camera Isabelle Xcode C++11 Angular Nestjs Rspec Asterisk Content Management System Google Cloud Dataflow Multithreading Routing Ip Stream Codenameone Visual Studio Autocomplete Google Chrome Devtools Oracle Apex Abap Windbg Fiware Statistics Plsql Stanford Nlp Kentico Internet Explorer 8 Delphi Java Frameworks Jira Chef Infra Amp Html Post Xamarin.forms Nhibernate Arangodb Boost Neo4j Grails Azure Ad B2c Function Recursion Visual Studio 2015 Rest Android Layout Powerbi Ipad Date Asynchronous Firefox Addon Centos Colors Sqlite Interface Sharepoint 2013 Netty Asp Classic Django Laravel Streaming Twitter Bootstrap 3 Llvm Kendo Ui Cloud Foundry Opencart Mule

Copyright © 2024. All Rights Reserved by - Fatal编程技术网