数据流：使用Python SDK的顶级模块：单元素PCollection_Python_Google Cloud Platform_Google Cloud Dataflow - Fatal编程技术网

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/haskell/8.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
数据流：使用Python SDK的顶级模块：单元素PCollection_Python_Google Cloud Platform_Google Cloud Dataflow - Fatal编程技术网

数据流：使用Python SDK的顶级模块：单元素PCollection

python google-cloud-platform google-cloud-dataflow

数据流：使用Python SDK的顶级模块：单元素PCollection,python,google-cloud-platform,google-cloud-dataflow,Python,Google Cloud Platform,Google Cloud Dataflow,我在看孵化器梁存储库上的word_counting.py示例（链接自数据流文档），我想修改它以获得出现次数最多的n。这是我的管道： counts = (lines | 'split' >> (beam.ParDo(WordExtractingDoFn()) .with_output_types(unicode)) | 'pair_with_one' >> beam.Map(lambda x:

我在看孵化器梁存储库上的word_counting.py示例（链接自数据流文档），我想修改它以获得出现次数最多的n。这是我的管道：

counts = (lines | 'split' >> (beam.ParDo(WordExtractingDoFn()) .with_output_types(unicode)) | 'pair_with_one' >> beam.Map(lambda x: (x, 1)) | 'group' >> beam.GroupByKey() | 'count' >> beam.Map(lambda (word, ones): (word, sum(ones))) | 'top' >> beam.combiners.Top.Of('top', 10, key=lambda (word, c): c) # 'top' is the only added line output = counts | 'format' >> beam.Map(lambda (word, c): '%s: %s' % (word, c)) output | 'write' >> beam.io.Write(beam.io.TextFileSink(known_args.output))
我使用Top.Of（）方法添加了一行，但它似乎返回了一个数组为单个元素的PCollection（我在等待一个有序的PCollection，但查看文档时，PCollection似乎是无序的集合
管道运行时，beam.Map仅在一个元素（即整个数组）上循环，并且在“format”中，lambda函数会引发错误，因为它无法将整个数组映射到元组（word，c）

在这一步中，我应该如何在不中断管道的情况下处理此单元素PCollection？
如果要将iterables的
PCollection
扩展为这些iterables元素的
PCollection
，可以使用
FlatMap
，其参数是从元素到结果iterable的函数：在在我们的情况下，元素本身是可分解的，所以我们使用标识函数

counts = ... | 'top' >> beam.combiners.Top.Of('top', 10, key=lambda (word, c): c) | 'expand' >> beam.FlatMap(lambda word_counts: word_counts) # sic! output = counts | 'format' >> beam.Map(lambda (word, c): '%s: %s' % (word, c)) ...

[google cloud platform]相关文章推荐

Google cloud platform gcloud命令突然失效'；我根本不工作 google-cloud-platform

Google cloud platform dataproc上的可抢占工作进程 google-cloud-platform

Google cloud platform 谷歌发布/订阅停止发送消息 google-cloud-platform

Google cloud platform 远程气流DAG google-cloud-platform google-cloud-storage airflow

Google cloud platform 在Google cloud bucket中创建/附加文件 google-cloud-platform google-cloud-storage

Google cloud platform 使用服务帐户登录google compute engine google-cloud-platform google-compute-engine

Google cloud platform Google数据存储与CloudSQL google-cloud-platform

Google cloud platform 我可以定期'；无法创建GCP任务，因为java.lang.RuntimeException:ManagedChannel分配站点 google-cloud-platform

Google cloud platform Google stackdriver错误报告API是否支持gevent？ google-cloud-platform google-compute-engine

Google cloud platform 以公共广播为条件的GCP地形 google-cloud-platform terraform

Google cloud platform Google Dataprep/Trifacta-加入三个数据集，重复数据消除，但保留不匹配的记录 google-cloud-platform

Google cloud platform GCP价目表变更对我即将开具的发票有何影响？ google-cloud-platform

Google cloud platform 无法获取可读的编码字符串 google-cloud-platform

Google cloud platform 谷歌免费云层的流量限制 google-cloud-platform google-compute-engine

Google cloud platform 如何统计GCP资源使用的stackdriver指标的数量 google-cloud-platform

Google cloud platform 如何从快照api响应中检索计算引擎名称？ google-cloud-platform google-cloud-storage google-compute-engine

Google cloud platform 通过Terraform提供没有外部IP的GCP VM实例 google-cloud-platform

Google cloud platform 在部署新服务版本期间，Cloud Run如何处理我的应用程序中运行的东西？ google-cloud-platform cloud

Google cloud platform 为GCP设置Google帐户 google-cloud-platform

Google cloud platform Google data studio api无法添加作用域 google-cloud-platform google-api

随机文章推荐

[python]相关推荐

Python Django管理站点：TemplateDoesNotExist？
Python Django

Python 如何存储facebook sdk提供的用户，而不是每个请求都有一个调用
Python Facebook Google App Engine

Python 对Twill使用showforms（）命令时出现分析错误
Python Browser Web Navigation

Python 了解SQLAlchemy中的SQLite线程池行为
Python Sqlite Sqlalchemy

Python 如何在pandas中使用由浮动索引的数据
Python Pandas

Python WebDriverException:消息：'；无法连接到ChromeDriver'；。utils.is_可连接（self.port）中出错：
Python Selenium Selenium Webdriver

使用python从重定向的stdin读取文件
Python Shell

Python 如何将一个元组解压成比元组更多的值？
Python Python 3.x

Python Scipy&x27；s的关联函数很慢
Python Numpy

Python Scrapy：修改响应中的元素和字段
Python Python 2.7 Scrapy

Python Django基于现有字段的新外键字段
Python Django Model

Python tensorflow和for循环中访问张量大小的条件图
Python Tensorflow

Python 正在更新芹菜任务已完成的客户端
Python Flask

Python 在导数计算的双_标量中遇到被零除的情况
Python Python 2.7 Numpy

Python继承新样式类中的旧样式类型
Python Inheritance

Python 没有名为win32com的模块
Python

Python 基于Tensorflow的变输入大小RNN训练
Python Numpy Tensorflow Machine Learning

Python 在烧瓶中使用Jinja2获取嵌套的dict项
Python Dictionary Flask

Python脚本在Windows中输出字典数据，但在Linux中不输出
Python Linux Windows

如何用Python解析SOAP XML
Python Xml Parsing Soap

Python Paramiko并行执行到远程unix主机
Python

Python 如何将HTTPS与PyQt4/5一起使用
Python Https

Python 为通过包装器函数创建的小部件设置按钮信号
Python

Python 创建搜索字段以对每列数据进行排序
Python Django

在Python中从多个数据帧中提取相同的列名
Python Pandas Loops

使用Python-setup列表存储30个学生的姓氏以及test1和test2的测试分数，这两个分数都是50分
Python

Python 在池辅助进程内使用time.time（）时出现双重释放或损坏错误
Python

什么是将Kafka与Python结合使用的最小示例？我试过的我克隆并执行了sudo docker compute up 我启动了下面列出的producer.py 我启动了下面列出的consumer.py
Python Docker Apache Kafka

Python-在一个函数中杀死多个进程
Python Process

Python ModuleNotFoundError:没有名为'；wtforms&x27；在尝试编写docker时
Python Docker

Tags

Intellij Idea Plone Cordova Coldfusion Sublimetext3 Responsive Design Usb Report Ubuntu Swift3 Tinymce Visual Studio 2015 Jestjs Nginx Windows Phone 8.1 Google Sheets Database Design Excel Nlp Inheritance Collections Html5 Canvas Google Cloud Dataflow Playframework Dart Join Rx Java Neural Network Websocket Internet Explorer Javafx Xcode4 Postgresql React Native Sitecore Amazon S3 Design Patterns Deployment Geometry Memory Management C++11 Function Fiware Tree Installation Xslt Google Compute Engine Cocoa Touch Html Loopbackjs Python 3.x If Statement Netbeans Android Emulator Hyperlink Vb.net Python Http Import Dll Mule Wpf Jsf 2 Io Google Colaboratory Protocol Buffers Joomla Tfs Llvm Oracle Apex Markdown Text Ssrs 2008 Vhdl Eclipse Plugin Wso2 Wix Computer Science Hyperledger Fabric Autocomplete Ios6 Sql Server 2008 Google Chrome Devtools Asp.net Mvc Ipad Dependencies Random Error Handling Dynamics Crm 2011 Zend Framework2 Air Api Scala Listview Jira Terraform Arm Delphi Qml Pascal Lua Sockets Twitter Bootstrap 3 Asp.net Silverlight Composer Php C++ Cli Functional Programming Youtube Api Testing Facebook Autodesk Forge Timer Java 8 Pip Login Couchbase Algorithm Network Programming Image Processing Drupal Boost Vim User Interface Fortran Mono Jms Debugging Routes Actions On Google Cassandra Rdf Xml Project Management Appium Junit Dns Ruby On Rails 3.2 Octave Snmp Visual Studio 2010 Map Language Agnostic Mongoose Sapui5 Jaxb Camera Cron Google Chrome Video Streaming Soap Windows 7 Twitter Variables Jquery Plugins Jboss Oop Influxdb Android Studio Bots Jekyll Clang Model Concurrency Apache Storm Polymer Xquery Binary Rspec Artificial Intelligence Pyspark Select Javafx 2 Generics Rss Go Doctrine Orm Drupal 7 Asp.net Mvc 2 Selenium Webdriver Passwords Azure Animation Apache Kafka Visual Studio 2017 Iframe Twilio Ruby On Rails 3 Logic Sqlite Bluetooth Blazor Cluster Computing Jqgrid Jupyter Notebook Templates Exception Handling Indexing Resharper Sugarcrm Data Binding

Copyright © 2024. All Rights Reserved by - Fatal编程技术网