Apache spark 控制文件的Spark流_Apache Spark_Spark Streaming_Spark Dataframe - Fatal编程技术网

Apache spark 控制文件的Spark流

apache-spark

Apache spark 控制文件的Spark流,apache-spark,spark-streaming,spark-dataframe,Apache Spark,Spark Streaming,Spark Dataframe,我正在使用Spark从文件夹中读取文本文件并将其加载到hive spark流的间隔为1分钟。在极少数情况下，源文件夹可能有1000个较大的文件如何控制spark流以限制程序读取的文件数？目前，我的程序正在读取过去1分钟内生成的所有文件。但我想控制它读取的文件数我正在使用TextFileStreamAPI JavaDStream<String> lines = jssc.textFileStream("C:/Users/abcd/files/"); JavaDStream

我正在使用Spark从文件夹中读取文本文件并将其加载到hive

spark流的间隔为1分钟。在极少数情况下，源文件夹可能有1000个较大的文件

如何控制spark流以限制程序读取的文件数？目前，我的程序正在读取过去1分钟内生成的所有文件。但我想控制它读取的文件数

我正在使用TextFileStreamAPI

    JavaDStream<String> lines = jssc.textFileStream("C:/Users/abcd/files/");

JavaDStream lines=jssc.textFileStream（“C:/Users/abcd/files/”；

有没有办法控制文件流传输速率？

恐怕没有。火花蒸镀基于时间驱动。您可以使用Flink，它提供数据驱动的

您可以使用“spark.streaming.backpressure.enabled”和“spark.streaming.backpressure.initialRate”来控制接收数据的速率
如果您的文件是CSV文件，您可以使用结构化流将文件读入带有
maxFilesPerTrigger
的流数据帧，如下所示：

import org.apache.spark.sql.types_ val streamDf=spark.readStream.option（“maxFilesPerTrigger”，“10”）.schema（StructType（Seq（StructField（“some_field”，StringType）））.csv（“/directory/of/files”）
Kakfa流媒体不是背压吗？你能给我一个这样做的例子吗？这些是spark配置，从“spark.streaming.backpressure”的名称可以明显看出！！！它与卡夫卡完美结合。我尚未测试的其他源.spark.streaming.kafka.maxRatePerPartition是特定于kafka的配置。

[intellij idea]相关文章推荐

Intellij idea “在哪里？”；“文件浏览器”；在IntelliJ？ intellij-idea

Intellij idea 日食'；Intellij中的s快速访问等价物 intellij-idea

Intellij idea 如何配置intellij idea而不是格式化部分代码？ intellij-idea

Intellij idea 不带hibernate.cfg.xml的IntelliJ JPA控制台 intellij-idea

Intellij idea 在IntelliJ中配置由键盘快捷键和鼠标单击组成的键映射 intellij-idea

Intellij idea 在Intellij IDEA 13.1.5社区版中创建新的dart项目 intellij-idea dart

Intellij idea 思想不'；如果功能接口有多个非默认方法，则不显示错误消息 intellij-idea java-8

Intellij idea WebStorm中的语言注入（将SCS转换为html） intellij-idea webstorm vue.js

Intellij idea 开始规则：<；从导航器或语法中选择>；在IntelliJ中的antlr插件中 intellij-idea antlr antlr4

Intellij idea 在IntelliJ 2016中安装带有Gradle的番石榴 intellij-idea

Intellij idea IntelliJ中变量定义的空格/对齐 intellij-idea phpstorm

Intellij idea Intellij在运行Ivy项目时找不到类 intellij-idea

Intellij idea IntelliJ IDEA+；Gradle-如何记录从IDE运行的长时间Gradle任务？ intellij-idea gradle

Intellij idea 在Intellij IDEA中禁用没有当前断点的所有断点 intellij-idea

Intellij idea 为什么Intellij Idea没有'；是否不显示当前文件？ intellij-idea

Intellij idea 如何使gradle中的idea插件为Kotlin生成适当的项目配置？ intellij-idea gradle kotlin

Intellij idea Intellij如何忽略有错误的文件？ intellij-idea

Intellij idea 试图用TeamCity和Intellij构建工件 intellij-idea teamcity

Intellij idea 使用IntelliJ IDEA从JFR转储生成火焰图 intellij-idea

Intellij idea I'；我什么都试过了，但Tab Character没有'；我不为IntelliJ工作 intellij-idea

随机文章推荐

Django rest framework Django REST框架，从URL获取对象 django-rest-framework

Django rest framework Django Rest框架在何处响应400 django-rest-framework

Django rest framework Django Rest框架：在执行create re时，使post数据中不需要序列化器字段 django-rest-framework

Django rest framework Django rest框架-从身份验证中排除端点 django-rest-framework

Django rest framework 如何从unittest重新生成django请求？ django-rest-framework

Django rest framework 如何使用django rest api返回定制响应 django-rest-framework

Django rest framework django-Queryset（单个字段中的不同字段值） django-rest-framework

Django rest framework DRF如何在没有queruset的情况下调用api django-rest-framework

[apache spark]相关推荐

Apache spark 本地火花和拼花文件
Apache Spark

Apache spark 将Spark与Spark cassandra连接器查询物化视图一起使用
Apache Spark Cassandra

Apache spark 对文本文件中的值集合进行排序，并使用pyspark将排序后的值保存回文本文件
Apache Spark Pyspark

Apache spark 尝试通过ssh连接到Amazon EMR Spark群集时出现“操作超时”错误
Apache Spark Ssh

Apache spark 如何在Spark Shell中显示行号
Apache Spark

Apache spark 无法在带有Spark stream 1.6.2的Kafka 0.10.0中接收任何消息
Apache Spark Apache Kafka

Apache spark 在dock swarm模式下通过docker compose部署spark群集
Apache Spark Docker Docker Compose

Apache spark 无法分析主URL:'；spark.bluemix.net'；
Apache Spark Ibm Cloud

Apache spark Structured Streaming 2.1.0 stream to Parquet创建了许多小文件
Apache Spark

Apache spark 如何在macOS上安装Apache Spark历史服务器
Apache Spark

Apache spark 为什么Spark Sql在一个操作的不同阶段读取同一文件两次？
Apache Spark

Apache spark 关于rddToDataFrameHolder的Spark sql
Apache Spark

Apache spark 如何在spark Dataframe中的列之间进行一些计算？
Apache Spark

Apache spark 如何在数组字段上加入？
Apache Spark

Apache spark 由键值对列表组成的RDD上的reduceByKey？
Apache Spark Pyspark

Apache spark 在hadoop集群中运行Spark作业时，我得到java.lang.NoClassDefFoundError:org/apache/hadoop/hbase/HBaseConfiguration
Apache Spark Hadoop Hbase

Apache spark 连接磁盘上的ORC分区文件？
Apache Spark

Apache spark Spark错误：java.lang.NoClassDefFoundError:org/apache/Spark/sql/sources/v2/StreamWriteSupport
Apache Spark

Apache spark Spark广播变量使用寿命
Apache Spark

Apache spark 如何在spark中对JSON文件进行流式传输（kafka）并将其转换为RDD？
Apache Spark Pyspark Apache Kafka

Apache spark 如何将数据集列强制转换为非基本数据类型
Apache Spark

Apache spark pyspark2提取Teradata需要很长时间
Apache Spark Jdbc Pyspark

Apache spark 从行对象检索列值：未找到编码器
Apache Spark

Apache spark Spark&x2B；卡夫卡：如何将卡夫卡流与RDBMS连接起来
Apache Spark Apache Kafka

Apache spark 使用spark对一组行进行矢量化
Apache Spark

Apache spark 如何读取spark结构化流媒体作业中每个微批次的相同起始偏移量？
Apache Spark Apache Kafka

Apache spark Spark任务的数量是否可以大于executor core？
Apache Spark Pyspark

Apache spark 如何应用函数修改列值？（Pyspark 2.4.5-数据块）
Apache Spark Pyspark

Apache spark 没有hadoop的Spark2.4.6：发生JNI错误
Apache Spark Hadoop

Apache spark Pyspark ML交叉验证程序评估多个评估器
Apache Spark Pyspark

Tags

Modelica Import Sequelize.js Windows Installer Binding Firefox Addon Sas Optimization Google Colaboratory Tridion User Interface Ssh Junit Numpy Path Localization Iphone Oracle10g Express Plone Azure Functions Groovy Build Xcode4 Git Material Ui Formatting Twig Postgresql Model Blackberry Vector Process Angular Material Asynchronous Verilog Drupal 7 F# Documentation Combobox Unix Image Enums Post Scheme Multithreading Visual Studio 2015 Applescript Hybris Grails Jar Printing Cygwin Dialogflow Es Kibana Sphinx Zend Framework Common Lisp Markdown Websphere Xaml Signalr Codenameone Powershell Macos Fullcalendar Subsonic Clang Firebase Sharepoint 2013 Selenium Webdriver Image Processing Plsql Google Compute Engine Elixir Linux Kernel Https Symfony1 Apache Ibm Midrange Sql Server 2012 Xslt Keycloak Workflow Loops Vagrant Ios4 Data Binding Uitableview Tabs Filesystems Spring Encoding Language Agnostic Couchbase Xpages Error Handling View Templates Camera Coq Debugging Llvm Ip Less Laravel 4 Dependency Injection Ajax Windows Methods Streaming Actions On Google Kendo Ui Stata Dependencies Database Design D Winapi 3d Jestjs Silverlight Opengl Es Pytorch Apache Nifi Exception Android Fragments Aurelia Wolfram Mathematica Cakephp Rxjs Spring Mvc Google Apps Script For Loop Rust Events Parameters Twitter Oauth 2.0 Unity3d Scikit Learn Google Cloud Dataflow Tsql Chef Infra Caching Jasmine Talend Iframe Internationalization Angular6 Math Microsoft Graph Api Listview Prolog Sitecore Gstreamer Gwt Intellij Idea Dll Laravel Xmpp Ruby On Rails 3 Character Encoding Functional Programming Teamcity Binary Geolocation Merge Generics Install4j Migration Redux Imagemagick Azure Service Fabric Sublimetext2 Synchronization Button React Native Tags Azure Active Directory Orm Vbscript Mapbox Sql Server 2005 Svg Doctrine Oauth Jpa Openstack Firefox Stored Procedures Pascal Amazon Dynamodb Opencl Cypress Mariadb Tree Dask Zsh Jenkins Certificate Teradata

Copyright © 2024. All Rights Reserved by - Fatal编程技术网