Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/ruby-on-rails/66.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Cassandra spark是否创建了两个在同一逻辑上工作的数据集或阶段?_Cassandra_Dataset_Apache Spark_Rdd - Fatal编程技术网

Cassandra spark是否创建了两个在同一逻辑上工作的数据集或阶段?

Cassandra spark是否创建了两个在同一逻辑上工作的数据集或阶段?,cassandra,dataset,apache-spark,rdd,Cassandra,Dataset,Apache Spark,Rdd,我试图从CSV文件中读取并将这些条目插入数据库。 我发现spark内部创建了两个RDD,即RDD_0_0和RDD_0_1,它们工作在相同的数据上并完成所有处理。 有谁能帮助我们弄清楚为什么调用方法会被不同的数据集调用两次 如果创建了两个数据集/阶段,为什么它们都使用相同的逻辑?? 请帮助我确认spark是否有效 公共最终类TestJavaAggregation1实现了可序列化{ 私有静态最终长serialVersionUID=1L; 静态CassandraConfig config=null;

我试图从CSV文件中读取并将这些条目插入数据库。 我发现spark内部创建了两个RDD,即RDD_0_0和RDD_0_1,它们工作在相同的数据上并完成所有处理。 有谁能帮助我们弄清楚为什么调用方法会被不同的数据集调用两次

如果创建了两个数据集/阶段,为什么它们都使用相同的逻辑?? 请帮助我确认spark是否有效

公共最终类TestJavaAggregation1实现了可序列化{ 私有静态最终长serialVersionUID=1L; 静态CassandraConfig config=null; 静态PreparedStatement=null; 专用瞬态SparkConf配置; private PersonAggregationRowWriterFactory aggregationWriter=新建PersonAggregationRowWriterFactory; 公开会议; 私有TestJavaAggregation1SparkConf conf{ this.conf=conf; } 公共静态无效主字符串[]args引发异常{ SparkConf conf=new SparkConf.setAppName“REadFromCSVFile”.setMaster“local[1]”。set“spark.executor.memory”,“1g”; conf.set“spark.cassandra.connection.host”、“localhost”; TestJavaAggregation1 app=新的TestJavaAggregation1conf; app.run; } 私自逃票{ JavaSparkContext sc=新的JavaSparkContextconf; 聚集酶C; sc.停止; } 私有JavaRDD sparkConfigJavaSparkContext sc{ JavaRDD lines=sc.textFile“PersonAggregation1_500.csv”,1; System.out.printlnlines.getCheckpoint文件; line.cache; 最终字符串标题=行。第一行; System.out.printlheading; 字符串headerValues=heading.replaceAll“\t”,“,”; System.out.PrintLnHeaderValue; CassandraConnector connector=CassandraConnector.applysc.getConf; 会话=connector.openSession; 试一试{ 执行“如果存在java_api5,则删除键空间”; 执行“createkeyspace java_api5 WITH replication={'class':'SimpleStrategy','replication_factor':1}”; 执行“创建表java_api5.person hashvalue INT,id INT,state TEXT,city TEXT,country TEXT,full_name TEXT,PRIMARY KEYhashvalue,id,state,city,country,full_name,按id DESC聚类顺序;”; }捕获异常{ 例如,打印跟踪; } 回流线; } @抑制“串行”警告 public void aggregateDataJavaSparkContext sc{ JavaRDD-lines=sparkconfig-gsc; System.out.println“FirstRDD”+line.partitions.size; JavaRDD result=lines.mapnew函数{ int i=0; 公共人物聚合调用字符串行{ 人物聚合=新人物聚合; 行=行+”,“+this.hashCode; 字符串[]部分=行。拆分“,”; 集合。集合。部分的值[0]; 集合。集合完整_nameparts[1]; 合计.部分[4]; 集合。集合城市部分[5]; 集合。集合部分[6]; aggregate.setHashValueInteger.valueOfparts[7]; *//下面的“保存”在数据库中插入200个条目,而CSV文件只有100条记录* **saveToJavaCassandraaggregate** 总回报; } }; System.out.printlnresult.collect.size; List personAggregationList=result.collect; javarddaggregaterdd=sc.parallelizepersonAggregationList; javaFunctionsaggregateRDD.writerBuilder“java_api5”,person, aggregationWriter.saveToCassandra; } } 请查看以下日志:

    15/05/29 12:40:37 INFO FileInputFormat: Total input paths to process : 1
    15/05/29 12:40:37 INFO SparkContext: Starting job: first at TestJavaAggregation1.java:89
    15/05/29 12:40:37 INFO DAGScheduler: Got job 0 (first at TestJavaAggregation1.java:89) with 1 output partitions (allowLocal=true)
    15/05/29 12:40:37 INFO DAGScheduler: Final stage: Stage 0(first at TestJavaAggregation1.java:89)
    15/05/29 12:40:37 INFO DAGScheduler: Parents of final stage: List()
    15/05/29 12:40:37 INFO DAGScheduler: Missing parents: List()
    15/05/29 12:40:37 INFO DAGScheduler: Submitting Stage 0 (PersonAggregation_5.csv MappedRDD[1] at textFile at TestJavaAggregation1.java:84), which has no missing parents
    15/05/29 12:40:37 INFO MemoryStore: ensureFreeSpace(2560) called with curMem=157187, maxMem=1009589944
    15/05/29 12:40:37 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.5 KB, free 962.7 MB)
    15/05/29 12:40:37 INFO MemoryStore: ensureFreeSpace(1897) called with curMem=159747, maxMem=1009589944
    15/05/29 12:40:37 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1897.0 B, free 962.7 MB)
    15/05/29 12:40:37 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:54664 (size: 1897.0 B, free: 962.8 MB)
    15/05/29 12:40:37 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
    15/05/29 12:40:37 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:838
    15/05/29 12:40:37 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PersonAggregation_5.csv MappedRDD[1] at textFile at TestJavaAggregation1.java:84)
    15/05/29 12:40:37 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
    15/05/29 12:40:37 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1326 bytes)
    15/05/29 12:40:37 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
    15/05/29 12:40:37 INFO CacheManager: Partition rdd_1_0 not found, computing it
    15/05/29 12:40:37 INFO HadoopRDD: Input split: file:/F:/workspace/apoorva/TestProject/PersonAggregation_5.csv:0+230
    15/05/29 12:40:37 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
    15/05/29 12:40:37 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
    15/05/29 12:40:37 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
    15/05/29 12:40:37 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
    15/05/29 12:40:37 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
    15/05/29 12:40:37 INFO MemoryStore: ensureFreeSpace(680) called with curMem=161644, maxMem=1009589944
    15/05/29 12:40:37 INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 680.0 B, free 962.7 MB)
    15/05/29 12:40:37 INFO BlockManagerInfo: Added rdd_1_0 in memory on localhost:54664 (size: 680.0 B, free: 962.8 MB)
    15/05/29 12:40:37 INFO BlockManagerMaster: Updated info of block rdd_1_0
    15/05/29 12:40:37 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2335 bytes result sent to driver
    15/05/29 12:40:37 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 73 ms on localhost (1/1)
    15/05/29 12:40:37 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
    15/05/29 12:40:37 INFO DAGScheduler: Stage 0 (first at TestJavaAggregation1.java:89) finished in 0.084 s
    15/05/29 12:40:37 INFO DAGScheduler: Job 0 finished: first at TestJavaAggregation1.java:89, took 0.129536 s
    1,FName1,MName1,LName1,state1,city1,country1
    1,FName1,MName1,LName1,state1,city1,country1
    15/05/29 12:40:37 INFO Cluster: New Cassandra host localhost/127.0.0.1:9042 added
    15/05/29 12:40:37 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
    FirstRDD1
    SecondRDD1
    15/05/29 12:40:47 INFO SparkContext: Starting job: collect at TestJavaAggregation1.java:147
    15/05/29 12:40:47 INFO DAGScheduler: Got job 1 (collect at TestJavaAggregation1.java:147) with 1 output partitions (allowLocal=false)
    15/05/29 12:40:47 INFO DAGScheduler: Final stage: Stage 1(collect at TestJavaAggregation1.java:147)
    15/05/29 12:40:47 INFO DAGScheduler: Parents of final stage: List()
    15/05/29 12:40:47 INFO DAGScheduler: Missing parents: List()
    15/05/29 12:40:47 INFO DAGScheduler: Submitting Stage 1 (MappedRDD[2] at map at TestJavaAggregation1.java:117), which has no missing parents
    15/05/29 12:40:47 INFO MemoryStore: ensureFreeSpace(3872) called with curMem=162324, maxMem=1009589944
    15/05/29 12:40:47 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.8 KB, free 962.7 MB)
    15/05/29 12:40:47 INFO MemoryStore: ensureFreeSpace(2604) called with curMem=166196, maxMem=1009589944
    15/05/29 12:40:47 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.5 KB, free 962.7 MB)
    15/05/29 12:40:47 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:54664 (size: 2.5 KB, free: 962.8 MB)
    15/05/29 12:40:47 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
    15/05/29 12:40:47 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:838
    15/05/29 12:40:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MappedRDD[2] at map at TestJavaAggregation1.java:117)
    15/05/29 12:40:47 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
    15/05/29 12:40:47 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1326 bytes)
    15/05/29 12:40:47 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
    15/05/29 12:40:47 INFO BlockManager: Found block rdd_1_0 locally
    com.local.myProj1.TestJavaAggregation1$1@2f877f16,797409046,state1,city1,country1
    15/05/29 12:40:47 INFO DCAwareRoundRobinPolicy: Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)
    15/05/29 12:40:47 INFO Cluster: New Cassandra host localhost/127.0.0.1:9042 added
    Connected to cluster: Test Cluster
    Datacenter: datacenter1; Host: localhost/127.0.0.1; Rack: rack1
    com.local.myProj1.TestJavaAggregation1$1@2f877f16,797409046,state2,city2,country1
    com.local.myProj1.TestJavaAggregation1$1@2f877f16,797409046,state3,city3,country1
    com.local.myProj1.TestJavaAggregation1$1@2f877f16,797409046,state4,city4,country1
    com.local.myProj1.TestJavaAggregation1$1@2f877f16,797409046,state5,city5,country1
    15/05/29 12:40:47 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 2343 bytes result sent to driver
    15/05/29 12:40:47 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 184 ms on localhost (1/1)
    15/05/29 12:40:47 INFO DAGScheduler: Stage 1 (collect at TestJavaAggregation1.java:147) finished in 0.185 s
    15/05/29 12:40:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
    15/05/29 12:40:47 INFO DAGScheduler: Job 1 finished: collect at TestJavaAggregation1.java:147, took 0.218779 s
    ______________________________5_______________________________
    15/05/29 12:40:47 INFO SparkContext: Starting job: collect at TestJavaAggregation1.java:150
    15/05/29 12:40:47 INFO DAGScheduler: Got job 2 (collect at TestJavaAggregation1.java:150) with 1 output partitions (allowLocal=false)
    15/05/29 12:40:47 INFO DAGScheduler: Final stage: Stage 2(collect at TestJavaAggregation1.java:150)
    15/05/29 12:40:47 INFO DAGScheduler: Parents of final stage: List()
    15/05/29 12:40:47 INFO DAGScheduler: Missing parents: List()
    15/05/29 12:40:47 INFO DAGScheduler: Submitting Stage 2 (MappedRDD[2] at map at TestJavaAggregation1.java:117), which has no missing parents
    15/05/29 12:40:47 INFO MemoryStore: ensureFreeSpace(3872) called with curMem=168800, maxMem=1009589944
    15/05/29 12:40:47 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.8 KB, free 962.7 MB)
    15/05/29 12:40:47 INFO MemoryStore: ensureFreeSpace(2604) called with curMem=172672, maxMem=1009589944
    15/05/29 12:40:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.5 KB, free 962.7 MB)
    15/05/29 12:40:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:54664 (size: 2.5 KB, free: 962.8 MB)
    15/05/29 12:40:47 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0
    15/05/29 12:40:47 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:838
    15/05/29 12:40:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 2 (MappedRDD[2] at map at TestJavaAggregation1.java:117)
    15/05/29 12:40:47 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
    15/05/29 12:40:47 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, PROCESS_LOCAL, 1326 bytes)
    15/05/29 12:40:47 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
    15/05/29 12:40:47 INFO BlockManager: Found block rdd_1_0 locally
    com.local.myProj1.TestJavaAggregation1$1@17b560af,397762735,state1,city1,country1
    com.local.myProj1.TestJavaAggregation1$1@17b560af,397762735,state2,city2,country1
    com.local.myProj1.TestJavaAggregation1$1@17b560af,397762735,state3,city3,country1
    com.local.myProj1.TestJavaAggregation1$1@17b560af,397762735,state4,city4,country1
    com.local.myProj1.TestJavaAggregation1$1@17b560af,397762735,state5,city5,country1
    15/05/29 12:40:47 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 2343 bytes result sent to driver
    15/05/29 12:40:47 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 16 ms on localhost (1/1)
    15/05/29 12:40:47 INFO DAGScheduler: Stage 2 (collect at TestJavaAggregation1.java:150) finished in 0.016 s
    15/05/29 12:40:47 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
    15/05/29 12:40:47 INFO DAGScheduler: Job 2 finished: collect at TestJavaAggregation1.java:150, took 0.026302 s

当您运行spark群集并运行spark作业时。spark根据RDD在群集中分布数据,数据分区由spark处理。当您通过读取文件在sparkConfig方法中创建行RDD时。spark对数据进行分区并在内部创建RDD分区,以便在内存中进行计算时ns它是在集群中RDD上的分布式数据上完成的。因此,您的JavaRDD行在内部是各个RDD_分区上的并集。因此,当您在JavaRDD行上运行映射作业时,它会运行在与运行映射函数的JavaRDD相关的各个内部RDD之间划分的所有数据。就像您的例子中的spark cre这就是为什么对两个内部JavaRDD分区调用map函数两次的原因。如果您还有其他问题,请告诉我。

Spark创建了一个RDDrdd\u 0,其中包含两个分区rdd\u 0和rdd\u 0\u 1。