Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 在Spark中读取CSV文件,并使用创建的RDD将其插入HBase_Scala_Apache Spark_Hbase_Classpath_Urlclassloader - Fatal编程技术网

Scala 在Spark中读取CSV文件,并使用创建的RDD将其插入HBase

Scala 在Spark中读取CSV文件,并使用创建的RDD将其插入HBase,scala,apache-spark,hbase,classpath,urlclassloader,Scala,Apache Spark,Hbase,Classpath,Urlclassloader,我能够使用Put方法在HBase表中插入数据。但是,现在我想从CSV文件中读取数据并将其写入HBase表 我写了这段代码: object Read_CSV_Method2 { case class HotelBooking(hotelName : String, is_cancelled : String, reservation_status : String, reservation_status_date : String) def main(

我能够使用Put方法在HBase表中插入数据。但是,现在我想从CSV文件中读取数据并将其写入HBase表

我写了这段代码:

object Read_CSV_Method2 {

    case class HotelBooking(hotelName : String, is_cancelled : String, reservation_status : String, 
           reservation_status_date : String)


    def main(args : Array[String]){             

        Logger.getLogger("org").setLevel(Level.ERROR)
        val sparkConf = new SparkConf().setAppName("Spark_Hbase_Connection").setMaster("local[*]")
        val sc = new SparkContext(sparkConf)
        val conf = HBaseConfiguration.create()    

        // establish a connection
        val connection:Connection = ConnectionFactory.createConnection(conf)

        // Table on which different commands have to be run.
        val tableName = connection.getTable(TableName.valueOf("hotelTable"))

        val sqlContext = new SQLContext(sc)
        import sqlContext.implicits._

        val textRDD = sc.textFile("/Users/sukritmehta/Documents/hotel_bookings.csv")

        var c = 1
        textRDD.foreachPartition{
                   iter => 
                     val hbaseConf = HBaseConfiguration.create()
                     val connection:Connection = ConnectionFactory.createConnection(hbaseConf)
                     val myTable = connection.getTable(TableName.valueOf("hotelTable"))
                     iter.foreach{
                       f =>

                         val col = f.split(",")
                         val insertData = new Put(Bytes.toBytes(c.toString()))
                         insertData.addColumn(Bytes.toBytes("hotelFam"), Bytes.toBytes("hotelName"), Bytes.toBytes(col(0).toString()))
                         insertData.addColumn(Bytes.toBytes("hotelFam"), Bytes.toBytes("is_canceled"), Bytes.toBytes(col(1).toString()))
                         insertData.addColumn(Bytes.toBytes("hotelFam"), Bytes.toBytes("reservation_status"), Bytes.toBytes(col(30).toString()))
                         insertData.addColumn(Bytes.toBytes("hotelFam"), Bytes.toBytes("reservation_status_date"), Bytes.toBytes(col(31).toString()))
                         c = c+1
                         myTable.put(insertData)
                     }
            }

       }

    }
但运行此代码后,我得到以下错误:

Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/InputSplitWithLocationInfo
    at org.apache.spark.rdd.HadoopRDD.getPreferredLocations(HadoopRDD.scala:356)
    at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:275)
    at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:275)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:274)
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:2003)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:2014)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:2013)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:2013)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:2013)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:2011)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:2011)
    at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1977)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$17.apply(DAGScheduler.scala:1112)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$17.apply(DAGScheduler.scala:1110)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1110)
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1069)
    at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1013)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2065)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapred.InputSplitWithLocationInfo
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    ... 32 more
pom.xml文件

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>Spark_Hbase</groupId>
  <artifactId>Spark_Hbase</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <build>
    <sourceDirectory>src</sourceDirectory>
    <plugins>
      <plugin>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.6.1</version>
        <configuration>
          <source>1.8</source>
          <target>1.8</target>
        </configuration>
      </plugin>
    </plugins>
  </build>

  <dependencies>
        <!-- Scala and Spark dependencies -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.11.11</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.0</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-avro_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.0</version>
            <scope>provided</scope>
        </dependency>
        <!-- <dependency> <groupId>org.codehaus.janino</groupId> <artifactId>commons-compiler</artifactId> 
            <version>3.0.7</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> 
            <artifactId>spark-network-common_2.11</artifactId> <version>2.1.1</version> 
            </dependency> -->
        <dependency>
            <groupId>joda-time</groupId>
            <artifactId>joda-time</artifactId>
            <version>2.9.9</version>
        </dependency>

        <!-- <dependency> <groupId>org.mongodb.spark</groupId> <artifactId>mongo-spark-connector_2.10</artifactId> 
            <version>2.1.1</version> </dependency> -->
        <!-- <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_2.10</artifactId> 
            <version>2.1.1</version> </dependency> -->

        <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase -->


        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>2.2.4</version>
        </dependency>

            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-core</artifactId>
                <version>1.2.1</version>
            </dependency>      



        <dependency>
              <groupId>org.apache.hbase</groupId>
              <artifactId>hbase-spark</artifactId>
              <version>2.0.0-alpha4</version> <!-- Hortonworks Latest -->
        </dependency>


         <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-mapreduce -->
<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-mapreduce</artifactId>
    <version>2.2.4</version>
</dependency>


    </dependencies>
</project>

4.0.0
火花酶
火花酶
0.0.1-快照
src
maven编译器插件
3.6.1
1.8
1.8
org.scala-lang
scala图书馆
2.11.11
org.apache.spark
spark-core_2.11
2.4.0
假如
org.apache.spark
spark-avro_2.11
2.4.0
org.apache.spark
spark-sql_2.11
2.4.0
假如
乔达时间
乔达时间
2.9.9
org.apache.hbase
hbase客户端
2.2.4
org.apache.hadoop
hadoop内核
1.2.1
org.apache.hbase
hbase spark
2.0.0-4
org.apache.hbase
hbase mapreduce
2.2.4

如果有人能帮我解决这个问题,那就太好了。

我想你失踪了

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>2.6.5</version>
            <scope>provided</scope>
        </dependency>

org.apache.hadoop
hadoop mapreduce客户端核心
2.6.5
假如

感谢您的快速回复。添加此依赖项后,我以前的错误将被删除。但是控制台中的新错误是:ERROR executor.executor:stage 0.0(TID 0)中task 0.0中的异常java.lang.NoSuchMethodError:org.apache.hadoop.mapred.TaskID。(Lorg/apache/hadoop/mapreduce/JobID;Lorg/apache/hadoop/mapreduce/TaskType;I)V位于org.apache.spark.rdd.HadoopRDD$.addLocalConfiguration(HadoopRDD.scala:402)请将组“org.apache.hadoop”的所有依赖项升级到同一版本2.6.5