Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala SparkStreaming:文件流()中出现错误_Scala_Apache Spark_Spark Streaming - Fatal编程技术网

Scala SparkStreaming:文件流()中出现错误

Scala SparkStreaming:文件流()中出现错误,scala,apache-spark,spark-streaming,Scala,Apache Spark,Spark Streaming,我正在尝试在scala中实现spark流媒体应用程序。我想使用fileStream()方法来处理新到达的文件以及hadoop目录中的旧文件 我从stackoverflow的以下两个线程中了解了fileStream()实现,如下所示: 我正在使用fileStream(),如下所示: val linesRDD = ssc.fileStream[LongWritable, Text, TextInputFormat](inputDirectory, (t: org.apache.hadoop.

我正在尝试在scala中实现spark流媒体应用程序。我想使用fileStream()方法来处理新到达的文件以及hadoop目录中的旧文件

我从stackoverflow的以下两个线程中了解了fileStream()实现,如下所示:

我正在使用fileStream(),如下所示:

val linesRDD = ssc.fileStream[LongWritable, Text, TextInputFormat](inputDirectory, (t: org.apache.hadoop.fs.Path) => true, false).map(_._2.toString)
type arguments [org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,
org.apache.hadoop.mapred.TextInputFormat] conform to the bounds of none of the overloaded alternatives of value fileStream: [K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](directory: String, filter: org.apache.hadoop.fs.Path ⇒ Boolean, newFilesOnly: Boolean, conf: org.apache.hadoop.conf.Configuration)(implicit evidence$12: scala.reflect.ClassTag[K], implicit evidence$13: scala.reflect.ClassTag[V], implicit evidence$14: scala.reflect.ClassTag[F])
org.apache.spark.streaming.dstream.InputDStream[(K, V)] <and> 
[K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](directory:
String, filter: org.apache.hadoop.fs.Path ⇒ Boolean, newFilesOnly: Boolean)(implicit evidence$9: scala.reflect.ClassTag[K], implicit evidence$10: scala.reflect.ClassTag[V], 
implicit evidence$11: scala.reflect.ClassTag[F])
org.apache.spark.streaming.dstream.InputDStream[(K, V)] <and> [K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](directory: String)(implicit evidence$6: scala.reflect.ClassTag[K], implicit evidence$7: scala.reflect.ClassTag[V], implicit evidence$8: scala.reflect.ClassTag[F])
org.apache.spark.streaming.dstream.InputDStream[(K, V)]

wrong number of type parameters for overloaded method value fileStream with alternatives: 
[K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](directory: String, filter: org.apache.hadoop.fs.Path ⇒ Boolean, newFilesOnly: Boolean, conf: org.apache.hadoop.conf.Configuration)(implicit evidence$12: scala.reflect.ClassTag[K], implicit evidence$13: scala.reflect.ClassTag[V], implicit evidence$14: scala.reflect.ClassTag[F])
org.apache.spark.streaming.dstream.InputDStream[(K, V)] <and> [K, V, F <:     org.apache.hadoop.mapreduce.InputFormat[K,V]](directory: String, filter: org.apache.hadoop.fs.Path ⇒ Boolean, newFilesOnly: Boolean)(implicit evidence$9: scala.reflect.ClassTag[K], implicit evidence$10: scala.reflect.ClassTag[V], implicit evidence$11: scala.reflect.ClassTag[F])
org.apache.spark.streaming.dstream.InputDStream[(K, V)] <and> 
[K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](directory: String)(implicit evidence$6: scala.reflect.ClassTag[K], implicit evidence$7: scala.reflect.ClassTag[V], implicit evidence$8: scala.reflect.ClassTag[F])
org.apache.spark.streaming.dstream.InputDStream[(K, V)] 

但我收到如下错误消息:

val linesRDD = ssc.fileStream[LongWritable, Text, TextInputFormat](inputDirectory, (t: org.apache.hadoop.fs.Path) => true, false).map(_._2.toString)
type arguments [org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,
org.apache.hadoop.mapred.TextInputFormat] conform to the bounds of none of the overloaded alternatives of value fileStream: [K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](directory: String, filter: org.apache.hadoop.fs.Path ⇒ Boolean, newFilesOnly: Boolean, conf: org.apache.hadoop.conf.Configuration)(implicit evidence$12: scala.reflect.ClassTag[K], implicit evidence$13: scala.reflect.ClassTag[V], implicit evidence$14: scala.reflect.ClassTag[F])
org.apache.spark.streaming.dstream.InputDStream[(K, V)] <and> 
[K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](directory:
String, filter: org.apache.hadoop.fs.Path ⇒ Boolean, newFilesOnly: Boolean)(implicit evidence$9: scala.reflect.ClassTag[K], implicit evidence$10: scala.reflect.ClassTag[V], 
implicit evidence$11: scala.reflect.ClassTag[F])
org.apache.spark.streaming.dstream.InputDStream[(K, V)] <and> [K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](directory: String)(implicit evidence$6: scala.reflect.ClassTag[K], implicit evidence$7: scala.reflect.ClassTag[V], implicit evidence$8: scala.reflect.ClassTag[F])
org.apache.spark.streaming.dstream.InputDStream[(K, V)]

wrong number of type parameters for overloaded method value fileStream with alternatives: 
[K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](directory: String, filter: org.apache.hadoop.fs.Path ⇒ Boolean, newFilesOnly: Boolean, conf: org.apache.hadoop.conf.Configuration)(implicit evidence$12: scala.reflect.ClassTag[K], implicit evidence$13: scala.reflect.ClassTag[V], implicit evidence$14: scala.reflect.ClassTag[F])
org.apache.spark.streaming.dstream.InputDStream[(K, V)] <and> [K, V, F <:     org.apache.hadoop.mapreduce.InputFormat[K,V]](directory: String, filter: org.apache.hadoop.fs.Path ⇒ Boolean, newFilesOnly: Boolean)(implicit evidence$9: scala.reflect.ClassTag[K], implicit evidence$10: scala.reflect.ClassTag[V], implicit evidence$11: scala.reflect.ClassTag[F])
org.apache.spark.streaming.dstream.InputDStream[(K, V)] <and> 
[K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](directory: String)(implicit evidence$6: scala.reflect.ClassTag[K], implicit evidence$7: scala.reflect.ClassTag[V], implicit evidence$8: scala.reflect.ClassTag[F])
org.apache.spark.streaming.dstream.InputDStream[(K, V)] 
类型参数[org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,

org.apache.hadoop.mapred.TextInputFormat]不符合值fileStream的任何重载选项的界限:[K,V,F请在下面找到示例java代码,并使用正确的导入,对我来说工作正常

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;

JavaStreamingContext jssc = SparkUtils.getStreamingContext("key", jsc);
//      JavaDStream<String> rawInput = jssc.textFileStream(inputPath);

        JavaPairInputDStream<LongWritable, Text> inputStream = jssc.fileStream(
                inputPath, LongWritable.class, Text.class,
                TextInputFormat.class, new Function<Path, Boolean>() {
                    @Override
                    public Boolean call(Path v1) throws Exception {
                        if ( v1.getName().contains("COPYING") ) {
                            // This eliminates staging files.
                            return Boolean.FALSE;
                        }
                        return Boolean.TRUE;
                    }
                }, true);
        JavaDStream<String> rawInput = inputStream.map(
                  new Function<Tuple2<LongWritable, Text>, String>() {
                    @Override
                    public String call(Tuple2<LongWritable, Text> v1) throws Exception {
                      return v1._2().toString();
                    }
                });
        log.info(tracePrefix + "Created the stream, Window Interval: " + windowInterval + ", Slide interval: " + slideInterval);
        rawInput.print();
import org.apache.hadoop.fs.Path;
导入org.apache.hadoop.hbase.client.Put;
导入org.apache.hadoop.io.LongWritable;
导入org.apache.hadoop.io.Text;
导入org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
导入org.apache.spark.api.java.JavaRDD;
导入org.apache.spark.api.java.JavaSparkContext;
导入org.apache.spark.api.java.function.function;
导入org.apache.spark.api.java.function.VoidFunction;
导入org.apache.spark.sql.DataFrame;
导入org.apache.spark.sql.Row;
导入org.apache.spark.sql.RowFactory;
导入org.apache.spark.streaming.Duration;
导入org.apache.spark.streaming.api.java.JavaDStream;
导入org.apache.spark.streaming.api.java.JavaPairInputStream;
导入org.apache.spark.streaming.api.java.JavaStreamingContext;
JavaStreamingContext jssc=SparkUtils.getStreamingContext(“key”,jsc);
//JavaDStream rawInput=jssc.textFileStream(inputPath);
JavaPairInputStream inputStream=jssc.fileStream(
inputPath、LongWritable.class、Text.class、,
TextInputFormat.class,新函数(){
@凌驾
公共布尔调用(路径v1)引发异常{
如果(v1.getName()包含(“复制”)){
//这消除了暂存文件。
返回Boolean.FALSE;
}
返回Boolean.TRUE;
}
},对);
JavaDStream rawInput=inputStream.map(
新函数(){
@凌驾
公共字符串调用(tuple2v1)引发异常{
返回v1._2().toString();
}
});
log.info(tracePrefix+”创建流,窗口间隔:“+windowInterval+”,幻灯片间隔:“+slideInterval”);
rawInput.print();

谢谢Lokesh。它对我也起作用。我为TextInputFormat导入了错误的包,如:import org.apache.hadoop.mapred.TextInputFormat;但正确的是:import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;我还需要您的一个帮助。通过向您提问:;您试图获取已处理的文件按spark命名。你能指导我吗?我正在尝试获取spark处理的文件名,以便我可以将它们删除/移动到不同的目录。嗨,不幸的是spark RDDs没有提供获取文件名的API机制,我从stackoverflow得到的唯一建议是将RDD打印为debugString并从中解析文件名t、 但这是一个相当肮脏的黑客行为。不是要挖掘一个老问题,但这里的答案似乎有效,尽管不确定它的性能问题。