Apache spark 跳过Spark中的前几行
我有spark 2.0代码,可以读取.gz(文本)文件并将它们写入配置单元表 我可以知道如何忽略所有文件中的前两行吗。我只想跳过前两行Apache spark 跳过Spark中的前几行,apache-spark,apache-spark-sql,spark-dataframe,Apache Spark,Apache Spark Sql,Spark Dataframe,我有spark 2.0代码,可以读取.gz(文本)文件并将它们写入配置单元表 我可以知道如何忽略所有文件中的前两行吗。我只想跳过前两行 SparkSession spark = SparkSession .builder() .master("local") .appName("SparkSessionFiles") .config("spark.some.config.option",
SparkSession spark = SparkSession
.builder()
.master("local")
.appName("SparkSessionFiles")
.config("spark.some.config.option", "some-value")
.enableHiveSupport()
.getOrCreate();
JavaRDD<mySchema> peopleRDD = spark.read()
.textFile("file:///app/home/emm/zipfiles/myzips/")
.javaRDD()
.map(new Function<String, mySchema>()
{
@Override
public mySchema call(String line) throws Exception
{
String[] parts = line.split(";");
mySchema mySchema = new mySchema();
mySchema.setCFIELD1 (parts[0]);
mySchema.setCFIELD2 (parts[1]);
mySchema.setCFIELD3 (parts[2]);
mySchema.setCFIELD4 (parts[3]);
mySchema.setCFIELD5 (parts[4]);
return mySchema;
}
});
// Apply a schema to an RDD of JavaBeans to get a DataFrame
Dataset<Row> myDF = spark.createDataFrame(peopleRDD, mySchema.class);
myDF.createOrReplaceTempView("myView");
spark.sql("INSERT INTO myHIVEtable SELECT * from myView");
SparkSession spark=SparkSession
.builder()
.master(“本地”)
.appName(“SparkSessionFiles”)
.config(“spark.some.config.option”、“some value”)
.enableHiveSupport()
.getOrCreate();
JavaRDD peopleRDD=spark.read()
.textFile(“file:///app/home/emm/zipfiles/myzips/")
.javaRDD()
.map(新函数()
{
@凌驾
公共mySchema调用(字符串行)引发异常
{
String[]parts=line.split(“;”);
mySchema mySchema=新mySchema();
mySchema.setCFIELD1(第[0]部分);
mySchema.setCFIELD2(第[1]部分);
mySchema.setCFIELD3(第[2]部分);
mySchema.setCFIELD4(第[3]部分);
mySchema.setCFIELD5(第[4]部分);
返回mySchema;
}
});
//将模式应用于JavaBeans的RDD以获取数据帧
数据集myDF=spark.createDataFrame(peopleRDD,mySchema.class);
myDF.createOrReplaceTempView(“myView”);
sql(“插入myHIVEtable SELECT*fromMyView”);
更新:修改代码
Lambda没有在我的eclipse上工作。因此使用了常规java语法。我现在有个例外
.....
Function2 removeHeader= new Function2<Integer, Iterator<String>, Iterator<String>>(){
public Iterator<String> call(Integer ind, Iterator<String> iterator) throws Exception {
System.out.println("ind="+ind);
if((ind==0) && iterator.hasNext()){
iterator.next();
iterator.next();
return iterator;
}else
return iterator;
}
};
JavaRDD<mySchema> peopleRDD = spark.read()
.textFile(path) //file:///app/home/emm/zipfiles/myzips/
.javaRDD()
.mapPartitionsWithIndex(removeHeader,false)
.map(new Function<String, mySchema>()
{
........
Java.util.NoSuchElementException
at java.util.LinkedList.removeFirst(LinkedList.java:268)
at java.util.LinkedList.remove(LinkedList.java:683)
at org.apache.spark.sql.execution.BufferedRowIterator.next(BufferedRowIterator.java:49)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.next(WholeStageCodegenExec.scala:374)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.next(WholeStageCodegenExec.scala:368)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
at com.comcast.emm.vodip.SparkSessionFiles.SparkSessionFiles$1.call(SparkSessionFiles.java:2480)
at com.comcast.emm.vodip.SparkSessionFiles.SparkSessionFiles$1.call(SparkSessionFiles.java:2476)
。。。。。
Function2 removeHeader=新函数2(){
公共迭代器调用(整数ind,迭代器迭代器)引发异常{
System.out.println(“ind=”+ind);
if((ind==0)&&iterator.hasNext(){
iterator.next();
iterator.next();
返回迭代器;
}否则
返回迭代器;
}
};
JavaRDD peopleRDD=spark.read()
.textFile(路径)//file:///app/home/emm/zipfiles/myzips/
.javaRDD()
.mapPartitionsWithIndex(removeHeader,false)
.map(新函数()
{
........
Java.util.NoSuchElementException
位于java.util.LinkedList.removeFirst(LinkedList.java:268)
在java.util.LinkedList.remove处(LinkedList.java:683)
位于org.apache.spark.sql.execution.BufferedRowIterator.next(BufferedRowIterator.java:49)
位于org.apache.spark.sql.execution.whisttagecodegenexec$$anonfun$8$$anon$1.next(whisttagecodegenexec.scala:374)
位于org.apache.spark.sql.execution.whisttagecodegenexec$$anonfun$8$$anon$1.next(whisttagecodegenexec.scala:368)
位于scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
位于scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
位于scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
在com.comcast.emm.vodip.SparkSessionFiles.SparkSessionFiles$1.call(SparkSessionFiles.java:2480)
在com.comcast.emm.vodip.SparkSessionFiles.SparkSessionFiles$1.call(SparkSessionFiles.java:2476)
您可以这样做:
JavaRDD<mySchema> peopleRDD = spark.read()
.textFile("file:///app/home/emm/zipfiles/myzips/")
.javaRDD()
.mapPartitionsWithIndex((index, iter) -> {
if (index == 0 && iter.hasNext()) {
iter.next();
if (iter.hasNext()) {
iter.next();
}
}
return iter;
}, true);
...
编辑:
我修改了代码以避免异常
此代码将只删除RDD的前2行,而不是每个文件的前2行
如果要删除每个文件的前2行,我建议为每个文件执行RDD,请应用
.mapPartitionWithIndex(…)
对于每个RDD,然后对每个RDD执行一个联合。请查看我更新的代码。index==0它是指RDD分区还是每个文件?实际上,我想从每个文件中删除前2行。我在您的评论后更新了我的答案。我希望它有帮助。哪一个性能更好-1)应用过滤器转换(前两行有一些唯一的值,我可以使用它们作为筛选条件)或者mapPartitionWithIndex然后加入RDD?不知道……也许可以尝试两种解决方案并进行比较。我对比较很感兴趣!
rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(2) else iter }