Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/333.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java Spark中大数数据的过滤_Java_Apache Spark_Filter_Apache Spark Sql_Apache Spark Dataset - Fatal编程技术网

Java Spark中大数数据的过滤

Java Spark中大数数据的过滤,java,apache-spark,filter,apache-spark-sql,apache-spark-dataset,Java,Apache Spark,Filter,Apache Spark Sql,Apache Spark Dataset,我有一些记录包含json格式的创建时间。在对数据进行实际操作之前,我想过滤掉12月31日晚上11:59:59(1577836740000)之后创建的数据。但是过滤查询不能处理这么大的数据。任何关于如何过滤的建议 现在我正在做的事情如下 Dataset<Row> testData = rawDataSet.select(cols) .filter(col("name").equalTo("dummy string"))

我有一些记录包含json格式的创建时间。在对数据进行实际操作之前,我想过滤掉12月31日晚上11:59:59(1577836740000)之后创建的数据。但是过滤查询不能处理这么大的数据。任何关于如何过滤的建议

现在我正在做的事情如下

 Dataset<Row> testData = rawDataSet.select(cols)
                .filter(col("name").equalTo("dummy string"))
                ...
                ...//Some filter
                ...
                .filter(col("creationTime").gt((1577836740000l)))

请使用以下代码作为示例

import spark.implicits.

val inputData = ( """[{ "name":"vaquar khan","RecordNumber": 2, "Zipcode": 704, "ZipCodeType": "STANDARD", "City": "PASEO COSTA DEL SUR", "State": "PR"},{"name":"Zidan khan", "RecordNumber": 10, "Zipcode": 709, "ZipCodeType": "STANDARD", "City": "BDA SAN LUIS", "State": "PR"}] """);

val inputDataDf = spark.read.json(Seq(inputData).toDS)

inputDataDf.show()

scala> inputDataDf.show
+-------------------+------------+-----+-----------+-------+-----------+
|               City|RecordNumber|State|ZipCodeType|Zipcode|       name|
+-------------------+------------+-----+-----------+-------+-----------+
|PASEO COSTA DEL SUR|           2|   PR|   STANDARD|    704|vaquar khan|
|       BDA SAN LUIS|          10|   PR|   STANDARD|    709| Zidan khan|
+-------------------+------------+-----+-----------+-------+-----------+


inputDataDf.filter("Zipcode  not like '704%' ").show

scala>inputDataDf.filter("Zipcode  not like '704%' ").show
+------------+------------+-----+-----------+-------+----------+
|        City|RecordNumber|State|ZipCodeType|Zipcode|      name|
+------------+------------+-----+-----------+-------+----------+
|BDA SAN LUIS|          10|   PR|   STANDARD|    709|Zidan khan|
+------------+------------+-----+-----------+-------+----------+

您得到的错误是什么,数据的模式是什么。我已尝试使用您的数据,它工作正常。请提供更多详细信息。您使用的Unix时间(157783674000L)不是12月31日23:59:59,而是12月31日23:59:00。在12月31日23:59:59的筛选器中使用1577816999,因为您的显示日期在12月31日23:59:59之前,所以不会给出任何值。@Nikk My bad这是缓存问题
import spark.implicits.

val inputData = ( """[{ "name":"vaquar khan","RecordNumber": 2, "Zipcode": 704, "ZipCodeType": "STANDARD", "City": "PASEO COSTA DEL SUR", "State": "PR"},{"name":"Zidan khan", "RecordNumber": 10, "Zipcode": 709, "ZipCodeType": "STANDARD", "City": "BDA SAN LUIS", "State": "PR"}] """);

val inputDataDf = spark.read.json(Seq(inputData).toDS)

inputDataDf.show()

scala> inputDataDf.show
+-------------------+------------+-----+-----------+-------+-----------+
|               City|RecordNumber|State|ZipCodeType|Zipcode|       name|
+-------------------+------------+-----+-----------+-------+-----------+
|PASEO COSTA DEL SUR|           2|   PR|   STANDARD|    704|vaquar khan|
|       BDA SAN LUIS|          10|   PR|   STANDARD|    709| Zidan khan|
+-------------------+------------+-----+-----------+-------+-----------+


inputDataDf.filter("Zipcode  not like '704%' ").show

scala>inputDataDf.filter("Zipcode  not like '704%' ").show
+------------+------------+-----+-----------+-------+----------+
|        City|RecordNumber|State|ZipCodeType|Zipcode|      name|
+------------+------------+-----+-----------+-------+----------+
|BDA SAN LUIS|          10|   PR|   STANDARD|    709|Zidan khan|
+------------+------------+-----+-----------+-------+----------+