Java Spark中大数数据的过滤
我有一些记录包含json格式的创建时间。在对数据进行实际操作之前,我想过滤掉12月31日晚上11:59:59(1577836740000)之后创建的数据。但是过滤查询不能处理这么大的数据。任何关于如何过滤的建议 现在我正在做的事情如下Java Spark中大数数据的过滤,java,apache-spark,filter,apache-spark-sql,apache-spark-dataset,Java,Apache Spark,Filter,Apache Spark Sql,Apache Spark Dataset,我有一些记录包含json格式的创建时间。在对数据进行实际操作之前,我想过滤掉12月31日晚上11:59:59(1577836740000)之后创建的数据。但是过滤查询不能处理这么大的数据。任何关于如何过滤的建议 现在我正在做的事情如下 Dataset<Row> testData = rawDataSet.select(cols) .filter(col("name").equalTo("dummy string"))
Dataset<Row> testData = rawDataSet.select(cols)
.filter(col("name").equalTo("dummy string"))
...
...//Some filter
...
.filter(col("creationTime").gt((1577836740000l)))
请使用以下代码作为示例
import spark.implicits.
val inputData = ( """[{ "name":"vaquar khan","RecordNumber": 2, "Zipcode": 704, "ZipCodeType": "STANDARD", "City": "PASEO COSTA DEL SUR", "State": "PR"},{"name":"Zidan khan", "RecordNumber": 10, "Zipcode": 709, "ZipCodeType": "STANDARD", "City": "BDA SAN LUIS", "State": "PR"}] """);
val inputDataDf = spark.read.json(Seq(inputData).toDS)
inputDataDf.show()
scala> inputDataDf.show
+-------------------+------------+-----+-----------+-------+-----------+
| City|RecordNumber|State|ZipCodeType|Zipcode| name|
+-------------------+------------+-----+-----------+-------+-----------+
|PASEO COSTA DEL SUR| 2| PR| STANDARD| 704|vaquar khan|
| BDA SAN LUIS| 10| PR| STANDARD| 709| Zidan khan|
+-------------------+------------+-----+-----------+-------+-----------+
inputDataDf.filter("Zipcode not like '704%' ").show
scala>inputDataDf.filter("Zipcode not like '704%' ").show
+------------+------------+-----+-----------+-------+----------+
| City|RecordNumber|State|ZipCodeType|Zipcode| name|
+------------+------------+-----+-----------+-------+----------+
|BDA SAN LUIS| 10| PR| STANDARD| 709|Zidan khan|
+------------+------------+-----+-----------+-------+----------+
您得到的错误是什么,数据的模式是什么。我已尝试使用您的数据,它工作正常。请提供更多详细信息。您使用的Unix时间(157783674000L)不是12月31日23:59:59,而是12月31日23:59:00。在12月31日23:59:59的筛选器中使用1577816999,因为您的显示日期在12月31日23:59:59之前,所以不会给出任何值。@Nikk My bad这是缓存问题
import spark.implicits.
val inputData = ( """[{ "name":"vaquar khan","RecordNumber": 2, "Zipcode": 704, "ZipCodeType": "STANDARD", "City": "PASEO COSTA DEL SUR", "State": "PR"},{"name":"Zidan khan", "RecordNumber": 10, "Zipcode": 709, "ZipCodeType": "STANDARD", "City": "BDA SAN LUIS", "State": "PR"}] """);
val inputDataDf = spark.read.json(Seq(inputData).toDS)
inputDataDf.show()
scala> inputDataDf.show
+-------------------+------------+-----+-----------+-------+-----------+
| City|RecordNumber|State|ZipCodeType|Zipcode| name|
+-------------------+------------+-----+-----------+-------+-----------+
|PASEO COSTA DEL SUR| 2| PR| STANDARD| 704|vaquar khan|
| BDA SAN LUIS| 10| PR| STANDARD| 709| Zidan khan|
+-------------------+------------+-----+-----------+-------+-----------+
inputDataDf.filter("Zipcode not like '704%' ").show
scala>inputDataDf.filter("Zipcode not like '704%' ").show
+------------+------------+-----+-----------+-------+----------+
| City|RecordNumber|State|ZipCodeType|Zipcode| name|
+------------+------------+-----+-----------+-------+----------+
|BDA SAN LUIS| 10| PR| STANDARD| 709|Zidan khan|
+------------+------------+-----+-----------+-------+----------+