Apache spark Spark RAPIDS-操作未替换为GPU版本
我是Rapids的新手,很难理解支持的操作 我有以下格式的数据:Apache spark Spark RAPIDS-操作未替换为GPU版本,apache-spark,apache-spark-sql,rapids,Apache Spark,Apache Spark Sql,Rapids,我是Rapids的新手,很难理解支持的操作 我有以下格式的数据: +------------+----------+ | kmer|source_seq| +------------+----------+ |TGTCGGTTTAA$| 4| |ACCACCACCAC$| 8| |GCATAATTTCC$| 1| |CCGTCAAAGCG$| 7| |CCGTCCCGTGG$| 6| |GCGCTGTT
+------------+----------+
| kmer|source_seq|
+------------+----------+
|TGTCGGTTTAA$| 4|
|ACCACCACCAC$| 8|
|GCATAATTTCC$| 1|
|CCGTCAAAGCG$| 7|
|CCGTCCCGTGG$| 6|
|GCGCTGTTATG$| 2|
|GAGCATAGGTG$| 5|
|CGGCGGATTCT$| 0|
|GGCGCGAGGGT$| 3|
|CCACCACCAC$A| 8|
|CACCACCAC$AA| 8|
|CCCAAAAAAAAA| 0|
|AAGAAAAAAAAA| 5|
|AAGAAAAAAAAA| 0|
|TGTAAAAAAAAA| 0|
|CCACAAAAAAAA| 8|
|AGACAAAAAAAA| 7|
|CCCCAAAAAAAA| 0|
|CAAGAAAAAAAA| 5|
|TAAGAAAAAAAA| 0|
+------------+----------+
我试图通过以下代码找出哪些“kmer”有哪些“source_seq”:
val w = Window.partitionBy("kmer")
x.withColumn("source_seqs", collect_list("source_seq").over(w))
// Result is something like this:
+------------+----------+-----------+
| kmer|source_seq|source_seqs|
+------------+----------+-----------+
|AAAACAAGACCA| 2| [2]|
|AAAACAAGCAGC| 4| [4]|
|AAAACCACGAGC| 3| [3]|
|AAAACCGCCAAA| 7| [7]|
|AAAACCGGTGTG| 1| [1]|
|AAAACCTATATC| 5| [5]|
|AAAACGACTTCT| 6| [6]|
|AAAACGCGCAAG| 3| [3]|
|AAAAGGCCTATT| 7| [7]|
|AAAAGGCGTTCG| 3| [3]|
|AAAAGGCTGTGA| 1| [1]|
|AAAAGGTCTACC| 2| [2]|
|AAAAGTCGAGCA| 7| [7, 0]|
|AAAAGTCGAGCA| 0| [7, 0]|
|AAAATCCGATCA| 0| [0]|
|AAAATCGAGCGG| 0| [0]|
|AAAATCGTTGAA| 7| [7]|
|AAAATGGACAAG| 1| [1]|
|AAAATTGCACCA| 3| [3]|
|AAACACCGCCGT| 3| [3]|
+------------+----------+-----------+
提到只有窗口支持collect\u list
,据我所知,这就是我在代码中所做的
但是,查看查询计划,很容易看到GPU没有执行collect\u list
:
scala> x.withColumn("source_seqs", collect_list("source_seq").over(w)).explain
== Physical Plan ==
Window [collect_list(source_seq#302L, 0, 0) windowspecdefinition(kmer#301, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS max_source#658], [kmer#301]
+- GpuColumnarToRow false
+- GpuSort [kmer#301 ASC NULLS FIRST], false, RequireSingleBatch, 0
+- GpuCoalesceBatches RequireSingleBatch
+- GpuShuffleCoalesce 2147483647
+- GpuColumnarExchange gpuhashpartitioning(kmer#301, 200), ENSURE_REQUIREMENTS, [id=#1496]
+- GpuFileGpuScan csv [kmer#301,source_seq#302L] Batched: true, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/home/cloud-user/phase1/example/1620833755/part-00000], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<kmer:string,source_seq:bigint>
scala>x.withColumn(“源代码”),collect\u list(“源代码”),over(w))。解释
==实际计划==
窗口[收集列表(源代码序列302L,0,0)windowspecdefinition(kmer 301,specifiedwindowframe(行框,无边界接收$(),无边界跟随$())作为最大源代码658],[kmer 301]
+-GpuColumnarToRow错误
+-GpuSort[kmer#301 ASC NULLS FIRST],false,RequireSingleBatch,0
+-GPU许可批次要求单批次
+-GPUSHUFFLE聚结2147483647
+-GpuColumnarExchange gpuhashpartitioning(kmer#301,200),确保#U要求,[id=#1496]
+-GpuFileGpuScan csv[kmer#301,source#seq#302L]批处理:true,DataFilters:[],格式:csv,位置:InMemoryFileIndex[file:/home/cloud user/phase1/example/1620833755/part-00000],分区筛选器:[],PushedFilters:[],ReadSchema:struct
与具有不同功能的类似查询不同,我们可以看到使用GPU执行的窗口:
scala> x.withColumn("min_source", min("source_seq").over(w)).explain
== Physical Plan ==
GpuColumnarToRow false
+- GpuWindow [gpumin(source_seq#302L) gpuwindowspecdefinition(kmer#301, gpuspecifiedwindowframe(RowFrame, gpuspecialframeboundary(unboundedpreceding$()), gpuspecialframeboundary(unboundedfollowing$()))) AS max_source#648L], [kmer#301], false
+- GpuSort [kmer#301 ASC NULLS FIRST], false, RequireSingleBatch, 0
+- GpuCoalesceBatches RequireSingleBatch
+- GpuShuffleCoalesce 2147483647
+- GpuColumnarExchange gpuhashpartitioning(kmer#301, 200), ENSURE_REQUIREMENTS, [id=#1431]
+- GpuFileGpuScan csv [kmer#301,source_seq#302L] Batched: true, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/home/cloud-user/phase1/example/1620833755/part-00000], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<kmer:string,source_seq:bigint>
scala>x.withColumn(“最小源”),min(“源顺序”)。在(w)上方。解释
==实际计划==
GpuColumnarToRow错误
+-GpuWindow[gpumin(source)gpuwindowspecdefinition(kmer 301,gpuspecifiedwindowframe(RowFrame,gpuspecialframeboundary(unboundedReceiving$()),gpuspecialframeboundary(unboundedfollowing$()))作为最大源648L],[kmer 301],false
+-GpuSort[kmer#301 ASC NULLS FIRST],false,RequireSingleBatch,0
+-GPU许可批次要求单批次
+-GPUSHUFFLE聚结2147483647
+-GpuColumnarExchange gpuhashpartitioning(kmer#301,200),确保#U要求,[id=#1431]
+-GpuFileGpuScan csv[kmer#301,source#seq#302L]批处理:true,DataFilters:[],格式:csv,位置:InMemoryFileIndex[file:/home/cloud user/phase1/example/1620833755/part-00000],分区筛选器:[],PushedFilters:[],ReadSchema:struct
我对支持的操作文档的理解是否有误,或者我是否以错误的方式编写了代码?对此,如有任何帮助,我们将不胜感激。肯尼。请问您使用的是哪个版本的
rapids-4-spark
插件,以及spark的版本
默认情况下禁用了COLLECT_LIST()
的初始GPU实现,因为其行为与Spark的w.r.t空值不匹配。(GPU版本在聚合数组行中保留空值,而Spark将其删除。)编辑:该行为在0.5版本中得到纠正
如果聚合列中没有空值(并且正在使用rapids-4-spark
0.4),可以通过设置spark.rapids.sql.expression.CollectList=true
来尝试启用运算符
通常,可以通过设置spark.rapids.sql.explain=NOT\u on\u GPU
来检查操作员没有在GPU上运行的原因。这会将原因打印到控制台
如果您在使用
rapids-4-spark
插件时仍然遇到困难或行为不正确,请随时提出错误。我们很乐意进一步调查。肯尼。请问您使用的是哪个版本的rapids-4-spark
插件,以及spark的版本
默认情况下禁用了COLLECT_LIST()
的初始GPU实现,因为其行为与Spark的w.r.t空值不匹配。(GPU版本在聚合数组行中保留空值,而Spark将其删除。)编辑:该行为在0.5版本中得到纠正
如果聚合列中没有空值(并且正在使用rapids-4-spark
0.4),可以通过设置spark.rapids.sql.expression.CollectList=true
来尝试启用运算符
通常,可以通过设置spark.rapids.sql.explain=NOT\u on\u GPU
来检查操作员没有在GPU上运行的原因。这会将原因打印到控制台
如果您在使用
rapids-4-spark
插件时仍然遇到困难或行为不正确,请随时提出错误。我们很乐意进一步调查。是的,正如Mithun提到的,spark.rapids.sql.expression.CollectList从0.5版开始生效。
但是,在0.4版本中它是错误的:
以下是我在0.5+版本上测试的计划:
val w = Window.partitionBy("name")
val resultdf=dfread.withColumn("values", collect_list("value").over(w))
resultdf.explain
== Physical Plan ==
GpuColumnarToRow false
+- GpuWindow [collect_list(value#134L, 0, 0) gpuwindowspecdefinition(name#133, gpuspecifiedwindowframe(RowFrame, gpuspecialframeboundary(unboundedpreceding$()), gpuspecialframeboundary(unboundedfollowing$()))) AS values#138], [name#133], false
+- GpuCoalesceBatches RequireSingleBatch
+- GpuSort [name#133 ASC NULLS FIRST], false, com.nvidia.spark.rapids.OutOfCoreSort$@28e73bd1
+- GpuShuffleCoalesce 2147483647
+- GpuColumnarExchange gpuhashpartitioning(name#133, 200), ENSURE_REQUIREMENTS, [id=#563]
+- GpuFileGpuScan csv [name#133,value#134L] Batched: true, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/tmp/df], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<name:string,value:bigint>
val w=Window.partitionBy(“名称”)
val resultdf=dfread.withColumn(“值”,collect_list(“值”)。超过(w))
结果解释
==实际计划==
GpuColumnarToRow错误
+-GpuWindow[collect#u list(value#134L,0,0)gpuwindowspecdefinition(name#133,gpuspecifiedwindowframe(RowFrame,gpuspecialframeboundary(unboundedpreceding$()),gpuspecialframeboundary(unboundedfollowing$()))作为值#138],[name#133],false
+-GPU许可批次要求单批次
+-GpuSort[name#133 ASC NULLS FIRST],false,com.nvidia.spark.rapids.outofcovert$@28e73bd1
+-GPUSHUFFLE聚结2147483647
+-GpuColumnarExchange gpuhashpartitioning(名称#133200),确保#U要求,[id=#563]
+-GpuFileGpuScan csv[name#133,value#134L]批处理:true,DataFilters:[],格式:csv,位置:InMemoryFileIndex[文件:/tmp/df],分区过滤器:[],PushedFilters:[],ReadSchema:struct
是的,正如Mithun提到的,spark.rapids.sql.expression.CollectList从0.5版本开始变为true。
但是,在0.4版本中它是错误的:
以下是我在0.5+版本上测试的计划:
val w = Window.partitionBy("name")
val resultdf=dfread.withColumn("values", collect_list("value").over(w))
resultdf.explain
== Physical Plan ==
GpuColumnarToRow false
+- GpuWindow [collect_list(value#134L, 0, 0) gpuwindowspecdefinition(name#133, gpuspecifiedwindowframe(RowFrame, gpuspecialframeboundary(unboundedpreceding$()), gpuspecialframeboundary(unboundedfollowing$()))) AS values#138], [name#133], false
+- GpuCoalesceBatches RequireSingleBatch
+- GpuSort [name#133 ASC NULLS FIRST], false, com.nvidia.spark.rapids.OutOfCoreSort$@28e73bd1
+- GpuShuffleCoalesce 2147483647
+- GpuColumnarExchange gpuhashpartitioning(name#133, 200), ENSURE_REQUIREMENTS, [id=#563]
+- GpuFileGpuScan csv [name#133,value#134L] Batched: true, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/tmp/df], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<name:string,value:bigint>
val w=Window.partitionBy(“名称”)
val resultdf=dfread.withColumn(“值”),collect_list(“值”)。结束