clojure-strng concat与地图序列中的group by

clojure-strng concat与地图序列中的group by,clojure,Clojure,来自jdbc源的给定输入数据如下: (def input-data [{:doc_id 1 :doc_seq 1 :doc_content "this is a very long "} {:doc_id 1 :doc_seq 2 :doc_content "sentence from a mainframe "} {:doc_id 1 :doc_seq 3 :doc_content "system that was built before i was "}

来自jdbc源的给定输入数据如下:

  (def input-data
    [{:doc_id 1 :doc_seq 1  :doc_content "this is a very long "}
    {:doc_id 1 :doc_seq 2  :doc_content "sentence from a mainframe "}
    {:doc_id 1 :doc_seq 3  :doc_content "system that was built before i was "}
    {:doc_id 1 :doc_seq 4  :doc_content "born."}
    {:doc_id 2 :doc_seq 1  :doc_content "this is a another very long "}
    {:doc_id 2 :doc_seq 2  :doc_content "sentence from the same mainframe "}
    {:doc_id 3 :doc_seq 1  :doc_content "Ok here we are again. "}
    {:doc_id 3 :doc_seq 2  :doc_content "The mainframe only had 40 char per field so"}
    {:doc_id 3 :doc_seq 3  :doc_content "they broke it into multiple rows "}
    {:doc_id 3 :doc_seq 4  :doc_content "which seems to be common"}
    {:doc_id 3 :doc_seq 5  :doc_content " for the time. "}
    {:doc_id 3 :doc_seq 6  :doc_content "thanks for your help."}])
我想按
doc id
进行分组,并在
doc\u内容中添加字符串concat,因此我的输出如下所示:

  [{:doc_id 1 :doc_content "this is a very long sentence from a mainfram system that was built before i was born."}
   {:doc_id 2 :doc_content "this is a another very long sentence ... clip..."}
   {:doc_id 3 :doc_content "... clip..."}]
我正在考虑使用
groupby
,但是这会输出一张地图,我需要 由于输入数据集可能非常大,因此输出一些延迟的内容。也许我可以运行
group by
和一些
reduce kv
的组合来得到我想要的。。。或者,如果我能强迫它懒惰的话,可能是频率
的东西

我可以保证它会被分类;我将通过(通过sql)在
doc\u id
doc\u seq
上下订单,因此该程序只负责聚合/字符串concat部分。我可能会有整个序列的大量输入数据,但该序列中的特定
doc\u id
应该只有几十个
doc\u seq


任何值得赞赏的提示,

分区都是惰性的,只要每个文档序列都适合内存,这应该可以工作:

(defn collapse-docs [docs]
  (apply merge-with
         (fn [l r]
           (if (string? r)
             (str l r)
             r))
         docs))

(sequence ;; you may want to use eduction here, depending on use case
  (comp
    (partition-by :doc_id)
    (map collapse-docs))
  input-data)
=>
({:doc_id 1,
  :doc_seq 4,
  :doc_content "this is a very long sentence from a mainframe system that was built before i was born."}
  {:doc_id 2, :doc_seq 2, :doc_content "this is a another very long sentence from the same mainframe "}
  {:doc_id 3,
   :doc_seq 6,
   :doc_content "Ok here we are again. The mainframe only had 40 char per field sothey broke it into multiple rows which seems to be common for the time. thanks for your help."})

partitionby
是惰性的,只要每个文档序列都适合内存,这应该可以工作:

(defn collapse-docs [docs]
  (apply merge-with
         (fn [l r]
           (if (string? r)
             (str l r)
             r))
         docs))

(sequence ;; you may want to use eduction here, depending on use case
  (comp
    (partition-by :doc_id)
    (map collapse-docs))
  input-data)
=>
({:doc_id 1,
  :doc_seq 4,
  :doc_content "this is a very long sentence from a mainframe system that was built before i was born."}
  {:doc_id 2, :doc_seq 2, :doc_content "this is a another very long sentence from the same mainframe "}
  {:doc_id 3,
   :doc_seq 6,
   :doc_content "Ok here we are again. The mainframe only had 40 char per field sothey broke it into multiple rows which seems to be common for the time. thanks for your help."})

为什么要连接它(在内存中),而不只是将它发送到需要发送的任何地方?(stdout?)我正在尝试将一个无法在最后一列中容纳许多字符串的大型机系统转换为某种东西,例如许多json值的序列。。。。这有意义吗?为什么要将它(在内存中)连接起来,而不是将它发送到需要发送的任何地方?(stdout?)我正在尝试将一个无法在最后一列中容纳许多字符串的大型机系统转换为某种东西,例如许多json值的序列。。。。这有意义吗?太好了,谢谢<代码>按分区
是我要找的。仅供参考,我的文档将全部放入内存(每个文档可能最多5mb),但我可能有几百万个文档,因此我必须注意在spark上结束时如何分配内存。这很好,谢谢<代码>按分区是我要找的。仅供参考,我的文档将全部放入内存中(每个文档可能最多5mb),但我可能有几百万个文档,因此我必须注意在spark上结束时如何分配内存。