Ruby-基于哈希键子集显示2个哈希数组之间的增量

Ruby-基于哈希键子集显示2个哈希数组之间的增量,ruby,set,data-comparison,Ruby,Set,Data Comparison,我试图比较两个哈希结构非常相似的哈希数组(相同且始终存在的键),并返回两者之间的增量——具体来说,我想捕获以下内容: 散列array1中不存在的部分array2 散列array2中不存在的部分array1 出现在两个数据集中的哈希 这通常可以通过简单地执行以下操作来实现: deltas_old_new = (array1-array2) deltas_new_old = (array2-array1) 我的问题(已经变成了2-3小时的挣扎!)是我需要根据散列中3个键的值('id','ref

我试图比较两个哈希结构非常相似的哈希数组(相同且始终存在的键),并返回两者之间的增量——具体来说,我想捕获以下内容:

  • 散列
    array1
    中不存在的部分
    array2
  • 散列
    array2
    中不存在的部分
    array1
  • 出现在两个数据集中的哈希
这通常可以通过简单地执行以下操作来实现:

deltas_old_new = (array1-array2)
deltas_new_old = (array2-array1)
我的问题(已经变成了2-3小时的挣扎!)是我需要根据散列中3个键的值('id','ref','name')来识别delta--这3个键的值实际上构成了我数据中的唯一条目--,但我必须保留散列的其他键/值对(例如,
'extra'
,以及为简洁起见未显示的许多其他键/值对)

示例数据:

array1 = [{'id' => '1', 'ref' => '1001', 'name' => 'CA', 'extra' => 'Not Sorted On 5'},
          {'id' => '2', 'ref' => '1002', 'name' => 'NY', 'extra' => 'Not Sorted On 7'},
          {'id' => '3', 'ref' => '1003', 'name' => 'WA', 'extra' => 'Not Sorted On 9'},
          {'id' => '7', 'ref' => '1007', 'name' => 'OR', 'extra' => 'Not Sorted On 11'}]

array2 = [{'id' => '1', 'ref' => '1001', 'name' => 'CA', 'extra' => 'Not Sorted On 5'},
          {'id' => '3', 'ref' => '1003', 'name' => 'WA', 'extra' => 'Not Sorted On 9'},
          {'id' => '8', 'ref' => '1002', 'name' => 'NY', 'extra' => 'Not Sorted On 7'},
          {'id' => '5', 'ref' => '1005', 'name' => 'MT', 'extra' => 'Not Sorted On 10'},
          {'id' => '12', 'ref' => '1012', 'name' => 'TX', 'extra' => 'Not Sorted On 85'}]
预期结果(3个单独的哈希数组):

array1 = [{'id' => '1', 'ref' => '1001', 'name' => 'CA', 'extra' => 'Not Sorted On 5'},
          {'id' => '2', 'ref' => '1002', 'name' => 'NY', 'extra' => 'Not Sorted On 7'},
          {'id' => '3', 'ref' => '1003', 'name' => 'WA', 'extra' => 'Not Sorted On 9'},
          {'id' => '7', 'ref' => '1007', 'name' => 'OR', 'extra' => 'Not Sorted On 11'}]

array2 = [{'id' => '1', 'ref' => '1001', 'name' => 'CA', 'extra' => 'Not Sorted On 5'},
          {'id' => '3', 'ref' => '1003', 'name' => 'WA', 'extra' => 'Not Sorted On 9'},
          {'id' => '8', 'ref' => '1002', 'name' => 'NY', 'extra' => 'Not Sorted On 7'},
          {'id' => '5', 'ref' => '1005', 'name' => 'MT', 'extra' => 'Not Sorted On 10'},
          {'id' => '12', 'ref' => '1012', 'name' => 'TX', 'extra' => 'Not Sorted On 85'}]
对象,该对象包含
array1
中的数据,但不在
array2
--

对象,该对象包含
array2
中的数据,但不在
array1
--

对象,该对象同时包含
array1
array2
--

我尝试过多次比较数组的迭代和基于3个键使用
Hash#keep_if
,以及将两个数据集合并到一个数组中,然后尝试基于
array1
进行重复数据消除,但我总是空手而来。提前感谢您的时间和帮助!

请参见和

由于您已标记了此问题,因此可以类似地使用集合:

require 'set'

set1 = array1.to_set
set2 = array2.to_set

set1 - set2
set2 - set1
set1 & set2

这不是很漂亮,但很有效。它创建了第三个数组,其中包含
array1
array2
中的所有唯一值,并对其进行迭代

然后,由于
include?
不允许自定义匹配器,我们可以通过使用并在数组中查找具有自定义子哈希匹配的项来伪造它。我们将把它包装在自定义方法中,这样我们就可以调用它传入
array1
array2
,而不是写入两次

最后,我们循环遍历
数组3
,确定
项是来自
数组1
数组2
,还是两者都来自,并添加到相应的输出数组中

array1 = [{'id' => '1', 'ref' => '1001', 'name' => 'CA', 'extra' => 'Not Sorted On 5'},
          {'id' => '2', 'ref' => '1002', 'name' => 'NY', 'extra' => 'Not Sorted On 7'},
          {'id' => '3', 'ref' => '1003', 'name' => 'WA', 'extra' => 'Not Sorted On 9'},
          {'id' => '7', 'ref' => '1007', 'name' => 'OR', 'extra' => 'Not Sorted On 11'}]

array2 = [{'id' => '1', 'ref' => '1001', 'name' => 'CA', 'extra' => 'Not Sorted On 5'},
          {'id' => '3', 'ref' => '1003', 'name' => 'WA', 'extra' => 'Not Sorted On 9'},
          {'id' => '8', 'ref' => '1002', 'name' => 'NY', 'extra' => 'Not Sorted On 7'},
          {'id' => '5', 'ref' => '1005', 'name' => 'MT', 'extra' => 'Not Sorted On 10'},
          {'id' => '12', 'ref' => '1012', 'name' => 'TX', 'extra' => 'Not Sorted On 85'}]

# combine the arrays into 1 array that contains items in both array1 and array2 to loop through
array3 = (array1 + array2).uniq { |item| { 'id' => item['id'], 'ref' => item['ref'], 'name' => item['name'] } }

# Array#include? doesn't allow a custom matcher, so we can fake it by using Array#detect
def is_included_in(array, object)
  object_identifier = { 'id' => object['id'], 'ref' => object['ref'], 'name' => object['name'] }

  array.detect do |item|
    { 'id' => item['id'], 'ref' => item['ref'], 'name' => item['name'] } == object_identifier
  end
end

# output array initializing
array1_only = []
array2_only = []
array1_and_array2 = []

# loop through all items in both array1 and array2 and check if it was in array1 or array2
# if it was in both, add to array1_and_array2, otherwise, add it to the output array that
# corresponds to the input array
array3.each do |item|
  in_array1 = is_included_in(array1, item)
  in_array2 = is_included_in(array2, item)

  if in_array1 && in_array2
    array1_and_array2.push item
  elsif in_array1
    array1_only.push item
  else
    array2_only.push item
  end
end


puts array1_only.inspect        # => [{"id"=>"2", "ref"=>"1002", "name"=>"NY", "extra"=>"Not Sorted On 7"}, {"id"=>"7", "ref"=>"1007", "name"=>"OR", "extra"=>"Not Sorted On 11"}]
puts array2_only.inspect        # => [{"id"=>"8", "ref"=>"1002", "name"=>"NY", "extra"=>"Not Sorted On 7"}, {"id"=>"5", "ref"=>"1005", "name"=>"MT", "extra"=>"Not Sorted On 10"}, {"id"=>"12", "ref"=>"1012", "name"=>"TX", "extra"=>"Not Sorted On 85"}]
puts array1_and_array2.inspect  # => [{"id"=>"1", "ref"=>"1001", "name"=>"CA", "extra"=>"Not Sorted On 5"}, {"id"=>"3", "ref"=>"1003", "name"=>"WA", "extra"=>"Not Sorted On 9"}]

对于这类问题,通常最容易使用索引

代码

def keepers(array1, array2, keys)
  a1 = make_hash(array1, keys)
  a2 = make_hash(array2, keys)
  common_keys_of_a1_and_a2 = a1.keys & a2.keys
  [keeper_idx(array1, a1, common_keys_of_a1_and_a2),
   keeper_idx(array2, a2, common_keys_of_a1_and_a2)]
end

def make_hash(arr, keys)
  arr.each_with_index.with_object({}) do |(g,i),h|
    (h[g.values_at(*keys)] ||= []) << i
  end
end

def keeper_idx(arr, a, common_keys_of_a1_and_a2)
  arr.size.times.to_a - a.values_at(*common_keys_of_a1_and_a2).flatten
end
array1 =
  [{'id' =>  '1', 'ref' => '1001', 'name' => 'CA', 'extra' => 'Not Sorted On 5'},
   {'id' =>  '2', 'ref' => '1002', 'name' => 'NY', 'extra' => 'Not Sorted On 7'},
   {'id' =>  '3', 'ref' => '1003', 'name' => 'WA', 'extra' => 'Not Sorted On 9'},
   {'id' =>  '3', 'ref' => '1003', 'name' => 'WA', 'extra' => 'Not Sorted On 8'},
   {'id' =>  '7', 'ref' => '1007', 'name' => 'OR', 'extra' => 'Not Sorted On 11'}]

array2 =
  [{'id' =>  '1', 'ref' => '1001', 'name' => 'CA', 'extra' => 'Not Sorted On 5'},
   {'id' =>  '3', 'ref' => '1003', 'name' => 'WA', 'extra' => 'Not Sorted On 9'},
   {'id' =>  '8', 'ref' => '1002', 'name' => 'NY', 'extra' => 'Not Sorted On 7'},
   {'id' =>  '5', 'ref' => '1005', 'name' => 'MT', 'extra' => 'Not Sorted On 10'},
   {'id' =>  '5', 'ref' => '1005', 'name' => 'MT', 'extra' => 'Not Sorted On 12'},
   {'id' => '12', 'ref' => '1012', 'name' => 'TX', 'extra' => 'Not Sorted On 85'}]
请注意,这两个数组与问题中给出的数组略有不同。问题没有指定每个数组是否可以包含两个哈希值,并且指定的键的值是否相同。因此,我在每个数组中添加了一个哈希值,以显示是否处理了这种情况

keys = ['id', 'ref', 'name']

idx1, idx2 = keepers(array1, array2, keys)
  #=> [[1, 4], [2, 3, 4, 5]]
idx1
idx2
)是删除匹配项后保留的
array1
array2
)元素的索引。(
array1
array2
不会修改。)

因此,这两个数组映射到

array1.values_at(*idx1)
  #=> [{"id"=> "2", "ref"=>"1002", "name"=>"NY", "extra"=>"Not Sorted On 7"},
  #    {"id"=> "7", "ref"=>"1007", "name"=>"OR", "extra"=>"Not Sorted On 11"}]

删除的散列的索引如下所示

array1.size.times.to_a - idx1
  #=> [0, 2, 3]
array2.size.times.to_a - idx2
  #[0, 1]
a1 = make_hash(array1, keys)
  #=> {["1", "1001", "CA"]=>[0], ["2", "1002", "NY"]=>[1],
  #    ["3", "1003", "WA"]=>[2, 3], ["7", "1007", "OR"]=>[4]}    
a2 = make_hash(array2, keys)
  #=> {["1", "1001", "CA"]=>[0], ["3", "1003", "WA"]=>[1],
  #    ["8", "1002", "NY"]=>[2], ["5", "1005", "MT"]=>[3, 4],
  #    ["12", "1012", "TX"]=>[5]}
common_keys_of_a1_and_a2 = a1.keys & a2.keys
  #=> [["1", "1001", "CA"], ["3", "1003", "WA"]]
keeper_idx(array1, a1, common_keys_of_a1_and_a2)
  #=> [1, 4] (for array1)
keeper_idx(array2, a2, common_keys_of_a1_and_a2)
  #=> [2, 3, 4, 5]· (for array2)
解释

步骤如下

array1.size.times.to_a - idx1
  #=> [0, 2, 3]
array2.size.times.to_a - idx2
  #[0, 1]
a1 = make_hash(array1, keys)
  #=> {["1", "1001", "CA"]=>[0], ["2", "1002", "NY"]=>[1],
  #    ["3", "1003", "WA"]=>[2, 3], ["7", "1007", "OR"]=>[4]}    
a2 = make_hash(array2, keys)
  #=> {["1", "1001", "CA"]=>[0], ["3", "1003", "WA"]=>[1],
  #    ["8", "1002", "NY"]=>[2], ["5", "1005", "MT"]=>[3, 4],
  #    ["12", "1012", "TX"]=>[5]}
common_keys_of_a1_and_a2 = a1.keys & a2.keys
  #=> [["1", "1001", "CA"], ["3", "1003", "WA"]]
keeper_idx(array1, a1, common_keys_of_a1_and_a2)
  #=> [1, 4] (for array1)
keeper_idx(array2, a2, common_keys_of_a1_and_a2)
  #=> [2, 3, 4, 5]· (for array2)

这对我的数据不起作用,因为我需要根据两个阵列中存在的特定哈希键('id','ref','name')进行重复数据消除。我不想对所有的键/值对进行重复数据消除,因为这会导致我的数据出现太多误报。您能否修改或撤回您的答案,使其继续可见?提前感谢!@KurtW我将保留此答案一段时间,如果只是为了警告其他人不要发布相同的内容。抱歉,我误解了您r问题。@Cary Swoveland,你的答案与我的问题完全吻合吗?再次感谢你最近几个月来的帮助!非常感谢你花时间来帮助我!这不像有些东西那么漂亮,但我还没有找到更好的解决方案来解决a->B,B->a和重叠。好奇的是,这能满足我的需要吗?谢谢e以这种方式标记为accepted!Kurt,
array1[0]
目前不在“包含
array1
中数据的对象中,但不在
array2
中”,因为
array2
array2[0]
)中有一个散列,其前三个键的值等于
array1[0]中相应键的值
。如果
array2[0]['extra']=>“未按99'排序”
,而不是“…
按5'排序,我假设情况仍然如此。”。如果是,是的,我相信另一个答案也适用于这里。@CarySwoveland,非常感谢你回复我。你能稍微澄清一下吗?你是在问
array1
/
array2
两者是否都有
extra
键/值对,以及在计算中是否应该考虑它?如果是,bot中是否存在
extra
h设置并在所有情况下具有相同的值(例如,如果'id'=>'2','ref'=>'1002','name'=>'NY',那么
extra
将始终等于'Not Sorted On 7'。感谢您的澄清。另外,简单石灰的答案是否更符合我的需要?如果是这样,您似乎是浓缩事物的大师,也许您有改进建议?再次感谢!@KurtW快速查看一份报告在Cary的另一篇帖子中,我肯定已经忘记了
values\u at
,它可以代替所有的
{'id'=>item['id'],'ref'=>item['ref'],'name'=>item['name']}
hash检查唯一性,作为一种快速压缩操作。我将仔细查看他的文章的其余部分,看看是否还有其他东西可以很容易地压缩其中的一些内容。Kurt,我得出结论,我的答案对这个问题不是最好的,所以我提供了一个不同的解决方案。Cary,你又一次超越了自己!非常感谢你详细而明确的回答
array1.size.times.to_a - idx1
  #=> [0, 2, 3]
array2.size.times.to_a - idx2
  #[0, 1]
a1 = make_hash(array1, keys)
  #=> {["1", "1001", "CA"]=>[0], ["2", "1002", "NY"]=>[1],
  #    ["3", "1003", "WA"]=>[2, 3], ["7", "1007", "OR"]=>[4]}    
a2 = make_hash(array2, keys)
  #=> {["1", "1001", "CA"]=>[0], ["3", "1003", "WA"]=>[1],
  #    ["8", "1002", "NY"]=>[2], ["5", "1005", "MT"]=>[3, 4],
  #    ["12", "1012", "TX"]=>[5]}
common_keys_of_a1_and_a2 = a1.keys & a2.keys
  #=> [["1", "1001", "CA"], ["3", "1003", "WA"]]
keeper_idx(array1, a1, common_keys_of_a1_and_a2)
  #=> [1, 4] (for array1)
keeper_idx(array2, a2, common_keys_of_a1_and_a2)
  #=> [2, 3, 4, 5]· (for array2)