Wolfram mathematica 从稀疏定义列表中拾取无模式值的算法

Wolfram mathematica 从稀疏定义列表中拾取无模式值的算法,wolfram-mathematica,Wolfram Mathematica,我有以下问题 我正在开发一个随机模拟器,它随机对系统的配置进行采样,并存储在特定时间实例中每个配置被访问的次数的统计信息。大致上代码是这样工作的 f[_Integer][{_Integer..}] :=0 ... someplace later in the code, e.g., index = get index; c = get random configuration (i.e. a tuple of integers, say a pair {n1, n2}); f[index][c

我有以下问题

我正在开发一个随机模拟器,它随机对系统的配置进行采样,并存储在特定时间实例中每个配置被访问的次数的统计信息。大致上代码是这样工作的

f[_Integer][{_Integer..}] :=0
...
someplace later in the code, e.g.,
index = get index;
c = get random configuration (i.e. a tuple of integers, say a pair {n1, n2}); 
f[index][c] = f[index][c] + 1;
which tags that configuration c has occurred once more in the simulation at time instance index.
代码完成后,会出现一个类似于以下内容的f定义列表(我手动键入它只是为了强调最重要的部分)

请注意,首先出现的无模式定义可能非常稀疏。此外,人们无法知道将选取哪些值和配置

问题在于有效地提取所需索引的值,例如

result = ExtractConfigurationsAndOccurences[f, 2] 
应该给出一个结构列表

result = {list1, list2}
在哪里

问题是提取配置和发生应该非常快。我能想到的唯一解决方案是使用SubValues[f](它给出了完整的列表)并用
Cases
语句过滤它。我意识到应该不惜任何代价避免这个过程,因为要测试的配置(定义)将成倍增加,这会大大降低代码的速度

在Mathematica中,有没有一种自然的方法可以快速实现这一点


我希望Mathematica将f[2]视为一个具有许多向下值的单头,但使用向下值[f[2]]不会产生任何效果。使用子值[f[2]]也会导致错误。

这是对我先前答案的完全重写。事实证明,在我之前的尝试中,我忽略了一个基于压缩数组和稀疏数组组合的简单得多的方法,它比所有以前的方法都要快得多,内存效率也更高(至少在我测试的样本大小范围内),而仅对原始的
子值
方法进行最小程度的更改。由于问题是关于最有效的方法的,我将从答案中删除其他方法(考虑到它们更复杂,占用大量空间。希望看到它们的人可以查看此答案的过去修订)

原始的基于
子值的方法
我们首先引入一个函数来为我们生成配置的测试样本。这是:

Clear[generateConfigurations];
generateConfigurations[maxIndex_Integer, maxConfX_Integer, maxConfY_Integer, 
  nconfs_Integer] :=
Transpose[{
  RandomInteger[{1, maxIndex}, nconfs],
  Transpose[{
     RandomInteger[{1, maxConfX}, nconfs],
     RandomInteger[{1, maxConfY}, nconfs]
  }]}]; 
我们可以生成一个小样本来说明:

In[3]:= sample  = generateConfigurations[2,2,2,10]
Out[3]= {{2,{2,1}},{2,{1,1}},{1,{2,1}},{1,{1,2}},{1,{1,2}},
          {1,{2,1}},{2,{1,2}},{2,{2,2}},{1,{2,2}},{1,{2,1}}}
我们这里只有2个索引和配置,其中“x”和“y”数字仅在1到2之间变化-10个这样的配置

以下函数将帮助我们模拟配置的频率累积,因为我们为重复出现的计数器增加基于子值的计数器:

Clear[testAccumulate];
testAccumulate[ff_Symbol, data_] :=
  Module[{},
   ClearAll[ff];
   ff[_][_] = 0;
   Do[
     doSomeStuff;
     ff[#1][#2]++ & @@ elem;
     doSomeMoreStaff;
   , {elem, data}]];
这里的
doSomeStuff
doSomeMoreStaff
符号表示一些可能排除或遵循计数代码的代码。
data
参数应该是由
generateConfigurations
生成的表单列表。例如:

In[6]:= 
testAccumulate[ff,sample];
SubValues[ff]

Out[7]= {HoldPattern[ff[1][{1,2}]]:>2,HoldPattern[ff[1][{2,1}]]:>3,
   HoldPattern[ff[1][{2,2}]]:>1,HoldPattern[ff[2][{1,1}]]:>1,
   HoldPattern[ff[2][{1,2}]]:>1,HoldPattern[ff[2][{2,1}]]:>1,
   HoldPattern[ff[2][{2,2}]]:>1,HoldPattern[ff[_][_]]:>0}
In[10]:= result = getResultingData[ff]
Out[10]= {{2,{2,1},1},{2,{1,1},1},{1,{2,1},3},{1,{1,2},2},{2,{1,2},1},
{2,{2,2},1},{1,{2,2},1}}
以下函数将从
子值列表中提取结果数据(索引、配置及其频率):

Clear[getResultingData];
getResultingData[f_Symbol] :=
   Transpose[{#[[All, 1, 1, 0, 1]], #[[All, 1, 1, 1]], #[[All, 2]]}] &@
        Most@SubValues[f, Sort -> False];
例如:

In[6]:= 
testAccumulate[ff,sample];
SubValues[ff]

Out[7]= {HoldPattern[ff[1][{1,2}]]:>2,HoldPattern[ff[1][{2,1}]]:>3,
   HoldPattern[ff[1][{2,2}]]:>1,HoldPattern[ff[2][{1,1}]]:>1,
   HoldPattern[ff[2][{1,2}]]:>1,HoldPattern[ff[2][{2,1}]]:>1,
   HoldPattern[ff[2][{2,2}]]:>1,HoldPattern[ff[_][_]]:>0}
In[10]:= result = getResultingData[ff]
Out[10]= {{2,{2,1},1},{2,{1,1},1},{1,{2,1},3},{1,{1,2},2},{2,{1,2},1},
{2,{2,2},1},{1,{2,2},1}}
为了完成数据处理周期,这里有一个简单的函数,根据
选择
,为固定索引提取数据:

Clear[getResultsForFixedIndex];
getResultsForFixedIndex[data_, index_] := 
  If[# === {}, {}, Transpose[#]] &[
    Select[data, First@# == index &][[All, {2, 3}]]];
作为我们的测试示例

In[13]:= getResultsForFixedIndex[result,1]
Out[13]= {{{2,1},{1,2},{2,2}},{3,2,1}}
这大概与@zorank在代码中所尝试的非常接近

基于压缩阵列和稀疏阵列的快速解决方案 正如@zorank所指出的,对于具有更多索引和配置的更大样本,这会变得很慢。现在,我们将生成一个大型示例来说明这一点(注意!这需要大约4-5 Gb的RAM,因此如果超过可用RAM,您可能需要减少配置数量)

现在,我们将从
ff
子值中提取完整数据:

In[16]:= (largeres = getResultingData[ff]); // Timing
Out[16]= {10.844, Null}
这需要一些时间,但只需要做一次。但当我们开始提取固定索引的数据时,我们发现它非常慢:

In[24]:= getResultsForFixedIndex[largeres,10]//Short//Timing
Out[24]= {2.687,{{{196,26},{53,36},{360,43},{104,144},<<157674>>,{31,305},{240,291},
 {256,38},{352,469}},{<<1>>}}}
这也需要一些时间,但它又是一次性操作

然后,将使用以下函数更有效地提取固定索引的结果:

Clear[extractPositionFromSparseArray];
extractPositionFromSparseArray[HoldPattern[SparseArray[u___]]] := {u}[[4, 2, 2]]

Clear[getCombinationsAndFrequenciesForIndex];
getCombinationsAndFrequenciesForIndex[packedIndices_, packedCombs_, 
    packedFreqs_, index_Integer] :=
With[{positions = 
         extractPositionFromSparseArray[
               SparseArray[1 - Unitize[packedIndices - index]]]},
  {Extract[packedCombs, positions],Extract[packedFreqs, positions]}];
现在,我们有:

In[25]:=  
getCombinationsAndFrequenciesForIndex[subIndicesPacked,subCombsPacked,subFreqsPacked,10]
  //Short//Timing

Out[25]= {0.094,{{{196,26},{53,36},{360,43},{104,144},<<157674>>,{31,305},{240,291},
{256,38},{352,469}},{<<1>>}}}

使代码速度更快一倍。此外,对于更稀疏的索引(例如,使用参数调用样本生成函数,如
generateConfigurations[2000,500,500,5000000]
),基于
Select
的函数的速度大约是100倍。

我可能会在这里使用sparsearray(请参阅下面的更新),但是,如果您坚持使用函数和*值来存储和检索值,方法是将第一部分(f[2]等)替换为您动态创建的符号,如:

Table[Symbol["f" <> IntegerString[i, 10, 3]], {i, 11}]
(* ==> {f001, f002, f003, f004, f005, f006, f007, f008, f009, f010, f011} *)

Symbol["f" <> IntegerString[56, 10, 3]]
(* ==> f056 *)

Symbol["f" <> IntegerString[56, 10, 3]][{3, 4}] = 12;
Symbol["f" <> IntegerString[56, 10, 3]][{23, 18}] = 12;

Symbol["f" <> IntegerString[56, 10, 3]] // Evaluate // DownValues
(* ==> {HoldPattern[f056[{3, 4}]] :> 12, HoldPattern[f056[{23, 18}]] :> 12} *)

f056 // DownValues
(* ==> {HoldPattern[f056[{3, 4}]] :> 12, HoldPattern[f056[{23, 18}]] :> 12} *)
如您所见,
ArrayRules
提供了一个包含贡献和计数的良好列表。这可以针对每个f[i]单独进行,也可以针对整组f[i]进行(最后一行)。

在某些情况下(取决于生成值所需的性能),使用辅助列表(f[i,0])
的以下简单解决方案可能很有用:

f[_Integer][{_Integer ..}] := 0;
f[_Integer, 0] := Sequence @@ {};

Table[
  r = RandomInteger[1000, 2];
  f[h = RandomInteger[100000]][r] = RandomInteger[10];
  f[h, 0] = Union[f[h, 0], {r}];
  , {i, 10^6}];

ExtractConfigurationsAndOccurences[f_, i_] := {f[i, 0], f[i][#] & /@ f[i, 0]};

Timing@ExtractConfigurationsAndOccurences[f, 10]

Out[252]= {4.05231*10^-15, {{{172, 244}, {206, 115}, {277, 861}, {299,
 862}, {316, 194}, {361, 164}, {362, 830}, {451, 306}, {614, 
769}, {882, 159}}, {5, 2, 1, 5, 4, 10, 4, 4, 1, 8}}}

非常感谢所有提供帮助的人。我一直在考虑每个人的输入,我相信在模拟设置中,以下是最佳解决方案:

SetAttributes[linkedList, HoldAllComplete];

temporarySymbols = linkedList[];

SetAttributes[bookmarkSymbol, Listable];

bookmarkSymbol[symbol_]:= 
   With[{old = temporarySymbols}, temporarySymbols= linkedList[old,symbol]];

registerConfiguration[index_]:=registerConfiguration[index]=
  Module[
   {
    cs = linkedList[],
    bookmarkConfiguration,
    accumulator
    },
    (* remember the symbols we generate so we can remove them later *)
   bookmarkSymbol[{cs,bookmarkConfiguration,accumulator}];
   getCs[index] := List @@ Flatten[cs, Infinity, linkedList];
   getCsAndFreqs[index] := {getCs[index],accumulator /@ getCs[index]};
   accumulator[_]=0;
   bookmarkConfiguration[c_]:=bookmarkConfiguration[c]=
     With[{oldCs=cs}, cs = linkedList[oldCs, c]];
   Function[c,
    bookmarkConfiguration[c];
    accumulator[c]++;
    ]
   ]

pattern = Verbatim[RuleDelayed][Verbatim[HoldPattern][HoldPattern[registerConfiguration [_Integer]]],_];

clearSimulationData :=
 Block[{symbols},
  DownValues[registerConfiguration]=DeleteCases[DownValues[registerConfiguration],pattern];
  symbols = List @@ Flatten[temporarySymbols, Infinity, linkedList];
  (*Print["symbols to purge: ", symbols];*)
  ClearAll /@ symbols;
  temporarySymbols = linkedList[];
  ]
它基于Leonid在之前的一篇文章中提出的解决方案,并附加了belsairus的建议,即为已处理的配置添加额外的索引。对以前的方法进行了调整,以便可以或多或少地使用相同的代码自然地注册和提取配置。这是一次打击两个苍蝇,因为簿记和检索和密切相关

当需要以增量方式添加模拟数据(所有曲线通常都有噪声,因此必须以增量方式添加梯段以获得良好的曲线图)时,这种方法会更好地工作。当一次生成数据,然后进行分析时,稀疏阵列方法将工作得更好
Table[Symbol["f" <> IntegerString[i, 10, 3]], {i, 11}]
(* ==> {f001, f002, f003, f004, f005, f006, f007, f008, f009, f010, f011} *)

Symbol["f" <> IntegerString[56, 10, 3]]
(* ==> f056 *)

Symbol["f" <> IntegerString[56, 10, 3]][{3, 4}] = 12;
Symbol["f" <> IntegerString[56, 10, 3]][{23, 18}] = 12;

Symbol["f" <> IntegerString[56, 10, 3]] // Evaluate // DownValues
(* ==> {HoldPattern[f056[{3, 4}]] :> 12, HoldPattern[f056[{23, 18}]] :> 12} *)

f056 // DownValues
(* ==> {HoldPattern[f056[{3, 4}]] :> 12, HoldPattern[f056[{23, 18}]] :> 12} *)
f = SparseArray[{_} -> 0, 100000];
f // ByteCount
(* ==> 672 *)

(* initialize f with sparse arrays, takes a few seconds with f this large *)
Do[  f[[i]] = SparseArray[{_} -> 0, {100, 110}], {i,100000}] // Timing//First
(* ==> 18.923 *)

(* this takes about 2.5% of the memory that a normal array would take: *)
f // ByteCount
(* ==>  108000040 *)

ConstantArray[0, {100000, 100, 100}] // ByteCount
(* ==> 4000000176 *)

(* counting phase *)
f[[1]][[1, 2]]++;
f[[1]][[1, 2]]++;
f[[1]][[42, 64]]++;
f[[2]][[100, 11]]++;

(* reporting phase *)
f[[1]] // ArrayRules
f[[2]] // ArrayRules
f // ArrayRules

(* 
 ==>{{1, 2} -> 2, {42, 64} -> 1, {_, _} -> 0}
 ==>{{100, 11} -> 1, {_, _} -> 0}
 ==>{{1, 1, 2} -> 2, {1, 42, 64} -> 1, {2, 100, 11} ->  1, {_, _, _} -> 0}
*)
f[_Integer][{_Integer ..}] := 0;
f[_Integer, 0] := Sequence @@ {};

Table[
  r = RandomInteger[1000, 2];
  f[h = RandomInteger[100000]][r] = RandomInteger[10];
  f[h, 0] = Union[f[h, 0], {r}];
  , {i, 10^6}];

ExtractConfigurationsAndOccurences[f_, i_] := {f[i, 0], f[i][#] & /@ f[i, 0]};

Timing@ExtractConfigurationsAndOccurences[f, 10]

Out[252]= {4.05231*10^-15, {{{172, 244}, {206, 115}, {277, 861}, {299,
 862}, {316, 194}, {361, 164}, {362, 830}, {451, 306}, {614, 
769}, {882, 159}}, {5, 2, 1, 5, 4, 10, 4, 4, 1, 8}}}
SetAttributes[linkedList, HoldAllComplete];

temporarySymbols = linkedList[];

SetAttributes[bookmarkSymbol, Listable];

bookmarkSymbol[symbol_]:= 
   With[{old = temporarySymbols}, temporarySymbols= linkedList[old,symbol]];

registerConfiguration[index_]:=registerConfiguration[index]=
  Module[
   {
    cs = linkedList[],
    bookmarkConfiguration,
    accumulator
    },
    (* remember the symbols we generate so we can remove them later *)
   bookmarkSymbol[{cs,bookmarkConfiguration,accumulator}];
   getCs[index] := List @@ Flatten[cs, Infinity, linkedList];
   getCsAndFreqs[index] := {getCs[index],accumulator /@ getCs[index]};
   accumulator[_]=0;
   bookmarkConfiguration[c_]:=bookmarkConfiguration[c]=
     With[{oldCs=cs}, cs = linkedList[oldCs, c]];
   Function[c,
    bookmarkConfiguration[c];
    accumulator[c]++;
    ]
   ]

pattern = Verbatim[RuleDelayed][Verbatim[HoldPattern][HoldPattern[registerConfiguration [_Integer]]],_];

clearSimulationData :=
 Block[{symbols},
  DownValues[registerConfiguration]=DeleteCases[DownValues[registerConfiguration],pattern];
  symbols = List @@ Flatten[temporarySymbols, Infinity, linkedList];
  (*Print["symbols to purge: ", symbols];*)
  ClearAll /@ symbols;
  temporarySymbols = linkedList[];
  ]
fillSimulationData[sampleArg_] :=MapIndexed[registerConfiguration[#2[[1]]][#1]&, sampleArg,{2}];

sampleForIndex[index_]:=
  Block[{nsamples,min,max},
   min = Max[1,Floor[(9/10)maxSamplesPerIndex]];
   max =  maxSamplesPerIndex;
   nsamples = RandomInteger[{min, max}];
   RandomInteger[{1,10},{nsamples,ntypes}]
   ];

generateSample := 
  Table[sampleForIndex[index],{index, 1, nindexes}];

measureGetCsTime :=((First @ Timing[getCs[#]])& /@ Range[1, nindexes]) // Max

measureGetCsAndFreqsTime:=((First @ Timing[getCsAndFreqs[#]])& /@ Range[1, nindexes]) // Max

reportSampleLength[sampleArg_] := StringForm["Total number of confs = ``, smallest accumulator length ``, largest accumulator length = ``", Sequence@@ {Total[#],Min[#],Max[#]}& [Length /@ sampleArg]]
clearSimulationData;

nindexes=100;maxSamplesPerIndex = 1000; ntypes = 2;

largeSample1 = generateSample;

reportSampleLength[largeSample1];

Total number of confs = 94891, smallest accumulator length 900, largest accumulator length = 1000;

First @ Timing @ fillSimulationData[largeSample1] 
With[{times = Table[measureGetCsTime, {50}]}, 
 ListPlot[times, Joined -> True, PlotRange -> {0, Max[times]}]]
With[{times = Table[measureGetCsAndFreqsTime, {50}]}, 
 ListPlot[times, Joined -> True, PlotRange -> {0, Max[times]}]]
nindexes = 10; maxSamplesPerIndex = 100000; ntypes = 10;
largeSample3 = generateSample;
largeSample3 // Short
{{{2,2,1,5,1,3,7,9,8,2},92061,{3,8,6,4,9,9,7,8,7,2}},8,{{4,10,1,5,9,8,8,10,8,6},95498,{3,8,8}}}
Total number of confs = 933590, smallest accumulator length 90760, largest accumulator length = 96876