Join 使用sas识别数据集中的相关值对_Join_Sas_Dataset

Join 使用sas识别数据集中的相关值对

join sas

Join 使用sas识别数据集中的相关值对,join,sas,dataset,Join,Sas,Dataset,我有一个包含单词同义词信息的数据集（多行）数据集的简要示例如下所示。给出了每个词的同义词信息 Word Synonym C01 C02 C01 C05 C02 C02 C02 C05 C03 C04 C05 C06 C11 C12 .. .. 从上述数据集中，单词同义词关系可以确定如下 C01-C02-C05-C06 C03-C04 C11-C12 在执行sas代码之后，我想要一个如下所示形式的数据集 Word Synonym1 Synonym2 Synonym3 C0

我有一个包含单词同义词信息的数据集（多行）数据集的简要示例如下所示。给出了每个词的同义词信息

Word Synonym
C01  C02
C01  C05 
C02  C02
C02  C05
C03  C04
C05  C06
C11  C12
..   ..

从上述数据集中，单词同义词关系可以确定如下

C01-C02-C05-C06
C03-C04
C11-C12

在执行sas代码之后，我想要一个如下所示形式的数据集

Word Synonym1 Synonym2 Synonym3
C01  C02      C05      C06
C03  C04
C11  C12

我尝试了冗余的内部连接步骤，但似乎有很多不必要的过程。

我在SAS中几乎找不到好的解决方案（在其他语言中，这更容易解决）。下面的方法不好，因为它试图将所有组写入一个变量，如果您有大量记录，该变量将很快用完。另外，它依赖于“#”作为分隔符。如果你的话可以有这个字符，你可能会想把它改成不同的东西

data groups;
    set testData nObs=numObs;

    array groups [*] $32767 group1-group100;
    retain groupN 0 group1-group100;

    categorized = 0;

    * Search for the word or synonym in the existing groups;
    if (groupN >= 1) then do;
        do currentGroup = 1 to groupN;
            if (index(groups[currentGroup], "#"||strip(word)||"#") and index(groups[currentGroup], "#"||strip(synonym)||"#") = 0) then do;
                    groups[currentGroup] = strip(groups[currentGroup])||strip(synonym)||"#";
                    categorized = 1;
            end;
            if (index(groups[currentGroup], "#"||strip(word)||"#") = 0 and index(groups[currentGroup], "#"||strip(synonym)||"#")) then do;
                    groups[currentGroup] = strip(groups[currentGroup])||strip(word)||"#";
                    categorized = 1;
            end;
            if (index(groups[currentGroup], "#"||strip(word)||"#") and index(groups[currentGroup], "#"||strip(synonym)||"#")) then do;
                    categorized = 1;
            end;

        end;
    end;

    * If the word and synonym were not found in the existing groups, create a new one;
    if (categorized = 0) then do;
        groups[groupN + 1]  = "#"||strip(word)||"#"||strip(synonym)||"#";
        groupN = groupN + 1;
    end;

    * Split the groups into unique key/value pairs;
    if (_n_ = numObs) then do;
        length key value $200;
        keep key value;
       do currentGroup = 1 to groupN;
            if (not missing(groups[currentGroup])) then do;
                key = scan(groups[currentGroup], 1, '#');
                do j = 2 to countC(groups[currentGroup],'#');
                    value = scan(groups[currentGroup], j, '#');
                    if (not missing(value)) then do;
                        output;
                    end;
                end;
            end;
       end;
    end;
run;

proc sort data = groups;
    by key;
run;

proc transpose data = groups out=result(drop = _:) prefix=synonym;
    by key;
    var value;
run;

您是否拥有SAS/或许可证？它有许多从您的数据类型中查找连接子图的过程