如何在R中获得数据帧中观测值的相似性分数

如何在R中获得数据帧中观测值的相似性分数,r,dplyr,similarity,feature-extraction,data-cleaning,R,Dplyr,Similarity,Feature Extraction,Data Cleaning,我有一个数据集,从人们那里收集了多次调查结果。下面是一个示例数据集 library(dplyr) DATA <- data.frame(ID = c(1,22,22,333,333,333,4444,4444,4444,4444), Gender = c("M","F","F","M","M","NotAvailable","M","M","F","NotAvailable"), MaritalStatus = c("W","M"

我有一个数据集,从人们那里收集了多次调查结果。下面是一个示例数据集

library(dplyr)
DATA <- data.frame(ID = c(1,22,22,333,333,333,4444,4444,4444,4444),
               Gender = c("M","F","F","M","M","NotAvailable","M","M","F","NotAvailable"),
               MaritalStatus = c("W","M","M","UM","NotAvailable","UM","M","UM","W","NotAvaiable"),
               Name = c("Available","NotAvailable","NotAvailable","Available","Available","Available","Available","NotAvailable",
                        "Available","NotAvailable"),
               Age = c(20,30,30,21,22,23,33,33,33,34),
               EmailIND = c(0,1,1,0,0,1,1,1,1,1),
               Irrelevant = c(12,3123,312,343,554,66,67,56,123,434)
    )
> DATA
     ID       Gender MaritalStatus         Name Age EmailIND Irrelevant
1     1            M             W    Available  20        0         12
2    22            F             M NotAvailable  30        1       3123
3    22            F             M NotAvailable  30        1        312
4   333            M            UM    Available  21        0        343
5   333            M  NotAvailable    Available  22        0        554
6   333 NotAvailable            UM    Available  23        1         66
7  4444            M             M    Available  33        1         67
8  4444            M            UM NotAvailable  33        1         56
9  4444            F             W    Available  33        1        123
10 4444 NotAvailable   NotAvaiable NotAvailable  34        1        434
库(dplyr)
数据数据
ID性别MaritalStatus姓名年龄EmailIND不相关
1米宽20 0 12
2 22 F M不可用30 1 3123
3 22 F M不可用30 1 312
4333微米可用面积210343
5 333米不可用22 0 554
6333不可用的UM 23 1 66
74444米可用33167
84444百万立方米不可用33 156
9 4444 F W可用33 1 123
10 4444不可用不可用不可用34 1434
我的目标是创建两个变量:

  • 相似性标志-如果indivdiual提供的信息在每次调查中相同,则为第1级,其他为0

  • 震级相似性-给出不同调查中个体提供相似信息的数值分数

  • 以下是我的解决方案:

    getSimRespFlag <- function(x){
      return(as.numeric(length(unique(x)) == 1))
    }
    
    getSimRespFlag%
    汇总所有(funs(getSimRespFlag))%>%
    mutate(SimilarResp_Flag=as.numeric((行和([2:(numberOfCols-1)])/(numberOfCols-2))==1),
    震级\相似性=行和([2:(numberOfCols-1)]/(numberOfCols-2))%>%
    选择(ID、相似性响应标志、大小相似性)
    >相似性数据
    #一个tibble:4×3
    ID SimilarResp_标志大小_相似性
    1     1                1                  1.0
    2    22                1                  1.0
    3   333                0                  0.2
    4  4444                0                  0.2
    数据数据
    ID性别MaritalStatus姓名年龄EmailIND无关相似性响应标志大小相似性
    1米宽可用20 0 12 1.0
    2 22 F M不可用30 1 3123 1 1.0
    3 22 F M不可用30 1 312 1.0
    4333微米可用21 0343 0.2
    5 333米不可用22 0 554 0.2
    6333不可用UM可用2316600.2
    74444米可用331670.2米
    84444百万立方米不可利用量331560.2
    9 4444 F W可用33 1 123 0.2
    10 4444不可用不可用不可用不可用34 1434 0.2
    
    有没有更好的方法来获取相似性标志和相似程度,比如文档中的余弦相似性(它有数值,但我有分类和数值)。我的数据集很大,此操作需要时间,因此任何快速的解决方案也可以工作

    numberOfCols <- ncol(DATA)
    similarity_DATA <- DATA%>%
                    select(-c(Irrelevant))%>%
                    group_by(ID)%>%
                    summarise_all(funs(getSimRespFlag))%>%
                    mutate( SimilarResp_Flag = as.numeric((rowSums(.[2:(numberOfCols-1)])/(numberOfCols-2)) == 1),
                            Magnitude_Similarity = rowSums(.[2:(numberOfCols-1)])/(numberOfCols-2))%>%
                    select(ID,SimilarResp_Flag,Magnitude_Similarity)
    
    > similarity_DATA
    # A tibble: 4 × 3
         ID SimilarResp_Flag Magnitude_Similarity
      <dbl>            <dbl>                <dbl>
    1     1                1                  1.0
    2    22                1                  1.0
    3   333                0                  0.2
    4  4444                0                  0.2
    
    DATA <- left_join(DATA,similarity_DATA,by ="ID")
    
    > DATA
         ID       Gender MaritalStatus         Name Age EmailIND Irrelevant SimilarResp_Flag Magnitude_Similarity
    1     1            M             W    Available  20        0         12                1                  1.0
    2    22            F             M NotAvailable  30        1       3123                1                  1.0
    3    22            F             M NotAvailable  30        1        312                1                  1.0
    4   333            M            UM    Available  21        0        343                0                  0.2
    5   333            M  NotAvailable    Available  22        0        554                0                  0.2
    6   333 NotAvailable            UM    Available  23        1         66                0                  0.2
    7  4444            M             M    Available  33        1         67                0                  0.2
    8  4444            M            UM NotAvailable  33        1         56                0                  0.2
    9  4444            F             W    Available  33        1        123                0                  0.2
    10 4444 NotAvailable   NotAvaiable NotAvailable  34        1        434                0                  0.2