Python 如何以最少的列值从其他数据帧中唯一定义数据帧？_Python_Pandas_Dataframe

Python 如何以最少的列值从其他数据帧中唯一定义数据帧？

python pandas dataframe

Python 如何以最少的列值从其他数据帧中唯一定义数据帧？,python,pandas,dataframe,Python,Pandas,Dataframe,我不熟悉Python和数据框架，试图解决一个机器学习问题，但遇到了一个问题。我真的需要想办法解决这个问题我有3个二进制数据帧。每个15*40 迭代每个数据帧，我需要找到每行的最小列数，这可以从其他数据帧的其他行中唯一地定义该数据帧的该行如果数据帧中的一行可以根据其他数据帧中可能的最小列数进行唯一标识。我将在该数据框中查找类似的列值，并删除它们。（生成规则）这样，我相信可以找到最小列数及其值，这些列数及其值可以定义来自其他数据帧的该数据帧的条目在Python或pandas中是否有任何简单的

我不熟悉Python和数据框架，试图解决一个机器学习问题，但遇到了一个问题。我真的需要想办法解决这个问题

我有3个二进制数据帧。每个15*40

迭代每个数据帧，我需要找到每行的最小列数，这可以从其他数据帧的其他行中唯一地定义该数据帧的该行

如果数据帧中的一行可以根据其他数据帧中可能的最小列数进行唯一标识。我将在该数据框中查找类似的列值，并删除它们。（生成规则）

这样，我相信可以找到最小列数及其值，这些列数及其值可以定义来自其他数据帧的该数据帧的条目

在Python或pandas中是否有任何简单的方法可以做到这一点

我被困住了，但到目前为止还没有成功

例如：

数据帧1:

1 0 1 0
0 1 1 0
1 0 1 1

1 1 1 0
1 1 1 0
1 1 1 1

0 0 1 0
0 0 1 0
1 1 0 1

数据帧2:

1 0 1 0
0 1 1 0
1 0 1 1

1 1 1 0
1 1 1 0
1 1 1 1

0 0 1 0
0 0 1 0
1 1 0 1

数据帧3:

1 0 1 0
0 1 1 0
1 0 1 1

1 1 1 0
1 1 1 0
1 1 1 1

0 0 1 0
0 0 1 0
1 1 0 1

预期输出如下：

1 0 1 0
0 1 1 0
1 0 1 1

1 1 1 0
1 1 1 0
1 1 1 1

0 0 1 0
0 0 1 0
1 1 0 1

2个唯一定义数据帧1的规则：

1 0 1 0
0 1 1 0
1 0 1 1

1 1 1 0
1 1 1 0
1 1 1 1

0 0 1 0
0 0 1 0
1 1 0 1

规则1：值为1、0的前两列定义第一行和第三行
规则2：第一个到值为0的列，1定义第二行

2个唯一定义数据帧2的规则：

1 0 1 0
0 1 1 0
1 0 1 1

1 1 1 0
1 1 1 0
1 1 1 1

0 0 1 0
0 0 1 0
1 1 0 1

规则1：值为1、1、1的前3列定义第一行和第二行
规则2：值为1、1、1、1的前4列定义第三行

2个唯一定义数据帧3的规则：

1 0 1 0
0 1 1 0
1 0 1 1

1 1 1 0
1 1 1 0
1 1 1 1

0 0 1 0
0 0 1 0
1 1 0 1

规则1：值为0的前两列，0定义第一行和第二行
规则2：值为0.1的最后两列定义第三行

这就是我希望基于列值定义规则的方式，以唯一地标识列数最少的数据帧

我试图遵循的伪代码：

1 0 1 0
0 1 1 0
1 0 1 1

1 1 1 0
1 1 1 0
1 1 1 1

0 0 1 0
0 0 1 0
1 1 0 1

对于数据帧中的每一行i

计算其他数据帧中每列值的出现次数

根据第i行中所有列的事件

查找有序列列表中的最小列数，即唯一地将行i与其他数据帧中的其他行区分开来

删除该数据帧中同时满足以下条件的所有行：已找到列及其值

如果该数据帧的长度不是0，请继续

有没有任何库或简单的方法可以做到这一点？

我用这种方法解决了这个特殊的问题。这是一个有效的解决办法。它最后打印一个

规则数组

规则

为每个数据帧包含一个数组。该数组由一个字典组成，声明

{columnName:columnValue}

import pandas as pd
import itertools

df0 = pd.DataFrame([[1, 0, 1, 0], [0, 1, 1, 0], [1, 0, 1, 1]])
df1 = pd.DataFrame([[1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 1]])
df2 = pd.DataFrame([[0, 0, 1, 0], [0, 0, 1, 0], [1, 1, 0, 1]])

print(df0)
print(df1)
print(df2)

list_dfs = [df0, df1, df2]


def find_rules(list_dfs):
    rules_sets = []
    for idx, df in enumerate(list_dfs):
        trgt_df = df
        other_df = [x for i, x in enumerate(list_dfs) if i != idx]
        other_df = pd.concat(other_df, ignore_index=True)

        def count_occur(value, col_name):
            return other_df[col_name].value_counts().get(value, 0)

        df_dict = []

        for idx, row in trgt_df.iterrows():
            listz = {}
            for col_name in list(trgt_df.columns):
                listz[col_name] = [row[col_name],
                                   count_occur(row[col_name], col_name)]
            df_dict.append(sorted(listz.items(), key=lambda x: x[1][1]))

        rules = []

        def check_for_uniquness(list_of_attr):
            for row in other_df.itertuples(index=False):
                conditions = len(list_of_attr)
                for atr in list_of_attr:
                    if row[atr[0]] == atr[1][0]:
                        conditions = conditions-1
                if conditions == 0:
                    return False
            return True

        def find_col_val(row, val):
            for r in row:
                if r[0] == val:
                    return r[1][0]

        def mark_similar(df_cur, list_of_attr):
            new = []
            for idx, row in enumerate(df_cur):
                combinations = len(list_of_attr)
                for atr in list_of_attr:
                    if find_col_val(row, atr[0]) == atr[1][0]:
                        combinations = combinations-1
                if combinations == 0:
                    new.append(idx)
            return [x for i, x in enumerate(df_cur) if i not in new]

        def return_dictionary(list_of_attr):
            dic = {}
            for idx, el in enumerate(list_of_attr):
                dic[el[0]] = el[1][0]
            return dic

        def possible_combinations(stuff):
            lists = []
            for L in range(0, len(stuff)+1):
                for subset in itertools.combinations(stuff, L):
                    lists.append(list(subset))
            del lists[0]
            return lists

        def X2R(df_dict):
            for elm in df_dict:
                combinations = possible_combinations(list(range(0, len(elm))))
                for combin in combinations:
                    column_combinations = []
                    for i in combin:
                        column_combinations.append(elm[i])
                    if check_for_uniquness(column_combinations):
                        rules.append(return_dictionary(
                            column_combinations))
                        return mark_similar(df_dict, column_combinations)

        while len(df_dict):
            df_dict = X2R(df_dict)

        rules_sets.append(rules)
    return rules_sets


rules = find_rules(list_dfs)
print(rules)