检查python/pandas中列之间的关系类型？（一对一、一对多或多对多）_Python_Python 3.x_Pandas_Many To Many_Relational Database

检查python/pandas中列之间的关系类型？（一对一、一对多或多对多）

python python-3.x pandas

检查python/pandas中列之间的关系类型？（一对一、一对多或多对多）,python,python-3.x,pandas,many-to-many,relational-database,Python,Python 3.x,Pandas,Many To Many,Relational Database,假设我有5列 pd.DataFrame({ 'Column1': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'Column2': [4, 3, 6, 8, 3, 4, 1, 4, 3], 'Column3': [7, 3, 3, 1, 2, 2, 3, 2, 7], 'Column4': [9, 8, 7, 6, 5, 4, 3, 2, 1], 'Column5': [1, 1, 1, 1, 1, 1, 1, 1, 1]}) 是否有一个函数知道每个PAR的关系类型？（一对一、一

假设我有5列

pd.DataFrame({
'Column1': [1, 2, 3, 4, 5, 6, 7, 8, 9],
'Column2': [4, 3, 6, 8, 3, 4, 1, 4, 3],
'Column3': [7, 3, 3, 1, 2, 2, 3, 2, 7],
'Column4': [9, 8, 7, 6, 5, 4, 3, 2, 1],
'Column5': [1, 1, 1, 1, 1, 1, 1, 1, 1]})

是否有一个函数知道每个PAR的关系类型？（一对一、一对多、多对一、多对多）

输出如下：

Column1 Column2 one-to-many
Column1 Column3 one-to-many
Column1 Column4 one-to-one
Column1 Column5 one-to-many
Column2 Column3 many-to-many
...
Column4 Column5 one-to-many

这可能不是一个完美的答案，但需要进一步修改：

a = df.nunique()
is9, is1 = a==9, a==1
one_one = is9[:, None] & is9
one_many = is1[:, None]
many_one = is1[None, :]
many_many = (~is9[:,None]) & (~is9)

pd.DataFrame(np.select([one_one, one_many, many_one],
                       ['one-to-one', 'one-to-many', 'many-to-one'],
                       'many-to-many'),
             df.columns, df.columns)

输出：

              Column1       Column2       Column3       Column4      Column5
Column1    one-to-one  many-to-many  many-to-many    one-to-one  many-to-one
Column2  many-to-many  many-to-many  many-to-many  many-to-many  many-to-one
Column3  many-to-many  many-to-many  many-to-many  many-to-many  many-to-one
Column4    one-to-one  many-to-many  many-to-many    one-to-one  many-to-one
Column5   one-to-many   one-to-many   one-to-many   one-to-many  one-to-many

这应该适合您：

df = pd.DataFrame({
'Column1': [1, 2, 3, 4, 5, 6, 7, 8, 9],
'Column2': [4, 3, 6, 8, 3, 4, 1, 4, 3],
'Column3': [7, 3, 3, 1, 2, 2, 3, 2, 7],
'Column4': [9, 8, 7, 6, 5, 4, 3, 2, 1],
'Column5': [1, 1, 1, 1, 1, 1, 1, 1, 1]})

def get_relation(df, col1, col2):        
    first_max = df[[col1, col2]].groupby(col1).count().max()[0]
    second_max = df[[col1, col2]].groupby(col2).count().max()[0]
    if first_max==1:
        if second_max==1:
            return 'one-to-one'
        else:
            return 'one-to-many'
    else:
        if second_max==1:
            return 'many-to-one'
        else:
            return 'many-to-many'

from itertools import product
for col_i, col_j in product(df.columns, df.columns):
    if col_i == col_j:
        continue
    print(col_i, col_j, get_relation(df, col_i, col_j))

输出：

Column1 Column2 one-to-many
Column1 Column3 one-to-many
Column1 Column4 one-to-one
Column1 Column5 one-to-many
Column2 Column1 many-to-one
Column2 Column3 many-to-many
Column2 Column4 many-to-one
Column2 Column5 many-to-many
Column3 Column1 many-to-one
Column3 Column2 many-to-many
Column3 Column4 many-to-one
Column3 Column5 many-to-many
Column4 Column1 one-to-one
Column4 Column2 one-to-many
Column4 Column3 one-to-many
Column4 Column5 one-to-many
Column5 Column1 many-to-one
Column5 Column2 many-to-many
Column5 Column3 many-to-many
Column5 Column4 many-to-one

首先，我们得到所有列的组合：

最后，我们使用with

validate

参数检查哪个关系通过了

try的测试，除了

：

请注意，我们省略了

many_to_many

，因为此关系未“检查”，引用自文档：

“多对多”或“m:m”：允许，但不会导致检查

输出

   first_column second_column  cardinality
0       Column1       Column1   one_to_one
1       Column1       Column2  one_to_many
2       Column1       Column3  one_to_many
3       Column1       Column4   one_to_one
4       Column1       Column5  one_to_many
5       Column2       Column1  many_to_one
6       Column2       Column4  many_to_one
7       Column3       Column1  many_to_one
8       Column3       Column4  many_to_one
9       Column4       Column1   one_to_one
10      Column4       Column2  one_to_many
11      Column4       Column3  one_to_many
12      Column4       Column4   one_to_one
13      Column4       Column5  one_to_many
14      Column5       Column1  many_to_one
15      Column5       Column4  many_to_one

我试着用Andrea的答案来调查一些巨大的CSV文件，几乎每样东西都是多对多的——甚至我确定的专栏都是1-1。问题是重复的

这是一个稍加修改的版本，带有一个演示，格式与数据库术语相匹配（以及一个消除歧义的描述）

首先是一个更清楚的例子医生开了许多处方，每个处方可以开几种药，但每种药都是由一个生产商生产的，每个生产商只生产一种药

       doctor  prescription         drug producer
0  Doctor Who             1      aspirin    Bayer
1    Dr Welby             2      aspirin    Bayer
2       Dr Oz             3      aspirin    Bayer
3  Doctor Who             4  paracetamol  Tylenol
4    Dr Welby             5  paracetamol  Tylenol
5       Dr Oz             6  antibiotics    Merck
6  Doctor Who             7      aspirin    Bayer

下面我的函数的正确结果 Andrea的主要变化：

在对上放置重复项，这样1-1就不会被看到太多
我将结果放在一个数据框中（请参见函数中的
```
report\u df
```
），以便于读取结果
我颠倒了逻辑以匹配UML术语（我不参与set与UML的争论——这正是我想要的方式）

错误的结果，不在下面放置重复项这些都是基于我的安德里亚的aglo的修改副本没有下降副本

你可以看到最后一行——从医生到药物——是多对多的，而它应该是1-1——这解释了我最初的结果（这很难用1000条记录进行调试）

新功能

谢谢Andrea-这对我帮助很大。

等等，但是一种关系怎么可能有两种类型？也许为了合并的目的，它是1:1还是m:m并不重要？找到原因，给我一点时间，我会更新我的答案，以更正一@Italot如果我没有错，那么输出是不正确的。Column1->Column2是一对多吗？好吧，我可能同意你的观点，但这样它遵循了问题的惯例。如果你喜欢另一种方式，你可以将多对一和一对多进行切换。是的，我同意，但为了正确起见，我会将其更改为正确的输出，也许OP出错了。顺便说一句，投了赞成票，回答很好+1你是对的，我换了。我还要求编辑原始帖子，以解决这个问题。Thanks@italo您混淆了集合关系和实体图关系。反之亦然。例如“多个员工为一个部门工作”=>Emp（N…1）部门关系，该部门内部有一对多集合关系。

       doctor  prescription         drug producer
0  Doctor Who             1      aspirin    Bayer
1    Dr Welby             2      aspirin    Bayer
2       Dr Oz             3      aspirin    Bayer
3  Doctor Who             4  paracetamol  Tylenol
4    Dr Welby             5  paracetamol  Tylenol
5       Dr Oz             6  antibiotics    Merck
6  Doctor Who             7      aspirin    Bayer

        column 1      column 2   cardinality                                        description
0         doctor  prescription     1-to-many   each doctor has many prescriptions (some  had 3)
1         doctor          drug  many-to-many  doctors had up to 2 drugs, and drugs up to 3 d...
2         doctor      producer  many-to-many  doctors had up to 2 producers, and producers u...
3   prescription        doctor     many-to-1             many prescriptions (max 3) to 1 doctor
4   prescription          drug     many-to-1               many prescriptions (max 4) to 1 drug
5   prescription      producer     many-to-1           many prescriptions (max 4) to 1 producer
6           drug        doctor  many-to-many  drugs had up to 3 doctors, and doctors up to 2...
7           drug  prescription     1-to-many     each drug has many prescriptions (some  had 4)
8           drug      producer        1-to-1               1 drug has 1 producer and vice versa
9       producer        doctor  many-to-many  producers had up to 3 doctors, and doctors up ...
10      producer  prescription     1-to-many  each producer has many prescriptions (some  ha...
11      producer          drug        1-to-1               1 producer has 1 drug and vice versa

           column 1      column 2   cardinality                                        description
0         doctor  prescription     1-to-many   each doctor has many prescriptions (some  had 3)
1         doctor          drug  many-to-many  doctors had up to 3 drugs, and drugs up to 4 d...
2         doctor      producer  many-to-many  doctors had up to 3 producers, and producers u...
3   prescription        doctor     many-to-1             many prescriptions (max 3) to 1 doctor
4   prescription          drug     many-to-1               many prescriptions (max 4) to 1 drug
5   prescription      producer     many-to-1           many prescriptions (max 4) to 1 producer
6           drug        doctor  many-to-many  drugs had up to 4 doctors, and doctors up to 3...
7           drug  prescription     1-to-many     each drug has many prescriptions (some  had 4)
8           drug      producer  many-to-many  drugs had up to 4 producers, and producers up ...
9       producer        doctor  many-to-many  producers had up to 4 doctors, and doctors up ...
10      producer  prescription     1-to-many  each producer has many prescriptions (some  ha...
11      producer          drug  many-to-many  producers had up to 4 drugs, and drugs up to 4...

from itertools import product
import pandas as pd

def get_relation(df, col1, col2):
    # pair columns, drop duplicates (for proper 1-1), group by each column with 
    # the count of entries from the other column associated with each group 
    first_max = df[[col1, col2]].drop_duplicates().groupby(col1).count().max()[0]
    second_max = df[[col1, col2]].drop_duplicates().groupby(col2).count().max()[0]
    if first_max==1:
        if second_max==1:
            return '1-to-1', f'1 {col1} has 1 {col2} and vice versa'
        else:
            return 'many-to-1',f'many {col1}s (max {second_max}) to 1 {col2}'
    else:
        if second_max==1:
            return '1-to-many', f'each {col1} has many {col2}s (some  had {first_max})'
        else:
            return f'many-to-many', f'{col1}s had up to {first_max} {col2}s, and {col2}s up to {second_max} {col1}s'

def report_relations(df):
    report = [] 
    for col_i, col_j in product(df.columns, df.columns):
        if col_i == col_j:
            continue
        relation = get_relation(df, col_i, col_j)
        report.append([col_i, col_j, *relation])
    report_df = pd.DataFrame(report, columns=["column 1", "column 2", "cardinality", "description"])
    # formating
    pd.set_option('display.max_columns', 1000, 'display.width', 1000, 'display.max_rows',1000)
    # comment one of these two out depending on where you're using it
    display(report_df) # for jupyter
    print(report_df)   # SO

test_df = pd.DataFrame({
    'doctor': ['Doctor Who', 'Dr Welby', 'Dr Oz','Doctor Who', 'Dr Welby', 'Dr Oz', 'Doctor Who'],
    'prescription': [1, 2, 3, 4, 5, 6, 7],
    'drug': [ 'aspirin', 'aspirin', 'aspirin', 'paracetemol', 'paracetemol', 'antibiotics', 'aspirin'],
    'producer': [ 'Bayer', 'Bayer', 'Bayer', 'Tylenol', 'Tylenol', 'Merck', 'Bayer']

})

display(test_df)
print(test_df)
report_relations(test_df)