Python 在多个条件下合并数据帧-特别是在相等的值上_Python_Pandas_Merge_Pandasql

Python 在多个条件下合并数据帧-特别是在相等的值上

python pandas merge

Python 在多个条件下合并数据帧-特别是在相等的值上,python,pandas,merge,pandasql,Python,Pandas,Merge,Pandasql,首先，如果这有点冗长，我很抱歉，但我想完整地描述一下我遇到的问题以及我已经尝试过的东西我试图在多种条件下合并两个数据帧对象。如果要满足的条件都是“相等”运算符，我知道如何做到这一点，但是，我需要使用小于和大于数据框代表遗传信息：一个是基因组中的突变列表，称为SNPs，另一个提供有关基因在人类基因组上位置的信息。对这些数据执行df.head返回以下内容： merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns

首先，如果这有点冗长，我很抱歉，但我想完整地描述一下我遇到的问题以及我已经尝试过的东西

我试图在多种条件下合并两个数据帧对象。如果要满足的条件都是“相等”运算符，我知道如何做到这一点，但是，我需要使用小于和大于

数据框代表遗传信息：一个是基因组中的突变列表，称为SNPs，另一个提供有关基因在人类基因组上位置的信息。对这些数据执行df.head返回以下内容：

merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])

SNP数据帧SNP_df：这显示了SNP参考ID及其位置。”“BP”代表“碱基对”位置

基因数据框基因测向：这个数据框显示了所有感兴趣的基因的位置

我想找出的是所有属于基因组中基因区域的SNP，并丢弃这些区域之外的SNP

如果我想基于多个equals条件将两个数据帧合并在一起，我将执行以下操作：

merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])

然而，在这个例子中，我需要找到染色体值与基因数据框中的值匹配的SNP，并且BP值介于'chr_start'和'chr_stop'之间。这些数据帧非常大，这是一个挑战。在当前的数据集中，snp_df有6795021行，基因_df有34362行

我试图通过分别观察染色体或基因来解决这个问题。由于不使用性染色体，因此有22个不同的染色体值ints 1-22。这两种方法都需要非常长的时间。一种方法使用pandasql模块，而另一种方法是通过单独的基因进行循环

SQL方法基因迭代法

有人能给出更有效的方法吗？

您可以使用以下方法来完成您想要的任务：

merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]

我刚刚想到了一种解决这个问题的方法——结合我的两种方法：

首先，关注单个染色体，然后在这些较小的数据框中遍历基因。这也不必使用任何SQL查询。我还包括了一个部分，可以立即识别出没有任何SNP的多余基因。这使用了我通常试图避免的双for循环，但在这种情况下，它工作得相当好

all_dfs = []

for chromosome in snp_df['chromosome'].unique():
    this_chr_snp    = snp_df.loc[snp_df['chromosome'] == chromosome]
    this_genes      = gene_df.loc[gene_df['chromosome'] == chromosome]

    # Getting rid of redundant genes
    min_bp      = this_chr_snp['BP'].min()
    max_bp      = this_chr_snp['BP'].max()
    this_genes  = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
            ~(this_genes['chr_stop'] <= min_bp)]

    for line in this_genes.iterrows():
        info     = line[1]
        this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
                (this_chr_snp['BP'] <= info['chr_stop'])]
        if this_snp.shape[0] != 0:
            this_snp    = this_snp[['SNP']]
            this_snp.insert(1, 'feature_id', info['feature_id'])
            all_dfs.append(this_snp)

all_genic_snps  = pd.concat(all_dfs)

虽然这并没有运行得很快，但它确实运行得很快，所以我可以得到一些答案。不过，我还是想知道是否有人有什么建议可以让它更高效地运行。

我确实想过使用这种方法——问题是，对完整数据帧的合并操作会产生巨大的输出。如果我举一个例子——仅对1号染色体而言，基因_-df中有3511个条目，snp_-df中有528381个条目。所以单是这个染色体上的一个内部连接就产生了1855145691个条目！另外，我在原始问题中显示的数据帧只是head方法的结果。因此，虽然没有匹配的数据，但完整的数据帧中应该有足够的数据。

all_dfs = []
for line in gene_df.iterrows():
    info    = line[1] # Getting the Series object
    this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
            (snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
    if this_snp.shape[0] != 0:
        this_snp = this_snp[['SNP']]
        this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
        all_dfs.append(this_snp)


all_genic_snps = pd.concat(all_dfs)

merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]

snp_df
Out[193]: 
   chromosome        SNP      BP
0           1  rs3094315  752566
1           1  rs3131972   30400
2           1  rs2073814  753474
3           1  rs3115859  754503
4           1  rs3131956  758144

gene_df
Out[194]: 
   chromosome  chr_start  chr_stop        feature_id
0           1      10954     11507  GeneID:100506145
1           1      12190     13639  GeneID:100652771
2           1      14362     29370     GeneID:653635
3           1      30366     30503  GeneID:100302278
4           1      34611     36081     GeneID:645520

merged_df
Out[195]: 
         SNP        feature_id
8  rs3131972  GeneID:100302278

all_dfs = []

for chromosome in snp_df['chromosome'].unique():
    this_chr_snp    = snp_df.loc[snp_df['chromosome'] == chromosome]
    this_genes      = gene_df.loc[gene_df['chromosome'] == chromosome]

    # Getting rid of redundant genes
    min_bp      = this_chr_snp['BP'].min()
    max_bp      = this_chr_snp['BP'].max()
    this_genes  = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
            ~(this_genes['chr_stop'] <= min_bp)]

    for line in this_genes.iterrows():
        info     = line[1]
        this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
                (this_chr_snp['BP'] <= info['chr_stop'])]
        if this_snp.shape[0] != 0:
            this_snp    = this_snp[['SNP']]
            this_snp.insert(1, 'feature_id', info['feature_id'])
            all_dfs.append(this_snp)

all_genic_snps  = pd.concat(all_dfs)