Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/340.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 熊猫按范围合并间隔_Python_Pandas_Bioinformatics - Fatal编程技术网

Python 熊猫按范围合并间隔

Python 熊猫按范围合并间隔,python,pandas,bioinformatics,Python,Pandas,Bioinformatics,我有一个熊猫数据框,看起来如下所示: chrom start end probability read 0 chr1 1 10 0.99 read1 1 chr1 5 25 0.99 read2 2 chr1 15 25 0.99 read2 3 chr1 30 40 0.75 read4 我想做的是合并具有相同染色体(色度列)且其坐标(开始、结束)

我有一个熊猫数据框,看起来如下所示:

  chrom  start  end  probability   read
0  chr1      1   10         0.99  read1
1  chr1      5   25         0.99  read2
2  chr1     15   25         0.99  read2
3  chr1     30   40         0.75  read4
我想做的是合并具有相同染色体(色度列)且其坐标(开始、结束)重叠的区间。在某些情况下,如果多个间隔相互重叠,则会有一些间隔应该合并,即使它们不重叠。请参见上述示例中的第0行和第2行以及下面合并的输出

对于那些被合并的元素,我想对它们的概率(概率列)求和,并计算“read”列中的唯一元素

使用上面的示例将产生以下输出,请注意,行0、1和2已合并:

 chrom  start  end  probability  read
0  chr1      1   20         2.97     2
1  chr1     30   40         0.75     1
到目前为止,我一直在使用pybedtools merge进行这项工作,但事实证明,它执行数百万次的速度很慢(我的案例)。因此,我正在寻找其他选择,熊猫是显而易见的选择。我知道,使用pandasgroupby可以对要合并的列应用不同的操作,如nuniquesum,这是我需要应用的操作。尽管如此,pandas groupby仅将数据与精确的“色度”、“开始”和“结束”坐标合并

我的问题是,我不知道如何使用pandas根据坐标(色度、开始、结束)合并行,然后应用求和努尼克操作

有没有快速的方法

谢谢

PS:正如我在我的问题上所说的,我已经做了数百万次了,所以速度是个大问题。因此,我不能使用pybedtools或纯python,这对我的目标来说太慢了

谢谢

IIUC

df.groupby((df.end.shift()-df.start).lt(0).cumsum()).agg({'chrom':'first','start':'first','end':'last','probability':'sum','read':'nunique'})
Out[417]: 
  chrom  start  end  probability  read
0  chr1      1   20         2.97     2
1  chr1     30   40         0.75     1
更多信息创建组密钥

(df.end.shift()-df.start).lt(0).cumsum()
Out[418]: 
0    0
1    0
2    0
3    1
dtype: int32

正如@root所建议的,公认的答案无法推广到类似的情况。e、 g.如果我们在问题示例中添加范围为2-3的额外行:

df = pd.DataFrame({'chrom': ['chr1','chr1','chr1','chr1','chr1'], 
    'start': [1, 2, 5, 15, 30],
    'end': [10, 3, 20, 25, 40],
    'probability': [0.99, 0.99, 0.99, 0.99, 0.75],
    'read': ['read1','read2','read2','read2','read4']})
…建议的聚合函数输出以下数据帧。请注意,4在1-10范围内,但它不再被捕获。范围1-10、2-3、5-20和15-25都重叠,因此应分组在一起

一种解决方案是以下方法(使用@W-B建议的聚合函数和组合区间的方法)

…它输出以下数据帧。第一行的总概率是3.96,因为我们组合了四行而不是三行

虽然这种方法应该更具普遍性,但它不一定很快!希望其他人能提出更快的替代方案。

以下是使用和熊猫的答案。它的改进之处在于,它的合并速度非常快,易于并行化,即使在单核模式下也能快速实现超级复制

设置:

import pandas as pd
import pyranges as pr
import numpy as np

rows = int(1e7)
gr = pr.random(rows)
gr.probability = np.random.rand(rows)
gr.read = np.arange(rows)
print(gr)

# +--------------+-----------+-----------+--------------+----------------------+-----------+
# | Chromosome   | Start     | End       | Strand       | probability          | read      |
# | (category)   | (int32)   | (int32)   | (category)   | (float64)            | (int64)   |
# |--------------+-----------+-----------+--------------+----------------------+-----------|
# | chr1         | 149953099 | 149953199 | +            | 0.7536048547309669   | 0         |
# | chr1         | 184344435 | 184344535 | +            | 0.9358130407479777   | 1         |
# | chr1         | 238639916 | 238640016 | +            | 0.024212603310159064 | 2         |
# | chr1         | 95180042  | 95180142  | +            | 0.027139751993808026 | 3         |
# | ...          | ...       | ...       | ...          | ...                  | ...       |
# | chrY         | 34355323  | 34355423  | -            | 0.8843190383030953   | 999996    |
# | chrY         | 1818049   | 1818149   | -            | 0.23138017743097572  | 999997    |
# | chrY         | 10101456  | 10101556  | -            | 0.3007915302642412   | 999998    |
# | chrY         | 355910    | 356010    | -            | 0.03694752911338561  | 999999    |
# +--------------+-----------+-----------+--------------+----------------------+-----------+
# Stranded PyRanges object has 1,000,000 rows and 6 columns from 25 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
执行:

def praderas(df):
    grpby = df.groupby("Cluster")
    prob = grpby.probability.sum()
    prob.name = "ProbSum"
    n = grpby.read.count()
    n.name = "Count"

    return df.merge(prob, on="Cluster").merge(n, on="Cluster")

%time result = gr.cluster().apply(praderas)
# 11.4s !
result[result.Count > 2]
# +--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------+
# | Chromosome   | Start     | End       | Strand       | probability          | read      | Cluster   | ProbSum            | Count     |
# | (category)   | (int32)   | (int32)   | (category)   | (float64)            | (int64)   | (int32)   | (float64)          | (int64)   |
# |--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------|
# | chr1         | 52952     | 53052     | +            | 0.7411051557901921   | 59695     | 70        | 2.2131010082513884 | 3         |
# | chr1         | 52959     | 53059     | +            | 0.9979036360671423   | 356518    | 70        | 2.2131010082513884 | 3         |
# | chr1         | 53029     | 53129     | +            | 0.47409221639405397  | 104776    | 70        | 2.2131010082513884 | 3         |
# | chr1         | 64657     | 64757     | +            | 0.32465233067499366  | 386140    | 88        | 1.3880589602361695 | 3         |
# | ...          | ...       | ...       | ...          | ...                  | ...       | ...       | ...                | ...       |
# | chrY         | 59356855  | 59356955  | -            | 0.3877207561218887   | 9966373   | 8502533   | 1.182153891322546  | 4         |
# | chrY         | 59356865  | 59356965  | -            | 0.4007557656399032   | 9907364   | 8502533   | 1.182153891322546  | 4         |
# | chrY         | 59356932  | 59357032  | -            | 0.33799123310907786  | 9978653   | 8502533   | 1.182153891322546  | 4         |
# | chrY         | 59356980  | 59357080  | -            | 0.055686136451676305 | 9994845   | 8502533   | 1.182153891322546  | 4         |
# +--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------+
# Stranded PyRanges object has 606,212 rows and 9 columns from 24 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.

没有,但它可以很容易地与熊猫分类。我认为这会对“合并”部分有所帮助,对吗?在您的输出数据帧示例中,您是指以“25”而不是“20”结尾的第一行吗?是的,我是这样做的。编辑:)我想你编辑错单元格了!我希望对第二个数据帧的第一行进行编辑,以指示范围1(开始)到25(结束)。我不理解df.end.shift()-df.start).lt(0).cumsum()的部分。你能解释一下吗?这似乎对解决我的问题至关重要problem@Praderas检查重叠圈,例如[1,10]和[5,15],我们通过5@Praderas在这种情况下,你想合并头2还是尾2?(它将被视为一个区间并合并3),因为你提到这种情况,我认为你需要澄清在一些有嵌套间隔的情况下这是行不通的,例如<代码> [[ 1, 10 ],[2, 3 ],[5, 6 ] ] /代码>将被分组为<代码> [[1, 3 ],[5, 6 ] ] /代码>。如@根所指出的,被接受的解决方案是误导的。尝试:
df=pd.DataFrame({'chrom':['chr1','chr1','chr1','chr1','chr1','start':[1,2,5],'end':[10,3,6],'probability':[0.99,0.99,0.99],'read':[read1','read2','read3']}
df groupby((df end.shift()-df.start).lt(0.cumsum()).agg('chrom':'first','start','first','read','first','first','read','first','resume','resume':'resume','resume')
@Praderas对于大型数据集,我的回答速度提高了(可能)>100倍。此外,您接受的回答在一般情况下是错误的,因为它只考虑成对的行。OP给出的输入
df
中的列
集群
在哪里?试图理解您的答案它是由函数
gr.cluster()
:)生成的
def praderas(df):
    grpby = df.groupby("Cluster")
    prob = grpby.probability.sum()
    prob.name = "ProbSum"
    n = grpby.read.count()
    n.name = "Count"

    return df.merge(prob, on="Cluster").merge(n, on="Cluster")

%time result = gr.cluster().apply(praderas)
# 11.4s !
result[result.Count > 2]
# +--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------+
# | Chromosome   | Start     | End       | Strand       | probability          | read      | Cluster   | ProbSum            | Count     |
# | (category)   | (int32)   | (int32)   | (category)   | (float64)            | (int64)   | (int32)   | (float64)          | (int64)   |
# |--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------|
# | chr1         | 52952     | 53052     | +            | 0.7411051557901921   | 59695     | 70        | 2.2131010082513884 | 3         |
# | chr1         | 52959     | 53059     | +            | 0.9979036360671423   | 356518    | 70        | 2.2131010082513884 | 3         |
# | chr1         | 53029     | 53129     | +            | 0.47409221639405397  | 104776    | 70        | 2.2131010082513884 | 3         |
# | chr1         | 64657     | 64757     | +            | 0.32465233067499366  | 386140    | 88        | 1.3880589602361695 | 3         |
# | ...          | ...       | ...       | ...          | ...                  | ...       | ...       | ...                | ...       |
# | chrY         | 59356855  | 59356955  | -            | 0.3877207561218887   | 9966373   | 8502533   | 1.182153891322546  | 4         |
# | chrY         | 59356865  | 59356965  | -            | 0.4007557656399032   | 9907364   | 8502533   | 1.182153891322546  | 4         |
# | chrY         | 59356932  | 59357032  | -            | 0.33799123310907786  | 9978653   | 8502533   | 1.182153891322546  | 4         |
# | chrY         | 59356980  | 59357080  | -            | 0.055686136451676305 | 9994845   | 8502533   | 1.182153891322546  | 4         |
# +--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------+
# Stranded PyRanges object has 606,212 rows and 9 columns from 24 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.