Python 用于比较不同数据集之间的值的比iterrows更好的解决方案
我一直在使用Python 用于比较不同数据集之间的值的比iterrows更好的解决方案,python,pandas,optimization,Python,Pandas,Optimization,我一直在使用iterrows()来比较两个数据集之间的列值,并在满足某些条件的情况下合并行,但是这需要很长时间。有没有更好的方法不用迭代就能做到这一点 这是全部功能 def find_peak_matches(lncRNA, CAGE): """isolates CAGE peaks that match an lncRNA""" lncRNA['promoter_start'] = lncRNA['promoter_start'].
iterrows()
来比较两个数据集之间的列值,并在满足某些条件的情况下合并行,但是这需要很长时间。有没有更好的方法不用迭代就能做到这一点
这是全部功能
def find_peak_matches(lncRNA, CAGE):
"""isolates CAGE peaks that match an lncRNA"""
lncRNA['promoter_start'] = lncRNA['promoter_start'].apply(pd.to_numeric).astype('int32')
lncRNA['promoter_stop'] = lncRNA['promoter_stop'].apply(pd.to_numeric).astype('int32')
CAGE['peak_start'] = CAGE['peak_start'].apply(pd.to_numeric).astype('int32')
CAGE['peak_stop'] = CAGE['peak_stop'].apply(pd.to_numeric).astype('int32')
peak_matches = pd.DataFrame()
for i, row in lncRNA.iterrows():
mask = (
(CAGE['chr'] == row['chr']) & \
(row['promoter_start'] <= CAGE['peak_start']) & \
(row['promoter_stop'] >= CAGE['peak_stop'])
)#finds peaks in lncRNA promoters
matches = CAGE[mask].dropna() #isolates only the peak matches
if len(matches) == 0: #if no matches found continue
continue
merged = pd.merge(
row.to_frame().T, matches,
on=['chr']
) #merges rows that meet mask conditions
peak_matches = pd.concat(
[peak_matches, merged],
ignore_index=True
) #creates a new df from all the merged rows
logging.debug('found peak matches')
return (peak_matches)
样本箱:
ID chr peak_start peak_stop
peak1 1 3 7
peak2 1 15 17
peak3 1 4 6
peak4 2 6 9
期望输出:
name chr promoter_start promoter_stop info ID peak_start peak_stop
lnc1 1 1 10 x peak1 3 7
lnc1 1 1 10 x peak3 4 6
lnc2 1 11 20 y peak2 15 17
真正的数据集每个包含大约20万行,因此我当前的代码花费的时间太长了。我试图合并峰值/启动子具有相同chr值且峰值开始/停止位于启动子开始/停止之间的行。有什么建议可以优化这一点吗?我对python相当了解,所以我不知道最好的方法是什么。您可以合并所有
lncRNA
和CAGE
数据帧,然后使用df.query
对其进行过滤
例如:
df = lncRNA.merge(CAGE, on='chr')
df = df.query('(promoter_start <= peak_start) & (promoter_stop >= peak_stop)')
print(df)
在您将数据放入
pandas
之前,您可能会更幸运地完成所有繁重的工作,在那里您可以对需要进行的比较数量进行更多选择——尽管您将放弃pandas
中的一些numpy加速剂。为了方便起见,我使用namedtuples
编写了下面的示例,并在创建数据帧之前进行了所有比较。使用200K x 200K的伪数据,它在我的机器上大约30秒就完成了烹饪,并获得了1000万行匹配,这完全取决于我使用的随机数据的多样性。YMMV
这里可能还有更多的“留在地板上”。一些智能排序(除了我使用的“chr”进行装箱)可能会更进一步
import pandas as pd
from collections import namedtuple, defaultdict
from random import randint
from itertools import product
# structures
rna = namedtuple('rna', 'name chr promoter_start promoter_stop info')
cage = namedtuple('cage', 'ID chr peak_start peak_stop')
row = namedtuple('row', 'name chr promoter_start promoter_stop info ID peak_start peak_stop')
# some data entry from post to check...
rnas = [rna('inc1',1,1,10,'x'), rna('inc2',1,11,20,'y'), rna('inc1',1,21,30,'z')]
cages = [cage('peak1',1,3,7), cage('peak2',1,15,17), cage('peak3',1,4,6), cage('peak4',2,6,9)]
result_rows = [row(r.name, r.chr, r.promoter_start, r.promoter_stop, r.info, c.ID, c.peak_start, c.peak_stop)
for r in rnas for c in cages if
r.chr == c.chr and
r.promoter_start <= c.peak_start and
r.promoter_stop >= c.peak_stop]
df = pd.DataFrame(data=result_rows)
print(df)
print()
# stress test
# big fake data
rnas = [rna('xx', randint(1,1000), randint(1,50), randint(10,150), 'yy') for t in range(200_000)]
cages = [cage('pk', randint(1,1000), randint(1,50), randint(10,150)) for t in range(200_000)]
# group by chr to expedite comparisons
rna_dict = defaultdict(list)
cage_dict = defaultdict(list)
for r in rnas:
rna_dict[r.chr].append(r)
for c in cages:
cage_dict[c.chr].append(c)
print('fake data made')
# use the chr's that are keys in the rna dictionary and make all comparisions...
result_rows = []
for k in rna_dict.keys():
result_rows.extend([row(r.name, r.chr, r.promoter_start, r.promoter_stop, r.info, c.ID, c.peak_start, c.peak_stop)
for r in rna_dict.get(k) for c in cage_dict.get(k) if
r.promoter_start <= c.peak_start and
r.promoter_stop >= c.peak_stop])
df = pd.DataFrame(data=result_rows)
print(df.head(5))
print(df.info())
合并完整的dfs会占用大量内存。在我的代码中,采用类似的先合并然后过滤的方法,导致服务器终止了它,这就是为什么我开始使用掩码的原因place@keenan也许你可以一步一步地合并/过滤?首先只合并/筛选
chr=1
,然后再合并/筛选chr=2
等。有没有一种简单的方法可以将数据帧转换为命名的偶?在达到这一点之前,我对熊猫做了很多修改,但我喜欢这种方法。补充回答。没有太多的经验,但发现了一些东西。。。
name chr promoter_start promoter_stop info ID peak_start peak_stop
0 lnc1 1 1 10 x peak1 3 7
2 lnc1 1 1 10 x peak3 4 6
4 lnc2 1 11 20 y peak2 15 17
import pandas as pd
from collections import namedtuple, defaultdict
from random import randint
from itertools import product
# structures
rna = namedtuple('rna', 'name chr promoter_start promoter_stop info')
cage = namedtuple('cage', 'ID chr peak_start peak_stop')
row = namedtuple('row', 'name chr promoter_start promoter_stop info ID peak_start peak_stop')
# some data entry from post to check...
rnas = [rna('inc1',1,1,10,'x'), rna('inc2',1,11,20,'y'), rna('inc1',1,21,30,'z')]
cages = [cage('peak1',1,3,7), cage('peak2',1,15,17), cage('peak3',1,4,6), cage('peak4',2,6,9)]
result_rows = [row(r.name, r.chr, r.promoter_start, r.promoter_stop, r.info, c.ID, c.peak_start, c.peak_stop)
for r in rnas for c in cages if
r.chr == c.chr and
r.promoter_start <= c.peak_start and
r.promoter_stop >= c.peak_stop]
df = pd.DataFrame(data=result_rows)
print(df)
print()
# stress test
# big fake data
rnas = [rna('xx', randint(1,1000), randint(1,50), randint(10,150), 'yy') for t in range(200_000)]
cages = [cage('pk', randint(1,1000), randint(1,50), randint(10,150)) for t in range(200_000)]
# group by chr to expedite comparisons
rna_dict = defaultdict(list)
cage_dict = defaultdict(list)
for r in rnas:
rna_dict[r.chr].append(r)
for c in cages:
cage_dict[c.chr].append(c)
print('fake data made')
# use the chr's that are keys in the rna dictionary and make all comparisions...
result_rows = []
for k in rna_dict.keys():
result_rows.extend([row(r.name, r.chr, r.promoter_start, r.promoter_stop, r.info, c.ID, c.peak_start, c.peak_stop)
for r in rna_dict.get(k) for c in cage_dict.get(k) if
r.promoter_start <= c.peak_start and
r.promoter_stop >= c.peak_stop])
df = pd.DataFrame(data=result_rows)
print(df.head(5))
print(df.info())
name chr promoter_start promoter_stop info ID peak_start peak_stop
0 inc1 1 1 10 x peak1 3 7
1 inc1 1 1 10 x peak3 4 6
2 inc2 1 11 20 y peak2 15 17
fake data made
name chr promoter_start promoter_stop info ID peak_start peak_stop
0 xx 804 34 35 yy pk 36 11
1 xx 804 34 35 yy pk 39 11
2 xx 804 34 35 yy pk 37 14
3 xx 804 34 35 yy pk 34 28
4 xx 804 34 35 yy pk 39 20
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10280046 entries, 0 to 10280045
Data columns (total 8 columns):
name object
chr int64
promoter_start int64
promoter_stop int64
info object
ID object
peak_start int64
peak_stop int64
dtypes: int64(5), object(3)
memory usage: 627.4+ MB
None
[Finished in 35.4s]
In [22]: df
Out[22]:
name chr promoter_start promoter_stop info
0 lnc1 1 1 10 x
1 lnc2 1 11 20 y
2 lnc3 1 21 30 z
In [23]: rna = namedtuple('rna', 'name chr promoter_start promoter_stop info')
In [24]: rows = [rna(*t) for t in df.itertuples(index=False)]
In [25]: rows
Out[25]:
[rna(name='lnc1', chr=1, promoter_start=1, promoter_stop=10, info='x'),
rna(name='lnc2', chr=1, promoter_start=11, promoter_stop=20, info='y'),
rna(name='lnc3', chr=1, promoter_start=21, promoter_stop=30, info='z')]
In [26]: rna = namedtuple('rna', 'name chr info promoter_start promoter_stop') # note: wrong
In [27]: rows = [rna(*t) for t in df.itertuples(index=False)]
In [28]: rows
Out[28]:
[rna(name='lnc1', chr=1, info=1, promoter_start=10, promoter_stop='x'),
rna(name='lnc2', chr=1, info=11, promoter_start=20, promoter_stop='y'),
rna(name='lnc3', chr=1, info=21, promoter_start=30, promoter_stop='z')]
In [29]: # note the above is mis-aligned!!!
In [32]: rows = [t for t in df.itertuples(name='row', index=False)]
In [33]: rows
Out[33]:
[row(name='lnc1', chr=1, promoter_start=1, promoter_stop=10, info='x'),
row(name='lnc2', chr=1, promoter_start=11, promoter_stop=20, info='y'),
row(name='lnc3', chr=1, promoter_start=21, promoter_stop=30, info='z')]
In [34]: type(rows[0])
Out[34]: pandas.core.frame.row
In [35]: rows[0].chr
Out[35]: 1
In [36]: rows[0].info
Out[36]: 'x'