Python 对与字符串和数字匹配的行进行计数
我在Python 对与字符串和数字匹配的行进行计数,python,pandas,Python,Pandas,我在SAMPLE列中有1-12个数字,我尝试计算每个数字的突变数(A:T、C:G等)。这段代码是有效的,但我如何修改这段代码,为每个突变提供全部12个条件,而不是为每个突变编写12次相同的代码 在这个例子中;AT为我提供了编号,而SAMPLE=1。我试图获得每个样本编号的AT编号(1,2,…12)。那么,如何修改这段代码呢?我将感谢你的帮助。多谢各位 SAMPLE MUT 0 11
SAMPLE
列中有1-12个数字,我尝试计算每个数字的突变数(A:T、C:G等)。这段代码是有效的,但我如何修改这段代码,为每个突变提供全部12个条件,而不是为每个突变编写12次相同的代码
在这个例子中;AT为我提供了编号,而SAMPLE=1
。我试图获得每个样本编号的AT编号(1,2,…12)。那么,如何修改这段代码呢?我将感谢你的帮助。多谢各位
SAMPLE MUT
0 11 chr1:100154376:G:A
1 2 chr1:100177723:C:T
2 9 chr1:100177723:C:T
3 1 chr1:100194200:-:AA
4 8 chr1:10032249:A:G
5 2 chr1:100340787:G:A
6 1 chr1:100349757:A:G
7 3 chr1:10041186:C:A
8 10 chr1:100476986:G:C
9 4 chr1:100572459:C:T
10 5 chr1:100572459:C:T
... ... ...
d= df["SAMPLE", "MUT" ]
chars1 = "TGC-"
number = {}
for item in chars1:
dm= d[(d["MUT"].str.contains("A:" + item)) & (d["SAMPLE"].isin([1]))]
num1 = dm.count()
number[item] = num1
AT=number["T"]
AG=number["G"]
AC=number["C"]
A_=number["-"]
您可以使用正则表达式替换创建突变类型(a->T,G->C)的列,然后将pandas groupby应用于计数
import pandas as pd
import re
df = pd.read_table('df.tsv')
df['mutation_type'] = df['MUT'].apply(lambda x: re.sub(r'^.*?:([^:]+:[^:]+)$', r'\1', x))
df.groupby(['SAMPLE','mutation_type']).agg('count')['MUT']
数据的输出如下所示:
SAMPLE mutation_type
1 -:AA 1
A:G 1
2 C:T 1
G:A 1
3 C:A 1
4 C:T 1
5 C:T 1
8 A:G 1
9 C:T 1
10 G:C 1
11 G:A 1
Name: MUT, dtype: int64
我对a.p.也有类似的回答
import pandas as pd
df = pd.DataFrame(data={'SAMPLE': [11,2,9,1,8,2,1,3,10,4,5], 'MUT': ['chr1:100154376:G:A', 'chr1:100177723:C:T', 'chr1:100177723:C:T', 'chr1:100194200:-:AA', 'chr1:10032249:A:G', 'chr1:100340787:G:A', 'chr1:100349757:A:G', 'chr1:10041186:C:A', 'chr1:100476986:G:C', 'chr1:100572459:C:T', 'chr1:100572459:C:T']}, columns=['SAMPLE', 'MUT'])
df['Sequence'] = df['MUT'].str.replace(r'\w+:\d+:', '\1')
df.groupby(['SAMPLE', 'Sequence']).count()
产生
MUT
SAMPLE Sequence
1 -:AA 1
A:G 1
2 C:T 1
G:A 1
3 C:A 1
4 C:T 1
5 C:T 1
8 A:G 1
9 C:T 1
10 G:C 1
11 G:A 1
我会在pandas中使用本机字符串提取方法
df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)')
返回不同组的匹配项:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN G NaN NaN
5 NaN NaN NaN NaN
6 NaN G NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
10 NaN NaN NaN NaN
然后我将使用pd.isnull
将其转换为True
或False
,并使用~
将其反转。从而在匹配的地方实现,在不匹配的地方实现
~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)'))
0 1 2 3
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False True False False
5 False False False False
6 False True False False
7 False False False False
8 False False False False
9 False False False False
10 False False False False
然后将其分配给数据帧
df[["T","G","C","-"]] = ~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)'))
SAMPLE MUT T G C -
0 11 chr1:100154376:G:A False False False False
1 2 chr1:100177723:C:T False False False False
2 9 chr1:100177723:C:T False False False False
3 1 chr1:100194200:-:AA False False False False
4 8 chr1:10032249:A:G False True False False
5 2 chr1:100340787:G:A False False False False
6 1 chr1:100349757:A:G False True False False
7 3 chr1:10041186:C:A False False False False
8 10 chr1:100476986:G:C False False False False
9 4 chr1:100572459:C:T False False False False
10 5 chr1:100572459:C:T False False False False
现在,我们可以简单地对列求和:
df[["T","G","C","-"]].sum()
T 0
G 2
C 0
- 0
但是等等,我们并不是只在SAMPLE==1
我们可以用面具很容易地做到这一点:
sample_one_mask = df.SAMPLE == 1
df[sample_one_mask][["T","G","C","-"]].sum()
T 0
G 1
C 0
- 0
如果您希望将此计数改为每个样本,则可以使用groupby
功能:
df[["SAMPLE","T","G","C","-"]].groupby("SAMPLE").agg(sum).astype(int)
T G C -
SAMPLE
1 0 1 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
8 0 1 0 0
9 0 0 0 0
10 0 0 0 0
11 0 0 0 0
TLDR
这样做:
df[["T","G","C","-"]] = ~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)'))
df[["SAMPLE","T","G","C","-"]].groupby("SAMPLE").agg(sum).astype(int)
您是否只想匹配“T”、“G”、“C”和“-”或“A”之后的任何内容:?这正是我想要的,非常感谢:)