Python 如何从数据框中的一列中提取信息并插入右侧的一列中
我有一个制表符分隔的表,其前三行如下所示-一个标题行和前两个条目:Python 如何从数据框中的一列中提取信息并插入右侧的一列中,python,pandas,bash,Python,Pandas,Bash,我有一个制表符分隔的表,其前三行如下所示-一个标题行和前两个条目: Geneid Chr Start End Strand Length Feature_count contig_ID MAG_id RPKM ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=
Geneid Chr Start End Strand Length Feature_count contig_ID MAG_id RPKM
ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical protein G1_719_cleanedcontig_v2_1580 346495 347049 + 555 68733 NODE_28_length_349332_cov_12.741083 ag0r3_bin.39 11455.58033225708
ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical protein G1_719_cleanedcontig_v2_1582 147164 151051 - 3888 61026 NODE_113_length_189623_cov_11.186889 ag0r3_bin.39 1451.8890393965803
我想为每一行提取“ID”和第一个分号之间的信息(例如,对于第一行,“G1_719_cleanedcontig_v2_1582_130”),并将其放在右边的一列中。如何使用Bash或Python或两者的组合来实现这一点?假设dataframe是
text
0 Geneid Chr Start End Strand Length Featur...
1 ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=...
2 ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=...
只需提取ID=
和之间的字符;
df['newcolumn']=df.text.str.extract('(?<=[ID]\=)(.*?)(?=\;)')
text \
0 Geneid Chr Start End Strand Length Featur...
1 ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=...
2 ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=...
newcolumn
0 NaN
1 G1_719_cleanedcontig_v2_1580_319
2 G1_719_cleanedcontig_v2_1582_130
df['newcolumn']=df.text.str.extract('(?这是相当宽泛/模糊的。请看,。我看不出这是多么宽泛或模糊;我在问如何获取“ID=”之后的字符串在第一列中的第一个分号之前,将其粘贴在此处列出的数据框右侧的一列中。不过,没有提到具体问题。您能为您的答案提供更多上下文吗?关于它如何工作的一些简要信息我们需要处理索引,我使用了lambda函数,该函数循环索引中的每个值操作之前的任何东西,然后再删除任何其他之后,现在我们有了我们想要的价值。
In [4]: df = pd.DataFrame({'Geneid': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'protein',
...: 'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'protein'},
...: 'Chr': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'G1_719_cleanedcontig_v2_1580',
...: 'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'G1_719_cleanedcontig_v2_1582'},
...: 'Start': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 346495,
...: 'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 147164},
...: 'End': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 347049,
...: 'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 151051},
...: 'Strand': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': '+',
...: 'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': '-'},
...: 'Length': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 555,
...: 'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 3888},
...: 'Feature_count': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 68733,
...: 'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 61026},
...: 'contig_ID': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'NODE_28_length_349332_cov_12.7410
...: 83',
...: 'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'NODE_113_length_189623_cov_11.186889'},
...: 'MAG_id': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'ag0r3_bin.39',
...: 'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'ag0r3_bin.39'},
...: 'RPKM': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 11455.580332257081,
...: 'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 1451.8890393965805}})
In [5]:
In [5]: df
Out[5]:
Geneid Chr Start End Strand Length Feature_count contig_ID MAG_id RPKM
ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G... protein G1_719_cleanedcontig_v2_1580 346495 347049 + 555 68733 NODE_28_length_349332_cov_12.741083 ag0r3_bin.39 11455.580332
ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G... protein G1_719_cleanedcontig_v2_1582 147164 151051 - 3888 61026 NODE_113_length_189623_cov_11.186889 ag0r3_bin.39 1451.889039
In [6]: pd.Series(df.index).apply(lambda x:x[x.index("ID=")+3:].split(";")[0])
Out[6]:
0 G1_719_cleanedcontig_v2_1580_319
1 G1_719_cleanedcontig_v2_1582_130
dtype: object