Python 如何从数据框中的一列中提取信息并插入右侧的一列中_Python_Pandas_Bash

Python 如何从数据框中的一列中提取信息并插入右侧的一列中

python pandas bash

Python 如何从数据框中的一列中提取信息并插入右侧的一列中,python,pandas,bash,Python,Pandas,Bash,我有一个制表符分隔的表，其前三行如下所示-一个标题行和前两个条目： Geneid Chr Start End Strand Length Feature_count contig_ID MAG_id RPKM ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=

我有一个制表符分隔的表，其前三行如下所示-一个标题行和前两个条目：

Geneid  Chr Start   End Strand  Length  Feature_count   contig_ID   MAG_id  RPKM
ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical protein   G1_719_cleanedcontig_v2_1580    346495  347049  +   555 68733   NODE_28_length_349332_cov_12.741083 ag0r3_bin.39    11455.58033225708
ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical protein  G1_719_cleanedcontig_v2_1582    147164  151051  -   3888    61026   NODE_113_length_189623_cov_11.186889    ag0r3_bin.39    1451.8890393965803

我想为每一行提取“ID”和第一个分号之间的信息（例如，对于第一行，“G1_719_cleanedcontig_v2_1582_130”），并将其放在右边的一列中。如何使用Bash或Python或两者的组合来实现这一点？

假设dataframe是

                                             text
0  Geneid  Chr Start   End Strand  Length  Featur...
1  ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=...
2  ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=...

只需提取

ID=

和

之间的字符；

df['newcolumn']=df.text.str.extract('(?<=[ID]\=)(.*?)(?=\;)')




                                            text  \
0  Geneid  Chr Start   End Strand  Length  Featur...   
1  ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=...   
2  ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=...   

                          newcolumn  
0                               NaN  
1  G1_719_cleanedcontig_v2_1580_319  
2  G1_719_cleanedcontig_v2_1582_130

df['newcolumn']=df.text.str.extract（'（？这是相当宽泛/模糊的。请看，。我看不出这是多么宽泛或模糊；我在问如何获取“ID=”之后的字符串在第一列中的第一个分号之前，将其粘贴在此处列出的数据框右侧的一列中。不过，没有提到具体问题。您能为您的答案提供更多上下文吗？关于它如何工作的一些简要信息我们需要处理索引，我使用了lambda函数，该函数循环索引中的每个值操作之前的任何东西，然后再删除任何其他之后，现在我们有了我们想要的价值。
In [4]: df = pd.DataFrame({'Geneid': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'protein',
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'protein'},
   ...:  'Chr': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'G1_719_cleanedcontig_v2_1580',
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'G1_719_cleanedcontig_v2_1582'},
   ...:  'Start': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 346495,
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 147164},
   ...:  'End': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 347049,
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 151051},
   ...:  'Strand': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': '+',
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': '-'},
   ...:  'Length': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 555,
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 3888},
   ...:  'Feature_count': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 68733,
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 61026},
   ...:  'contig_ID': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'NODE_28_length_349332_cov_12.7410
   ...: 83',
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'NODE_113_length_189623_cov_11.186889'},
   ...:  'MAG_id': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'ag0r3_bin.39',
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'ag0r3_bin.39'},
   ...:  'RPKM': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 11455.580332257081,
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 1451.8890393965805}})

In [5]: 

In [5]: df
Out[5]: 
                                                     Geneid                           Chr   Start     End Strand  Length  Feature_count                             contig_ID        MAG_id          RPKM
ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G...  protein  G1_719_cleanedcontig_v2_1580  346495  347049      +     555          68733   NODE_28_length_349332_cov_12.741083  ag0r3_bin.39  11455.580332
ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G...  protein  G1_719_cleanedcontig_v2_1582  147164  151051      -    3888          61026  NODE_113_length_189623_cov_11.186889  ag0r3_bin.39   1451.889039

In [6]: pd.Series(df.index).apply(lambda x:x[x.index("ID=")+3:].split(";")[0])
Out[6]: 
0    G1_719_cleanedcontig_v2_1580_319
1    G1_719_cleanedcontig_v2_1582_130
dtype: object