用于在OSX和raspbian中工作的蛋白质数据库的Python脚本在Ubuntu中不工作
出于某种原因,我的python脚本在MAC OSX和raspbian buster中都能工作(是的,我在绝望的时刻在一个树莓中尝试过),但它在Ubuntu 18中不起作用,所以我在我的主PC中使用了它。我甚至在其他PC中尝试过新安装Ubuntu Mate 20,但它仍然不起作用 以下是脚本:用于在OSX和raspbian中工作的蛋白质数据库的Python脚本在Ubuntu中不工作,python,python-3.x,pandas,ubuntu,Python,Python 3.x,Pandas,Ubuntu,出于某种原因,我的python脚本在MAC OSX和raspbian buster中都能工作(是的,我在绝望的时刻在一个树莓中尝试过),但它在Ubuntu 18中不起作用,所以我在我的主PC中使用了它。我甚至在其他PC中尝试过新安装Ubuntu Mate 20,但它仍然不起作用 以下是脚本: import sys import csv from http.client import IncompleteRead import pandas as pd from Bio import Entrez
import sys
import csv
from http.client import IncompleteRead
import pandas as pd
from Bio import Entrez
Entrez.email = ""
# get from WPs accession, corresponding assembly, NC IDs, strains names. Write a csv table with all these as final data tablee,
#+ a table with WPs and Assembly IDs for inputting in FLAG
list_of_accession = []
with open (sys.argv[1], 'r') as csvfile:
efetchin=csv.reader(csvfile, delimiter = ',')
for row in efetchin:
list_of_accession.append(str(row[0]))
with open('efetch_output.txt', mode = 'w') as efetch_output:
efetch_output = csv.writer(efetch_output, delimiter='\t', quotechar='"', quoting=csv.QUOTE_MINIMAL)
efetch_output.writerow(['ID','Source', 'Nucleotide Accession', 'Start', 'Stop', 'Strand', 'Protein', 'Protein Name', 'Organism', ' Strain', 'Assembly'])
input_handle = Entrez.efetch(db="protein", id= list_of_accession, rettype="ipg", retmode="tsv")
for line in input_handle:
print(line, file=open('efetch_output.txt','a'))
input_handle.close()
#process file in pandas
file_name = "efetch_output.txt"
file_name_output = "final_output.tsv"
df = pd.read_csv(file_name, sep="\t", low_memory=False)
# Get names of indexes for which rows have to be dropped
indexNames = df[ df['Source'] == 'INSDC'].index
# Delete these row indexes from dataFrame
df.drop(indexNames , inplace=True)
#rearrange table columns
df = df[['ID', 'Source', 'Nucleotide Accession', 'Protein', 'Protein Name', 'Start', 'Stop', 'Strand', 'Organism',' Strain', 'Assembly']]
#Sort table on Assembly number ignoring GCF_
df['sort'] = df['Assembly'].str.extract('(\d+)', expand=False).astype(str)
df.sort_values('sort',inplace=True, ascending=True)
df = df.drop('sort', axis=1)
#drop all duplicates that're similar in indicated subset fields
df3=df.drop_duplicates(subset=['Start', 'Stop', 'Strand', 'Organism',' Strain', 'Assembly'],keep='first')
#sorts dataframe alphabetically by Organism and writes to csv
df3.sort_values(by = "Organism", axis=0, ascending=True, inplace=False).to_csv("final_parsed_output.tsv", "\t", index=False)
#get WP_X and GFC_X IDs in a tsv to input in FLAGs
new_dataframe1 = df3[['Assembly', 'Protein']]
new_dataframe2 = df3[['Organism',' Strain', 'Assembly', 'Protein']]
new_dataframe1.sort_values(by = "Protein", axis=0, ascending=True, inplace=False).to_csv('flags_input.tsv', '\t', header=False, columns = ['Assembly', 'Protein'])
new_dataframe2.sort_values(by = "Organism", axis=0, ascending=True, inplace=False).to_csv('flags_input_wstrains.tsv', '\t', header=False, columns = ['Organism',' Strain', 'Assembly', 'Protein'])
print ('program finished')
我不知道我是否可以在这里上传一个csv作为例子,你可以使用。但它们基本上是csv中的蛋白质列表,如下所示:
WP_047566605.1 WP_043586512.1 WP_086526429.1 WP_043669791.1
WP_086513259.1 WP_086518190.1 WP_053774664.1 WP_012298127.1
WP_063071144.1 WP_012038522.1 WP_066595335.1 WP_088456184.1
WP_058743206.1 WP_042537210.1 WP_058724426.1
我在ubuntu mate 20中遇到的错误是:
jj@p4:~/Documents/Bioinformatica/Bioinformatic/August/Codes/Etna$ python3 etna.py JJTEST.csv
/usr/local/lib/python3.8/dist-packages/pandas/core/computation/expressions.py:68: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
return op(a, b)
Traceback (most recent call last):
File "etna.py", line 44, in <module>
df['sort'] = df['Assembly'].str.extract('(\d+)', expand=False).astype(str)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py", line 5126, in __getattr__
return object.__getattribute__(self, name)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/accessor.py", line 187, in __get__
accessor_obj = self._accessor(obj)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/strings.py", line 2100, in __init__
self._inferred_dtype = self._validate(data)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/strings.py", line 2157, in _validate
raise AttributeError("Can only use .str accessor with string values!")
AttributeError: Can only use .str accessor with string values!
jj@p4:~/Documents/Bioinformatica/Bioinformatic/August/code/Etna$python3 Etna.py JJTEST.csv
/usr/local/lib/python3.8/dist-packages/pandas/core/computation/expressions.py:68:FutureWarning:elementwise比较失败;而是返回标量,但将来将执行元素级比较
返回op(a、b)
回溯(最近一次呼叫最后一次):
文件“etna.py”,第44行,在
df['sort']=df['Assembly'].str.extract('(\d+),expand=False.astype(str)
文件“/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py”,第5126行,在__
返回对象。\uuuGetAttribute(self,name)
文件“/usr/local/lib/python3.8/dist-packages/pandas/core/accessor.py”,第187行,在__
存取器_obj=自身。_存取器(obj)
文件“/usr/local/lib/python3.8/dist-packages/pandas/core/strings.py”,第2100行,在__
self.\u推断的\u数据类型=self.\u验证(数据)
文件“/usr/local/lib/python3.8/dist packages/pandas/core/strings.py”,第2157行,在
raise AttributeError(“只能使用带字符串值的.str访问器!”)
AttributeError:只能对字符串值使用.str访问器!
我不完全理解问题所在,但我已将输出文件从txt修改为csv,并将de tsv str更改为float。现在它正在工作。这是否回答了您的问题?我尝试将第44行更改为df['sort']=df['Assembly'].astype(str).str.extract('(\d+),expand=False).astype(float)
,新的错误是:/usr/local/lib/python3.8/dist-packages/pandas/core/computation/expressions.py:68:FutureWarning:elementwise比较失败;返回标量,但将来将执行元素比较返回op(a,b)程序已完成
如果我这样做,则会出现相同错误:df['sort']=df['Assembly'].astype(str).str.extract('(\d+),expand=False).astype(str)
很高兴您解决了问题。下一次,请先摘录a,作为您问题的一部分。作为这里的一个新用户,也可以阅读一下。@UlrichEckhardt我认为包含WP_编号的引用就足够作为一个最小的可复制示例了。我不知道如何上传CSV文件。代码也是可能的最小值。很抱歉,如果我没有提出正确的问题,我试图遵守所有规则。如果不合适,我可以删除帖子。你可以将数据内联到Python代码中,不需要第二个文件。此外,抛出错误的行之后的任何行都与示例无关。您需要检查是否可以删除或简化任何其他代码。