Python 用多项式NB预测高棉语(scikit学习)
我试图用scikit在python上学习制作一个分类器,以预测病毒的核苷酸序列是否对人类有潜在致病性。我用0而不是1确定了一些序列是致病性的,序列在不同的序列中是有区别的,列表如下:Python 用多项式NB预测高棉语(scikit学习),python,machine-learning,scikit-learn,Python,Machine Learning,Scikit Learn,我试图用scikit在python上学习制作一个分类器,以预测病毒的核苷酸序列是否对人类有潜在致病性。我用0而不是1确定了一些序列是致病性的,序列在不同的序列中是有区别的,列表如下: ATCGATCGAATCGGATC 1 ATCGGGGGATATATAAATATATATATATTGTTGTATG 1 ATCGTAT 0 ataaatattgcg 0 … 我基本上是从Krish Naik那里得到我的工作的,他试图预测蛋白质的种类。我只是根据我的目标进行了修改,但问题是我没有找到任何解决方案来预测
ATCGATCGAATCGGATC 1
ATCGGGGGATATATAAATATATATATATTGTTGTATG 1
ATCGTAT 0
ataaatattgcg 0
…
我基本上是从Krish Naik那里得到我的工作的,他试图预测蛋白质的种类。我只是根据我的目标进行了修改,但问题是我没有找到任何解决方案来预测新序列的致病性。
你可以找到我在gitlab上使用的数据。 下面是我在数据中使用的Krish Naik的代码(这一部分似乎是作为一个模型构建的): 为了预测新序列,我决定采用同样的方法,获取并计算我的新序列中的高棉语单词:
#NEW SEQUENCE
new_seq = pd.read_table('sequence.txt')
new_seq.head()
# function to convert sequence strings into k-mer words, default size = 6 (hexamer words)
def getKmers(sequence, size=6):
return [sequence[x:x+size].lower() for x in range(len(sequence) - size + 1)]
new_seq['words'] = new_seq.apply(lambda x: getKmers(x['sequence']), axis=1)
new_seq = new_seq.drop('sequence', axis=1)
new_seq.head()
new_seqtext = list(new_seq['words'])
for item in range(len(new_seqtext)):
new_seqtext[item] = ' '.join(new_seqtext[item])
y_data = new_seq.iloc[:, 0].values
y_data
# Creating the Bag of Words model using CountVectorizer()
# This is equivalent to k-mer counting
# The n-gram size of 4 was previously determined by testing
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(4,4))
U = cv.fit_transform(new_seqtext)
u_pred = classifier.predict(U)
但是,使用classifier.predict的预测似乎没有达到预期的效果,即使序列已经被高棉语删减了一个计数
Traceback (most recent call last):
File "Original.py", line 89, in <module>
new_seq['words'] = new_seq.apply(lambda x: getKmers(x['sequence']), axis=1)
File "/home/name/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 6878, in apply
return op.get_result()
File "/home/name/.local/lib/python3.8/site-packages/pandas/core/apply.py", line 186, in get_result
return self.apply_standard()
File "/home/name/.local/lib/python3.8/site-packages/pandas/core/apply.py", line 295, in apply_standard
result = libreduction.compute_reduction(
File "pandas/_libs/reduction.pyx", line 618, in pandas._libs.reduction.compute_reduction
File "pandas/_libs/reduction.pyx", line 128, in pandas._libs.reduction.Reducer.get_result
File "Original.py", line 89, in <lambda>
new_seq['words'] = new_seq.apply(lambda x: getKmers(x['sequence']), axis=1)
File "/home/name/.local/lib/python3.8/site-packages/pandas/core/series.py", line 871, in __getitem__
result = self.index.get_value(self, key)
File "/home/name/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 4419, in get_value
raise e1
File "/home/name/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 4405, in get_value
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas/_libs/index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 90, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'sequence'
我的方法是否太幼稚,无法获得高棉语的新序列?找到了一个队友的解决方案: 而不是使用
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(4,4))
U = cv.fit_transform(new_seqtext)
你可以使用这个:
U = cv.transform(new_seqtext)
正如我到目前为止所理解的那样,fit_变换用于构建模型,无需再次导入,cv始终声明为
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(4,4))
U = cv.fit_transform(new_seqtext)
U = cv.transform(new_seqtext)