Python 用多项式NB预测高棉语(scikit学习)

Python 用多项式NB预测高棉语(scikit学习),python,machine-learning,scikit-learn,Python,Machine Learning,Scikit Learn,我试图用scikit在python上学习制作一个分类器,以预测病毒的核苷酸序列是否对人类有潜在致病性。我用0而不是1确定了一些序列是致病性的,序列在不同的序列中是有区别的,列表如下: ATCGATCGAATCGGATC 1 ATCGGGGGATATATAAATATATATATATTGTTGTATG 1 ATCGTAT 0 ataaatattgcg 0 … 我基本上是从Krish Naik那里得到我的工作的,他试图预测蛋白质的种类。我只是根据我的目标进行了修改,但问题是我没有找到任何解决方案来预测

我试图用scikit在python上学习制作一个分类器,以预测病毒的核苷酸序列是否对人类有潜在致病性。我用0而不是1确定了一些序列是致病性的,序列在不同的序列中是有区别的,列表如下:
ATCGATCGAATCGGATC 1
ATCGGGGGATATATAAATATATATATATTGTTGTATG 1
ATCGTAT 0
ataaatattgcg 0

我基本上是从Krish Naik那里得到我的工作的,他试图预测蛋白质的种类。我只是根据我的目标进行了修改,但问题是我没有找到任何解决方案来预测新序列的致病性。
你可以找到我在gitlab上使用的数据。 下面是我在数据中使用的Krish Naik的代码(这一部分似乎是作为一个模型构建的):

为了预测新序列,我决定采用同样的方法,获取并计算我的新序列中的高棉语单词:

#NEW SEQUENCE
new_seq = pd.read_table('sequence.txt')
new_seq.head()

# function to convert sequence strings into k-mer words, default size = 6 (hexamer words)
def getKmers(sequence, size=6):
    return [sequence[x:x+size].lower() for x in range(len(sequence) - size + 1)]

new_seq['words'] = new_seq.apply(lambda x: getKmers(x['sequence']), axis=1)
new_seq = new_seq.drop('sequence', axis=1)

new_seq.head()

new_seqtext = list(new_seq['words'])
for item in range(len(new_seqtext)):
    new_seqtext[item] = ' '.join(new_seqtext[item])
y_data = new_seq.iloc[:, 0].values


y_data

# Creating the Bag of Words model using CountVectorizer()
# This is equivalent to k-mer counting
# The n-gram size of 4 was previously determined by testing
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(4,4))
U = cv.fit_transform(new_seqtext)

u_pred = classifier.predict(U)
但是,使用classifier.predict的预测似乎没有达到预期的效果,即使序列已经被高棉语删减了一个计数

Traceback (most recent call last):
  File "Original.py", line 89, in <module>
    new_seq['words'] = new_seq.apply(lambda x: getKmers(x['sequence']), axis=1)
  File "/home/name/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 6878, in apply
    return op.get_result()
  File "/home/name/.local/lib/python3.8/site-packages/pandas/core/apply.py", line 186, in get_result
    return self.apply_standard()
  File "/home/name/.local/lib/python3.8/site-packages/pandas/core/apply.py", line 295, in apply_standard
    result = libreduction.compute_reduction(
  File "pandas/_libs/reduction.pyx", line 618, in pandas._libs.reduction.compute_reduction
  File "pandas/_libs/reduction.pyx", line 128, in pandas._libs.reduction.Reducer.get_result
  File "Original.py", line 89, in <lambda>
    new_seq['words'] = new_seq.apply(lambda x: getKmers(x['sequence']), axis=1)
  File "/home/name/.local/lib/python3.8/site-packages/pandas/core/series.py", line 871, in __getitem__
    result = self.index.get_value(self, key)
  File "/home/name/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 4419, in get_value
    raise e1
  File "/home/name/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 4405, in get_value
    return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
  File "pandas/_libs/index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 90, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'sequence'

我的方法是否太幼稚,无法获得高棉语的新序列?

找到了一个队友的解决方案: 而不是使用

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(4,4))
U = cv.fit_transform(new_seqtext)
你可以使用这个:

U = cv.transform(new_seqtext)
正如我到目前为止所理解的那样,fit_变换用于构建模型,无需再次导入,cv始终声明为

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(4,4))
U = cv.fit_transform(new_seqtext)
U = cv.transform(new_seqtext)