Python 用多项式NB预测高棉语（scikit学习）_Python_Machine Learning_Scikit Learn

Python 用多项式NB预测高棉语（scikit学习）

python machine-learning scikit-learn

Python 用多项式NB预测高棉语（scikit学习）,python,machine-learning,scikit-learn,Python,Machine Learning,Scikit Learn,我试图用scikit在python上学习制作一个分类器，以预测病毒的核苷酸序列是否对人类有潜在致病性。我用0而不是1确定了一些序列是致病性的，序列在不同的序列中是有区别的，列表如下： ATCGATCGAATCGGATC 1 ATCGGGGGATATATAAATATATATATATTGTTGTATG 1 ATCGTAT 0 ataaatattgcg 0 … 我基本上是从Krish Naik那里得到我的工作的，他试图预测蛋白质的种类。我只是根据我的目标进行了修改，但问题是我没有找到任何解决方案来预测

我试图用scikit在python上学习制作一个分类器，以预测病毒的核苷酸序列是否对人类有潜在致病性。我用0而不是1确定了一些序列是致病性的，序列在不同的序列中是有区别的，列表如下：
ATCGATCGAATCGGATC 1
ATCGGGGGATATATAAATATATATATATTGTTGTATG 1
ATCGTAT 0
ataaatattgcg 0
…
我基本上是从Krish Naik那里得到我的工作的，他试图预测蛋白质的种类。我只是根据我的目标进行了修改，但问题是我没有找到任何解决方案来预测新序列的致病性。
你可以找到我在gitlab上使用的数据。下面是我在数据中使用的Krish Naik的代码（这一部分似乎是作为一个模型构建的）：

为了预测新序列，我决定采用同样的方法，获取并计算我的新序列中的高棉语单词：

#NEW SEQUENCE
new_seq = pd.read_table('sequence.txt')
new_seq.head()

# function to convert sequence strings into k-mer words, default size = 6 (hexamer words)
def getKmers(sequence, size=6):
    return [sequence[x:x+size].lower() for x in range(len(sequence) - size + 1)]

new_seq['words'] = new_seq.apply(lambda x: getKmers(x['sequence']), axis=1)
new_seq = new_seq.drop('sequence', axis=1)

new_seq.head()

new_seqtext = list(new_seq['words'])
for item in range(len(new_seqtext)):
    new_seqtext[item] = ' '.join(new_seqtext[item])
y_data = new_seq.iloc[:, 0].values


y_data

# Creating the Bag of Words model using CountVectorizer()
# This is equivalent to k-mer counting
# The n-gram size of 4 was previously determined by testing
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(4,4))
U = cv.fit_transform(new_seqtext)

u_pred = classifier.predict(U)

但是，使用classifier.predict的预测似乎没有达到预期的效果，即使序列已经被高棉语删减了一个计数

Traceback (most recent call last):
  File "Original.py", line 89, in <module>
    new_seq['words'] = new_seq.apply(lambda x: getKmers(x['sequence']), axis=1)
  File "/home/name/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 6878, in apply
    return op.get_result()
  File "/home/name/.local/lib/python3.8/site-packages/pandas/core/apply.py", line 186, in get_result
    return self.apply_standard()
  File "/home/name/.local/lib/python3.8/site-packages/pandas/core/apply.py", line 295, in apply_standard
    result = libreduction.compute_reduction(
  File "pandas/_libs/reduction.pyx", line 618, in pandas._libs.reduction.compute_reduction
  File "pandas/_libs/reduction.pyx", line 128, in pandas._libs.reduction.Reducer.get_result
  File "Original.py", line 89, in <lambda>
    new_seq['words'] = new_seq.apply(lambda x: getKmers(x['sequence']), axis=1)
  File "/home/name/.local/lib/python3.8/site-packages/pandas/core/series.py", line 871, in __getitem__
    result = self.index.get_value(self, key)
  File "/home/name/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 4419, in get_value
    raise e1
  File "/home/name/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 4405, in get_value
    return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
  File "pandas/_libs/index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 90, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'sequence'

我的方法是否太幼稚，无法获得高棉语的新序列？

找到了一个队友的解决方案：而不是使用

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(4,4))
U = cv.fit_transform(new_seqtext)

你可以使用这个：

U = cv.transform(new_seqtext)

正如我到目前为止所理解的那样，fit_变换用于构建模型，无需再次导入，cv始终声明为

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(4,4))
U = cv.fit_transform(new_seqtext)

U = cv.transform(new_seqtext)