Python 如何规范scikit学习中的排名数据？_Python_Algorithm_Machine Learning_Artificial Intelligence_Scikit Learn

Python 如何规范scikit学习中的排名数据？

python algorithm machine-learning artificial-intelligence scikit-learn

Python 如何规范scikit学习中的排名数据？,python,algorithm,machine-learning,artificial-intelligence,scikit-learn,Python,Algorithm,Machine Learning,Artificial Intelligence,Scikit Learn,我正在做一些机器学习，在编码的一个方面需要帮助。在我的培训数据中，我有许多网页的URL以及这些网页的一些功能。我在网页文本上运行TF-IDF以创建更多功能我提取的特征之一是，对于每个网址，我检索谷歌页面排名。这个值可以是世界上任何一个值，但排名越低，谷歌认为“质量越好” 考虑到我有7000个网址，排名差异很大（例如，www.google.com可能排在第1位，而www.bbc.co.uk可能排在第1117位，其他排名将远远超出我们的7000个URL），我如何将这一数字正常化我如何使用scik

我正在做一些机器学习，在编码的一个方面需要帮助。在我的培训数据中，我有许多网页的URL以及这些网页的一些功能。我在网页文本上运行TF-IDF以创建更多功能

我提取的特征之一是，对于每个网址，我检索谷歌页面排名。这个值可以是世界上任何一个值，但排名越低，谷歌认为“质量越好”

考虑到我有7000个网址，排名差异很大（例如，www.google.com可能排在第1位，而www.bbc.co.uk可能排在第1117位，其他排名将远远超出我们的7000个URL），我如何将这一数字正常化

我如何使用scikit learn有效地规范化这些数据，以便在我的机器学习算法中使用这些数据？我正在运行一个逻辑回归，它只是试图预测一个网页是否“好”。目前我使用的唯一功能是在网页文本中使用我的TF-IDF创建的功能。理想的情况下，我想结合我的网页排名功能的方式，将给我最高的交叉验证分数

非常感谢

因此，我们可以假设我的数据是以下形式的TSV：

URL GooglePageRank WebsiteText

行的一个示例：

http://www.google.com 1 This would be the text of the google webpage.

我希望规范化我的排名数据，并将其用于逻辑回归。目前，我只使用“WebsiteText”专栏，在上面运行TF-IDF，并将其插入我的逻辑回归。我想学习如何将这个列与我的标准化GooglePageRank列结合起来，并在我的逻辑回归中使用这两个列——我该怎么做

以下是我迄今为止的代码：

  import numpy as np
  from sklearn import metrics,preprocessing,cross_validation
  from sklearn.feature_extraction.text import TfidfVectorizer
  import sklearn.linear_model as lm
  import pandas as p
  loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=' ')

  print "loading data.."
  traindata = list(np.array(p.read_table('train.tsv'))[:,2])
  testdata = list(np.array(p.read_table('test.tsv'))[:,2])
  y = np.array(p.read_table('train.tsv'))[:,-1]

  tfv = TfidfVectorizer(min_df=3,  max_features=None, strip_accents='unicode',  
        analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1)

  rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
                             C=1, fit_intercept=True, intercept_scaling=1.0, 
                             class_weight=None, random_state=None)

  X_all = traindata + testdata
  lentrain = len(traindata)

  print "fitting pipeline"
  tfv.fit(X_all)
  print "transforming data"
  X_all = tfv.transform(X_all)

  X = X_all[:lentrain]
  X_test = X_all[lentrain:]

  print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))

  print "training on full data"
  rd.fit(X,y)
  pred = rd.predict_proba(X_test)[:,1]
  testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
  pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
  pred_df.to_csv('benchmark.csv')
  print "submission file created.."

*编辑：*

这是我目前正在运行的-

from sklearn import metrics,preprocessing,cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer
import sklearn.preprocessing
import sklearn.linear_model as lm
import pandas as p
loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=',')
print "loading data.."

#load train/test data for TF-IDF -- I know this is bad practice, but keeping it this way for the moment!
traindata = list(np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,2])
testdata = list(np.array(p.read_csv('FinalTestCSVFin.csv', delimiter=";"))[:,2])

#load labels
y = np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,-2]

#Load Integer values and append together
AllAlexaInfo = np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,-1]

#make tfidf object
tfv = TfidfVectorizer(min_df=1, max_features=None, strip_accents='unicode',  
                      analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), 
                      use_idf=1,smooth_idf=1,sublinear_tf=1)
div = DictVectorizer()
X = []
X_all = traindata + testdata
lentrain = len(traindata)
# fit/transform the TfidfVectorizer on the training data
vect = tfv.fit_transform(X_all) #bad practice, but using this for the moment!

for i, alexarank in enumerate(AllAlexaInfo):
    feature_dict = {'alexarank': AllAlexaInfo}
    # get ith row of the tfidf matrix (corresponding to sample)
    row = vect.getrow(i)    

    # filter the feature names corresponding to the sample
    all_words = tfv.get_feature_names()
    words = [all_words[ind] for ind in row.indices] 

    # associate each word (feature) with its corresponding score
    word_score = dict(zip(words, row.data)) 

    # concatenate the word feature/score with the datamining feature/value
    X.append(dict(word_score.items() + feature_dict.items()))

div.fit_transform(X)  # training data based on both Tfidf features and pagerank
sc = preprocessing.StandardScaler().fit(X)
X = sc.transform(X)
X_test = X_all[lentrain:]
X_test = sc.transform(X_test)

print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))

print "training on full data"
rd.fit(X,y)
pred = rd.predict_proba(X_test)[:,1]
testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
pred_df.to_csv('benchmark.csv')
print "submission file created.."

这似乎永远都在运行，而且我相信我有一个问题，“alexarank”值输入不正确-我如何解决这个问题？

根据您对我的评论的回答，我将相应执行：

tfv = TfidfVectorizer(
    min_df=3,
    max_features=None,
    strip_accents='unicode',                    
    analyzer='word',
    token_pattern=r'\w{1,}',
    ngram_range=(1, 2), 
    use_idf=1,
    smooth_idf=1,
    sublinear_tf=1)
div = DictVectorizer()

X = []

# fit/transform the TfidfVectorizer on the training data
vectors = tfv.fit_transform(traindata)

for i, pagerank in enumerate(pageranks):
    feature_dict = {'pagerank': pagerank}
    # get ith row of the tfidf matrix (corresponding to sample)
    row = vect.getrow(i)    

    # filter the feature names corresponding to the sample
    all_words = tfv.get_feature_names()
    words = [all_words[ind] for ind in row.indices] 

    # associate each word (feature) with its corresponding score
    word_score = dict(zip(words, row.data)) 

    # concatenate the word feature/score with the datamining feature/value
    X.append(dict(word_score.items() + feature_dict.items()))

div.fit_transform(X)  # training data based on both Tfidf features and pagerank

IIRC，您希望将TfidfVectorizer中的功能与pagerank值相结合，从而让您的逻辑回归分类器根据文本功能和pagerank值进行选择？@BalthazarRouberol这是正确的，是：）非常感谢您的回复。在本例中，如何枚举页面列组？你是怎么读的？您的回答非常有帮助，只是目前正努力让它运行——我是Python的初学者，所以请耐心听我说！：）谢谢：）我已经更新了我的问题，以显示我使用您的建议对代码所做的添加。不幸的是，我仍然无法让它运行：（在您最初的问题中，您声明GooglePageRank和WebsiteText位于同一行，由一个选项卡分隔。在我的回答中，我假设您已将PageRank加载到内存中。您可以使用列表理解：

pageranks=[line.split（'\t'）[1]对于我的_文件中的行]

啊，是的，我现在明白了。但是，我在尝试运行您的代码时仍然遇到一些问题。我已经更新了上面的编辑以显示这一点。我正在使用pandas在PageRank列中阅读，但似乎得到了一个

值错误：无论何时我尝试运行此代码，max_df都与

相对应。很抱歉造成了麻烦，但是如果您能为我提供帮助，我将不胜感激！谢谢：）您是否尝试增加tfidfvectorier
构造函数中的max_df
值？其默认值为1.0