一, 介绍
RankLib.jar 是一个学习排名 (Learning to rank) 算法的库, 目前已经实现了如下几种算法:
- MART
- RankNet
- RankBoost
- AdaRank
- Coordinate Ascent
- LambdaMART
- ListNet
- Random Forests
- Linear regression
二, jar 包 https://sourceforge.net/p/lemur/wiki/RankLib How to use/
- Usage: java -jar RankLib.jar <Params>
- Params:
- [+] Training (+ tuning and evaluation)
- # 训练数据
- -train <file> Training data
- # 指定排名算法
- -ranker <type> Specify which ranking algorithm to use
- 0: MART (gradient boosted regression tree)
- 1: RankNet
- 2: RankBoost
- 3: AdaRank
- 4: Coordinate Ascent
- 6: LambdaMART
- 7: ListNet
- 8: Random Forests
- 9: Linear regression (L2 regularization)
- # 特征描述文件, 列出要学习的特征, 每行一个特征, 默认使用所有特征
- [ -feature <file> ] Feature description file: list features to be considered by the learner, each on a separate line
- If not specified, all features will be used.
- #
- [ -metric2t <metric> ] Metric to optimize on the training data. Supported: MAP, NDCG@k, DCG@k, P@k, RR@k, ERR@k (default=ERR@10)
- [ -gmax <label> ] Highest judged relevance label. It affects the calculation of ERR (default=4, i.e. 5-point scale {0,1,2,3,4})
- [ -silent ] Do not print progress messages (which are printed by default)
- # 是否在验证数据集上调整模型
- [ -validate <file> ] Specify if you want to tune your system on the validation data (default=unspecified)
- If specified, the final model will be the one that performs best on the validation data
- # 训练 - 验证数据集的分割比例
- [ -tvs <x \in [0..1]> ] If you don't have separate validation data, use this to set train-validation split to be (x)(1.0-x)
- # 学习模型保存到指定文件
- [ -save <model> ] Save the model learned (default=not-save)
- # 是否要在数据上测试训练的模型
- [ -test <file> ] Specify if you want to evaluate the trained model on this data (default=unspecified)
- # 训练 - 测试数据集的分割比例
- [ -tts <x \in [0..1]> ] Set train-test split to be (x)(1.0-x). -tts will override -tvs
- # 默认与 metric2t 一致
- [ -metric2T <metric> ] Metric to evaluate on the test data (default to the same as specified for -metric2t)
- # 归一化特征向量, 方法包括求和归一化, 均值 / 标准差归一化, 最大值 / 最小值归一化
- [ -norm <method>] Normalize all feature vectors (default=no-normalization). Method can be:
- sum: normalize each feature by the sum of all its values
- zscore: normalize each feature by its mean/standard deviation
- linear: normalize each feature by its min/max values
- # 在训练数据集上执行交叉验证
- [ -kcv <k> ] Specify if you want to perform k-fold cross validation using the specified training data (default=NoCV)
- -tvs can be used to further reserve a portion of the training data in each fold for validation
- # 交叉验证训练库模型的目录
- [ -kcvmd <dir> ] Directory for models trained via cross-validation (default=not-save)
- [ -kcvmn <model> ] Name for model learned in each fold. It will be prefix-ed with the fold-number (default=empty)
- [-] RankNet-specific parameters # 特定参数
- # 训练迭代次数
- [ -epoch <T> ] The number of epochs to train (default=100)
- # 隐含层个数
- [ -layer <layer> ] The number of hidden layers (default=1)
- # 每层隐含节点个数
- [ -node <node> ] The number of hidden nodes per layer (default=10)
- # 学习率
- [ -lr <rate> ] Learning rate (default=0.00005)
- [-] RankBoost-specific parameters # 特定参数
- # 训练迭代次数
- [ -round <T> ] The number of rounds to train (default=300)
- # 搜索的阈值候选个数
- [ -tc <k> ] Number of threshold candidates to search. -1 to use all feature values (default=10)
- [-] AdaRank-specific parameters # 特定参数
- # 训练迭代次数
- [ -round <T> ] The number of rounds to train (default=500)
- #
- [ -noeq ] Train without enqueuing too-strong features (default=unspecified)
- # 连续两轮学习之间的误差
- [ -tolerance <t> ] Tolerance between two consecutive rounds of learning (default=0.002)
- # 一个特征可以被连续选择而不改变性能的最大次数
- [ -max <times> ] The maximum number of times can a feature be consecutively selected without changing performance (default=5)
- [-] Coordinate Ascent-specific parameters # 特定参数
- [ -r <k> ] The number of random restarts (default=5)
- [ -i <iteration> ] The number of iterations to search in each dimension (default=25)
- [ -tolerance <t> ] Performance tolerance between two solutions (default=0.001)
- [ -reg <slack> ] Regularization parameter (default=no-regularization)
- [-] {MART, LambdaMART}-specific parameters # 特定参数
- # 树的个数
- [ -tree <t> ] Number of trees (default=1000)
- # 一个叶子的样本个数
- [ -leaf <l> ] Number of leaves for each tree (default=10)
- # 学习率
- [ -shrinkage <factor> ] Shrinkage, or learning rate (default=0.1)
- # 树分割时的候选特征个数
- [ -tc <k> ] Number of threshold candidates for tree spliting. -1 to use all feature values (default=256)
- # 一个叶子最少的样本个数
- [ -mls <n> ] Min leaf support -- minimum #samples each leaf has to contain (default=1)
- [ -estop <e> ] Stop early when no improvement is observed on validaton data in e consecutive rounds (default=100)
- [-] ListNet-specific parameters
- [ -epoch <T> ] The number of epochs to train (default=1500)
- [ -lr <rate> ] Learning rate (default=0.00001)
- [-] Random Forests-specific parameters # 随机森林特定参数
- [ -bag <r> ] Number of bags (default=300)
- # 子集采样率
- [ -srate <r> ] Sub-sampling rate (default=1.0)
- # 特征采样率
- [ -frate <r> ] Feature sampling rate (default=0.3)
- [ -rtype <type> ] Ranker to bag (default=0, i.e. MART)
- # 树个数
- [ -tree <t> ] Number of trees in each bag (default=1)
- # 每棵树的叶节点个数
- [ -leaf <l> ] Number of leaves for each tree (default=100)
- # 学习率
- [ -shrinkage <factor> ] Shrinkage, or learning rate (default=0.1)
- # 树分割时使用的候选特征阈值个数
- [ -tc <k> ] Number of threshold candidates for tree spliting. -1 to use all feature values (default=256)
- [ -mls <n> ] Min leaf support -- minimum #samples each leaf has to contain (default=1)
- [-] Linear Regression-specific parameters
- [ -L2 <reg> ] L2 regularization parameter (default=1.0E-10)
- [+] Testing previously saved models # 测试已保存的模型
- # 加载模型
- -load <model> The model to load
- Multiple -load can be used to specify models from multiple folds (in increasing order),
- in which case the test/rank data will be partitioned accordingly.
- # 测试数据
- -test <file> Test data to evaluate the model(s) (specify either this or -rank but not both)
- # 对指定文件中的样本排序, 与 -test 不能同时使用
- -rank <file> Rank the samples in the specified file (specify either this or -test but not both)
- [ -metric2T <metric> ] Metric to evaluate on the test data (default=ERR@10)
- [ -gmax <label> ] Highest judged relevance label. It affects the calculation of ERR (default=4, i.e. 5-point scale {0,1,2,3,4})
- [ -score <file>] Store ranker's score for each object being ranked (has to be used with -rank)
- # 打印单个排名列表上的性能(必须与 -test 一起使用)
- [ -idv <file> ] Save model performance (in test metric) on individual ranked lists (has to be used with -test)
- # 特征归一化
- [ -norm ] Normalize feature vectors (similar to -norm for training/tuning)
- 1. -train <file>
指定训练数据的文件, 训练数据格式:
label qid:$id $featureid:$featurevalue $featureid:$featurevalue ... # description
每行代表一个样本, 相同查询请求的样本的 qid 相同, label 表示该样本和该查询请求的相关程度, description 描述信息, 不参与训练计算.
2,-ranker <type>
指定排名算法
MART(Multiple Additive Regression Tree)多重增量回归树
GBDT(Gradient Boosting Decision Tree)梯度渐进决策树
GBRT(Gradient Boosting Regression Tree)梯度渐进回归树
TreeNet 决策树网络
- RankNet
- RankBoost
- AdaRank
- Coordinate Ascent
- LambdaMART
- ListNet
- Random Forests
- Linear regression
- 3,-feature <file>
指定样本的特征定义文件, 格式如下:
- feature1
- feature2
- ...
- # featureK(该特征不参与分析)
- 4,-metric2t <metric>
指定信息检索中的评价指标, 包括:
MAP, NDCG@k, DCG@k, P@k, RR@k, ERR@k
- 5,Example
- java -jar bin/RankLib.jar -train MQ2008/Fold1/train.txt -test MQ2008/Fold1/test.txt -validate MQ2008/Fold1/vali.txt -ranker 6 -metric2t NDCG@10 -metric2T ERR@10 -save mymodel.txt
命令解释>>>
训练数据: MQ2008/Fold1/train.txt
测试数据: MQ2008/Fold1/test.txt
验证数据: MQ2008/Fold1/vali.txt
排名算法: 6,LambdaMART
评估指标: NDCG, 取排名前 10 个数据进行计算
测试数据评估指标: ERR, 取排名前 10 个数据进行计算
保存模型: mymodel.txt
参数 -validate 是可选的, 但可以更好的模型结果, 对于 RankNet/MART/LambdaMART 非常重要.
-metric2t 仅应用于 list-wise 算法 (AdaRank,Coordinate Ascent 和 LambdaMART);point-wise 和 Pair-wise 算法(MART,RankNet,RankBoost) 是使用自己内部的 RMSE/pair-wise loss 作为评价指标. ListNet 虽然是 list-wise 算法, 但是也不用 metric2t 指定评价指标.
6,k-fold cross validation
顺序分区
java -jar bin/RankLib.jar -train MQ2008/Fold1/train.txt -ranker 4 -kcv 5 -kcvmd models/ -kcvmn ca -metric2t NDCG@10 -metric2T ERR@10
按顺序将训练数据拆分 5 等份, 第 i 份数据作为第 i 折叠的测试数据, 第 i 折叠的训练数据则是由其他折叠的数据组成.
随机分区
java -cp bin/RankLib.jar ciir.umass.edu.features.FeatureManager -input MQ2008/Fold1/train.txt -output mydata/ -shuffle
将训练数据 train.txt 重新洗牌存储在 mydata/ 目录下 train.txt.shuffled
获取每个折叠中的数据
java -cp bin/RankLib.jar ciir.umass.edu.features.FeatureManager -input MQ2008/Fold1/train.txt.shuffled -output mydata/ -k 5
7, 评估已训练的模型
java -jar bin/RankLib.jar -load mymodel.txt -test MQ2008/Fold1/test.txt -metric2T ERR@10
8, 模型对比
- java -jar bin/RankLib.jar -test MQ2008/Fold1/test.txt -metric2T NDCG@10 -idv output/baseline.ndcg.txt
- java -jar bin/RankLib.jar -load ca.model.txt -test MQ2008/Fold1/test.txt -metric2T NDCG@10 -idv output/ca.ndcg.txt
- java -jar bin/RankLib.jar -load lm.model.txt -test MQ2008/Fold1/test.txt -metric2T NDCG@10 -idv output/lm.ndcg.txt
输出文件中包含了每条查询的 NDCG@10 指标值, 以及所有查询的综合指标, 例如:
- NDCG@10 170 0.0
- NDCG@10 176 0.6722390270733757
- NDCG@10 177 0.4772656487866462
- NDCG@10 178 0.539003131276382
- NDCG@10 185 0.6131471927654585
- NDCG@10 189 1.0
- NDCG@10 191 0.6309297535714574
- NDCG@10 192 1.0
- NDCG@10 194 0.2532778777010656
- NDCG@10 197 1.0
- NDCG@10 200 0.6131471927654585
- NDCG@10 204 0.4772656487866462
- NDCG@10 207 0.0
- NDCG@10 209 0.123151194370365
- NDCG@10 221 0.39038004999210174
- NDCG@10 all 0.5193204478059303
然后再进行对比:
java -cp RankLib.jar ciir.umass.edu.eval.Analyzer -all output/ -base baseline.ndcg.txt> analysis.txt
对比结果 analysis.txt 如下:
- Overall comparison
- ------------------------------------------------------------------------
- System Performance Improvement Win Loss p-value
- baseline_ndcg.txt [baseline] 0.093
- LM_ndcg.txt 0.2863 +0.1933 (+207.8%) 9 1 0.03
- CA_ndcg.txt 0.5193 +0.4263 (+458.26%) 12 0 0.0
- Detailed break down
- ------------------------------------------------------------------------
- [ <-100%) [-100%,-75%) [-75%,-50%) [-50%,-25%) [-25%,0%) (0%, 25%] (+25%, 50%] (+50%, 75%] (+75%, 100%] (> +100%]
- LM_ndcg.txt 0 0 1 0 0 4 2 2 1 0
- CA_ndcg.txt 0 0 0 0 0 1 6 2 3 0
9, 利用训练模型重排名
java -jar RankLib.jar -load mymodel.txt -rank myResultLists.txt -score myScoreFile.txt
myScoreFile.txt 文件中只是增加了一列, 表示重新计算的排名评分, 需要自己另外根据该评分排序获取新的排名顺序.
- 1 0 -7.528650760650635
- 1 1 2.9022061824798584
- 1 2 -0.700125515460968
- 1 3 2.376657485961914
- 1 4 -0.29666265845298767
- 1 5 -2.038628101348877
- 1 6 -5.267711162567139
- 1 7 -2.022146463394165
- 1 8 0.6741248369216919
- ...
参考
RankLib wiki https://sourceforge.net/p/lemur/wiki/RankLib/
来源: https://www.cnblogs.com/memento/p/9398047.html