Cross-validation

Cross-validation in Learn provides various strategies for cross-validation, an integrated scoring functioncalled cross_val_score!, and GridSearchCV. The strategies comprise kfold, stratified_kfold, and shuffle_split. The function cross_val_score! combines fit!, predict, and score with cross-validation to obtain a robust score a set of parameters. GridSearchCV finally allows to search for the parameter set, from a combination of given parameters, using cross-validation.

K-fold

K-fold cross validation. The function kfold returns an Iterable that provides sets of training and test indicies.

kfold(n_observations::Int; k::Integer=10)
kfold(y::Vector; k::Integer=10) = kfold(length(y); k=k)
kfold(X::Matrix; k::Integer=10) = kfold(size(X, 1); k=k)
kf = kfold(y; k=3)
for (idx_tr, idx_test) in kf
    X_tr = X[idx_tr, :]
    y_tr = y[idx_tr, :]
    X_test = X[idx_test, :]
    y_test = y[idx_test, :]
    svc = SVC()
    fit!(svc, X_tr, y_tr)
    score(svc, X_test, y_test)
end

Stratified K-fold

Stratified version of K-fold cross validation. The function stratified_kfold returns an Iterable that provides sets of training and test indicies.

stratified_kfold(y::Vector{Int}; k::Integer=2)
skf = stratified_kfold(y; k=3)
for (idx_tr, idx_test) in skf
    X_tr = X[idx_tr, :]
    y_tr = y[idx_tr, :]
    X_test = X[idx_test, :]
    y_test = y[idx_test, :]
    svc = SVC()
    fit!(svc, X_tr, y_tr)
    score(svc, X_test, y_test)
end

Shuffle-Split

Choose values at random without repitition. The function shufflesplit returns an Iterable that provides sets of training and test indicies.

shufflesplit(n_observations::Int; k::Integer=10)
shufflesplit(y::Vector; k::Integer=10) = shufflesplit(length(y); k=k)
shufflesplit(X::Matrix; k::Integer=10) = shufflesplit(size(X, 1); k=k)
ss = shuffle_split(X; k=3)
for (idx_tr, idx_test) in ss
    X_tr = X[idx_tr, :]
    X_test = X[idx_test, :]
    km = Kmeans()
    fit!(km, X_tr)
    score(km, X_test)
end

cross_val_score!

Run fit, predict, and score on different folds, provided via a cross validation strategy. Then compute the mean of the scores across all folds. Works for regression, classification, clustering, and pipelines using those estimators.

cross_val_score!{T<:Classifier}(estimator::Union{T, Pipeline{T}}, X::Matrix{Float64}, y::Vector; cv::Function=stratified_kfold, scoring::Union{Function, Void}=nothing)
cross_val_score!{T<:AbstractFloat, S<:Regressor}(estimator::Union{S, Pipeline{S}}, X::Matrix{T}, y::Vector; cv::Function=kfold, scoring::Union{Function, Void}=nothing)
cross_val_score!{T<:AbstractFloat, S<:Cluster}(estimator::Union{S, Pipeline{S}}, X::Matrix{T}; cv::Function=kfold, scoring::Union{Function, Void}=nothing)

GridSearchCV

Run cross_val_score! on different sets of parameters and return the best estimator, its parameters, and its score. GridSearchCV works on individual estimators as well as on Pipeline objects. Running a grid search with cross-validation on a pipeline is probably this package’s most advanced features.

GridSearchCV(estimator::T, param_grid::Dict{ASCIIString, Vector}; scoring=nothing, cv=stratified_kfold)
GridSearchCV{S<:Cluster}(estimator::Union{S, Pipeline{S}}, param_grid::Dict{ASCIIString, Vector}; scoring=nothing, cv=kfold)
GridSearchCV{S<:Regressor}(estimator::Union{S, Pipeline{S}}, param_grid::Dict{ASCIIString, Vector}; scoring=nothing, cv=shufflesplit)

Create a new instance for a grid search. For individual estimators the parameters are provided as a dictionary, with the parameter name as key and the paramter values as a list. For pipelines you need to combine the names of the pipeline stage and the parameter. See the examples below to understand how to prepare the parameters for individual estimators and for pipelines.

params = Dict{ASCIIString, Vector}("C"=>[0.01, 0.1, 1., 10., 100., 1000., 10000., 100000., 1000000.], "kernel"=>["rbf", "linear", "polynomial", "sigmoid"])
gs = GridSearchCV{SVC}(SVC(), params; scoring=f1_score, cv=y->stratified_kfold(y; k=10))
fit!(gs, X, y)
@show gs.best_score
@show gs.best_estimator
@show gs.best_params
pipe = Pipeline([("mms", MinMaxScaler()), ("ss", StandardScaler())], ("svc", SVC()))
params = Dict{ASCIIString, Vector}("mms__range_min"=>[0.0], "mms__range_max"=>[1.0], "svc__C"=>[0.01, 0.1, 1., 10., 100., 1000., 10000., 100000., 1000000.], "svc__kernel"=>["rbf", "linear", "polynomial", "sigmoid"])
gs = GridSearchCV{Pipeline}(pipe, params)
fit!(gs, X, y)
@show gs.best_score
@show gs.best_params
@show gs.best_estimator.preprocessors
@show gs.best_estimator.estimator