Linear regression
during linear regression, both Ein and Eout converge to sigma square, so the expected error( square error) between them is show in the pic.
d: vc dimension
N: Number of data points
logistic regression
Error definition
the output of logistic regression is generated by sigmoid function
the logistic regression error is defined by maximizing the likelihood the h equal f, which is called crosss entropy error. this error is related to w, xn and yn
this is the mathmatical form of cross entropy error for calculation
Gradient decent and learning rate
the optimal gradient descent direction is opposite direction of Delta-Ein
purple niu is teh learning rate
the training process is described in the chart
Linear model
advantages and disadvantages
demenstration of advantages and disadvantages between threee linear models
optimization process
what means stochastic gradient descent??
SGD logistic regression can be seen as a softed PLA
stop criterien during training:
*iteration times
niu (learning rate )is set as 0.1(experience value)
multiclass classification
two methods to deal with multiclass classification
- OVA(one versus all)[not recommended, not clearly seperatable]
- OVO(one versus one )
nonlinear transformation
the principle to deal with nonlinear problem is to transform the data from nonlinear space to linear space, and use linear model to deal with it. However this will lead to more complex model, so there are some parameters( C, lambda) to restrict the complexity of the model (regularization???)
overfitting
use more complex model is not good expecially when number of data points is small
four reasons leading to oversfitting:
- data size
- stochastic noise
- deterministic noise
- excessive power (model complexity)
suggested ways to avoid overfitting:
regularization
loosing and softing the constraints to deal with the nonlinear problem
augemented error definition and regularizer definition
regularization prefers smaller lambda and smaller w step during training
more effective transformation using legendre polynomials than naive polynomials
regularization confines to VC guarante with somehow enlarged error range
the model complexity is reduced due to regularization
if lambda increases, the effective vc dimension is reduced
principles to design regularizer:
- target dependent
- plausible
- friendly
L1 regularizer: time-saving in caculation, solution is sparse but not optimal
L2 regularizer: easy to optimize, precise in solution
if noise is large then more regularizer is needed
validation
there are lots of hyper-parameters to choose in model selection, all of them are still guaranteed by hoeffdings rule
the data set should be divided into training data set and validation data set, validation data set should not be used for training purpose except for cross validation
the traing data set is used to training all modells (g), the validation set is used to select the best modell according to error level
rule of thumb to divide data set into train set and validation set
the expected error using validation estimates Eout even better than purely Ein
cross validation:
v-fold validation is preferred over single validation if computation allows
5 and 10 fold are good to use
some good suggestions in machine learning
start with simple model to avoid overfit
avoid biased data
avoid manual data snooping (danger of less generazation)