machine learning basic notes

from coursa/xuantianlin

Posted by neverset on March 26, 2020

Linear regression

during linear regression, both Ein and Eout converge to sigma square, so the expected error( square error) between them is show in the pic. d: vc dimension N: Number of data points

logistic regression

Error definition

the output of logistic regression is generated by sigmoid function the logistic regression error is defined by maximizing the likelihood the h equal f, which is called crosss entropy error. this error is related to w, xn and yn

this is the mathmatical form of cross entropy error for calculation

Gradient decent and learning rate

the optimal gradient descent direction is opposite direction of Delta-Ein

purple niu is teh learning rate

the training process is described in the chart

Linear model

advantages and disadvantages

demenstration of advantages and disadvantages between threee linear models

optimization process

what means stochastic gradient descent??

SGD logistic regression can be seen as a softed PLA stop criterien during training: *iteration times niu (learning rate )is set as 0.1(experience value)

multiclass classification

two methods to deal with multiclass classification

  • OVA(one versus all)[not recommended, not clearly seperatable]
  • OVO(one versus one )

nonlinear transformation

the principle to deal with nonlinear problem is to transform the data from nonlinear space to linear space, and use linear model to deal with it. However this will lead to more complex model, so there are some parameters( C, lambda) to restrict the complexity of the model (regularization???)

overfitting

use more complex model is not good expecially when number of data points is small

four reasons leading to oversfitting:

  • data size
  • stochastic noise
  • deterministic noise
  • excessive power (model complexity)

suggested ways to avoid overfitting:

regularization

loosing and softing the constraints to deal with the nonlinear problem

augemented error definition and regularizer definition

regularization prefers smaller lambda and smaller w step during training

more effective transformation using legendre polynomials than naive polynomials

regularization confines to VC guarante with somehow enlarged error range

the model complexity is reduced due to regularization

if lambda increases, the effective vc dimension is reduced

principles to design regularizer:

  • target dependent
  • plausible
  • friendly

L1 regularizer: time-saving in caculation, solution is sparse but not optimal L2 regularizer: easy to optimize, precise in solution

if noise is large then more regularizer is needed

validation

there are lots of hyper-parameters to choose in model selection, all of them are still guaranteed by hoeffdings rule

the data set should be divided into training data set and validation data set, validation data set should not be used for training purpose except for cross validation

the traing data set is used to training all modells (g), the validation set is used to select the best modell according to error level

rule of thumb to divide data set into train set and validation set

the expected error using validation estimates Eout even better than purely Ein

cross validation:

v-fold validation is preferred over single validation if computation allows 5 and 10 fold are good to use

some good suggestions in machine learning

start with simple model to avoid overfit

avoid biased data

avoid manual data snooping (danger of less generazation)

three learning principles