Basics you need to know in Machine Learning

1. Eager vs Lazy Classiﬁers

Eager Learning Classiﬁcation Strategy

Classiﬁer builds a full model during an initial training phase, to use later when new query examples arrive.
More oﬄine setup work, less work at run-time.
Generalise before seeing the query example.

Lazy Learning Classiﬁcation Strategy

Classiﬁer keeps all the training examples for later use.
Little work is done oﬄine, wait for new query examples.
Focus on the local space around the examples.

2. Noisy data in KNN

A simple 1-NN classiﬁer is easy to implement, but it will be very sensitive to "noise" in the data, which means a misclassification will occur every time a single noisy example is retrieved.Using a larger neighbourhood size (e.g. k > 2) can sometimes make the classiﬁer more robust and overcome this problem. But when k is large (k→N) and classes are unbalanced, we always predict the majority class.

3. Why odd values are generally preferred over even values in KNN.

When assigning a label for a query with kNN, the decision is based on the 'votes' of the query's k-nearest neighbours.If k is even, then it is possible for ties to occur and so a tie-breaking step must be incorporated into the algorithm. This is can naturally never occur when we use an odd value of k. Distance-based resolution of ties can also be used - this is equivalent to a weighted kNN classiﬁer.

4. Decision tree basic idea

Build a tree model that splits the training set into subsets in which all examples have the same class. Each rule is based on a feature, and the split corresponds to the values it can take. If necessary, each subset can be split again using another feature, and so on until all examples have the same class. Once the tree is built, we can use it to quickly classify new input examples - this is an eager learning strategy.

Initially, all examples in the training set are placed at the root node of the tree. One of the available features (X) is now used to split the examples at the root node into two or more child nodes containing subsets of examples.The same process is now applied to each child node, except for any child node at which all examples have the same class.

5. Node purity

A tree node is pure if all examples at that node have the same class label. A decision tree in which all the leaf nodes are pure can always be constructed provided there are no clashes in the data. Clashes mean (i.e. examples having the same "description" in terms of features, but with diﬀerent class labels). Most decision tree algorithms use some measure of node (im)purity to choose features to split when building the tree. This measure guides the learning process.

6. What is a good tree

A "good" decision tree will classify all examples correctly using as few tree nodes as possible.Objective in building a decision tree is to choose good features so as to minimise the depth of the tree. (A perfect feature divides examples into categories of one class with high purity).

7. Common selection strategy in decision trees

Selection algorithms use impurity as an objective to guide feature selection. For each feature, some measure of impurity is applied to the current set of tree nodes. The feature that maximises the reduction in impurity is selected as the next most useful feature.

8. ID3 vs C45

ID3 Algorithm: Popular algorithm which repeatedly builds a decision tree from the top down (Quinlan, 1986).Start with an empty tree and set of all training examples S.

( n**. **b. does not handle numeric data)

Criterion is information gain, which is a popular information theoretic approach for selecting features in decision trees, based on entropy. In general, ID3 repeats the feature selection + splitting process until all examples have the same class, or no features are left to split.

C4.5 is an improved version of ID3 algorithm to overcome some of its disadvantages (Quinlan, 1993). It contains several improvements to make it "an industrial strength" decision tree learner, including:

Handling continuous numeric features.
Handling training data with missing values.
Choosing an appropriate feature selection measure.
Providing an option for pruning trees after creation to reduce likelihood of overﬁtting.

9. Inconsistent data handling

Inconsistent training data occurs when two identical examples from the training set have diﬀerent labels, the same as the clashes mentioned before. When building a tree, this can result in cases where leaf nodes are not "pure" and no features are left to split.

Simple solution:For the label, take the majority vote at the leaf node; If there is a tie, randomly choose one of the class labels.

10. Naïve bayes classifier

Step 1: Calculate class prior probabilities

Step 2: Calculate conditional probabilities for each feature

However, in certain domains, strong violations of the independence assumptions can lead to poor performance by Naive Bayes classiﬁers.

Always need to keep in mind the type of data and the type problem to be solved when choosing a classiﬁcation model.

11. A robust evaluation process for comparing the performance of a pair of classifiers on one or more datasets

A robust evaluation process comparing 2 classiﬁers would involve applying k-fold cross validation with a three-way hold-out strategy to each dataset as follows:

Divide data set into k folds.
FOR EACH of the k folds:

Create test set T from the k-th fold.
Create training set R from the remaining examples.
Divide R into R 1 and validation set V.
FOR EACH classifier
Use V to tune parameters on a model trained with R1.
Use selected parameters to train a model with R.
Measure Accuracy on T.
1. Collate results, assess signiﬁcance of differences.

Signiﬁcance of proportions of wins and losses across all experiments can be measured statistically (e.g. McNemar's test);We can repeat entire process multiple times to further reduce random variance - e.g. 10 x 10-fold cross validation.

12. Overfitting problems

Generally, it happens when the training model learn the relationship between features and labels too much (including its noise). In another word, the training model tends to fit the training sets as much as possible and cause to the lack ability of generalization to unseen data. Also, it happens when the train sets are small, the training sets with noise or with high-dimensionality.

Approaches used to address this problem including:

using evaluation methods, such as hold-out strategy and cross validation to split origin sets to train sets and test sets, and use the test sets to evaluate the model as unseen dataset.
when overfitting caused due to small training set, try to train more clean and relevant data.
when overfitting caused due to high-dimensionality, using dimensionality reduction strategies (such as feature selection and feature transformation) to remove irrelevant features or transform the data into a compact format.

Overall, a good model must not only ﬁt the training data well, but also accurately classify examples that it has never seen before.

13. Skewed class distributions (Imbalanced data) affect the evaluation, how to tackle.

Imbalanced data: Refers to a problem in classiﬁcation where the classes in the data are not represented equally (i.e. the distribution of class sizes is skewed).

Example: In a binary classiﬁcation problem, we have a dataset of 100 items. 80 items belong to Class A, 20 belong to Class B. This is an imbalanced dataset, where the ratio A:B is 4:1. We call A the majority class and B the minority class.

High accuracy can be achieved by biased classiﬁers which just predict the majority class.To deal with skewed classes, use a balanced evaluation measure. Measures include:

Balance Accuracy Rate (BAR): Mean of TP Rate and TN Rate;
Balance Error Rate (BER): Mean of FP Rate and FN Rate.

14. Approach for hypothesis testing

Calculate a test statistic on sample data that is relevant to the hypothesis being examined.
Convert the result to a p-value by comparing its value to the distribution of test statistics under the null hypothesis.
Decide, for a speciﬁc level of signiﬁcance, if we should reject or not reject the null hypothesis, based on the p-value: p<alpha, then reject H0 at level alpha.

The smaller the p-value (<0.001), the more likely the results are to be highly signiﬁcant, results with a p-value near 1.0 are unlikely to be meaningful.

15. Short description about Evaluation

Evaluation is a central task in machine learning. Two common types of evaluation experiment:1. Verify a trained classiﬁer on unseen data.2. Compare the performance of two or more classiﬁers on similar problems.

Generally, include the following steps:

Choose one or more appropriate evaluation measures.
Run the classiﬁer one or more times on a dataset or selection of datasets.
Compare the performance of the classiﬁer with existing benchmark classiﬁers, based on the evaluation measure(s).

16. ROC Curve

Often want to compare the performance of classiﬁers at many diﬀerent decision thresholds (i.e. summarise many confusion matrices).A Receiver Operating Characteristic (ROC Curve) is a graphical plot of how the true positive rate and false positive rate change over many diﬀerent thresholds. The curve is drawn by plotting a point for each feasible threshold and joining them.A trained classiﬁer should always be above the "random" reference line. The strength of the classiﬁer increases as the ROC curve moves further from the line (i.e. closer to top left corner).

When we want to compare the performance of two classiﬁers at diﬀerent thresholds.Make comparisons based on Area Under the Curve (AUC). A better classifier will have a ROC curve closer to top-left corner, giving a larger area under the curve.

17. Problems with simple hold-out strategy

Sometimes we don't want the "luxury" of setting aside data for testing.
Sometimes in a single experiment, the hold-out estimate of error rate can be misleading if we get an "unfortunate" split of the data.
Even if we use multiple splits, some examples will never be included for training or testing, while others might be selected many times.

18. Would leave-one-out cross validation be an appropriate evaluation strategy on relatively large dataset

Leave-one-out cross validation is an extreme case of k-Fold Cross Validation where k is selected to be the total number of examples in the dataset. Step of running this:

For a dataset with n examples, perform n experiments.
For each experiment use n-1 examples for training and the remaining only 1 example (n.b. one example stands for one-fold here) for testing.
Average the accuracy/error rates over all n experiments

So, leave-one-out would require running 5000 experiments where 1 example is left out each time. This could be computationally impractical to do this, so k-fold cross validation would be more suitable.

19. Covariance and Correlation in Regression

If Y increases as X increases the relationship is positive and vice versa. The covariance between Y and X indicates the direction of the relationship. But, it doesn't tell us about the strength of the relationship.

When we want to measure the strength of the linear relationship, we use the coeﬃcient of determination R square. R square is the fraction of the variation which is explained by the linear regression model.It measures the explanatory power of the model,tells us how well our regression line matches the real data. The closer R 2 is to 1 the better the ﬁt. Note: Correlation does not mean causality, which means it does not tell us if X is the cause of changes in Y.

If R square is near 1 then it means that X accounts for a large part of the variation in Y, it gives us an idea of how the predictor variable X accounts for, or determines the response variable Y.

20. Error Based Learning and Gradient Descent

Deﬁning a loss function, or error function:

The task is now to ﬁnd the set of parameters β which minimise this loss function. This is a process of minimising the distance between our hypothesis h(x) and the data y.

Gradient descent is an algorithm that makes small steps along a function to ﬁnd a local minimum. We start at some point and ﬁnd the gradient (slope).Then take steps in the opposite direction to the gradient (i.e. downhill).The size of the step is controlled by an adjustable parameter.This algorithm gets us closer and closer to the local minimum.

General process:

Deﬁne a loss function
Start with some initial set of parameters
Repeat until convergence:
- Update the parameters in the opposite direction to the gradient

21. How to update in Gradient descent

We start with some initial value of the parameters (could be randomly chosen or a reasonable ﬁrst guess).Then we compute the gradient of the loss function with respect to that parameter.Then adjust the parameter by a small amount (controlled by α), the opposite direction to the gradient (minus sign).By adjusting α, we can change how quickly we converge to the minimum. Large α -> risk of overshooting the minimum, small α -> might not converge on the local minimum.

22. The learning rate in GD, best choose

The learning rate, α, determines the size of the adjustment made to each weight at each step in the process. Most of the time we use rules of thumb, and also trial and error to choose the learning rate. Similarly, for the weights, we might start with initial values for β0, β1.

23. Batch GD and Stochastic GD

Method is called "batch" gradient descent because we use the entire batch of points x to calculate each gradient; Stochastic gradient descent uses a sample of points at each step and is faster for large dataset.

24. Partition clustering (K-means)

General goal: Assign similar items to the same cluster, keep dissimilar items apart. High intra-cluster similarity, low inter-cluster similarity.Minimise distances between the items and their nearest centroid (i.e. minimisation of sum-of-squared error (SSE)).k-Means tries to reduce SSE via a two-step iterative process:

1) Reassign items to their nearest cluster centroid

2) Update the centroids based on the new assignments

Repeatedly apply these two steps until the algorithm converges to a ﬁnal result.

Algorithm Steps:

Initialisation: Pre-specify k and select k initial cluster centroids randomly.
Assignment step: Assign every item to its nearest cluster centroid (e.g. using Euclidean distance).
Update step: Re-compute the centroids of the clusters based on the new cluster assignments, where a centroid is the mean point of its cluster.
Go back to Step 2, until when no reassignments occur (or until a maximum number of iterations is reached).

Results produced by k-Means are often highly dependent on the initial solution. Diﬀerent starting positions can lead to diﬀerent local optimal, which means result diﬀerent clusters of the same data.Common strategy: Run the algorithm multiple times, select the solution(s) that scores well according to some validation measure.

25. Limitations of k-Means

Advantages:

Fast, easy to implement.
"Good enough" in a wide variety of tasks and domains.

Disadvantages:

Must pre-specify number of clusters k.
Highly sensitive to choose of initial clusters.
Assumes that each cluster is spherical in shape and data examples are largely concentrated near its centroid.
Traditional objective can give undue inﬂuence to outliers.
Iterative process can lead to empty clusters, particularly for higher values of k.

26. Two distinct categories of hierarchical clustering algorithm

Agglomerative: Begin with each item assigned to its own cluster. Apply a bottom-up strategy where, at each step, the most similar pair of clusters are merged.
Divisive: Begin with a single cluster containing all items. Apply a top-down strategy where, at each step, a chosen cluster is split into two sub-clusters.

27. Cluster Metrics

Single linkage: Deﬁne cluster distance as the smallest pairwise distance between items from each cluster.
Complete linkage: Deﬁne cluster distance as the largest pairwise distance between items from each cluster.
Average linkage: Deﬁne cluster distance as the average of all pairwise distances between items from each cluster.

The choice of cluster distance metric can substantially aﬀect the resulting clustering. Complete linkage is sensitive to outliers. Single linkage tends to produce long chains, not cohesive clusters.

28. Bisecting K-means algorithm

Key idea: apply a standard partition algorithm (i.e. k-means) to split a single cluster into two sub-clusters.

29. Pros and cons of Hierarchical clustering

Advantages:

Allows for multiple levels of granularity , both broad clusters and niche clusters.
No requirement (no pre-specify) of the "correct" value for number of clusters k in advance.

Disadvantages:

Poor decisions made early in the clustering process can greatly inﬂuence the quality of the ﬁnal clustering.
Once a merging or splitting decision has been made, there exists no facility to rectify a mistake at a later stage (cannot undo).
More computationally expensive than partition methods, particularly for agglomerative clustering.

30. Cluster Validation

Cluster validation measures for automatically producing a quantitative evaluation of the quality of a clustering. Common motivation - "good" clusters have the property that cluster members are close to each other and far from members of other clusters. Cluster validation is often applied for parameter selection (e.g. select an appropriate value k for the k-Means algorithm).

Typical Strategy:

Apply k-Means for each value from k min to k max.
Calculate score for each clustering using a cluster validation measure.
Examine plot of scores to identify a peak for the best value for k.

Silhouette width for an item xi is given by si. Values are in the range [-1,1], a larger value is better.

where, ai measure average distance to all other items in same cluster; bi measure average distance to all other items in nearest competing cluster.

Average Silhouette Width (ASW): Calculate overall score for a clustering by averaging the silhouette widths for all n items.

31. Curse of Dimensionality

Intuitively adding more features (dimensions) to a dataset should provide more information about each example, making prediction easier.In reality, we often reach a point where adding more features no longer helps, or can even reduce predictive power. Curse of dimensionality : the phenomenon whereby many machine learning algorithms can perform poorly on high-dimensional data (Bellman, 1961).In practice, to build a model, the number of examples required per feature increases exponentially with number of features.Also, for a given number of examples, there is a maximum number of features beyond which the performance of a classifier will degrade rather than improve.

There are often other reasons why we might want to reduce the number of features/dimensions used to represent data:

Computational cost: For many algorithms, having a large number of features can signiﬁcantly increase running times and memory usage.
Financial cost: In certain domains (e.g. clinical medicine, manufacturing), running experiments to generate feature values can be expensive. In such cases, we only want to generate the minimum number of features required to build a good model.
Interpretability: A feature set which is more compact can help to give a better understanding of the underlying process that generated the data.

(Understanding which features in our data are informative and which are not is an important knowledge discovery task.)

32. Strategies for dimension reduction

1. Feature Transformation (Feature Extraction)

Transforms the original features of a dataset to a completely new, smaller, more compact feature set, while retaining as much information as possible. 

e.g. Principal Components Analysis (PCA), Linear Discriminant Analysis (LDA)

2. Feature Selection

Tries to ﬁnd a minimum subset of the original features that optimises one or more criteria, rather than producing an entirely new set of dimensions for the data.

e.g. Information Gain ﬁlter, Wrapper with sequential forward selection

33. Feature selection and why?

Feature Subset Selection is to find the best subset of all available features, which contains the smallest number of dimensions that most contribute to accuracy. Discard the remaining, unimportant dimensions.

Building a better classiﬁer - Redundant or noisy features can damage accuracy.
Knowledge discovery - Identifying useful features helps us learn more about the data.
Features expensive to obtain - Test a large number of features, select a few for the ﬁnal system (e.g. sensors, manufacturing).
Interpretability - Selected features still have meaning. We can extract meaningful rules from the classiﬁer.