1. Eager vs Lazy Classifiers

Eager Learning Classification Strategy

Lazy Learning Classification Strategy

2. Noisy data in KNN

A simple 1-NN classifier is easy to implement, but it will be very sensitive to "noise" in the data, which means a misclassification will occur every time a single noisy example is retrieved.Using a larger neighbourhood size (e.g. k > 2) can sometimes make the classifier more robust and overcome this problem. But when k is large (k→N) and classes are unbalanced, we always predict the majority class.

3. Why odd values are generally preferred over even values in KNN.

When assigning a label for a query with kNN, the decision is based on the 'votes' of the query's k-nearest neighbours.If k is even, then it is possible for ties to occur and so a tie-breaking step must be incorporated into the algorithm. This is can naturally never occur when we use an odd value of k. Distance-based resolution of ties can also be used - this is equivalent to a weighted kNN classifier.

4. Decision tree basic idea

Build a tree model that splits the training set into subsets in which all examples have the same class. Each rule is based on a feature, and the split corresponds to the values it can take. If necessary, each subset can be split again using another feature, and so on until all examples have the same class. Once the tree is built, we can use it to quickly classify new input examples - this is an eager learning strategy.

Initially, all examples in the training set are placed at the root node of the tree. One of the available features (X) is now used to split the examples at the root node into two or more child nodes containing subsets of examples.The same process is now applied to each child node, except for any child node at which all examples have the same class.

5. Node purity

A tree node is pure if all examples at that node have the same class label. A decision tree in which all the leaf nodes are pure can always be constructed provided there are no clashes in the data. Clashes mean (i.e. examples having the same "description" in terms of features, but with different class labels). Most decision tree algorithms use some measure of node (im)purity to choose features to split when building the tree. This measure guides the learning process.

6. What is a good tree

A "good" decision tree will classify all examples correctly using as few tree nodes as possible.Objective in building a decision tree is to choose good features so as to minimise the depth of the tree. (A perfect feature divides examples into categories of one class with high purity).

7. Common selection strategy in decision trees

Selection algorithms use impurity as an objective to guide feature selection. For each feature, some measure of impurity is applied to the current set of tree nodes. The feature that maximises the reduction in impurity is selected as the next most useful feature.

8. ID3 vs C45

ID3 Algorithm: Popular algorithm which repeatedly builds a decision tree from the top down (Quinlan, 1986).Start with an empty tree and set of all training examples S.

( n**. **b. does not handle numeric data)

Criterion is information gain, which is a popular information theoretic approach for selecting features in decision trees, based on entropy. In general, ID3 repeats the feature selection + splitting process until all examples have the same class, or no features are left to split.

C4.5 is an improved version of ID3 algorithm to overcome some of its disadvantages (Quinlan, 1993). It contains several improvements to make it "an industrial strength" decision tree learner, including:

9. Inconsistent data handling

Inconsistent training data occurs when two identical examples from the training set have different labels, the same as the clashes mentioned before. When building a tree, this can result in cases where leaf nodes are not "pure" and no features are left to split.

Simple solution:For the label, take the majority vote at the leaf node; If there is a tie, randomly choose one of the class labels.

10. Naïve bayes classifier

Step 1: Calculate class prior probabilities

Step 2: Calculate conditional probabilities for each feature

However, in certain domains, strong violations of the independence assumptions can lead to poor performance by Naive Bayes classifiers.

Always need to keep in mind the type of data and the type problem to be solved when choosing a classification model.

11. A robust evaluation process for comparing the performance of a pair of classifiers on one or more datasets

A robust evaluation process comparing 2 classifiers would involve applying k-fold cross validation with a three-way hold-out strategy to each dataset as follows:

  1. Divide data set into k folds.

  2. FOR EACH of the k folds:

Significance of proportions of wins and losses across all experiments can be measured statistically (e.g. McNemar's test);We can repeat entire process multiple times to further reduce random variance - e.g. 10 x 10-fold cross validation.

12. Overfitting problems

Generally, it happens when the training model learn the relationship between features and labels too much (including its noise). In another word, the training model tends to fit the training sets as much as possible and cause to the lack ability of generalization to unseen data. Also, it happens when the train sets are small, the training sets with noise or with high-dimensionality.

Approaches used to address this problem including:

  1. using evaluation methods, such as hold-out strategy and cross validation to split origin sets to train sets and test sets, and use the test sets to evaluate the model as unseen dataset.
  2. when overfitting caused due to small training set, try to train more clean and relevant data.
  3. when overfitting caused due to high-dimensionality, using dimensionality reduction strategies (such as feature selection and feature transformation) to remove irrelevant features or transform the data into a compact format.

Overall, a good model must not only fit the training data well, but also accurately classify examples that it has never seen before.

13. Skewed class distributions (Imbalanced data) affect the evaluation, how to tackle.

Imbalanced data: Refers to a problem in classification where the classes in the data are not represented equally (i.e. the distribution of class sizes is skewed).

Example: In a binary classification problem, we have a dataset of 100 items. 80 items belong to Class A, 20 belong to Class B. This is an imbalanced dataset, where the ratio A:B is 4:1. We call A the majority class and B the minority class.

High accuracy can be achieved by biased classifiers which just predict the majority class.To deal with skewed classes, use a balanced evaluation measure. Measures include:

14. Approach for hypothesis testing

  1. Calculate a test statistic on sample data that is relevant to the hypothesis being examined.
  2. Convert the result to a p-value by comparing its value to the distribution of test statistics under the null hypothesis.
  3. Decide, for a specific level of significance, if we should reject or not reject the null hypothesis, based on the p-value: p<alpha, then reject H0 at level alpha.

The smaller the p-value (<0.001), the more likely the results are to be highly significant, results with a p-value near 1.0 are unlikely to be meaningful.

15. Short description about Evaluation

Evaluation is a central task in machine learning. Two common types of evaluation experiment:1. Verify a trained classifier on unseen data.2. Compare the performance of two or more classifiers on similar problems.

Generally, include the following steps:

16. ROC Curve

Often want to compare the performance of classifiers at many different decision thresholds (i.e. summarise many confusion matrices).A Receiver Operating Characteristic (ROC Curve) is a graphical plot of how the true positive rate and false positive rate change over many different thresholds. The curve is drawn by plotting a point for each feasible threshold and joining them.A trained classifier should always be above the "random" reference line. The strength of the classifier increases as the ROC curve moves further from the line (i.e. closer to top left corner).

When we want to compare the performance of two classifiers at different thresholds.Make comparisons based on Area Under the Curve (AUC). A better classifier will have a ROC curve closer to top-left corner, giving a larger area under the curve.

17. Problems with simple hold-out strategy

18. Would leave-one-out cross validation be an appropriate evaluation strategy on relatively large dataset

Leave-one-out cross validation is an extreme case of k-Fold Cross Validation where k is selected to be the total number of examples in the dataset. Step of running this:

So, leave-one-out would require running 5000 experiments where 1 example is left out each time. This could be computationally impractical to do this, so k-fold cross validation would be more suitable.

19. Covariance and Correlation in Regression

If Y increases as X increases the relationship is positive and vice versa. The covariance between Y and X indicates the direction of the relationship. But, it doesn't tell us about the strength of the relationship.

When we want to measure the strength of the linear relationship, we use the coefficient of determination R square. R square is the fraction of the variation which is explained by the linear regression model.It measures the explanatory power of the model,tells us how well our regression line matches the real data. The closer R 2 is to 1 the better the fit. Note: Correlation does not mean causality, which means it does not tell us if X is the cause of changes in Y.

If R square is near 1 then it means that X accounts for a large part of the variation in Y, it gives us an idea of how the predictor variable X accounts for, or determines the response variable Y.

20. Error Based Learning and Gradient Descent

Defining a loss function, or error function:

The task is now to find the set of parameters β which minimise this loss function. This is a process of minimising the distance between our hypothesis h(x) and the data y.

Gradient descent is an algorithm that makes small steps along a function to find a local minimum. We start at some point and find the gradient (slope).Then take steps in the opposite direction to the gradient (i.e. downhill).The size of the step is controlled by an adjustable parameter.This algorithm gets us closer and closer to the local minimum.

General process:

21. How to update in Gradient descent

We start with some initial value of the parameters (could be randomly chosen or a reasonable first guess).Then we compute the gradient of the loss function with respect to that parameter.Then adjust the parameter by a small amount (controlled by α), the opposite direction to the gradient (minus sign).By adjusting α, we can change how quickly we converge to the minimum. Large α -> risk of overshooting the minimum, small α -> might not converge on the local minimum.

22. The learning rate in GD, best choose

The learning rate, α, determines the size of the adjustment made to each weight at each step in the process. Most of the time we use rules of thumb, and also trial and error to choose the learning rate. Similarly, for the weights, we might start with initial values for β0, β1.

23. Batch GD and Stochastic GD

Method is called "batch" gradient descent because we use the entire batch of points x to calculate each gradient; Stochastic gradient descent uses a sample of points at each step and is faster for large dataset.

24. Partition clustering (K-means)

General goal: Assign similar items to the same cluster, keep dissimilar items apart. High intra-cluster similarity, low inter-cluster similarity.Minimise distances between the items and their nearest centroid (i.e. minimisation of sum-of-squared error (SSE)).k-Means tries to reduce SSE via a two-step iterative process:

1) Reassign items to their nearest cluster centroid

2) Update the centroids based on the new assignments

Repeatedly apply these two steps until the algorithm converges to a final result.

Algorithm Steps:

  1. Initialisation: Pre-specify k and select k initial cluster centroids randomly.
  2. Assignment step: Assign every item to its nearest cluster centroid (e.g. using Euclidean distance).
  3. Update step: Re-compute the centroids of the clusters based on the new cluster assignments, where a centroid is the mean point of its cluster.
  4. Go back to Step 2, until when no reassignments occur (or until a maximum number of iterations is reached).

Results produced by k-Means are often highly dependent on the initial solution. Different starting positions can lead to different local optimal, which means result different clusters of the same data.Common strategy: Run the algorithm multiple times, select the solution(s) that scores well according to some validation measure.

25. Limitations of k-Means

Advantages:

Disadvantages:

26. Two distinct categories of hierarchical clustering algorithm

  1. Agglomerative: Begin with each item assigned to its own cluster. Apply a bottom-up strategy where, at each step, the most similar pair of clusters are merged.

  2. Divisive: Begin with a single cluster containing all items. Apply a top-down strategy where, at each step, a chosen cluster is split into two sub-clusters.

27. Cluster Metrics

The choice of cluster distance metric can substantially affect the resulting clustering. Complete linkage is sensitive to outliers. Single linkage tends to produce long chains, not cohesive clusters.

28. Bisecting K-means algorithm

Key idea: apply a standard partition algorithm (i.e. k-means) to split a single cluster into two sub-clusters.

29. Pros and cons of Hierarchical clustering

Advantages:

Disadvantages:

30. Cluster Validation

Cluster validation measures for automatically producing a quantitative evaluation of the quality of a clustering. Common motivation - "good" clusters have the property that cluster members are close to each other and far from members of other clusters. Cluster validation is often applied for parameter selection (e.g. select an appropriate value k for the k-Means algorithm).

Typical Strategy:

  1. Apply k-Means for each value from k min to k max.
  2. Calculate score for each clustering using a cluster validation measure.
  3. Examine plot of scores to identify a peak for the best value for k.

Silhouette width for an item xi is given by si. Values are in the range [-1,1], a larger value is better.

where, ai measure average distance to all other items in same cluster; bi measure average distance to all other items in nearest competing cluster.

Average Silhouette Width (ASW): Calculate overall score for a clustering by averaging the silhouette widths for all n items.

31. Curse of Dimensionality

Intuitively adding more features (dimensions) to a dataset should provide more information about each example, making prediction easier.In reality, we often reach a point where adding more features no longer helps, or can even reduce predictive power. Curse of dimensionality : the phenomenon whereby many machine learning algorithms can perform poorly on high-dimensional data (Bellman, 1961).In practice, to build a model, the number of examples required per feature increases exponentially with number of features.Also, for a given number of examples, there is a maximum number of features beyond which the performance of a classifier will degrade rather than improve.

There are often other reasons why we might want to reduce the number of features/dimensions used to represent data:

  1. Computational cost: For many algorithms, having a large number of features can significantly increase running times and memory usage.
  2. Financial cost: In certain domains (e.g. clinical medicine, manufacturing), running experiments to generate feature values can be expensive. In such cases, we only want to generate the minimum number of features required to build a good model.
  3. Interpretability: A feature set which is more compact can help to give a better understanding of the underlying process that generated the data.

(Understanding which features in our data are informative and which are not is an important knowledge discovery task.)

32. Strategies for dimension reduction

1. Feature Transformation (Feature Extraction)

Transforms the original features of a dataset to a completely new, smaller, more compact feature set, while retaining as much information as possible.


e.g. Principal Components Analysis (PCA), Linear Discriminant Analysis (LDA)

2. Feature Selection

Tries to find a minimum subset of the original features that optimises one or more criteria, rather than producing an entirely new set of dimensions for the data.

e.g. Information Gain filter, Wrapper with sequential forward selection

33. Feature selection and why?

Feature Subset Selection is to find the best subset of all available features, which contains the smallest number of dimensions that most contribute to accuracy. Discard the remaining, unimportant dimensions.

  1. Building a better classifier - Redundant or noisy features can damage accuracy.
  2. Knowledge discovery - Identifying useful features helps us learn more about the data.
  3. Features expensive to obtain - Test a large number of features, select a few for the final system (e.g. sensors, manufacturing).
  4. Interpretability - Selected features still have meaning. We can extract meaningful rules from the classifier.