Machine Learning: Hugh Donnelly
This article aims to introduce users to basic ML concepts and lay the foundation for future learning and exploration of ML. We will discuss common ML terminology and then cover three Python packages that are used in ML. The article concludes with additional resources for self-study.
Terminology
Jargon is one of the first obstacles for beginners in ML. This section explains some of the most common terms you need to be familiar with.
Statistics versus Machine Learning
Statistical approaches and ML techniques both analyze observations to reveal some underlying process; however, they diverge in their assumptions, terminology, and techniques. Statistical approaches rely on foundational assumptions and explicit models of structure, such as observed samples that are assumed to be drawn from a specified underlying probability distribution. These a priori restrictive assumptions can fail in reality
In contrast, ML seeks to extract knowledge from large amounts of data with no such restrictions. The goal of ML algorithms is to automate decision-making processes by generalizing (i.e., “learning”) from known examples to determine an underlying structure in the data. The emphasis is on the ability of the algorithm to generate structure or predictions from data without any human help. An elementary way to think of ML algorithms is to “find the pattern, apply the pattern.”
ML techniques are better able than statistical approaches to handle problems with many variables (high dimensionality) or with a high degree of non-linearity. ML algorithms are particularly good at detecting change, even in highly non-linear systems, because they can detect the preconditions of a model’s break or anticipate the probability of a regime switch.
Machine Learning
“Machine Learning” has exploded in popularity over the last few years. But what exactly does it mean and how is it related to other terms like “Artificial Intelligence” or “Deep Learning”? The simplest explanation is to view each of the terms as concentric squares (see image below).
Starting with the outermost square, we have Artificial Intelligence (AI). Put simply, AI is defined as human intelligence exhibited by machines. When most people talk about AI, they’re referring to an artificial general intelligence (AGI) which is currently out of reach. AI research is generally pegged as starting with Arthur Samuel’s 1959 IBM Journal article “Some Studies in Machine Learning Using the Game of Checkers.” Samuel’s aim was to write a program that played checkers better than the person who wrote the program. He succeeded and created one of the world’s first self-learning programs.
Timeline of developments in AI from Nvidia’s explanation of Deep Learning.
The next level down is ML which is one approach to achieving AI. ML parses data with an algorithm, learns from it, and then uses those learnings to make a prediction about something. Instead of hard-coding software routines with a specific set of instructions, the computer is trained using large amounts of data and generalizes what it learned from the training data to data it hasn’t seen before.
Finally, we get down to Deep Learning which is a technique for implementing ML. Deep Learning utilizes Artificial Neural Networks (ANN) which are inspired by the biology of the human brain. ANNs define discrete layers of “neurons” that connect to each other just like biological neurons. Each neuron takes in inputs from all the neurons that connect to it, sums those signals and applies a weight to it, and then passes that signal off to the neurons in the next layer that it connects to. The network “learns” by adjusting the weights of all the neurons such that the network as a whole makes the most accurate predictions. Recent breakthroughs in Deep Learning have been made possible by (1) cheaper access to powerful compute resources, (2) growth in big data.
Summary of ML Algorithms and How to Choose Among Them
ML is broadly divided into three distinct classes of techniques: supervised learning, unsupervised learning, and deep learning. The chart below is a guide to the various ML algorithms organized by algorithm type (supervised or unsupervised) and by type of variables (continuous, categorical, or both).
The graph below presents a stylized decision flowchart for choosing among the ML algorithms shown in the above chart. The dark-shaded ovals contain the supervised ML algorithms; the light-shaded ovals contain the unsupervised ML algorithms; and the key questions to consider are shown in the unshaded boxes.
Observations and Features
Data used in ML is typically referred to as a dataset. You can think of a dataset simply as a table of data. As an example, consider a dataset about video game characters. Each row, sometimes called an “observation”, corresponds to information about a specific character. Each column, sometimes called a “feature”, contains a specific attribute, e.g., hit points, attack, defense.
Feature Selection
An important part of the ML process is “feature selection” or deciding which columns in your dataset should be included in your ML process. Sometimes a dataset is very large, e.g., hundreds of columns, and blindly passing the entire dataset to an ML algorithm will overcomplicate your analysis and erode the prediction-ability of your model. Other times, you may have features that are completely irrelevant or even duplicates of other features. In each case, you should aim to remove features that don’t help to solve the problem, or use other techniques like Principal Component Analysis (PCA) to reduce the number of features in your dataset.
PCA is used to summarize or reduce highly correlated features of data into a few main, uncorrelated composite variables. A composite variable is a variable that combines two or more variables that are statistically strongly related to each other. Informally, PCA involves transforming the covariance matrix of the features and involves two key concepts: eigenvectors and eigenvalues. The eigenvectors define new, mutually uncorrelated composite variables that are linear combinations of the original features. As a vector, an eigenvector also represents a direction. Associated with each eigenvector is an eigenvalue. An eigenvalue gives the proportion of total variance in the initial data that is explained by each eigenvector. The PCA algorithm orders the eigenvectors from highest to lowest according to their eigenvalues — that is, in terms of their usefulness in explaining the total variance in the initial data. PCA selects as the first principal component the eigenvector that explains the largest proportion of variation in the data set (the eigenvector with the largest eigenvalue). The second principal component explains the next largest proportion of variation remaining after the first principal component; it continues for the third, fourth, and subsequent principal components. As the principal components are linear combinations of the initial feature set, only a few principal components are typically required to explain most of the total variance in the initial feature covariance matrix.
It is important to know how many principal components to retain because there is a trade-off between a lower dimensional, more manageable view of a complex data set when a few are selected and some loss of information. Scree plots, which show the proportion of total variance in the data explained by each principal component, can be helpful in this regard (see charts below). In practice, the smallest number of principal components that should be retained is that which the scree plot shows as explaining 85% to 95% of total variance in the initial data set.
Scree Plots of Percent of Total Variance Explained by Each Principal Component for Hypothetical DLC 500 and VLC 30 Equity Indexes.
In this illustration, researchers use scree plots and decide that three principal components are sufficient for explaining the returns to the hypothetical DLC 500 and VLC 30 equity indexes over the last 10-year period. The data set consists of index prices and more than 2,000 fundamental and technical features. Multi-collinearity among the features is a typical problem because that many features or combinations of features would tend to have overlaps. To mitigate the problem, PCA can be used to capture the information and variance in the data. The scree plots above show that of the 20 principal components generated, the first 3 together explain about 90% and 86% of the variance in the value of the DLC 500 and VLC 30 indexes, respectively. The scree plots indicate that for each of these indexes, the incremental contribution to explaining the variance structure of the data is quite small after about the fifth principal component. Therefore, these less useful principal components can be ignored without much loss of information.
The feature selection process removes low-value/duplicate features from the dataset before passing it to the ML model.
The main drawback of PCA is that since the principal components are combinations of the data set’s initial features, they typically cannot be easily labeled or directly interpreted by the analyst. Compared to modeling data with variables that represent well-defined concepts, the end user of PCA may perceive PCA as something of a “black box.”
Reducing the number of features to the most relevant predictors is very useful, even when working with data sets having as few as ten or so features. Notably, dimension reduction facilitates visually representing the data in two or three dimensions. It is typically performed as part of exploratory data analysis, before training another supervised or unsupervised learning model. ML models are quicker to train, tend to reduce overfitting (by avoiding the curse of dimensionality), and are easier to interpret if provided with lower dimensional data sets.
Supervised vs Unsupervised Learning
There are broadly two types of machine learning: supervised and unsupervised. Supervised learning uses a labeled dataset to train a model, i.e., we know what the output values for our model should be. Unsupervised learning is the opposite: we don’t know what the output values for our model should be and want to see if we can uncover a hidden structure to the dataset.
Orchestrating the Development Lifecycle of Machine Learning-Based IoT Applications: A Taxonomy and Survey — Scientific Figure on ResearchGate.
In supervised machine learning, the dependent variable (Y) is the target and the independent variables (X’s) are known as features. The labeled data (training data set) is used to train the supervised ML algorithm to infer a pattern-based prediction rule. The fit of the ML model is evaluated using labeled test data in which the predicted targets (Y Predict) are compared to the actual targets (Y Actual).
Supervised learning can be divided into two categories of problems, regression problems and classification problems, with the distinction between them being determined by the nature of the target (Y) variable. If the target variable is continuous, then the task is one of regression (even if the ML technique used is not “regression,” note this nuance of ML terminology). If the target variable is categorical or ordinal (i.e., a ranked category), then it is a classification problem. Regression and classification use different ML techniques.
Regression focuses on making predictions of continuous target variables. Most people are already familiar with multiple linear regression (e.g., ordinary least squares) models, but other supervised learning techniques exist that include non-linear models. These non-linear models are useful for problems involving large data sets with large numbers of features, many of which may be correlated. Some examples of problems belonging to the regression category are using historical stock market returns to forecast stock price performance or using historical corporate financial ratios to forecast the probability of bond default.
Classification focuses on sorting observations into distinct categories. In a regression problem, when the dependent variable (target) is categorical, the model relating the outcome to the independent variables (features) is called a “classifier.” Many classification models are binary classifiers, as in the case of fraud detection for credit card transactions. Multi-category classification is not uncommon, as in the case of classifying firms into multiple credit rating categories. In assigning ratings, the outcome variable is ordinal, meaning the categories have a distinct order or ranking (e.g., from low to high creditworthiness). Ordinal variables are intermediate between categorical variables and continuous variables on a scale of measurement.
Unsupervised learning is machine learning that does not make use of labeled data. More formally, in unsupervised learning, we have inputs (X’s) that are used for analysis without any target (Y) being supplied. In unsupervised learning, because the ML algorithm is not given labeled training data, the algorithm seeks to discover structure within the data themselves. As such, unsupervised learning is useful for exploring new data sets as it can provide human experts with insights into a data set too big or too complex to visualize.
Two important types of problems that are well suited to unsupervised machine learning are reducing the dimension of data and sorting data into clusters, known as dimension reduction and clustering, respectively.
Dimension reduction focuses on reducing the number of features while retaining variation across observations to preserve the information contained in that variation. Dimension reduction may have several purposes. It may be applied to data with a large number of features to produce a lower dimensional representation (i.e., with fewer features) that can fit, for example, on a computer screen. Dimension reduction is also used in many quantitative investment and risk management applications where it is critical to identify the major factors underlying asset price movements.
Clustering focuses on sorting observations into groups (clusters) such that observations in the same cluster are more similar to each other than they are to observations in other clusters. Groups are formed based on a set of criteria that may or may not be prespecified (such as the number of groups). Clustering has been used by asset managers to sort companies into empirically determined groupings (e.g., based on their financial statement data) rather than conventional groupings (e.g., based on sectors or countries).
Overfitting
Overfitting is the most common mistake made in ML. The point of training ML models is to generate useful predictions about data we haven’t seen — that is, we want our models to learn from the training data we give them and generalize to new data. It’s useful to consider a dataset as containing signal, the true underlying pattern or relationship we’re interested in, and noise. If we allow our models to strictly fit training data (which in the real world is often quite noisy), predictions for our training dataset will be very accurate but predictions for new data will be very inaccurate. In other words, our model will have overfitted to noise in the training dataset and underfitted to the underlying signal.
The following image illustrates the danger of overfitting. Our model (the blue line) perfectly accounts for every observation (black dot). However, the model is a poor match for the true linear relationship inherent in the data (black line). If we were to give the model a new observation with an input value of -5, the model would predict an output of 10. This is way off from the true value which should be closer to -12.
Ghiles, CC BY-SA 4.0, via Wikimedia Commons.
There are many ways to combat overfitting. It’s good practice to split your initial dataset in training and testing data. We only train the model on the training dataset and then measure how well it generalizes using the testing dataset. Another common technique for combatting overfitting is to penalize the model for complexity. As it learns, the model has to balance the increase in accuracy for the training dataset it gains from more complexity against this extra cost of adding that complexity.
The concepts of underfitting, overfitting, and good (or robust) fitting are illustrated in the graph below. Underfitting means the model does not capture the relationships in the data. The left graph shows four errors in this underfit model (three misclassified circles and one misclassified triangle). Overfitting means training a model to such a degree of specificity to the training data that the model begins to incorporate noise coming from quirks or spurious correlations; it mistakes randomness for patterns and relationships. The algorithm may have memorized the data, rather than learned from it, so it has perfect hindsight but no foresight. The main contributors to overfitting are thus high noise levels in the data and too much complexity in the model. The middle graph shows no errors in this overfit model. Complexity refers to the number of features, terms, or branches in the model and to whether the model is linear or non-linear (non-linear is more complex). As models become more complex, overfitting risk increases. A good fit/robust model fits the training (in-sample) data well and generalizes well to out-of-sample data, both within acceptable degrees of error. The right graph shows that the good fitting model has only one error, the misclassified circle.
Regularization is a form of regression algorithm that helps guard against overfitting your model to noise in the training data. Examples of this are Ridge Regression and LASSO (least absolute shrinkage and selection operator) Regression. Essentially, they shrink the estimated coefficients towards 0 penalizing a more complicated model and this can give you a better chance for a more simple and robust model.
Ridge Regression is a popular type of regularized linear regression that includes an L2 penalty. An L2 penalty minimizes the size of all coefficients, although it prevents any coefficients from being removed from the model by allowing their value to become zero. This has the effect of shrinking the coefficients for those input variables that do not contribute much to the prediction task.
LASSO is a popular type of penalized regression where the penalty term involves summing the absolute values of the regression coefficients. The greater the number of included features, the larger the penalty. So, a feature must make a sufficient contribution to model fit to offset the penalty from including it.
Therefore, penalized regression ensures that a feature is included only if the sum of squared residuals declines by more than the penalty term increases. All types of penalized regression involve a trade-off of this type. Also, since LASSO eliminates the least important features from the model, it automatically performs feature selection.
Regularization methods can also be applied to non-linear models. A long-term challenge of the asset management industry in applying mean–variance optimization has been the estimation of stable covariance matrixes and asset weights for large portfolios. Asset returns typically exhibit strong multi-collinearity, making the estimation of the covariance matrix highly sensitive to noise and outliers, so the resulting optimized asset weights are highly unstable. Regularization methods have been used to address this problem.
Errors
To capture these effects and calibrate degree of fit, data scientists compare error rates in- and out-of-sample as a function of both the data and the algorithm. Total in-sample errors (E in) are generated by the predictions of the fitted relationship relative to actual target outcomes on the training sample. Total out-of-sample errors (E out) are from either the validation or test samples. Low or no in-sample error but large out-of-sample error are indicative of poor generalization. Data scientists decompose the total out-of-sample error into three sources:
Bias error, or the degree to which a model fits the training data. Algorithms with erroneous assumptions produce high bias with poor approximation, causing underfitting and high in-sample error.
Variance error, or how much the model’s results change in response to new data from validation and test samples. Unstable models pick up noise and produce high variance, causing overfitting and high out-of- sample error.
Base error due to randomness in the data.
A learning curve plots the accuracy rate (= 1 — error rate) in the validation or test samples (i.e., out-of-sample) against the amount of data in the training sample, so it is useful for describing under- and overfitting as a function of bias and variance errors. If the model is robust, out-of-sample accuracy increases as the training sample size increases. This implies that error rates experienced in the validation or test samples and in the training sample converge toward each other and toward a desired error rate (or, alternatively, the base error). In an underfitted model with high bias error, shown in the left panel of the graph below, high error rates cause convergence below the desired accuracy rate. Adding more training samples will not improve the model. In an overfitted model with high variance error, shown in the middle panel of the below graph, the validation sample and training sample error rates fail to converge. In building models, data scientists try to simultaneously minimize both bias and variance errors while selecting an algorithm with good predictive or classifying power, as seen in the right panel of the below graph.
Learning Curves: Accuracy in Validation and Training Samples versus Training Sample Size.
Out-of-sample error rates are also a function of model complexity. As complexity increases in the training set, error rates (E in) fall and bias error shrinks. As complexity increases in the test set, however, error rates (E out) rise and variance error rises. Typically, linear functions are more susceptible to bias error and underfitting, while non linear functions are more prone to variance error and overfitting. Therefore, an optimal point of model complexity exists where the bias and variance error curves intersect and in- and out-of-sample error rates are minimized. A fitting curve, which shows in- and out-of-sample error rates (E in and E out) on the y-axis plotted against model complexity on the x-axis, is presented in the below graph and illustrates this trade-off.
Fitting Curve Shows Trade-Off Between Bias and Variance Errors and Model Complexity.
Finding the optimal point (managing overfitting risk) — the sweet spot just before the total error rate starts to rise (due to increasing variance error) — is a core part of the ML process and the key to successful generalization. Data scientists express the trade-off between overfitting and generalization as a trade-off between cost (the difference between in- and out-of-sample error rates) and complexity. They use the trade-off between cost and complexity to calibrate and visualize under- and overfitting and to optimize their models.
We have seen that overfitting impairs generalization, but overfitting potential is endemic to the supervised machine learning process due to the presence of noise. So, how do data scientists combat this risk? Two common guiding principles and two methods are used to reduce overfitting:
Preventing the algorithm from getting too complex during selection and training, which requires estimating an overfitting penalty.
Proper data sampling achieved by using cross-validation, a technique for estimating out-of-sample error directly by determining the error in validation samples.
Mitigating overfitting risk by avoiding excessive out-of-sample error is critical to creating a supervised machine learning model that generalizes well to fresh data sets drawn from the same domain. The main techniques used to mitigate overfitting risk in model construction are complexity reduction and cross-validation.
Python Libraries
The Python ML ecosystem is extensive. You can focus on three open-source libraries to demonstrate and explore ML techniques:
NumPy describes itself as “the fundamental package for scientific computing with Python.” Its implementation of N-dimensional arrays makes it the building block of most numerical Python packages. You could even implement many ML techniques using just numpy arrays!
The SciPy library (not to be confused with the SciPy organization under which much of the python scientific ecosystem is organized) is “one of the core packages that make up the SciPy stack. It provides many user-friendly and efficient numerical routines, such as routines for numerical integration, interpolation, optimization, linear algebra, and statistics.”
Scikit-learn (also known as sklearn) offers a simple, consistent, well-document interface for doing ML in python that’s very accessible. It’s designed to interoperate with NumPy and SciPy.
Here are some cheat sheets for the packages:
Additional Resources
Below are some useful resources to review as you embark on your ML journey.
Intro
ML Cheat Sheet: Simple explanations of common ML terms.
How Machines Learn: An 8-minute video explaining one way (genetic breeding) that machines learn. It’s heavily simplified but still a great summary of some of the core concepts in ML.
Intermediate
An Introduction to Statistical Learning: One of the best introductions to the main concepts of ML. The book also offers code examples, although they’re in R.
Google Machine Learning Crash Course: Google’s self-study, “fast-paced, practical introduction to machine learning.”
Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems: As one review of this book puts it: “This book really showed me what I was missing: context. It doesn’t just demonstrate different tools, it gives you a framework that you can apply to any problem (chapter 2) and how to think about what you’re doing in each phase of an ML project.” This book offers actual python code examples for many ML models.
Learning from Data: A Short Course: Another excellent book that dives into the math and theory underlying ML. More difficult than the above two books.
Read more post from this writer at Hugh Donnelly's Medium