With so many developments to learn and follow in the field of data science each day, a set of subjects remain essential. They lead all new concepts and must be understood thoroughly. Many of them are presented here and are key to review when preparing for a job interview or to proceed with your studies in the field of artificial intelligence.
As the name implies, data science is a branch of science that applies the scientific method to data with the goal of studying the relationships between different features and drawing out meaningful conclusions based on these relationships.
Data is the key component in data science. A dataset is a particular instance of data that is used for analysis or model building at any given time. It contains observations made from a specific phenomenon or context and it shows up in a tabular form, where the rows represent the observations and the columns, their features or aspects.
There are many different sorts of datasets, such as numerical data, categorical data, text data, image data, voice data, and video data. A dataset could be static (not changing) or dynamic (changes with time). Dynamic datasets can be defined also as action datasets, like transactional datasets (eg. credit card purchases or phone calls made) or financial datasets (stock prices records).
For beginning data science projects, the most popular type of dataset is a dataset containing numerical data, typically stored in a comma-separated values (CSV) file format.
Businesses are becoming much more data-oriented over the time, so data are bigger, messier, and more complex now than have ever been.
Data wrangling is the process of converting data from its raw form to a tidy form ready for analysis. It includes several processes like data importing, data cleansing, data structuring, string processing, HTML parsing, handling dates and times, handling missing data and outliers, and text mining. It is more likely for the data to be in a file, a database, or extracted from documents such as web pages, tweets, or PDFs.
The process of data wrangling is a critical step for any data scientist. Having the perfect data for using in analytics is not easy. Very rarely is the data uniform and ready accessible in a data science project.
Knowing how to wrangle, clean, understand, and prepare data will enable you to derive critical insights from your data that would otherwise be hidden.
Dense and complex information are portrayed in a graphical form designed to make it easy to compare data and use it to tell a story, which can help users in decision making.
Data Visualization is one of the most important branches of data science. It is one of the main tools used to show the relationships between different variables.
A number of different forms can be used to express data visualization. Charts are a common way of expressing data, as they depict different data varieties and allow data comparison. The type of chart you use depends primarily on two things: the data you want to communicate, and what you want to convey about that data.
Scatter plots, line graphs, bar plots, histograms, qqplots, smooth densities, boxplots, pair plots, heat maps are some of the most used charts for descriptive analytics.
Data visualization is also used in machine learning for data preprocessing and analysis, feature selection, model building, model testing, and model evaluation.
When preparing a data visualization, keep in mind that data visualization is more of an art than science. To produce a good visualization, you need to put several pieces of code together for an excellent end result. Custom styles and shapes make data easier to understand at a glance, in ways that suit the user’s needs and context.
An outlier is a data point that is very different from the rest of the dataset. Outliers are often just bad
data, eg. due to a malfunctioned sensor, contaminated experiments, human error in recording data. However, outliers could indicate something real, such as a malfunction in a system.
Outliers are very common and are expected in large datasets. One common way to detect outliers in a dataset is by using a box plot. Generally speaking, outliers are expected to be located more than 1.5 IQR below Q1 or more than 1.5 IQR above Q3.
Outliers can significantly degrade the predictive power of a machine learning model. Models absorb outliers as valid values and apply their effect when predicting unseen occurrences.
A common way to deal with outliers is to simply omit the data points or to replace them by the median. However, removing real data outliers can be too optimistic, leading to non-realistic models. Advanced methods for dealing with outliers include the RANSAC method.
Most datasets contain missing values. The easiest way to deal with missing data is simply to throw away the data point. However, the removal of samples or dropping of entire feature columns is simply not feasible because we might lose too much valuable data. In this case, we can use different interpolation techniques to estimate the missing values from the other training samples in our dataset.
One of the most common interpolation techniques is mean imputation, where we simply replace the missing value with the mean value of the entire feature column. Other options for imputing missing values are median or most frequent (mode), where the latter replaces the missing values with the most frequent values.
Whatever imputation method you employ in your model, you have to keep in mind that imputation is only an approximation, and hence can produce an error in the final model.
If the data supplied was already preprocessed, it's important to find out how missing values were considered. What percentage of the original data was discarded and what imputation method was used to estimate missing values.
Scaling your features will help improve the quality and predictive power of your model. In order to bring features to the same scale, we could decide to use either normalization or standardization of features.
Most often, we assume data is normally distributed and default towards standardization, but that is not always the case. It is important that before deciding whether to use either standardization or normalization, you first take a look at how your features are statistically distributed.
If the feature tends to be uniformly distributed, then we may use normalization (MinMaxScaler). This can be useful in algorithms that do not assume any distribution of the data like K-Nearest Neighbors and Neural Networks. If the feature is approximately Gaussian, then we can use standardization (StandardScaler).
There is no hard and fast rule to tell you when to normalize or standardize your data. You can always start by fitting your model to raw, normalized and standardized data and compare the performance for best results.
It is a good practice to fit the scaler on the training data and then use this same scaler to transform the testing data. This would avoid any data leakage during the model testing process.
Again, note that whether you employ normalization or standardization, these are also approximative methods and are bound to contribute to the overall error of the model.
When our original continuous data do not follow a normal distribution, we can log transform this data to make it as “normal” as possible so that many statistical analysis from this data become valid or much more significant.
The normal distribution, also known as the Gaussian distribution, is the most important probability distribution in statistics for independent, random variables, and is recognized by its familiar bell-shaped curve in statistical reports. It's a continuous probability distribution that is symmetrical around its mean and most of the observations cluster around the central peak.
It's widely used in scientific studies for its many benefits. The normal distribution is simple. Its mean, median and mode have the same value and it can be defined with just two parameters: mean and variance. It also has important mathematical implications such as the Central Limit Theorem.
In other words, the log transformation reduces or removes the skewness of our original data. The important caveat here is that the original data has to follow or approximately follow a log-normal distribution. Otherwise, the log transformation won’t work.
Large datasets with hundreds or thousands of features often lead to redundancy especially when features are correlated with each other.
Training a model on a high-dimensional dataset having too many features can sometimes lead to overfitting (the model captures both real and random effects). In addition, an overly complex model having too many features can be hard to interpret.
One way to solve the problem of redundancy is by feature selection and dimensionality reduction techniques such as PCA.
The Principal Component Analysis is a statistical method that is used for feature extraction. PCA is used for high-dimensional and correlated data. The basic idea of PCA is to transform the original space of features into the space of the principal component.
The PCA transformation reduces the number of features to be used in the final model by focusing only on the components accounting for the majority of the variance in the dataset, and also removes the correlation between features.
PCA achieves dimensionality reduction by transforming features into orthogonal component axes of maximum variance in a dataset.
Along with PCA, the LDA is also a data preprocessing linear transformation technique that is often used for dimensionality reduction. It selects relevant features that can be used in the final machine learning algorithm.
While the PCA is an unsupervised algorithm used for feature extraction in high-dimensional and correlated data, the LDA is a supervised algorithm used to find the feature subspace that optimizes class separability and reduce dimensionality.
In machine learning, the dataset is often partitioned into training and testing sets. The model is trained on the training dataset and then tested on the testing dataset.
The testing dataset thus acts as the unseen dataset, which can be used to estimate a generalization error (the error expected when the model is applied to a real-world dataset after the model has been deployed).
Many modeling procedures use yet a validation set.
These are machine learning algorithms that perform learning by studying the relationship between the feature variables and the known target variable.
Supervised learning has two subcategories:
a) Continuous Target Variables Algorithms for predicting continuous target variables, that include Linear Regression, K-Neighbors Regression (KNR), and Support Vector Regression (SVR).
b) Discrete Target Variables Algorithms for predicting discrete target variables, that include: Perceptron classifier, Logistic Regression Classifier, Support Vector Machines (SVM), Decision Tree Classifier, K-Nearest Classifier, Naive Bayes Classifier
In unsupervised learning, we are dealing with unlabeled data or data of unknown structure.
Using unsupervised learning techniques, we are able to explore the structure of our data to extract meaningful information without the guidance of a known outcome variable or reward function.
K-Means clustering is an example of an unsupervised learning algorithm.
In reinforcement learning, the goal is to develop a system (agent) that improves its performance based on interactions with the environment.
Since the information about the current state of the environment typically includes a so-called reward signal, we can think of reinforcement learning as a field related to supervised learning.
However, in reinforcement learning, this feedback is not the correct ground truth label or value, but a measure of how well the action was measured by a reward function. Through the interaction with the environment, an agent can then use reinforcement learning to learn a series of actions that maximize this reward.
In a machine learning model, there are two types of parameters:
a) Model Parameters: These are the parameters in the model that must be determined using the training data set. These are the fitted parameters.
b) Hyperparameters: These are adjustable parameters that must be tuned to obtain a model with optimal performance. It is important that during training, the hyperparameters be tuned to obtain the model with the best performance (with the best-fitted parameters).
Cross-validation is a method of evaluating a machine learning model's performance across random samples of the dataset. This assures that any biases in the dataset are captured.
This procedure help obtain reliable estimates of the model's generalization error, that is, how well the model performs on unseen data.
In k-fold cross-validation, the dataset is randomly partitioned into training and testing sets. The model is trained on the training set and evaluated on the testing set. The process is repeated k-times. The average training and testing scores are then calculated by averaging over the k-folds.
In statistics and machine learning, the bias-variance tradeoff is the property of a set of predictive models whereby models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples and vice versa.
The bias-variance dilemma or problem is the conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set: The bias is an error from erroneous assumptions in the learning algorithm.
High bias (overly simple) can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). The variance is an error from sensitivity to small fluctuations in the training set.
High variance (overly complex) can cause an algorithm to model the random noise in the training data rather than the intended outputs (overfitting). It is important to find the right balance between model simplicity and complexity.
In machine learning (predictive analytics), there are several metrics that can be used for model evaluation.
For example, a continuous target supervised learning model can be evaluated using metrics such as the R2 score, mean square error (MSE), or mean absolute error (MAE).
A discrete target supervised learning model, also referred to as a classification model, can be evaluated using metrics such as accuracy, precision, recall, f1 score, and the area under the ROC curve (AUC).
a) Linear Algebra: This is the core skill in machine learning, where all concepts derive from. A data set is represented as a matrix. Linear Algebra is used in data preprocessing, data transformation, dimensionality reduction, and model creation and evaluation. Main topics include: Vectors, Vector Fields, Fundamental Theorem of Linear Algebra, Matrices, Dot Product, Cross Product, Eigenvalues, Eigenvectors, Eigendecomposition, LU Decomposition, Principal Axes Theorem, SVD.
b) Calculus: Most machine learning models are built with a dataset having several features or predictors. Hence, familiarity with multivariable calculus is extremely important for building a machine learning model. Also, models perform predictive modeling by minimizing an objective function, thereby learning the weights that must be applied to the testing data in order to obtain the predicted labels. Concepts to be used with: Functions of several variables, Minimum and Maximum values of a function, Derivatives and Gradients, Step function, Sigmoid function, Logit function, ReLU (Rectified Linear Unit) function, Cost/Loss/Objective function, Likelihood function, Error function, Gradient Descent, Stochastic Gradient Descent, Batch Gradient Descent.
Statistics and Probability are used for visualization of features, data preprocessing, feature transformation, data imputation, dimensionality reduction, feature engineering, model evaluation, etc. Here are the topics you need to be familiar with:
Mean, Median, Mode, Variance, Standard Deviation, Correlation Coefficient, Covariance Matrix, Probability distributions (Binomial, Poisson, Normal), Central Limit Theorem, p-value, Bayes Theorem (Confusion Matrix, Precision, Recall, Positive Predictive Value, Negative Predictive Value, ROC curve), R2 score, Mean Square Error (MSE), A/B Testing, Monte Carlo Simulation.
A typical data analysis project may involve several parts, each including several data files and different scripts with code. Keeping all these organized can be challenging. Productivity tools help you to keep projects organized and to maintain a record of your completed projects. Some essential productivity tools for practicing data scientists include tools such as Unix/Linux, git and GitHub, RStudio, and Jupyter Notebook