A structured approach to solving data science problems!

Amit Gairola
7 min readOct 3, 2021

While working on numerous data science projects, I have seen that that most data scientists adopt a haphazard approach when they work on a data science problem. While it is understandable that Data Science is both art as well as science, it is important to have some method to the madness. Most people I see just take the data and start throwing algorithms, hoping that they would achieve success through brute force. In most cases, this does not result into a favorable outcome. The business users also get frustrated and then it would appear that data science is nothing more than a fad.

With experience, I have tried to come up with a 16 step approach (could be more, could be less) to apply a structured approach to solving a data science problem.

I would try to give an overview of what the approach is and yes, nothing is written in stone. The important thing is that this approach can be adopted, modified and tailored to the need of the problem at hand.

Step 1: Formulate the business problem

It is extremely important that the data scientist formulates the business problem identifying the primary objective! This primary objective could be as simple as — I want to forecast sales for the next 6 months or it could be as complex as finding out reasons why employees leave an organization. However, it is very important that the data scientist is clear about what he/she is going to solve.

Also important is to clearly define and agree with business users on how would the success of the solution be measured. It can be either qualitative or in most cases it is quantitative. An example would be — Accuracy of the weekly sales forecast should be atleast 80% averaged over an 8 week period. Other examples could be that we would like to identify the primary reasons due to which there is employee attrition or customer churn. While the first example is a quantitative measure, the second is a qualitative measure and should be address by formulating appropriate hypothesis.

It is extremely important to have a sign-off from business users to avoid any issues further.

Step 2: Identify the right infrastructure, the libraries that are going to be used

It is important to identify the right infrastructure as this becomes crucial in not only achieving the outcomes but also ensuring that the machine learning model learns in a reasonable amount of time. Choosing the right infrastructure — as an example a Sagemaker Environment on AWS or Azure Machine Learning or Google Cloud could be chosen and infrastructure planned accordingly based on the compute and storage required.

It is also important to identify the libraries and the version that is going to be used for the project. If the project is going to be built using Tensorflow or Scikit Learn on Python, then it is important to identify the right version of libraries being used.

Step 3: Identify the right datasets and define the dependent and independent variables

While we can fancy about solving the most complex problems using data science, the core of the solution lies in the data. Most data scientists would spend atleast 70% of their time working on data. So it is important to identify the right datasets or data sources that need to be worked on.

It is also important to identify how do you identify the “target” variable. In Data Science, target variable is something that you are trying to predict. Hence it is important to identify the variable/column in the dataset that is going to be defined as the target. The rest of the columns/features become independent variables which will be used to predict the “target variable”.

Step 4: Identify the subset of data that is going to be used for training of the model

It is important to identify the data that will be used to train the data science model. As an example, all data till 30-Aug-2021 (as an example) will be used for training the model and the data after that will be used for testing the model accuracy. Usually the dataset is partitioned into training dataset and testing dataset and this could be done through either Random Sampling or Stratified Sampling. Other techniques such as K-fold cross validation or Leave One Out Cross validation can also be used to define how is the model trained.

Step 5: Data Preparation, Cleaning

Data Preparation and cleaning is a mandatory step that requires preparing the data for the model to learn. Computers can only understand numerical data. So it is important to do data conversions such as One Hot Encoding of categorical data (as an example if Gender column has Male, Female and Others, then this data is one hot encoded to convert to numerical values). Other example is handling of missing values. It is important to define a strategy for missing values as the data scientist is making a best guess estimate of the missing value. Median or Mode are used for Numerical and Categorical values respectively.

Step 6: Exploratory Data Analysis

While Data Science talks about algorithms to solve problems, nothing beats a good exploratory data analysis. An Exploratory Data Analysis or EDA involves visual inspection of data through tools such as Heatmaps for correlation analysis, Bar charts and Scatter plots, Frequency charts or Histograms, Box Plots (Outlier detection) and so on.

EDA gives important insights into relationships that exist within data and is key to feature engineering.

Step 7: Scaling for numerical data

Most data science algorithms are sensitive to the scale of the data. For example, if the dataset has two columns — height in centimeters and weight in kilograms, these are on different scales and these need to be brought to the same scale. Most commonly, StandardScaler is used for scaling. This computes a z-value (number of standard deviation times the data point is away from the mean). It is computed as (x — mean)/sigma, where is the data point.

Other types of scaling include MinMaxScaling and Robust Scaling.

Step 8: Categorical feature handling

Categorical features are class values (Gender — Male and Female, Income group — High and Low). Computers do not understand such variables and One Hot Encoding is required. In R, one hot encoding is done by default by most algorithms.

Step 9: Multi-collinearity and Curse of Dimensionality

Multi-collinearity arises when independent variables/columns/features are not truly independent. This essentially means that there is inherent relationship between two columns which are assumed to be independent.

Also, there could be a case where in after all the steps above, the number of features/dimensions increase dramatically. As an example, initially the data scientist started with 10–15 variables and then after all the Feature Engineering and data pre-processing this results in say more than 100 variables. Hence it becomes very important to reduce the number of dimensions.

Typically this is done through either Principal Component Analysis (Unsupervised Learning) or Linear Discriminant Analysis (LDA).

This not only removes multicollinearity as well as it also simplifies the processing of data by the algorithm.

Step 10: Train the model using an algorithm

This step involves using a Python or R or the programming language library to train the algorithm.

Step 11: Measure the outcome on the Training Set

The data scientist should measure the prediction accuracy on the training set. How well the model is able to learn the patterns from the data is determined through this step. This usually uses the metric as defined in Step 1.

Step 12: Predicting the outcome using the Testing Set

The data scientist would use the Testing set to predict the outcomes. The data scientist would also measure the accuracy or use the metric as defined in Step 1 for the success of the project.

Step 13: Overfitting?

If the accuracy on the Training set is high (say 95%) and the accuracy on the testing set is poor (say 60%), then it is possible that the model is overfitting (it is not able to generalize and is failing when it encounters unseen data from the testing set).

In such cases it is important to look at

a) How the training set and testing set were formed or if cross validation is necessary

b) Use L1/L2 regularization to reduce overfitting.

Step 14: Explore other algorithms

The data scientist can also explore other algorithms of similar nature to look at improving efficiency and accuracy of the model. If the algorithm in Step 10 was say Logistic Regression, then the data scientist can also look at other algorithms such as Decision Tree, Gaussian Naive Bayes, Support Vector Machines etc.

If the data scientist chooses to explore other algorithms, then Steps 10 to 13 need to be repeated for every algorithm considered.

Step 15: Consider Ensemble techniques

The data scientist can also consider ensemble techniques such as Bagging, Boosting or Stacking to improve accuracy of the model. Random Forest (Bagging), Adaboost, Gradient Boosting Machines, XGBoost (Boosting) can also be considered.

Step 16: Explainability and Visual Output

Modern times require data scientists to be able to explain their models. Regulations such as GDPR and others grant the right to a person to ask for explanation of an outcome that was determined by a Data Science Model. For example if the model is predicting the likelihood of a loan application defaulting and hence recommending rejection of the loan application, the new laws grant the right to the loan applicant to question the bank on why the loan application was rejected and bank by law is required to provide an explanation.

Hence explainability of the model becomes very important and it is an arduous exercise to ensure that the model is explainable.

Most business users do not usually understand the statistical explanations and hence it is important that the outcome of the model is explained in an intuitive manner. This involves building dashboards with intuitive visuals to explain the model and the outcomes. Usually these are developed in specialized tools such as Power BI, Tableau or Qlik. Otherwise open source libraries such as Plotly can also be used to develop such dashboards.

Note: The above mentioned steps are an attempt to bring a structure to solving a data science problem. And as mentioned earlier, this is not etched in stone. Building a data science model is a very complex exercise and the above mentioned steps are an attempt to put an structured approach to this complex exercise.

--

--

Amit Gairola

Data Scientist, Big Data, Reinforcement Learning Enthusiast, Contributor to the Github Arctic Vault