Important Checklist for Any Machine Learning Project

There are mainly 8 key steps to consider in machine learning projects.

  1. Frame the problem and understand the big picture.
  2. Get relevant data.
  3. Explore the data set and get insights.
  4. Prepare and clean the data set, expose the underlying data patterns to the algorithms.
  5. Explore different models and identify the best ones.
  6. Fine-tune the models and combine them into a great solution.
  7. Present the solution.
  8. Launch, monitor, and maintain the system.

Let’s understand more…

Frame the Problem and Understand the Big Picture

  1. Define business objective.
  2. How the model could be used?
  3. Identify existing solutions for the problem we want to solve?
  4. Which machine learning method we choose (supervised, unsupervised, reinforcement learning, online or offline, etc.)?
  5. How we measure model performance? Does the model able to achieve our objectives?
  6. Identify the minimum performance needed to reach the business objective?
  7. What are similar problems and use cases? Can we reuse experience or tools?
  8. Does human expertise better than a computer algorithm?
  9. List all the possible assumptions and verify them.

Note: automate as much as possible in every steps in the process.

Get Relevant Data

Note: automate as much as possible so we can easily get fresh data.

  1. List the data you need and how much you need.
  2. Identify the data sources. Where can you get data.
  3. Check how much storage requires and create a workspace.
  4. Check for legal obligations before accessing any data storages. Get authorization if necessary.
  5. Convert the data to a friendly format where we can manipulate easily.
  6. Ensure sensitivity of the information.
  7. Check data type (time series, sample, geographical etc) and its size.
  8. Sample a test set, put it aside, and never look at it.

Explore the Dataset and Get Insights

Note: Having industry expert’s opinion and insights would always be beneficial.

  1. Create a copy of the data sheet. Sampling it down to a manageable size would be greatly helpful for data exploration process.
  2. Keep a record of our data exploration. We can use Jupyter or any other notebook for machine learning projects.
  3. Study each attribute and its characteristics.
  4. Identify the target attributes if the model is supervised learning.
  5. Visualize the data.
  6. Study the correlations between each attributes.
  7. Identify the promising transformations which can be useful.
  8. Identify and collect extra data that would be useful.
  9. Document what we have learned.
NameType% of missing valuesNoisiness and type of noisePossibly useful for the task?Type of distribution
categoricalstochasticGaussian
int/floatoutliersuniform
bounded/unboundedrounding errors,logarithmic
text
structured

Prepare and Clean the Dataset

Notes: Keep original dataset intact. Work with copies. That way we can keep original dataset safe.

Write functions for all data transformations. So we can:

  • Easily prepare a dataset for fresh data.
  • Apply these transformations in future projects.
  • Clean and prepare test set.
  • Clean and prepare new data instances when our solution is live in production.
  • Make it easy for hyperparameters process.
  1. Data cleaning: Removing outliers is often important even though it is optional. Fill missing values (e.g., with zero, mean, median…) or ignore such columns and rows.
  2. Feature selection is again optional but highly recommended: Drop the attributes (features) that is not useful for the task.
  3. Feature engineering, where appropriate: Discretize continuous features. Decompose features (e.g., categorical, date/time, etc.). Add promising transformations of features (e.g., log(x), sqrt(x), x^2, etc.). Aggregate features into promising new features.
  4. Feature scaling: standardize or normalize features.

Explore Different Models

Notes: If we have huge data set, it is good idea to sample smaller training sets so we can train many different models in a reasonable time (however this could penalizes complex models such as large neural nets or random forests).

  1. Train many quick models from different categories (e.g., linear, naive Bayes, SVM, Random Forests, neural net, etc.) using standard parameters.
  2. Measure and compare their performance. Using N-fold cross-validation compute standard deviation and mean of the performance measure on the N folds.
  3. Analyze the types of errors that the models make. What data would a human have used to avoid these errors?
  4. Have a quick round of feature selection and engineering.
  5. Identify most promising models.

Fine-Tune the System

Notes: Use as much data as possible as you move toward the end of fine-tuning.

Don’t tweak the model after measuring the generalization error: It will start overfitting the test set.

  1. Fine-tune the hyperparameters using cross-validation.
  2. Try Ensemble methods. Combining your best models will often perform better than running them individually.
  3. Once you are confident about your final model, measure its performance on the test set to estimate the generalization error.

Present the Solution

  1. Document everything we have done.
  2. Create a presentation. Highlighting the big picture is important.
  3. Explain the business objective. Mention model performance and also show other models results
  4. Present key learning points in a beautiful visualizations. Describe what worked and what did not. List assumptions and limitations of the model.

Launch the Model

  1. Do proper testing and launch the model in production with production data inputs.
  2. Monitor system performance at regular intervals and trigger alerts when it drops.
    • As data evolve models performance will be affected. Beware of slow degradation too.
    • Measuring performance may require a human pipeline (e.g via a crowdsourcing service).
    • Also monitor inputs’ quality
  3. Retrain models on a regular basis on fresh data.


Learning resources:

  • Learning resources: Hands‑On Machine Learning with Scikit‑Learn, Keras, and TensorFlow: Book by Aurelien Geron