There are mainly 8 key steps to consider in machine learning projects.
- Frame the problem and understand the big picture.
- Get relevant data.
- Explore the data set and get insights.
- Prepare and clean the data set, expose the underlying data patterns to the algorithms.
- Explore different models and identify the best ones.
- Fine-tune the models and combine them into a great solution.
- Present the solution.
- Launch, monitor, and maintain the system.
Let’s understand more…
Frame the Problem and Understand the Big Picture
- Define business objective.
- How the model could be used?
- Identify existing solutions for the problem we want to solve?
- Which machine learning method we choose (supervised, unsupervised, reinforcement learning, online or offline, etc.)?
- How we measure model performance? Does the model able to achieve our objectives?
- Identify the minimum performance needed to reach the business objective?
- What are similar problems and use cases? Can we reuse experience or tools?
- Does human expertise better than a computer algorithm?
- List all the possible assumptions and verify them.
Note: automate as much as possible in every steps in the process.
Get Relevant Data
Note: automate as much as possible so we can easily get fresh data.
- List the data you need and how much you need.
- Identify the data sources. Where can you get data.
- Check how much storage requires and create a workspace.
- Check for legal obligations before accessing any data storages. Get authorization if necessary.
- Convert the data to a friendly format where we can manipulate easily.
- Ensure sensitivity of the information.
- Check data type (time series, sample, geographical etc) and its size.
- Sample a test set, put it aside, and never look at it.
Explore the Dataset and Get Insights
Note: Having industry expert’s opinion and insights would always be beneficial.
- Create a copy of the data sheet. Sampling it down to a manageable size would be greatly helpful for data exploration process.
- Keep a record of our data exploration. We can use Jupyter or any other notebook for machine learning projects.
- Study each attribute and its characteristics.
- Identify the target attributes if the model is supervised learning.
- Visualize the data.
- Study the correlations between each attributes.
- Identify the promising transformations which can be useful.
- Identify and collect extra data that would be useful.
- Document what we have learned.
Name | Type | % of missing values | Noisiness and type of noise | Possibly useful for the task? | Type of distribution |
– | categorical | – | stochastic | – | Gaussian |
– | int/float | – | outliers | – | uniform |
– | bounded/unbounded | – | rounding errors, | – | logarithmic |
– | text | – | – | – | – |
– | structured | – | – | – | – |
Prepare and Clean the Dataset
Notes: Keep original dataset intact. Work with copies. That way we can keep original dataset safe.
Write functions for all data transformations. So we can:
- Easily prepare a dataset for fresh data.
- Apply these transformations in future projects.
- Clean and prepare test set.
- Clean and prepare new data instances when our solution is live in production.
- Make it easy for hyperparameters process.
- Data cleaning: Removing outliers is often important even though it is optional. Fill missing values (e.g., with zero, mean, median…) or ignore such columns and rows.
- Feature selection is again optional but highly recommended: Drop the attributes (features) that is not useful for the task.
- Feature engineering, where appropriate: Discretize continuous features. Decompose features (e.g., categorical, date/time, etc.). Add promising transformations of features (e.g., log(x), sqrt(x), x^2, etc.). Aggregate features into promising new features.
- Feature scaling: standardize or normalize features.
Explore Different Models
Notes: If we have huge data set, it is good idea to sample smaller training sets so we can train many different models in a reasonable time (however this could penalizes complex models such as large neural nets or random forests).
- Train many quick models from different categories (e.g., linear, naive Bayes, SVM, Random Forests, neural net, etc.) using standard parameters.
- Measure and compare their performance. Using N-fold cross-validation compute standard deviation and mean of the performance measure on the N folds.
- Analyze the types of errors that the models make. What data would a human have used to avoid these errors?
- Have a quick round of feature selection and engineering.
- Identify most promising models.
Fine-Tune the System
Notes: Use as much data as possible as you move toward the end of fine-tuning.
Don’t tweak the model after measuring the generalization error: It will start overfitting the test set.
- Fine-tune the hyperparameters using cross-validation.
- Try Ensemble methods. Combining your best models will often perform better than running them individually.
- Once you are confident about your final model, measure its performance on the test set to estimate the generalization error.
Present the Solution
- Document everything we have done.
- Create a presentation. Highlighting the big picture is important.
- Explain the business objective. Mention model performance and also show other models results
- Present key learning points in a beautiful visualizations. Describe what worked and what did not. List assumptions and limitations of the model.
Launch the Model
- Do proper testing and launch the model in production with production data inputs.
- Monitor system performance at regular intervals and trigger alerts when it drops.
- As data evolve models performance will be affected. Beware of slow degradation too.
- Measuring performance may require a human pipeline (e.g via a crowdsourcing service).
- Also monitor inputs’ quality
- Retrain models on a regular basis on fresh data.
Learning resources:
- Learning resources: Hands‑On Machine Learning with Scikit‑Learn, Keras, and TensorFlow: Book by Aurelien Geron