Skip to content

Training Data

  • We must have a good data
  • Garbage in => Garbage out
  • Critical part to build a good model
  • Several options to model our data
  • Labeled vs unlabeled data
  • Structure vs unstructured data

Labeled data

Data that has been tagged with one or more labels. Used to train supervised machine learning models

Unlabeled data

Data that has not been tagged with any labels. Used to train unsupervised machine learning models

Structure data

Data that is organized in a specific format, such as a table or a database. Used to train structured machine learning models

  • Tabular data
  • Time series data

Unstructured data

Data that does not have a specific format, such as text, images, or audio. Used to train unstructured machine learning models

Types

  • Regression: Predict a continuous value
  • Predict the price of a house
  • Classification: Predict a category
  • Predict whether an email is spam or not

Training vs Validation vs Test set

  • Training set: Data used to train the model
  • Validation set: Data used to tune the model hyperparameters
  • Test set: Data used to evaluate the model performance

Feature Engineering

The process of using Domain knowledge to select and transform raw data into meaning.

  • We can do feature Engineering on structured data

Unsupervised Learning (K-mean)

  • No labeled data
  • Data set contains only features
  • Usage (isolation Forest)
  • Market basket analysis
  • Fraud detection

Semi supervised Learning

  • Small amount of labeled data
  • Huge amount of unlabeled data
  • Generate humans label without humans

Reinforcement Learning

  • Learn from the environment
  • Learn from the feedback
  • Learn from the reward
  • Usage
  • Game playing
  • Robotics
  • Self driving cars