Machine Learning Fundamentals

How does a machine learn?

Machine Learning primarily functions in three ways, depending on how the information is processed:

Learning Types in Machine Learning

Types of learning in machine learning.

Supervised: It is like teaching with examples. Models learn using labeled data to make predictions. If you want to predict how many goals a team will score based on shots on goal, you use Linear Regression. If you prefer to know the probability of victory, you use Logistic Regression.
Unsupervised: Models find patterns in data that do not have prior labels.
Reinforcement: This is the classic "reward and punishment" system. It is based on reward systems given when a model succeeds in its actions.

Explanatory video on types of machine learning.

Data Cleaning

This is a key stage in Machine Learning. Before training a model, it is important to prepare the data for example, by using encoding, scaling, and visualization.

If one variable measures goals (0–5) and another measures passes (0–500), the model can get confused. We use Normalization (adjusting everything between 0 and 1) or Standardization to ensure everything is on the same scale.

Feature	Normalization (Min-Max)	Standardization (Z-score)
Output Range	Generally [0,1] or [−1,1]	No fixed boundaries (typically between −3 and 3)
Outlier Sensitivity	Highly sensitive (extreme values dictate the range)	Much more robust
Ideal Algorithm	KNN, Neural Networks	Linear Regression, Logistic Regression, PCA

Outliers are atypical data points. For example, if you have 100 players and one runs 100 km while the others run between 10 and 20 km, that specific player would be an outlier.

It is also very useful to use histograms to view data distribution and heatmaps to identify the strength of the relationship between variables.

Heatmap example to Machine Learning

Heatmap example

Feature Selection

It is important to understand that not all data is useful for what we want to achieve; therefore, it must be refined. To do this, we can use:

Feature Engineering: Creating new variables from existing ones to help the model discover deeper patterns.
Selection: Simplifying the model by keeping only the variables that provide real value.

Here is an example of transforming raw data into more useful information for models:

Input	Feature Engineering	Why is it useful?
Goals scored + Minutes played	Goals per 90 minutes	Allows for a fair efficiency comparison between a starter and a substitute.
Shots on goal + Goals	Shooting accuracy (%)	Indicates how lethal a player is, rather than just how often they shoot.
Match date	Is it the weekend?	Helps determine if team performance changes depending on the day of the week.

How do we know if our model is good?

It is recommended to split the data into Training (Train), where the model learns, and Testing (Test), generally using 20% of the data to verify performance. This helps us see if the model truly learned or simply memorized the data.

Overfitting

This occurs when an AI model becomes "too expert" at recognizing the training data, to the point where it loses the ability to generalize to new data.

Imagine a student preparing for a math exam. Instead of learning the formulas and the underlying logic, they decide to memorize every exercise and answer in their textbook.

If you give them a practice question that is identical to the one in the book, they will get a perfect score. However, if the actual exam changes a single number or presents a slightly different scenario, the student will fail completely. They didn't learn how to reason; they only learned how to repeat what they memorized.

Want to see this in code? I've documented a step-by-step project where I apply all of this to predict soccer results: https://byandrev.dev/en/blog/machine-learning-football-project

Blog