Simple Linear Regression: Supervised Learning Algorithm

Linear Regression is a machine learning algorithm, which is based on the principle of supervised learning.

So as we know we have two types of Supervised Learning tasks, Linear regression (as the name suggests), performs a regression task.

Regression is used to understand the relationship between dependent and independent variables. It is commonly used to make projections, such as for sales revenue for a given business. Linear regression and polynomial regression are popular regression algorithms.

Linear regression predicts a dependent variable value (y) based on a given independent variable (x). Hence, it gives us a linear relationship between x (input) and y(output). Therefore, the name is Linear Regression.
In the figure you can see above, the linear relationship between X (input) and Y (output) is shown. The regression line is the best fit line for our model (the red line).

The relationship between the variables is exactly as you read in your high school algebra (for the slope of a line):

Y= θ0 + θ1*x

x: input training data (univariate – one input variable(parameter))
y: labels to data (supervised learning)

While we train the model, we predict the best fit line to predict the value of y for a given value of x. The model gets the best regression fit line by finding the best θ1 and θ2 values.
θ0: intercept
θ1: coefficient of x

As soon as we find the best values for θ0 and θ1, we obtain the best fit line.

The next step is to finally make predictions using model, the model will predict the value of y for the given input value of x.

How to update θ0 and θ1 values to get the best fit line? Important question, right? This is where cost function comes to play.

Cost Function (J):

The cost functions is what returns the error value between actual and predicted outputs.
When we achieve the best-fit regression line, the model aims to predict y value such that the error difference between predicted value and true value is minimum. This makes it extremely important to update the θ0 and θ1 values, in order to achieve the best value that minimize the error between predicted y value (predicted output) and actual y value.

Cost function(J) of Linear Regression is the Root Mean Squared Error (RMSE) between predicted y value (pred) and actual y value (y).

In order to update θ0 and θ1 values to reduce Cost function (which essentially equate to minimizing RMSE value) and achieve the best fit line the model uses Gradient Descent. The overall idea is to begin with a random θ0 and θ1 value and then iteratively updating the values, reaching minimum cost.

How to select the values of θ0 and θ1  to begin with?

So the idea is you take a random value based on x which is the given input and y which is the actual input and have a rough value of slope and constant ( θ0 and θ1), that’s something the code takes care of. So you then eventually keep updating the values in order to achieve the least difference between the actual and predicted values.

Some steps you take to optimise your results in linear regression

So let’s begin with data, as a lot depends on the quality of data you are working with, and how many holes you have in the data. Based on that you describe the function you will get, find the outliers and replace those with Mean or Median or Mode values, or perform some other data imputation methods.

Outliers are the values that lie far away from the majority of values.

Then identify the columns to know the impact on data set. These can be found out by understanding the co-relation between the input column and the column to be predicted, in a general sense.

Root Mean Square Error (RMSE) and Root Absolute Error (RAE) is used to reduce the error. It gives the mean error made by the model when doing the predictions of the given dataset. Depending on scale of data in training data it may not be that high.

Every dataset has some noise which causes inherent error on every model. Still, if we get high errors in the dataset we try some of following:

– Remove outliers in the data

– Do feature selection, some of features may not be as informative.

– Try to combine some features to make it more meaningful.

– Maybe the linear regression is under fitting or over fitting the data you can check Region under curve (ROC) and try to use more complex model like polynomial regression or regularization respectively (which we will discuss in the coming days).

So it depends on a whole lot of parameters that you get to fine-tune when you are running the model. That impacts the results.

Shoot any questions you may have! Tack så mycket (that’s just Swedish for thank you so much, no AI involved there, haha!)

What is Supervised Learning?

Supervised learning is a type of learning method in Machine Learning or Artificial Intelligence. In supervised learning, we use labeled datasets to train algorithms that to classify data or predict outcomes accurately. As we feed labelled input data into the model, it adjusts its weights and biases iteratively, which ensures that the model has been fitted appropriately.

Supervised learning is used in organisations to solve a variety of real-world problems at scale, such as classifying spam in a separate folder from your inbox.

Basic working of Supervised Learning

Supervised learning employs a training set to teach models how to spit out the desired output. The training dataset consists of inputs and correct outputs, which allows the model to learn better over a period of time. Accuracy is measured using the loss function, and we keep on adjusting the weights until the error is minimum.

Supervised learning can be separated into two types of problems :

  • Classification, which clearly classifies the data into categories. An algorithm is implemented to accurately assign test data into specific categories. It recognizes specific entities within the dataset and attempts to draw some conclusions on how those entities should be labeled or defined. Common classification algorithms are linear classifiers, support vector machines (SVM), decision trees, k-nearest neighbor, and random forest.
  • Regression is used to understand the relationship between dependent and independent variables. It is commonly used to make projections, such as for sales revenue for a given business. Linear regression, logistical regression, and polynomial regression are popular regression algorithms.

Difference between Supervised Learning and Unsupervised Learning

Difference between supervised and unsupervised learning in details:

Shoot any questions you have!

Wanna know how Skydiving feels?

Freedom. Breathlessness. Strong forces of wind. The feeling closest to flying and seeing the world with a whole new perspective. 

That’s how I would like to describe skydiving. 

To be honest, I was always jealous of birds because they can fly, and I can’t. 

But when I tried skydiving, it gave me that perspective I had been looking for, for so long. 

Was I scared the first time I decided to jump? Maybe a little, but I knew nirvana was just a step away, so I gathered all the courage and made that jump. The jump I was yearning for, for years. To see the beauty of the world, and how small a part of the world we are, yet we hold the power to change things with our mind. Such an amazing feeling it was! 

I guess I already answered how I felt while skydiving, and no it does not feel like falling at all. All I experienced was the strongest force of wind and saw breathtakingly stunning views. The beauty I hadn’t imagined ever. The kind of beauty that inspires you. 

When you are free falling at around 120 miles per hour (193 kms per hour), your brain is mostly engaged in trying to comprehend what you are seeing and feeling, and tries to communicate that with you. Isn’t it extremely intriguing? Brain trying to connect the feelings and the amazing views you gather at such an amazingly fast pace, and has some issues doing it extremely comprehensively. 

I felt the freefall – to me it was something magical. Like my brain, unable to comprehend everything is stuck somewhere and my body is free. It’s something so difficult to explain in mere words, but let’s try, shall we? It’s like the brain is staring down at the ground trying to figure out what’s really happening as you fall from the airplane, and your body has already exited, leaving your mind some place else. 

Your body is free and you experience just wind and the adrenaline rushing all inside your body. 

It was only when the parachute opened that my brain could start to get a sense of what was really happening, in a very very amazing way!

Hold on a moment, and imagine this experience playing in your head – like maybe a scene from a movie? Man, isn’t it amazing? 

Those 60 seconds, and 14,000 fts were way beyond any description. No words can do any justice to them. The gazillion sensations in your body, and the adrenaline is not something some words could capture, but you get the hang of it.

The beauty, the freedom, the breathlessness, the strong wind, the adrenaline rush and the sense of being truly yourself, detached from everything, just an entity in this amazingly beautiful world. A small entity, who holds the power to do wonders. That’s what skydiving is to me. 

Singular Value Decomposition (Unsupervised Algorithms)

Singular value decomposition is used to reduce a dataset containing a large number of values to a dataset with significantly fewer values. This reduced dataset will still contain a large fraction of the variability present in the original data. It is used to extract and untangle information, like PCA.

Eigenvalues and Eigenvectors

An eigenvector of an n × n matrix A is a nonzero vector x such that Ax = λx for some scalar λ. A scalar λ is called an eigenvalue of A if there is a nontrivial solution x of Ax = λx; such an x is called an eigenvector corresponding to λ.

Example : So in the first example, recall we have A1v = 4v. Thus, v is an eigenvector of A with a corresponding eigenvalue λ = 4. If you also try A1w, you will find out A1w = w so w is an eigenvector of A with a corresponding eigenvalues of 1.


A matrix A is diagonalizable if we can rewrite it as a product A=IDI−1, where I is an invertible matrix (and thus I−1 exists) and D is a diagonal matrix (where all off-diagonal elements are zero).

Since I is invertible, P it must be square; hence, it really only makes sense for square matrices.

Another useful note is that if A=IDI−1, then AI=ID. Let’s define I through its columns ai and D via its diagonal entries, we can consider the columns of I separately from each other, the columns of I must be the eigenvectors of A and the values on the diagonal must be eigenvalues of A.

Somehow, the singular value decomposition is essentially diagonalization in a more general sense. Singular value decomposition is used to reduce a dataset containing a large number of values to a dataset with significantly fewer values.

Let’s start by reviewing the matrix transformations

Singular value decomposition

Singular value decomposition (or SVD) is a factorization of a matrix. In fact, is a generalized version of eigenvalue decomposition. Before, for eigenvalue decomposition, we needed to have square matrices. So, a size n × n matrix would have at most n distinct eigenvalues (possibly less if numbers repeated). This is no longer the case.

Given an m×n matrix A with m > n, A can be factorized by SVD into three matrices:

– U is an m × n orthogonal matrix that satisfies UT U = In,
– S is a n×n diagonal matrix,
– V isann×northogonalmatrixsatisfyingVVT =VTV =In,

such that A = USV T . The entries in the diagonal matrix S are known as the singular values of A. They turn out to be the square roots of the eigenvalues of the square matrix AT A. So, if A is a real symmetric matrix with positive eigenvalues, then the singular values and eigenvalues are the same. However, this is not true in general. It is important to realize they are related, but distinct factorizations.

Image Compression by Using SVD(Singular Value Decomposition ):

Applications: Data Compression

The SVD is a thoroughly useful decomposition, useful for a whole ton of stuff. I’d like to quickly provide you with some examples, just to show you a small glimpse of what this can be used for in computer science, math, and other disciplines.

One application of the SVD is data compression. Consider some matrix A with rank five hundred; that is, the columns of this matrix span a 500-dimensional space. Encoding this matrix on a computer is going to take quite a lot of memory! We might be interested in approximating this matrix with one of lower rank – how close can we get to this matrix if we only approximate it as a matrix with rank one hundred, so that we only have to store a hundred columns? What if we use a matrix of rank twenty? Can we summarize all of the information in this very dense, 500-rank matrix with only a rank twenty matrix?

It turns out that you can prove that taking the n largest singular values A, replacing the rest with zero (to form Σ′), and recomputing UΣ′VT gives you the provably-best n-rank approximation to the matrix. Not only that, but the total of the first n singular values divided by the sum of all the singular values is the percentage of “information” that those singular values contain. If we want to keep 90% of the information, we just need to compute sums of singular values until we reach 90% of the sum, and discard the rest of the singular values. This yields a quick and dirty compression algorithm for matrices – take the SVD, drop all but a few singular values, and then recompute the approximated matrix. Since we only need to store the columns of U and V that actually get used (many get dropped since we set elements on the diagonal of Σ to zero), we greatly reduce the memory usage.
Here’s a tiger:
We can convert this tiger to black and white, and then just treat this tiger as a matrix, where each element is the pixel intensity at the relevant location. Here are the singular values of this tiger:
Note that this is a log scale (base 10). Most of the action and the largest singular values are the first thirty or so, and they contain a majority of the “information” in this matrix! We can plot the cumulative percentage, to see how much the first thirty or fifty singular values contain of the information:
After just fifty of the singular values, we already have over 70% of the information contained in this tiger! Finally, let’s take some approximations and plot a few approximate tigers:
Note that after about thirty or fifty components, adding more singular values doesn’t visually seem to improve image quality. By a quick application of SVD, you’ve just compressed a 500×800 pixel image into a 50×500 matrix (for U), 50 singular values, and a 800×50 matrix (for V).

The MATLAB code for generating these is incredibly straight-forward, as follows below.
Low-Rank Matrix Approximation Image Compression

Shoot any questions you have in Linkedin Comments! (or here)


Lost in Stockholm

“I hear echoes of a thousand screams

As I lay me down to sleep

There’s a black hole deep inside of me

Reminding me

That I lost my backbone

Somewhere in Stockholm

I lost my backbone

Somewhere in Stockholm”

As these lyrics play in the background, I think of Stockholm and how I actually found my home 7064 km (or 4389 miles away). 

I love everything about that city, absolutely everything. I love how the sun rises and sets weird there, I love the summer with so much sun and I love the winter without it. I love how you can roam around freely until 4 am if you want and you feel safe, even when alone. I love what I feel for the city and how it makes me feel warm in the heart. I love some of the people I met there, and the career it gave me. The purpose it gave me.

Stockholm’s just stunningly beautiful, with such beautiful beaches and forests, and exceptionally amazing museums. The land of Abba, Opeth and Avicii, the place that has something to offer for everyone. The place that accepts you however you are. 


Stockholm and some people I met there gave me the strength to be myself, that I wasn’t the misfit I thought I was. It gave me the courage to talk about my goals and aspirations, and that if I really want, I can achieve them. 

The Avicii Tribute Concert held every year in Stockholm on December 5, taught me in a way how important mental health is, how it doesn’t discriminate against anyone. That it’s not only okay to talk about mental health, but it’s important to do so. We can’t lose more people to mental health, can we?

I also went to see Sweden vs Spain Europa qualifiers, and damn, Spain tied in the last 2 minutes to spare, but the experience was something beautiful, you only have to be there to know that feeling.

And of course my fun trips to Ikea, where I always spent at least 4 times of what I had planned – no no I am not a bad shopper, ikea is exceptionally good at selling stuff! (You didn’t think I would write about Sweden and not mention Ikea, right?)

Me in Stockholm!

I also had the best time volunteering for the Stockholm’s Nobel Nightcap and meeting some of the Nobel Prize Laureates. I am featured in the aftermovie, dressed as a beatle.   

So many good memories, amazing people I met there, the poker nights I had with my friends, the new year’s bonfire, the Diwali get together, the India day hosted by KTH Royal Institute of Stockholm and trying out new outdoor activities during the summer. It was just amazing. 

When I moved from India to Sweden in January 2019, I did see a cultural gap, but the one I enjoyed to the utmost level. I enjoy freedom and open thoughts, and that’s exactly what I got there. Not to mention, my professor at KTH, a tall Polish dude (three times my height) was the best mentor anyone could ask for. Probably nocturnal like me, as he used to reply to every email even in the middle of the night and always checked-in on me (also my lab-mates were awesome). 

I learnt so many things after moving to Sweden such as speaking a little bit of Swedish, cooking amazing food, baking the best cheesecakes with strawberry syrup (ask my friends), the feeling behind fika and the importance of Lagom. 

Maybe next time, I will talk about my travels there (which are breathtaking by the way), but this is what Stockholm makes me feel at heart. It’s the home I found 4389 miles away. 

Relationship Between Mind and AI : Talk by Me

How can neuroscience benefit from AI? As we all know, brains are far too complex for us to understand at present. I read a book called “The Psychopath Inside” by James Fallon, where he explains the brain in terms of a 3*3 Rubik’s cube (it’s still so impossibly difficult to understand and visualise without prior knowledge). 

This is where AI jumps in, according to me, and can be employed in a number of ways. Using AI we can produce new tools or applications to come up with connections or general theoretical principles. This will help us understand the complex machine that our brain is. 

Read how AI can help understand our minds in the interview with Smriti Mishra.

Link to the interview:

Tune in to hear my talk during the NDSML Summit in the Data and ML Engineering track. This track focuses on case studies on applying agile approaches to designing, implementing and maintaining distributed data architecture and pipelines to support a wide tools, models and frameworks in production. 

Also, please feel free to use my code to get 10% off on the registration!

Thank you for reading! See you during the event 🙂

Hierarchical Clustering (Unsupervised Learning algorithm)

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. 
– You start with raw unlabelled data and the endpoint is a set of clusters. 
– Each cluster is different from the other cluster, and the objects within each cluster are similar to each other.

How does it work? 
– Hierarchical clustering starts by treating each observation as a separate cluster. 
– Then it iteratively executes the following two steps: (1) identify the two clusters that are closest together, and (2) merge the two most similar clusters. 
– This iterative process continues until all the clusters are merged together. 

Hierarchical clustering is of two types:

Agglomerative (AGNES)

(1) Agglomerative (AGNES)

You build clusters in a bottom up approach (AGNES). This is the most common type of hierarchical clustering agglomerative clustering, which is used to group objects in clusters based on their similarity. It’s also known as AGNES (Agglomerative Nesting). The algorithm starts by treating each object as a singleton cluster. Next, pairs of clusters are successively merged until all clusters have been merged into one big cluster containing all objects. The result is a tree-based representation of the objects, named dendrogram.

(2) Divisive (DIANA)

You build clusters in a top down approach (DIANA). It starts by including all objects in a single large cluster. At each step of iteration, the most heterogeneous cluster is divided into two. The process is iterated until all objects are in their own cluster.

Types of hierarchical clustering.

Thanks for reading, shoot any questions you have!

Principal Component Analysis (PCA) – Unsupervised Learning

PCA is one of famous techniqeus for dimension reduction, feature extraction, and data visualization.
– PCA is a method that rotates the dataset in a way such that the rotated features are statistically uncorrelated. 
– This rotation is often followed by selecting only a subset of the new features, based on how useful the features are. 
– PCA is used to reduce the number of independent variables in a dataset and is applicable when the ratio of data points to independent variables is low. 
– PCA transforms a linear combination of variables such that the resulting variable expresses the maximum variance within the combination of variables.
– Every principal component will ALWAYS be orthogonal (perpendicular) to every other principal component, and hence linearly independent to each other. 

Steps for PCA: 
(1) Standardisation or scaling the data. 
(2) Computing Covariance matrix
(3) Calculating EigenVectore and EigenValues
(4) Computing Principal Components 
(5) Reducing the dimension of data by selecting best components without losing

PCA Algorithm


Example of the PCA Algorithm 

Let’s try displaying the Times stories using the principal components. 
– First we make an empty plot— just to set up the axes nicely for the data which will actually be displayed. 
– Then plot a blue “m” at the location of each music story, and a red “a” at the location of each art story. 
– Although we have gone from 4431 dimensions to 2, and so thrown away a lot of information, we could draw a line across this plot and have most of the art stories on one side of it and all the music stories on the other.
– If we let ourselves use the first four or five principal components, we’d still have a thousand- fold savings in dimensions, but we’d be able to get almost-perfect separation between the two classes. 
– This is a sign that PCA is really doing a good job at summarizing the information in the word-count vectors, and in turn that the bags of words give us a lot of information about the meaning of the stories. 

Figure explains : Projection of the Times stories on to the first two principal components. Labels: “a” for art stories, “m” for music.

Let’s discuss the rest in comments! Thanks for reading.

Hidden Markov Models (Unsupervised Learning Algorithms)

It is one of the more elaborate ML algorithms – a statical model that analyzes the features of data and groups it accordingly. 

The HMM is based on augmenting the Markov chain. 
– A Markov chain is a model that tells us something about the probabilities of sequences of random variables, states, each of which can take on values from some set. 
– These sets can be words, or tags, or symbols representing anything, like the weather. 
– A Markov chain makes a very strong assumption that if we want to predict the future in the sequence, all that matters is the current state. – The states before the current state have no impact on the future except via the current state. 
– Example: It’s as if to predict tomorrow’s weather you could examine today’s weather but you weren’t allowed to look at yesterday’s weather.

It finds use in Pattern Recognition, Natural Language Processing (NLP), data analytics, etc. 

A Markov process.

A simple weather model

The probabilities of weather conditions (modeled as either rainy or sunny), given the weather on the preceding day, can be represented by a transition matrix:[3]P={\begin{bmatrix}0.9&0.1\\0.5&0.5\end{bmatrix}}

The matrix P represents the weather model in which a sunny day is 90% likely to be followed by another sunny day, and a rainy day is 50% likely to be followed by another rainy day. The columns can be labelled “sunny” and “rainy”, and the rows can be labelled in the same order.The above matrix as a graph.

(P)i j is the probability that, if a given day is of type i, it will be followed by a day of type j.

Notice that the rows of P sum to 1: this is because P is a stochastic matrix.

Predicting the weather

The weather on day 0 (today) is known to be sunny. This is represented by a vector in which the “sunny” entry is 100%, and the “rainy” entry is 0%:{\mathbf  {x}}^{{(0)}}={\begin{bmatrix}1&0\end{bmatrix}}

The weather on day 1 (tomorrow) can be predicted by:{\mathbf  {x}}^{{(1)}}={\mathbf  {x}}^{{(0)}}P={\begin{bmatrix}1&0\end{bmatrix}}{\begin{bmatrix}0.9&0.1\\0.5&0.5\end{bmatrix}}={\begin{bmatrix}0.9&0.1\end{bmatrix}}

Thus, there is a 90% chance that day 1 will also be sunny.

The weather on day 2 (the day after tomorrow) can be predicted in the same way:{\mathbf  {x}}^{{(2)}}={\mathbf  {x}}^{{(1)}}P={\mathbf  {x}}^{{(0)}}P^{2}={\begin{bmatrix}1&0\end{bmatrix}}{\begin{bmatrix}0.9&0.1\\0.5&0.5\end{bmatrix}}^{2}={\begin{bmatrix}0.86&0.14\end{bmatrix}}

or{\mathbf  {x}}^{{(2)}}={\mathbf  {x}}^{{(1)}}P={\begin{bmatrix}0.9&0.1\end{bmatrix}}{\begin{bmatrix}0.9&0.1\\0.5&0.5\end{bmatrix}}={\begin{bmatrix}0.86&0.14\end{bmatrix}}

General rules for day n are:{\mathbf  {x}}^{{(n)}}={\mathbf  {x}}^{{(n-1)}}P{\mathbf  {x}}^{{(n)}}={\mathbf  {x}}^{{(0)}}P^{n}

Steady state of the weather

In this example, predictions for the weather on more distant days are increasingly inaccurate and tend towards a steady state vector. This vector represents the probabilities of sunny and rainy weather on all days, and is independent of the initial weather.

The steady state vector is defined as:{\mathbf  {q}}=\lim _{{n\to \infty }}{\mathbf  {x}}^{{(n)}}

but converges to a strictly positive vector only if P is a regular transition matrix (that is, there is at least one Pn with all non-zero entries).

Since the q is independent from initial conditions, it must be unchanged when transformed by P. This makes it an eigenvector (with eigenvalue), and means it can be derived from P. For the weather example:{\displaystyle {\begin{aligned}P&={\begin{bmatrix}0.9&0.1\\0.5&0.5\end{bmatrix}}\\\mathbf {q} P&=\mathbf {q} &&{\text{(}}\mathbf {q} {\text{ is unchanged by }}P{\text{.)}}\\&=\mathbf {q} I\\\mathbf {q} (P-I)&=\mathbf {0} \\\mathbf {q} \left({\begin{bmatrix}0.9&0.1\\0.5&0.5\end{bmatrix}}-{\begin{bmatrix}1&0\\0&1\end{bmatrix}}\right)&=\mathbf {0} \\\mathbf {q} {\begin{bmatrix}-0.1&0.1\\0.5&-0.5\end{bmatrix}}&=\mathbf {0} \\{\begin{bmatrix}q_{1}&q_{2}\end{bmatrix}}{\begin{bmatrix}-0.1&0.1\\0.5&-0.5\end{bmatrix}}&={\begin{bmatrix}0&0\end{bmatrix}}\\-0.1q_{1}+0.5q_{2}&=0\end{aligned}}}

and since they are a probability vector we know that q_{1}+q_{2}=1.

Solving this pair of simultaneous equations gives the steady state distribution:{\begin{bmatrix}q_{1}&q_{2}\end{bmatrix}}={\begin{bmatrix}0.833&0.167\end{bmatrix}}

In conclusion, in the long term, about 83.3% of days are sunny.

Let’s discuss the rest in the comments!

Unsupervised Learning Algorithm : K-means Clustering

A centroid based clustering algorithm. The main aim of this algorithm is to minimise the sum of distances between the data point and their corresponding clusters. The input data is unlabelled, so the algorithm divides the data into n number of clusters iteratively until it creates the most optimised clusters. 

The algorithm primarily performs two tasks:
– Determines the best value for ‘k’ centroids by an iterative process.
– Then it assigns each data point to its closest ‘k’ center. The data points that are closer to the particular ‘k’ center, forms a cluster. 
(As seen in the image)

K-means Clustering – Example:

A pizza chain wants to open its delivery centres across a city. Let’s think of the possible challenges.

  • Where is the pizza delivered frequently?
  • Number of pizza stores to take care of delivery in that area?
  • Locations for the pizza stores within all these areas in order to keep the distance between the store and delivery points minimum?

Understanding these metrics will involve a lot of analysis, statistical analysis and mathematics. Let’s understand how k-means clustering method works.

K-means Clustering Method:

If k is given, the K-means algorithm can be executed in the following steps:

  • Parting the objects into k non-empty subsets.
  • Identify the cluster centroids of the current partition.
  • Assign each point to a specific cluster.
  • Compute the distances from each point and allot points to the cluster where the distance from the centroid is minimum.
  • After re-allotting the points, find the centroid of the new cluster formed.

The step by step process:

k-means clustering with example
The step by step process of the pizza chain analysing the process (source)