Linear Regression is a machine learning algorithm, which is based on the principle of supervised learning.
So as we know we have two types of Supervised Learning tasks, Linear regression (as the name suggests), performs a regression task.
Regression is used to understand the relationship between dependent and independent variables. It is commonly used to make projections, such as for sales revenue for a given business. Linear regression and polynomial regression are popular regression algorithms.
Linear regression predicts a dependent variable value (y) based on a given independent variable (x). Hence, it gives us a linear relationship between x (input) and y(output). Therefore, the name is Linear Regression.
In the figure you can see above, the linear relationship between X (input) and Y (output) is shown. The regression line is the best fit line for our model (the red line).
The relationship between the variables is exactly as you read in your high school algebra (for the slope of a line):
Y= θ0 + θ1*x
x: input training data (univariate – one input variable(parameter))
y: labels to data (supervised learning)
While we train the model, we predict the best fit line to predict the value of y for a given value of x. The model gets the best regression fit line by finding the best θ1 and θ2 values.
θ1: coefficient of x
As soon as we find the best values for θ0 and θ1, we obtain the best fit line.
The next step is to finally make predictions using model, the model will predict the value of y for the given input value of x.
How to update θ0 and θ1 values to get the best fit line? Important question, right? This is where cost function comes to play.
Cost Function (J):
The cost functions is what returns the error value between actual and predicted outputs.
When we achieve the best-fit regression line, the model aims to predict y value such that the error difference between predicted value and true value is minimum. This makes it extremely important to update the θ0 and θ1 values, in order to achieve the best value that minimize the error between predicted y value (predicted output) and actual y value.
Cost function(J) of Linear Regression is the Root Mean Squared Error (RMSE) between predicted y value (pred) and actual y value (y).
In order to update θ0 and θ1 values to reduce Cost function (which essentially equate to minimizing RMSE value) and achieve the best fit line the model uses Gradient Descent. The overall idea is to begin with a random θ0 and θ1 value and then iteratively updating the values, reaching minimum cost.
How to select the values of θ0 and θ1 to begin with?
So the idea is you take a random value based on x which is the given input and y which is the actual input and have a rough value of slope and constant ( θ0 and θ1), that’s something the code takes care of. So you then eventually keep updating the values in order to achieve the least difference between the actual and predicted values.
Some steps you take to optimise your results in linear regression
So let’s begin with data, as a lot depends on the quality of data you are working with, and how many holes you have in the data. Based on that you describe the function you will get, find the outliers and replace those with Mean or Median or Mode values, or perform some other data imputation methods.
Outliers are the values that lie far away from the majority of values.
Then identify the columns to know the impact on data set. These can be found out by understanding the co-relation between the input column and the column to be predicted, in a general sense.
Root Mean Square Error (RMSE) and Root Absolute Error (RAE) is used to reduce the error. It gives the mean error made by the model when doing the predictions of the given dataset. Depending on scale of data in training data it may not be that high.
Every dataset has some noise which causes inherent error on every model. Still, if we get high errors in the dataset we try some of following:
– Remove outliers in the data
– Do feature selection, some of features may not be as informative.
– Try to combine some features to make it more meaningful.
– Maybe the linear regression is under fitting or over fitting the data you can check Region under curve (ROC) and try to use more complex model like polynomial regression or regularization respectively (which we will discuss in the coming days).
So it depends on a whole lot of parameters that you get to fine-tune when you are running the model. That impacts the results.
Shoot any questions you may have! Tack så mycket (that’s just Swedish for thank you so much, no AI involved there, haha!)