Name	Name	Last commit message	Last commit date
Latest commit History 77 Commits
Formulas	Formulas
README.md	README.md

Linear Regression

Written by Juzer Shakir

Description
Notations
Definition
Flowchart
Univariate Linear Regression
- Definition
- Formula
- Cost Function
- Gradient Descent
Multivariate Linear Regression
- Definition
- Formula
- Cost Function
- Gradeint Descent
Feature Scaling and Mean Normalization
Bias - Variance
- High Bias
- Just Right
- High Variance
Resolving High Variance
- Cost Function
- Gradeint Descent

Description

A Mathematical intuition and quick guide and understanding of how Linear Regression Algorithms works. Given links to other study materials in order to understand the concepts more concretly.

Notations

m Number of Training Examples.
x "input" variable / features.
y "ouput" variable / "target" variable.
n Number of feature variable (x)
(x, y) One training example.
x_i , y_i i^th training example.
x_{i_j} i^th training example of the j^th column / feature.

Definition

A linear equation that models a function such that if we give any x to it, it will predict a value y , where both x and y are input and output varaibles respectively. These are numerical and continous values.

It is the most simple and well known algorithm used in machine learning.

Flowchart

The above Flowchart represents that we choose our training set, feed it to an algorithm, it will learn the patterns and will output a function called Hypothesis function 'H(x)'. We then give any x value to that function and it will output an estimated y value for it.

For historical reasons, this function H(x) is called hypothesis function.

Univariate Linear Regression

Definition for Univariate Linear Regression

When you have one feature / variable x as an input to the function to predict y, we call this Univariate Linear Regression problem.

Formula for Univariate Linear Regression

H(x) = th₀ + th₁x

Other way of representing this formula as what we are familiar with:

H(x) = b + mx

Where :

b = th₀ y intercept

m = th₁ slope

x = x feature / input variable

Source

Help

Intuition behind linear equation.

Need to Practice?

Cost Function for Univariate Linear Regression

All that said, how do we figure out the best possible straight line to the data that we feed?

This is where Cost Function will help us:

The best fit line to our data will be where we have least distance between the predicted 'y' value and trained 'y' value.

Formula for Cost Function

Where :

h(x_i) hypothesis function

y_i actual values of y

1/m gives Mean of Squared Errors

1/2 Mean is halved as a convenience for the computation of the Gradient Descent.

The above formula takes the sum of the distances between predicted values and actual values of training set, sqaure it, take the average and multiply it by 1/2.

This cost function is also called as Squared Error Function or Mean Squared Error.

Why do we take squares of the error's?
The MSE function is commonly used and is a reasonable choice and works well for most Regression problems.

Let's subsititute MSE function to function J :

Help

Intuition behind Cost Function.

Gradient Descent for Univariate Linear Regression

So now we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters in the hypothesis function. That's where Gradient Descent comes in.
Gradient Descent is used to minimize the cost function J, minimizing J is same as minimizing MSE to get best possible fit line to our data.

Formula for Gradient Descent

Where :

:= Is the Assignment Operator

a is Alpha, it's the number which is called learning rate. If its too high it may fail to converge and if too low then descending will be slow.

'th_j' Taking Gradient Descent of a feature or a column of a dataset.

/(th_j) J(th₀,th₁) Taking partial derivative of MSE cost function.

Additional Resources

Intuition behind Gradient Descent.

Partial Derivative.

Now Let's apply Gradient Descend to minmize our MSE function.
In order to apply Gradient Descent, we need to figure out the partial derivative term.
So let's solve partial derivative of cost function J.

Now let's plug these 2 values to our Gradient Descent:

Note :

Cost Function for Linear Regression is always going to be Convex or Bowl Shaped Function, so this function doesn't have any local minimum but one global minimum, thus always converging to global minimum.

The above hypothesis function has 2 parameters, th₀ & th₁, so Gradient Descent will run on each feature, hence here two times, one for feature and one for base (y-intercept), to get minimum value of j. So if we have n features, Gradient Descent will run on all n+1 features.

Multivariate Linear Regression

Linear Algebra

Intro to Vectors & Scalars.

Combined Vector Operations.

Intro to Matrices.

Representing linear systems with matrices.

Add & subtract matrices.

Multipling matrices.

Identity Matrix.

Properties of Matrix Multiplication.

Matrix Inverses.

Matrix Transpose.

Definition for Multivariate Linear Regression

Its same as Univariate Linear Regression, except it has more than one feature variable (x) to predict target variable (y).

Formula for Multivariate Linear Regression

Our hypothesis function for n = 4 :

Where :

th₀ y intercept

And rest are features x to help predict y value.

Intuition:
In order to develop an intuition about this function, let's imagine that this function represents price of a house (y) based on the features given (x), then we can think of this function as:

th₀ as the basic price of a house.

th₁ as price/m².

x₁ as area of a house (m²).

th₂ as price/floor.

x₂ as number of floors.

etc (You get the idea)

Let's set all the parameters:

th₀, th₁, th₂, th₃.......th_n = th

And Let's set all the features:

x₀, x₁, x₂, x₃.......x_n = x

Where :

th will be n+1 dimensional vector because we have th₀ which is not a feature.

x₀ is added just for convenience so that we can take matrix multiplication of th as th^T and x and we will set x₀ value to 1, so this doesn't change the values.

x will also be n+1 dimensional vector.

Cost Function for Multivariate Linear Regression

So,

J(th₀, th₁, th₂, th₃.......th_n) = J(th)

Where,

Gradient Descent for Multivariate Linear Regression

Appling Gradient Descend to minmize our MSE function after solving partial derivative of J(th), we get :

For Example : If we had 2 features, this is how gradient descent would run on each parameter:

Feature Scaling and Mean Normalization

We can speed up Gradient Descent by having each of our input values in roughly the same range. This is because th will descend quickly on small ranges and slowly on large ranges. If we have large ranges, it will oscilate inefficiently down to optimum (minimum) when the variables are very uneven.

The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally between :

-1 <= x_i <= 1

-0.5 <= x_i <= 0.5

These aren't exact requirements, we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges.
This can make Gradient Descent run much faster and converge in a lot few iterations.

Two techniques to help with this are Feature Scaling and Mean Normalization.

Feature Scaling involves diving the input values by the range (i.e max value - min value) of the input variable, resulting in a new values.

Mean Normalization involves subtracting the average of a feature variable from the values of the feature, dividing by range of values or by standard deviation, resulting in a new values.

Where:

m_j Average of a Feature variable j.

s_i Either Range or Standard Deviation of a Feature j.

Bias - Variance

High Bias

Consider a problem of predicting y. The figure below shows the result of fitting a hypothesis function th₀ + th₁x to a dataset.

We see that the data doesn't really lie on a straight line and so the fit is not very good.
This figure is an instance of Underfitting, in which the data clearly shows structure not captured by the model h(x).
Underfitting or high bias is when the form of our h(x) function maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features.

Just Right

If we add an extra feature x² and fit hypothesis function th₀ + th₁x + th₂x², then we obtain a slightly better fit to the data.

High Variance

If we add more features, it would intuitively seem that it could perform much better, however there's also a danger in adding too amny features. The result of fitting 4th order polynomial where our h(x) is th₀ + th₁x + th₂x² + th₃x³ + th₄x⁴

We see that even though the fitted curve passes through the data perfectly, we would not expect this to be a very good predictor of for example housing prices y for different areas x.

Overfitting or High Variance, is caused by h(x) function that fits the available data but does not generalize (unable to accurately predict) well to predict new data. It is usually caused by giving too many features or having a complicated h(x) function that creates lots of unnecessory curves and angles unrelated to the data.

Resolving High Variance

There are 2 main options to address the issue of Overfitting:

Regularization
- Keep all features, but reduce the magnitude of parameters th_j.
- It works well when we have a lot of slightly useful features.
Reduce number of features
- Manually select which features to keep.
- Use model selection algorithm.

Cost Function for Regularization

If we have overfitting from our hypothesis function, for example like in figure 3, we can reduce the weight that some of the terms in our function carry by increasing their cost.
For example, let's say we wanted to make the following function more quadratic:
th₀ + th₁x + th₂x² + th₃x³ + th₄x⁴
We'll want to eliminate the influence of th₃x³ and th₄x⁴, without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function.

We've added two extra terms at the end to inflate the cost of th₃ and th₄. Now, in order for the cost function to get close to 0, we will have to reduce the values th₃ and th₄ to near 0. This will in turn greatly reduce the values of th₃x³ and th₄x⁴ in our hypothesis function.

As a result we see that the new H(x) looks like a quadratic function but fits the data better compared to figure 3 due to extra small terms th₃x³ and th₄x⁴.

We can also regularize all of our parameters in a single summation as:

The l, or lambda, is regularization parameter. It determines how much costs of our th parameters are inflated.
Using the above cost function with the extra summation, we can smooth the output of our hypothesis function to reduce overfitting. If lambda is choosen to be too large, it may smooth our function too much and causing High bias or Underfitting.

Note:
We penalize the parameters from th₁ .... th_n , but we don't penalize th₀. We treat this differently.

Gradient Descent for Regularization

We will modify our gradient descent function to separate out th₀ from rest of the parameters because we do not want to penalize th₀.

The term l/m times th_j performs our regularization.

With some manipulation our updated Gradient Descent for th_j can also be represented as:

The first term in the equation will always be less than 1. Intuitively you can see it as reducing the value th_j by some amount on every update. And the second term is exactly the same as it was in Gradient Descent without applying regularization.

Folders and files

Latest commit

History

Repository files navigation

Linear Regression

Written by Juzer Shakir

Table of Contents

Description

Notations

Definition

Flowchart

Univariate Linear Regression

Definition for Univariate Linear Regression

Formula for Univariate Linear Regression

Cost Function for Univariate Linear Regression

Formula for Cost Function

Gradient Descent for Univariate Linear Regression

Formula for Gradient Descent

Multivariate Linear Regression

Definition for Multivariate Linear Regression

Formula for Multivariate Linear Regression

Cost Function for Multivariate Linear Regression

Gradient Descent for Multivariate Linear Regression

Feature Scaling and Mean Normalization

Bias - Variance

High Bias

Just Right

High Variance

Resolving High Variance

Cost Function for Regularization

Gradient Descent for Regularization

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages