Gradient Descent

Gradient Descent is an algorithm that tries to seek the best values of predictors that give the lowest error.

I’ll implement Gradient Descent algorithm on a linear regression. First, we need to set an objective equation.

Objective Equation
Since the algorithm seeks to minimize error, we need an error equation. I’ll use Sum of Squared Error or SSE as the objective equation.

$$SSE=\sum_{i=1}^N(y_{i}-mx_{i}+\beta)^2$$

Let’s suppose we have this data set:

data <- data.frame(x = (1:3),
                   y = (4:6))

1 2	data <- data.frame(x = (1:3), y = (4:6))

Next, we supply the starting points, in this case, it is $m$ and $\beta$

We then can write the first for loop() to calculate the SSE where $m = 0$ and $\beta = 0$

for (i in 1:1){
  for (i in 1:nrow(data)){
    x <- data[i,1]
    y <- data[i,2]
    sse <- (y - ((m * x) + b))^2
    sum_sse <<- sum_sse + sse
    if(i == nrow(data)){
      temp <<- cbind(b, m, sum_sse)
      storage <<- rbind(storage, temp)
      totalsse <<- 0
      }
}

for (i in 1:1){

for (i in 1:nrow(data)){

x <- data[i,1]

y <- data[i,2]

sse <- (y - ((m * x) + b))^2

sum_sse <<- sum_sse + sse

if(i == nrow(data)){

temp <<- cbind(b, m, sum_sse)

storage <<- rbind(storage, temp)

totalsse <<- 0

}

The loop will loop through all observations in the data. Then assign the error to $sse$. Then it will sum up to $sum sse$ When it is done; it will put in a storage dataset to store $\beta$,$ m$, and $sum sse$.

Up until this point, it is the usual $SSE$ calculation. Now is the bread and butter of gradient descent: Partial Derivatives. Partial derivatives are to calculate the derivative of a function (equation in this case) with more than one variable. So, it works perfectly in our case.

$$\frac{\partial}{\partial m}=\frac{2}{N}\sum_{i=1}^{N}-x_{i}(y_{i}-(mx_{i}+\beta))$$

$$\frac{\partial}{\partial \beta}=\frac{2}{N}\sum_{i=1}^{N}-(y_{i}-(mx_{i}+\beta))$$

Paul’s Online Math Notes website gives a very detailed explanation of Partial Derivative (link).

#Calculating Gradient  
    for (j in 1:steps){
      for (i in 1:nrow(data)){
        x <- data[i,1]
        y <- data[i,2]    
       
        #Partial derivative respect to m
        dm <- -((2/nrow(data)) * x * (y - (m*x + b)))
        
        #Partial derivative respect to beta
        db <- -((2/nrow(data)) * (y - (m*x + b)))
        
        #Update values
        m_gradient <<- m_gradient + dm
        b_gradient <<- b_gradient + db      
      }
      m <- (m_base - (learningrate * m_gradient))
      b <- (b_base - (learningrate * b_gradient))

#Calculating Gradient

for (j in 1:steps){

for (i in 1:nrow(data)){

x <- data[i,1]

y <- data[i,2]

#Partial derivative respect to m

dm <- -((2/nrow(data)) * x * (y - (m*x + b)))

#Partial derivative respect to beta

db <- -((2/nrow(data)) * (y - (m*x + b)))

#Update values

m_gradient <<- m_gradient + dm

b_gradient <<- b_gradient + db

}

m <- (m_base - (learningrate * m_gradient))

b <- (b_base - (learningrate * b_gradient))

First, we calculate the derivatives, then assign the values to gradient values. After that, we will reassign the new value to $m$ and $\beta$. Another important variable in the calculation is the learning rate $\alpha$. The value will control how much we will move down the slope.

After we get the new value of $m$ and $\beta$, we then recalculate the $SSE$ as many times as we want.

So, to put all of that in a function.

gradient_descent <- function(m, b, data, steps, learningrate){
  sse <<- 0
  sum_sse <<- 0
  storage <<- data.frame(b = c(99),
                         m = c(99),
                         sum_sse = c(99))
  m_gradient <<- 0
  b_gradient <<- 0
  dm <<- 0
  dx <<- 0
  m_base <<- m
  b_base <<- b
  
  for (i in 1:1){
    for (i in 1:nrow(data)){
      x <- data[i,1]
      y <- data[i,2]
      sse <- (y - ((m * x) + b))^2
      sum_sse <<- sum_sse + sse
      if(i == nrow(data)){
        temp <<- cbind(b, m, totalerror)
        storage <<- rbind(storage, temp)
        sum_sse <<- 0
      }
    }
    #Calculating Gradient  
    for (j in 1:steps){
      for (i in 1:nrow(data)){
        x <- data[i,1]
        y <- data[i,2]    

        #Partial derivative respect to m
        dm <- -((2/nrow(data)) * x * (y - (m*x + b)))
        
        #Partial derivative respect to beta
        db <- -((2/nrow(data)) * (y - (m*x + b)))
        
        #Update values
        m_gradient <<- m_gradient + dm
        b_gradient <<- b_gradient + db
        
      }
      m <- (m_base - (learningrate * m_gradient))
      b <- (b_base - (learningrate * b_gradient))
      
      for (i in 1:nrow(data)){
       x <- data[i,1]
        y <- data[i,2]
        sse <- (y - ((m * x) + b))^2
        sum_sse <<- sum_sse + sse
        if(i == nrow(data)){
          temp <<- cbind(b, m, totalerror)
          storage <<- rbind(storage, temp)
          sum_sse <<- 0         
        }
      }
    }
  }
}

gradient_descent <- function(m, b, data, steps, learningrate){

sse <<- 0

sum_sse <<- 0

storage <<- data.frame(b = c(99),

m = c(99),

sum_sse = c(99))

m_gradient <<- 0

b_gradient <<- 0

dm <<- 0

dx <<- 0

m_base <<- m

b_base <<- b

for (i in 1:1){

for (i in 1:nrow(data)){

x <- data[i,1]

y <- data[i,2]

sse <- (y - ((m * x) + b))^2

sum_sse <<- sum_sse + sse

if(i == nrow(data)){

temp <<- cbind(b, m, totalerror)

storage <<- rbind(storage, temp)

sum_sse <<- 0

}

#Calculating Gradient

for (j in 1:steps){

for (i in 1:nrow(data)){

x <- data[i,1]

y <- data[i,2]

#Partial derivative respect to m

dm <- -((2/nrow(data)) * x * (y - (m*x + b)))

#Partial derivative respect to beta

db <- -((2/nrow(data)) * (y - (m*x + b)))

#Update values

m_gradient <<- m_gradient + dm

b_gradient <<- b_gradient + db

}

m <- (m_base - (learningrate * m_gradient))

b <- (b_base - (learningrate * b_gradient))

for (i in 1:nrow(data)){

x <- data[i,1]

y <- data[i,2]

sse <- (y - ((m * x) + b))^2

sum_sse <<- sum_sse + sse

if(i == nrow(data)){

temp <<- cbind(b, m, totalerror)

storage <<- rbind(storage, temp)

sum_sse <<- 0

}

Let’s give it a try.

gradient_descent(m = 2, b = 3, steps = 10, data=data, learningrate = 0.0001)

1	gradient_descent(m = 2, b = 3, steps = 10, data=data, learningrate = 0.0001)

storage %>%
  filter(b != 99)

1 2	storage %>% filter(b != 99)

> storage %>%
+   filter(b != 99)
          b        m totalerror
1  3.000000 2.000000   14.00000
2  2.999600 1.999067   13.96908
3  2.999200 1.998134   13.93824
4  2.998801 1.997203   13.90746
5  2.998403 1.996273   13.87675
6  2.998005 1.995344   13.84610
7  2.997607 1.994415   13.81553
8  2.997210 1.993488   13.78502
9  2.996813 1.992562   13.75458
10 2.996416 1.991637   13.72421
11 2.996020 1.990713   13.69390

> storage %>%

+ filter(b != 99)

b m totalerror

1 3.000000 2.000000 14.00000

2 2.999600 1.999067 13.96908

3 2.999200 1.998134 13.93824

4 2.998801 1.997203 13.90746

5 2.998403 1.996273 13.87675

6 2.998005 1.995344 13.84610

7 2.997607 1.994415 13.81553

8 2.997210 1.993488 13.78502

9 2.996813 1.992562 13.75458

10 2.996416 1.991637 13.72421

11 2.996020 1.990713 13.69390