Logistic is one of the most useful regression algorithm, and in this post, I will talk about the algorithm, its optimization method, its implementation and the comparison between logistic regression and PLA (perceptron learning algorithm)
Simple Linear Regression
Method
First let’s see simple linear regression. The formula is:
$$Y = X * \beta$$
Where $Y$, $X$ are all matrix here. I incorporate the intercept into this formula by adding a $(1, 1, …., 1)^T$ column at the left most part of matrix $X$. So the $X$ here is a augmented matrix, and the column from 2 to the end is the variables it will use.
Of course, we can use as many variables as we want, for instance, we can have interaction terms like $X_1X_2$.
Implementation
Here is the matlab code that implements this simple algorithm.


Logistic regression
Method
While simple linear regression can produce the predicted value, sometimes we do not need the actual result. We only need to separate the predicted value into two categories, which is a binary classfication problem.
To achieve, we need a simple transformation, which transforms our value to a new value that falls in [0,1] accurately:
$$y = \frac{1}{1+e^{xw}}$$
This is called the sigmoid function, and it satisfies our requirement. So suppose our new value Y already falls into [0,1], then we need to calculate the required w. But the question is, how to calculate the exact value of w, like what we have done in the previouly method, simple linear regression?
It turns out that it is not that easy to calculate the value of w explicitly. So we need to use the optimization method to approximate it.
Optimization
There are many optimization method that can be applied to this problem. Here I use one of the most widely used one, gradient descent.
The gradient descent algorithm will converge to the local optimal point from all dimension simultaneously. This is different from the coordinate descent, which converges to the local optimal point one dimension at a time. Regarding this, we also need to provide a learning rate, that can be used to control the speed to the optimal point. It too high, it will jump over the local optimal point, if too small, we need to do so many times of calculation.
The formula for gradient descent is:
$$ \mathbf{w} = \mathbf{w}  \alpha * \frac{\partial RSS}{\partial \mathbf{w}} $$
RSS is defined as:
$$ RSS = \frac{1}{2}(\mathbf{y}  sigmoid(\mathbf{X}))^T(\mathbf{y}  sigmoid(\mathbf{X})) $$
In logistic regression, if we take RSS as a convex function, we can redefine it as:
$$
\begin{align}
RSS &= A + B \\
A &=  \mathbf{y} * log(sigmoid(\mathbf{X}\mathbf{w})) \\
B &=  (1  \mathbf{y}) * log(1  sigmoid(\mathbf{X}\mathbf{w}))
\end{align}
$$
The first part is the error for $y$ = 1, while the second part is for $y$ = 0.
By applying conditional likelihood on this formulas (proof omitted here), we have:
$$ \mathbf{w} = \mathbf{w}  \alpha * (\mathbf{y}  sigmoid(\mathbf{X}\mathbf{w}))^T\mathbf{X} $$
Where $\alpha$ is learning rate.
There are two thresholds to stop this optimization. One is, we can define a finite number of times, say 2000 times, then after that the algorithm will stop automatically. Or, we can set the format of RSS or cost function, if the error rate reduced for the cost function is small enough, we will stop either.
Implementation


Comparison between logistic regression and PLA
When looking at this formula:
$$ \mathbf{w} = \mathbf{w}  \alpha * (\mathbf{y}  sigmoid(\mathbf{X}\mathbf{w}))^T\mathbf{X} $$
It reminds us of the previous PLA algorithm. The difference is, first, the learning rate is 1, and second, there is only $(\mathbf{y}  \mathbf{\hat{y}})^T\mathbf{X}$, instead of $(\mathbf{y}  sigmoid(\mathbf{X}\mathbf{w}))^T\mathbf{X}$, and the only difference is, in PLA, the predicted result is either 0 or 1, but in logistic regression, the result can be any value in [0,1], and we set a threshold to determine which category it belongs to (usually 0.5 as the threshold).