Zhen Zhang's Blog

Logistic Regression

Logistic is one of the most useful regression algorithm, and in this post, I will talk about the algorithm, its optimization method, its implementation and the comparison between logistic regression and PLA (perceptron learning algorithm)

Simple Linear Regression

Method

First let’s see simple linear regression. The formula is:

$$Y = X * \beta$$

Where $Y$, $X$ are all matrix here. I incorporate the intercept into this formula by adding a $(1, 1, …., 1)^T$ column at the left most part of matrix $X$. So the $X$ here is a augmented matrix, and the column from 2 to the end is the variables it will use.

Of course, we can use as many variables as we want, for instance, we can have interaction terms like $X_1X_2$.

Implementation

Here is the matlab code that implements this simple algorithm.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
function matrix = dimExpand(x,order)
% This is one of the core function in regression. Indeed, what this
% function does is transform the X (independent value) to the desired
% matrix we need to do the regression based on the order. For example, with
% order 0, it will give a vector with ones; with order 1, it will append
% the X itself append to order 0, and so on so forth for higher orders.
% To note, this function has been optimized for matrix calculation, which
% means you can give either an array or a matrix as the input of X.
% This functions maintains two pointers: one for the augment power (t1), and
% the other one for the order we need. By calculating them at an appropriate
% order, we can give this function the magic of accetping a matrix and give an
% augmented matrix as the output.
t1 = 0;
t2 = order;
[r,n] = size(x);
z = ones(r, 1 + n * order);
while t2 > 0;
t1 = t1 + n;
z(:,t1-n+2:t1+1) = x.^(t1/n);
t2 = t2 - 1;
end
matrix = z;
end
function coefficient = singlePolyReg(x,y,order)
% This function implements the algorithm for polynomial regression.
z = dimExpand(x,order);
coefficient = (transpose(z) * z) \ (transpose(z) * y);
end

Logistic regression

Method

While simple linear regression can produce the predicted value, sometimes we do not need the actual result. We only need to separate the predicted value into two categories, which is a binary classfication problem.

To achieve, we need a simple transformation, which transforms our value to a new value that falls in [0,1] accurately:

$$y = \frac{1}{1+e^{xw}}$$

This is called the sigmoid function, and it satisfies our requirement. So suppose our new value Y already falls into [0,1], then we need to calculate the required w. But the question is, how to calculate the exact value of w, like what we have done in the previouly method, simple linear regression?

It turns out that it is not that easy to calculate the value of w explicitly. So we need to use the optimization method to approximate it.

Optimization

There are many optimization method that can be applied to this problem. Here I use one of the most widely used one, gradient descent.

The gradient descent algorithm will converge to the local optimal point from all dimension simultaneously. This is different from the coordinate descent, which converges to the local optimal point one dimension at a time. Regarding this, we also need to provide a learning rate, that can be used to control the speed to the optimal point. It too high, it will jump over the local optimal point, if too small, we need to do so many times of calculation.

The formula for gradient descent is:

$$ \mathbf{w} = \mathbf{w} - \alpha * \frac{\partial RSS}{\partial \mathbf{w}} $$

RSS is defined as:

$$ RSS = \frac{1}{2}(\mathbf{y} - sigmoid(\mathbf{X}))^T(\mathbf{y} - sigmoid(\mathbf{X})) $$

In logistic regression, if we take RSS as a convex function, we can redefine it as:

$$
\begin{align}
RSS &= A + B \\
A &= - \mathbf{y} * log(sigmoid(\mathbf{X}\mathbf{w})) \\
B &= - (1 - \mathbf{y}) * log(1 - sigmoid(\mathbf{X}\mathbf{w}))
\end{align}
$$

The first part is the error for $y$ = 1, while the second part is for $y$ = 0.

By applying conditional likelihood on this formulas (proof omitted here), we have:

$$ \mathbf{w} = \mathbf{w} - \alpha * (\mathbf{y} - sigmoid(\mathbf{X}\mathbf{w}))^T\mathbf{X} $$

Where $\alpha$ is learning rate.

There are two thresholds to stop this optimization. One is, we can define a finite number of times, say 2000 times, then after that the algorithm will stop automatically. Or, we can set the format of RSS or cost function, if the error rate reduced for the cost function is small enough, we will stop either.

Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
function result = sigmoid(x,w)
% This is the sigmoid function designed for logistic regression.
result = 1 ./ (1 + exp(- x * w));
end
function result = costFun(x,y,w)
% This function calculates the cost function related to the logistic
% regression. The definition of cost function is well known, so I will not
% explain it here.
result = 0;
x = dimExpand(x,1);
[n, ~] = size(x);
for k = 1:n
temp = sigmoid(x,w);
if y(k) == 1
result = result - log(temp);
else
result = result - log(1-temp);
end
end
end
function result = categorizeY(threshold,y)
% This function transforms the Y (dependent value) into two categories
% based on the threshold value given to the function.
% It creates a new array, and will evaluate the value based
% on the corresponding Y value.
i = length(y);
result = zeros(i,1);
for n = 1:i
if y(n) > threshold
result(n) = 1;
else
result(n) = 0;
end
end
end
function result = graDescent(x,y,w,learnRate)
% This is the core function for gradient descent and use it to do logistic
% regression. It main logic is, keep looping, until the error rate is less
% than 0.0001. While looping, it will update the weight vector in all
% dimensions simultaneously, based on the gradient descent algorithm. Then
% at last, return the weight vector.
[~,r] = size(x);
z = dimExpand(x,1);
keep = true;
cost = 0;
while keep
temp = zeros(r+1,1);
temp = temp + z' * (sigmoid(z,w) - y);
w = w - learnRate * temp;
cost\_new = costFun(x,y,w);
if abs(cost\_new - cost) / cost <= 0.0001
keep = false;
end
cost = cost\_new;
end
result = w;
end

Comparison between logistic regression and PLA

When looking at this formula:

$$ \mathbf{w} = \mathbf{w} - \alpha * (\mathbf{y} - sigmoid(\mathbf{X}\mathbf{w}))^T\mathbf{X} $$

It reminds us of the previous PLA algorithm. The difference is, first, the learning rate is 1, and second, there is only $(\mathbf{y} - \mathbf{\hat{y}})^T\mathbf{X}$, instead of $(\mathbf{y} - sigmoid(\mathbf{X}\mathbf{w}))^T\mathbf{X}$, and the only difference is, in PLA, the predicted result is either 0 or 1, but in logistic regression, the result can be any value in [0,1], and we set a threshold to determine which category it belongs to (usually 0.5 as the threshold).