Zhen Zhang's Blog

Some thoughts on PLA algorithm

PLA (Perceptron Learning Algorithm) is a very basic machine learning algorithm. It serves as an approach to separate two class of staffs. In this post, I will first introduce the basic logic, then talk about the difficulties I encountered when implementing this algorithm.

Algorithm Summary

Theory

I learnt this algorithm from the course of “Machine learning algorithm” on the platform of coursera.

For the input $(x_i,y_i)$ (here we take $x_i$ as a vector, and the dimension is d), we will give a weight vector $w_i$ and a threshold such that:

$$ t_i = \sum^d_{j=1}w_{ij}x_{ij} $$

Then we compare $t$ to the threshold. If $t > threshold$, then we claim the result will be positive, and negative otherwise.

Of course we can incorporate the threshold into the sum formula:

$$
\begin{align}
h(x_i) &= sign((\sum^d_{j=1}w_{ij}x_{ij}) - threshold) \\
&= sign((\sum^d_{j=1}w_{ij}x_{ij}) + (-threshold) * 1) \\
&= sign((\sum^d_{j=0}w_{ij}x_{ij})) \\
&= sign(w^Tx)
\end{align}
$$

It is known as the perceptron classifier.

Pictures

At first, we will give PLA a initial weight vector $w_0$, and at most times it will not be correct to separate all the data. Then we will correct the mistakes one by one:

PLA correction

(This picture is copied from the lecture slides of the course “Machine larning algorithm” on the platform of coursera)

It says, when the classifier is false, we can correct as follows: if the result is positive but we give is negative, then we will add the weight vector by the false $x$ vector. In the opposite, we will subtract the false $x$ vector from the weight vector.

The essence of this approach is that, we will always make our weight vector to be nearer to the mistake classified observation. The best case will be they are overlapped, since our weight vector is served as the normal vector to the region it defines separately.

But there will come up with one problem: the order to correct the mistakes. By default, after we correct one mistake, we will go on to the next observation, not from the first one.

R implementation

Here is the R code I write:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
library(ggplot2)
plaAnalysis <- function(dataset, eta = 1) {
## preprocessing the basic attributes of the data
data_collen <- dim(dataset)[2] - 1
data_len <- dim(dataset)[1]
w_init <- rep(0, data_collen) # initialize w as w = 0
mistakes <- list() # initialize the mistake list
num_mis <- 1 # initialize the mistake indicator
num_update <- 0 # track the number of loop
w <- w_init
# the main loop
while (num_mis == 1) {
# loop preparation
num_mis <- 0 # set it to zero so if no mistake found the loop will end
for (i in 1:data_len) {
x_here <- unlist(dataset[i, 1:data_collen])
y_here <- as.numeric(dataset[i, data_collen+1])
test <- planeSide(data_collen,x_here,w)
if (signx(test) != y_here) {
w <- w + x_here * y_here * eta # update w
w <- unlist(w)
names(w) <- NULL
num_mis <- 1 # change it to 1 to show there is some mistake so we need to follow the loop again
num_update <- num_update + 1
w_data <- data.frame(x = c(3, 1), y = c(p_norm[2] + (p_norm[1] - 3) * w[1] / w[2], p_norm[2] + (p_norm[1] - 1) * w[1] / w[2])) # p_norm is the vector I define to be the middle point. More explanation is illustrated below
myplot <- ggplot(dataset, aes(x = x, y = y)) + geom_point(aes(color = z)) + geom_line(data = w_data, aes(x = x, y = y))
print(myplot)
}
}
}
return(list(num_update=num_update,w=w))
}
signx <- function(x) {
ifelse(x == 0, -1, sign(x))
}
# This function is to determine which side the point belongs to
planeSide <- function(n, x_here, w) {
result <- 0
for (i in 1:n) {
result <- (x_here[i] - p_norm[i]) * w[i] + result
}
return(result)
}
## This only applies if we want to use random orders to do multiple times
# randomPla <- function(dataset, n) {
# total <- 0
# for (i in 1:n) {
# order <- sample(400)
# dataset_ran <- dataset[order,]
# result <- plaAnalysis(dataset_ran, eta = 0.5)
# total <- total + result$num_update
# }
# return (total/n)
# }
print(plaAnalysis(plaData))

My thoughts

But now here there is some confusion that puzzles me the most. We have already give the weight a initial value, but it only serves as a normal vector. To confirm the plane, no matter a line or a 3D-plane or higher plans, we need a point together with the normal vector to locate the position of the plane. The formula for a normal vector and a point to confirm a plane is (I illustrate it in the 3D-plane):

$$ (x - x_0) * a_i + (y - y_0) * b_i + (z - z_0) * c_i = 0 $$

where $(x_0, y_0, z_0)$ is the point and $(a_i, b_i, c_i)$ is the normal vector. If we just left the point with blank, the point we actually use is (0,0,0). But there are some circumstances that starting with the zero point will give us no answer. The picture for this point will be shown in the example part.

So I think the most safe way to do is find two points that are different from each other in the output, then calculate the midder point of them as the initial point.

Example

Here I use rnorm() to generate five observations, and then generate some corresponding z to make them separable.

1
2
3
4
5
6
7
> plaData
# x y z
# 1 2.742720 -2.4765163 1
# 2 1.374764 -0.3865385 -1
# 3 2.719462 0.9260654 1
# 4 2.595238 -3.2343898 1
# 5 1.503082 -0.4721146 -1

And the p_norm vector I chose is the middle point of the first two vectors.

1
2
> p_norm
# [1] 2.058742 -1.431527

Then the PLA will update for three times, if the initial vector is zero vector:

1
2
3
4
5
# $num_update
# [1] 3
# $w
# [1] 8.1816441 -0.6243856

PLA1

PLA2

PLA3

What if we start with point zero as the initial point? You can draw it, but the result is, the PLA will run for indefinite time, since it will never halt.

Pocket

Even if we run the right PLA, we may still cannot find the perfect classifier. That is due to the dataset is not separable. So in this circumstance, We might want to calculate a weight vector which minimize the error proportion. But this has been proved to be a NP problem, so the only way to do this is to try for a number of times, say 100 times, then calculate its error rate, and try another 100 times, calculate again and again…