In this post we will learn the simplest form of artificial neural network, aka perceptron. It also called single-layer perceptron. We will see that a single neuron can perform a linear classifier. We will implement it in python by processing each data samples separately and then will do the vectorized implementation of the same algorithm. This will save us an extra for loop and give us a sense of how vectorized equation work which is very helpful in machine learning. We will build the theory and also implement the same in TensorFlow. This will help us build a conceptual understanding of TensorFlow programs to a similar python script that does the same job. We will also learn a little bit about the various types of non-linearities used in the neural network literature. If you are busy or don’t like to go thorough the whole post, you can check out the full code in the GitHub repository.

**Acknowledgements**: Thanks to Professor Odelia Schwartz and Dr. Luis G Sanchez Giraldo for the wonderful class lecture (CSC688, Spring 2016, University of Miami) and lab assignment on Perceptron. This blog post follows both the lecture note and the lab assignment very closely.

### Single-layer perceptron

Rosenblatt’s single layer perceptron is one of the earliest models for learning. Our goal is to find a linear decision function parametrized by the weigh vector `W`

and bias parameter `b`

. Learning occurs by making small adjustments to these parameters every time the predicted label ` mismatches the true label `

` of an input data point `

`. In particular, the perceptron implements the following function to solve the classification problem:`

,

where ` is a non-linear function and `

`d`

is the number of features or the dimensionality of the problem. As we can see, changing the values of ` and `

`b`

gives us different functions, and thus, it defines a collection of functions. Target of learning is to find out the `w`

and `b`

that “best capture” the input-output relation.

The concept of artificial neural network is primarily inspired by the goal to model neurons in the brain. But now it becomes more of an engineering and computer science topic and have shown promising results in the large scale visual recognition and machine learning tasks. I plan to not go too much into details of the biology and comparison of biological neurons and perceptron. Because, sadly, it is not absolutely necessary to know the biological neurons to understand the working principle of a perceptron. Just for the sake of understanding, following pictures give us an idea of what a neurons do. First one is a cartoon of a biological neuron

and below is a perceptron model that kind of mimics the biological neuron above. Images are taken from the Stanford’s CS231n lecture by Andrej Karpathy.

We see that each neuron/perceptron performs a dot product with inputs and its weights, adds the bias with them, and then applies the non-linearity

, which is *f(x)**sigmoid* in this case. This non-linear function is also called as activation function.

As for the above figure, total input is `, where `

`N`

is the total number of inputs. The class prediction is dependent on whether the activation of a particular sample results in an output of * f(z)* that is greater than a predefined threshold. For simplicity, this threshold is included in the formula as

, as we see in the above figure, where the threshold `b`

is also called bias. To make it even more general, sometimes `b`

is replaced with

and a

is multiplied with it to finally look like

. Graphically, it looks something like the following figure, where we have two features in the data,

and

.As you can see, for the higher dimensional data, the separating line will be a hyperplane.

### Perceptron as a simple linear classifier

The following example shows the perceptron as a simple linear classifier. Let’s consider a very simple operation like `AND`

and go over step by step.

import numpy as np import tensorflow as tf import matplotlib.pyplot as plt

Following are the global variables.

NUM_FEATURES = 2 NUM_ITER = 2000 learning_rate = 0.01

We have two features in the data (0 and 1). `0.01`

is a standard value as learning rate. This indicates how quickly the model should learn or how quickly the model abandons old beliefs and replaces them with new ones. Learning rate is one of the most important and most discussed hyper parameters in machine learning. That is, a larger value as learning rate means that the perceptron changes its mind very rapidly. It has a strong mathematical derivation on why it works and how, which is not the part of today’s discussion. Instead, lets see an intuitive example of how learning rate works.

Lets think of a cute little child who didn’t see cats before, and how she learns to recognize a cat. Lets show her 10 examples of cats and all of them have orange fur. Now, she will think that cats have orange fur in general. So whenever she sees a cat, she will look for orange fur and try to identify the cat. Now she sees a black cat and her parents tell her it’s a cat (supervised learning). With a large learning rate, she will quickly realize that “orange fur” is not the most important feature of cats. But, with a small learning rate, she will think that this black cat is an outlier and cats are still orange.

Lets get to the business now.

Following is the python implementation of perceptron algorithm (gradient descent for cross-entropy loss with logistic sigmoid activation) to learn the above function `f(x)`

, to perform logical `AND`

operation. You can find different approaches to the perceptron training algorithm. But at the end of the day they all perform the same task, linear classification. Labels are 0 and 1 as you can see. `N`

is the number of samples in the dataset, `y`

is the actual/expected output and ` is the predicted output. We will see the implementation of the same algorithm in TensorFlow in the next section.`

As we have mentioned before, `f`

is a non-linear function. It takes the linear combination (or vector dot product) of input ` each multiplied with corresponding weights coefficients `

`. Non-linearity allows the model to learn complex functions. Various types of non-linearity is out there. Here we use sigmoid function`

`,`

which takes any real-valued input and squashes into [0, 1], which can be interpreted as probability distribution. Since all the probability values in the consideration sums up to 1, usually softmax cross-entropy loss is formulated over sigmoid activations. You can play around with other standard non-linearity used in neural networks. My another personal favorite is ReLU, `f(x) = max(0, x)`

, where we get rid of the negative responses with zeros. I plan to dedicate a whole post on various activation functions in deep learning with their specific contribution and limitations.

Data `x`

is size where 4 is the *number of samples* and 2 is the *number of features*. Number of features is sometimes also called the dimensionality `d`

of the data. Weight `W`

is always as the size of the dimensionality of the data.

x = np.array([[0, 0], [1, 0], [1, 1], [0, 1]], np.float32) # 4x2, input y = np.array([0, 0, 1, 0], np.float32) # 4, correct output, AND operation #y = np.array([0, 1, 1, 1], np.float32) # OR operation W = np.zeros(NUM_FEATURES, np.float32) # 2x1, weight b = np.zeros(1, np.float32) # 1x1 N, d = np.shape(x) # number of samples and number of features # process each sample separately for k in range(NUM_ITER): for j in range(N): yHat_j = x[j, :].dot(W) + b # 1x2, 2x1 yHat_j = 1.0 / (1.0 + np.exp(-yHat_j)) err = y[j] - yHat_j # error term deltaW = err * x[j, :] deltaB = err W = W + learning_rate * deltaW # if err = y - yHat, then W = W + lRate * deltW b = b + learning_rate * deltaB # Now plot the fitted line. We need only two points to plot the line plot_x = np.array([np.min(x[:, 0] - 0.2), np.max(x[:, 1]+0.2)]) plot_y = - 1 / W[1] * (W[0] * plot_x + b) # comes from, w0*x + w1*y + b = 0 then y = (-1/w1) (w0*x + b) print('W:' + str(W)) print('b:' + str(b)) print('plot_y: '+ str(plot_y)) plt.scatter(x[:, 0], x[:, 1], c=y, s=100, cmap='viridis') plt.plot(plot_x, plot_y, color='k', linewidth=2) plt.xlim([-0.2, 1.2]); plt.ylim([-0.2, 1.25]); plt.show()

Here we process each samples separately in each iteration. Model computes the error `y - yHat`

each time to correct the prediction. You can see from the updates of `W`

and `b`

that, the update rule applies a positive or negative adjustments to the prediction, so that, next time, `yHat`

becomes closer to the `y`

.

Formula of `plot_y`

comes from

`, then`

`.`

And we see the following outputs of weights, bias and the corresponding `y`

values to plot them. We indicate the errors made by the prediction as delta. Using the learning rate, perceptron adjusts the weights and bias in a way that, next time, the prediction becomes a little bit better. This way, after the number of iterations, this reaches a reasonable solution (your `W`

and `b`

values might be different).

`W:[ 5.60070666 5.58967588]`

`b:[-8.56957404]`

`plot_y: [ 1.73350219 0.3307394 ]`

We see from the above plot that, the single-layer perceptron does a good job to learn `AND`

operation. You can check that it can learn `OR`

operations too by un-commenting the appropriate `y`

statement in the code. In fact, you can try with any dataset that is linearly separable just by putting the data in `x`

and labels in `y`

. Note, your values of `W`

and `b`

might be different than this ones. But the plot will look the same at the end. This means that you came up with two different points of the same line.

Instead of processing each sample one by one in every iteration, we can vectorize the formulas and get rid of an extra for loop, making the program run faster. Following implementation is simpler than the one we described above. As you will notice that vectorized version of the equations comes in handy in machine learning literature.

for k in range(NUM_ITER): yHat = x.dot(W) + b yHat = 1.0 / (1.0 + np.exp(-yHat)) err = y - yHat deltaW = np.transpose(x).dot(err) # have to 2x1 deltaB = np.sum(err) # have to 1x1. collect error from all the 4 samples W = W + learning_rate * deltaW # if err = y - yHat, then W = W + lRate * deltW b = b + learning_rate * deltaB

Pay special attention to the `deltaW`

and `deltaB`

here. We transpose the input `x`

before multiplying with the error term which we didn’t do before. This is a simple linear algebra trick to get this expression. The most easiest way to think about it is in context of the matrix sizes. For example, input `x`

is and true output `y`

is (and so the `err`

is ), and the weights `W`

have to be (why?). Only way to get a weight matrix out of `x`

and `err`

is to transpose `x`

and then multiply with `err`

and that is exactly what we have done here. For the `deltaB`

, we sum all the four error terms, which means that we are collecting errors made by ALL the samples. If you run this code, you can see that this also gives us the same separation of the dataset `x`

.

### Perceptron using the TensorFlow

Now, let’s implement the same perceptron algorithm using TensorFlow. This will allow us to understand the basic difference on how TensorFlow approaches the problem. To me, TensorFlow seems a bit trickier at first. But it is more robust once you get the hang of it. Especially when the scale of the problem gets bigger.

Following is the code of the same vectorized version of perceptron. Data `x`

and labels `y`

are defined same as before. In TensorFlow, instead of using `x`

and `y`

directly, we define placeholders under a session. We then send the values to these placeholders at the time of computation.

x = np.array([[0, 0], [1, 0], [1, 1], [0, 1]], np.float32) # 4x2, input y = np.array([0, 0, 1, 0], np.float32) # 4, correct output, AND operation #y = np.array([0, 1, 1, 1], np.float32) # OR operation y = np.reshape(y, [4,1]) # convert to 4x1 X = tf.placeholder(tf.float32, shape=[4, 2]) Y = tf.placeholder(tf.float32, shape=[4, 1]) W = tf.Variable(tf.zeros([NUM_FEATURES, 1]), tf.float32) B = tf.Variable(tf.zeros([1, 1]), tf.float32) yHat = tf.sigmoid( tf.add(tf.matmul(X, W), B) ) # 4x1 err = Y - yHat deltaW = tf.matmul(tf.transpose(X), err ) # have to be 2x1 deltaB = tf.reduce_sum(err, 0) # 4, have to 1x1. sum all the biases? yes W_ = W + learning_rate * deltaW B_ = B + learning_rate * deltaB step = tf.group(W.assign(W_), B.assign(B_)) #to update the values of weights and biases. sess = tf.Session() init = tf.global_variables_initializer() sess.run(init) for k in range(NUM_ITER): sess.run([step], feed_dict={X: x, Y: y}) W = np.squeeze(sess.run(W)) b = np.squeeze(sess.run(B)) # Now plot the fitted line. We need only two points to plot the line plot_x = np.array([np.min(x[:, 0] - 0.2), np.max(x[:, 1]+0.2)]) plot_y = - 1 / W[1] * (W[0] * plot_x + b) plot_y = np.reshape(plot_y, [2, -1]) plot_y = np.squeeze(plot_y) print('W: ' + str(W)) print('b: ' + str(b)) print('plot_y: '+ str(plot_y)) plt.scatter(x[:, 0], x[:, 1], c=y, s=100, cmap='viridis') plt.plot(plot_x, plot_y, color='k', linewidth=2) plt.xlim([-0.2, 1.2]); plt.ylim([-0.2, 1.25]); plt.show()

Weight `W`

and bias `B`

declared as variables, because we train them in the program. TensorFlow function `Variable()`

has a parameter called `trainable=`

which by default is `True`

. Setting it to `True`

puts the variable to the graph collection `GraphKeys.TRAINABLE_VARIABLES`

, which is a default list of variables to use by the optimizer classes. We used the formulas as before with slight modifications. To keep the original values of `W`

and `B`

, and add the update from current iteration, we use temporary variables `W_`

and `B_`

, and after each iteration, we assign them back to the actual variable `W`

and `B`

. Let’s plot the solution for both `AND`

and `OR`

operations.

We see that our simple single layer perceptron model does pretty good job to fit logical `AND`

and `OR`

operations. Full code with necessary comments can be found in my GitHub repository.

### Discussion

Congratz! If you make it to this point. Now we know the basic building block of an artificial neural network aka perceptron. Perceptron is a powerful model of computation. As a homework, you can try this algorithm in different dataset and see how the model performs. An easy extension would be to try on the Iris flower dataset. You can also try with different values of learning rate and number of iterations, and see how the perceptron behaves.

Though perceptron is cool but it’s capability is very limited. We will see that a single layer perceptron is not capable to learn even moderately complex operations. For example, `XOR`

, which we plan to discuss in the next post. We will see why more than one layers is needed to represent `XOR`

operations. We will develop an end-to-end `XOR`

gate in the next post.

Nice blog. I think plt.scatter(x[:, 0], x[:, 1], c=y, s=100, cmap=’viridis’) should be replaced by plt.scatter(x[:, 0], x[:, 1], c=y.ravel(), s=100, cmap=’viridis’)

LikeLike

I don’t mean to be nit picky but what you did was Logistic Regression not the Perceptron Learning Algorithm. Logistic regression uses 1/ (1 + e^-(w*x)) as its activation function where as the Perceptron uses if w*x < 0 then -1 else 1. This has a pretty big difference in functionality. On top of that, the Perceptron is a deterministic activation function that provides either one class or another where as the sigmoid activation provides a probabilistic result.

LikeLike

Yes, this is logistic regression. Not perceptron.

LikeLike