This is my final project for the Computer Vision class thought by Rob Fergus in Fall 2016. I got the project of implementing compressing ideas of Song Han presented in paper Learning both Weights and Connections for Efficient Neural Networks
. I start with a brief summary of the report and my-work to ease the reading.
torch.nn.Linear
and torch.nn.SpatialConvolutional
modules to mask connections and enable straight-forward training. I made a Pull Request to the project #1073-cuda
support and enjoyed it. Song Han starts the paper with a focus on energy consumption of neural networks and motivates the need of compressing neural networks by pointing out that a smaller network would fit in memory and therefore the energy consumption would be less. He first talks about the Related Work
To prune the network importance of each weight is learned with a training method that is not mentioned. After this learning step, connections whose importance weights below a certion threshold are removed. Then the network is retrained and this part is crucial.
Caffee is used and a mask is implemented over the weights such that it disregards the masked outparameters.
There is still one point that is not clear to me how the pruning is made with L1 or L2. I need to think about this. But basically in this section it is shown that iterative prunning with L2-regularization gave best results. One need to prune different regions separetely. Because FC layers are more prunable.
Yann Le Cun's pruning paper emphasizing the importance of pruning as a regularizer and performance-optimizer. The idea of deleting parameters with small saliency
is proposed. Magnitude of weights proposed as simple measure of saliency in the earlier literature and its similarity to the weight decay mentioned. This paper proposes a better more accurate measure of saliency.
This paper focuses on some transfer learning tasks where the models trained on Image-net transfered to solve smaller classification problems. One significant difference is instead of pruning weights whole neuron is pruned.
It always feels good to read old papers. I visited this paper to learn more about Weight Decay and its connection to bias function(regularizer). They reported sparser connections are achieved as a result of applying exponential bias.
This paper proposes around 2x speedup at convolutional layers by deriving low-rank approximations of the filters and 5-10x parameter reduction at fully connected layers. The motivation of the paper based on the findings of Denil et al. regarding the redundancies in network parameterts.
The main paper published at ICLR 2016 combining pruning idea with other methods like quantizing and huffman coding.
I started with first training my first network LeNet-5
on HPC and got a test error of 0.96% in 30 Epochs with deafult training parameters. It occupies around 5Mb and has 313k parameters. My goal is to get 10x compression in size following the three methods outlined in the paper. The parameter breakdown is below:
I wanted to implement every part in Torch. After diving in I realized this might be a hard task. The reason is basically there is no Sparse-Tensor implementation and no space gain is made through making the weigth matrices(connections) sparse. After struggling a bit, I decided to aim an encoding and decoding method. Because implementing Sparse Tensor's and all the required operations is another project by itself I believe. Layers like SpatialConvolution and Linear is implemented for optimization and source code is not that easy to understand and modify. Therefore I decided to use full weight matrices throughout my experiments and represent connectivity by having non-zero weights.
First I've started with Pruner
module. After couple of iterations I've decided to intialize Pruner module with setVariables
call, which includes a model a pruner function(mask generating), a trainer,a tester and relevant torchnet engine. With these parameters I gave full power to the Pruner module to re-train and test model. After initialization one is ready to prune the network. Pruner:prune
call gets a mask-generating function(there was two implementations Pruner.maskThreshold
and Pruner.maskPercentage
at the beginning), a table of layer-id's and a table of mask-generating function parameters (either threshold or percentages in this case). Basically prune uses the layer-id's to get the weight tensor of target layer. Since this is a development code there are no type-checks and the provided id should be a valid one(a layer with .weight
field like nn.SpatialConvolution, nn.Linear). Then a mask is generated by calling the provided function with provided parameters and selected weight Tensor. The result is a binary mask with the same size as the weight-Tensor. The mask is saved in each layer and resulting model is tested. prune
repeats this proccess for each layer-id and returns the percantage of retained connections for each layer-id and the test-accurcy.
After pruning one can call Pruner:reTrain
function with nEpochs to retrain the network. Test-accury after testing is returned.
I've played with this and got some initial results by just masking according to the absolute value of the weights and got similar, sometimes better results with around 50% pruning of each layer without retraining. The individual sensitivity of each layer is below.
conv1-fcc1-fcc3 | conv2-fcc2 |
---|---|
![]() |
![]() |
![]() |
![]() |
Then I've implemented retraining. To do that after going through several possibilities like definining new nn
modules or torchnet.engine
hooks, I decided to alter the nn.Linear
and nn.SpatialConvolution
code and implement the pruning via masking cleanly. This choice helped me to prune the layers just by adding the binary mask and retrain them properly.
I didn't need to implement CUDA for other homeworks, but this time I wanted to learn how to do it and see the difference. I've realized that it is pretty straight forward: a generic function isCuda(inp)
which calls inp:cuda()
if cuda flag is provided does the necessary work. I've got a 2x speedup on Lenet-5 model compare to its multithreaded version.
There are 2 main methods proposed in the literature as pruning metric.
An important point here to made is, all of the methods above try to approximate or optimize the L