Implementation of NetAdapt

Environment

This code is based on pytorch (version 1.0.1). However, since pytorch does not handle 32bits architectures and that the goal I had in mind when implementing the paper was to do efficient inference on a rasberry-pi, I used tensorflow lite to compute the tables instead (compute_table.py). This means that two environments (one with pytorch and scikit learn and one with tensorflow 1.13.1 (or one environment with both) are needed to run the project.
The code uses CIFAR-10 as a dataset.

Performance tables

The performances tables are built in two steps. First, all the .tf-lite files are built on a Desktop computer. This takes approximately one day (without multiprocessing) and generates a few gigs worth of .tflite files. The script do not uses multiprocessing on its own but it offers an argument to specify the number of times it was launched in parallel and one for the offset of the current instance so that it only does a fraction of the work.

Once this is done, the files can be used to generate measurements on the target device. We found out averaging several of them and smooting to give better results, see this report Section 4.5 for an explanation and perf_tables/merge_perf_tables for an example using scipy.

To build a table, one would need to use these lines (4 processes might not be the optimum)

python compute_table.py --save_file='res-40-2-tf-lite-pred' --mode='save' --output_folder='tflite_files' --num_process=4 --offset_process=0
python compute_table.py --save_file='res-40-2-tf-lite-pred' --mode='save' --output_folder='tflite_files' --num_process=4 --offset_process=1
python compute_table.py --save_file='res-40-2-tf-lite-pred' --mode='save' --output_folder='tflite_files' --num_process=4 --offset_process=2
python compute_table.py --save_file='res-40-2-tf-lite-pred' --mode='save' --output_folder='tflite_files' --num_process=4 --offset_process=3

on the desktop computer/server, compute_table uses a WideResNet-40-2 by default.

Then, on the target device having a

python compute_table.py --save_file='res-40-2-tf-lite-pred' --mode='load' --output_folder='tflite_files' --benchmark_lite_loc='path_towards_tf_lite_benchmark_binary'

We recommend to repeat this step and then to use a script analogous to perf_tables/merge_perf_tables. perf_tables/res-40-2-tf-lite-raspberry-pi.pickle is an example of the resulting file.

Code

A slight modification w.r.t. the paper is that we do not prune the layers with smallest error increase but the one with smallest relative error increase ($\frac{\Delta_{error}}{|\Delta_{latency}|}$), like MobileNetv3

First, the base network needs to be built,

python train.py --net='res' --depth=40 --width=2.0 --save_file='res-40-2'

would train a base WideResNet-40-2 for example (assuming CIFAR-10 location is in ~/Documents/CIFAR-10)

Then, the network can be pruned, for example

python prune.py --save_file='res-40-2-pruned' --pruning_fact=0.4 --base_model='res-40-2' --perf_table='res-40-2'

would prune 40% of the res-40-2 network using res-40-2 (assumed to lay in folder perf_tables) as a performance table.

Limitations

Unfortunately, the network objects used in this script are non trivial to implement (see model/wideresnet.py and utils/compute_table/wideresnet_table.py) for example. I only built a wideresnet. Networks with skip connections are harder to code because it is needed to share the number of channels between all the layers connected through skip connections, models without skip connections would be simpler to code.

Results

I implemented NetAdapt as a part of my master thesis. This is only a cured repo. During my thesis, I found out that training the network from scratch after pruning is much more interesting than fine-tuning it as advocated by A Closer Look at Structured Pruning for Neural Network Compression . MobileNetv3 also retrained from scratch after pruning when using NetAdapt. After pruning, we save the number of channels pruned per layers that can be used to retrain WideResNets from scratch. I do not provide files to do so here but I used this in pytorch and this in keras (tensorflow) to get retrained from scratch networks.

Unfortunately, pruning results are somewhat deceiving, at least on CIFAR-10 on a Raspberry PI 3B using a WideResNet as a base architecture. We do not do better than uniform pruning and we actually do worse than finiding new architectures with a grid search.

More details on this can be found in my master thesis report, Sections 4.8 and 4.6.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

models

models

nbr_channels

nbr_channels

perf_tables

perf_tables

utils

utils

README.md

README.md

compute_table.py

compute_table.py

prune.py

prune.py

pruning_comparison.png

pruning_comparison.png

train.py

train.py

Repository files navigation

Implementation of NetAdapt

Environment

Performance tables

Code

Limitations

Results

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
models		models
nbr_channels		nbr_channels
perf_tables		perf_tables
utils		utils
README.md		README.md
compute_table.py		compute_table.py
prune.py		prune.py
pruning_comparison.png		pruning_comparison.png
train.py		train.py

NatGr/NetAdapt

Folders and files

Latest commit

History

Repository files navigation

Implementation of NetAdapt

Environment

Performance tables

Code

Limitations

Results

About

Topics

Resources

Stars

Watchers

Forks

Languages