DIDL 2018 Paper #6 Reviews and Comments
===========================================================================
Paper #6 Parallelized Training of Deep NN - Comparison of Current Concepts
and Frameworks


Review #6A
===========================================================================

Overall merit
-------------
3. Weak accept

Reviewer expertise
------------------
3. Knowledgeable

Paper summary
-------------
In this paper, the authors describe a performance study on data parallel training on TensorFlow and MXNet. The results seem to indicate that MXNet shows better performance and scalability than TensorFlow, but it is unclear how far that result generalizes.

Comments for author
-------------------
The study is quite interesting, but not completely new (the author's mention similar studies on different settings). Whenever there are multiple systems that serve the same purpose (in this case, deep learning), the question raises "which system performs best" and we can observe countless studies that often result in different conclusions that depend on their setup.  

The given study is performed on small data sets and a limited number of configurations / NN architectures. As such, it provides only a small data point and it is unclear how far the conclusions drawn from it would generalize to other situations. I encourage the authors to extend their study to larger data sets and other neural networks. 

The choice of parameter severs in TensorFlow could impact the performance and scalability. It is not clear whether the design choice by the authors ("to get similarly low effort for the training setup, we used for each TensorFlow experiment one dedicated parameter server" ) is really the best for TF performance. The argument of low effort for the training setup is not convincing, as the primary goal of the study is to compare performance of the frameworks.

In terms of related work, there was a paper on last year's version of the DIDL workshop which could be cited for reference to a deeper discussion of model parallelism [1].

Despite of the weaknesses of the paper, I think that it will make an interesting contribution to the workshop.

[1] Ruben Mayer, Christian Mayer, and Larissa Laich. 2017. The tensorflow partitioning and scheduling problem: it's the critical path!. In Proceedings of the 1st Workshop on Distributed Infrastructures for Deep Learning (DIDL '17). ACM, New York, NY, USA, 1-6. DOI: https://doi.org/10.1145/3154842.3154843


Review #6B
===========================================================================

Overall merit
-------------
3. Weak accept

Reviewer expertise
------------------
2. Some familiarity

Paper summary
-------------
This paper explores distributed Deep Neural Net training and in particular compares two frameworks (TensorFlow and MXNet) in this context. The various concepts and limitations of distributed training are exposed. The focus of the study is the experimental comparison of centralised vs decentralised Parameter Servers (PS). After verifying this assumption that the number of nodes could be lowered if the batch size was reduced, experiments show that both frameworks are able to scale more effectively when the batch size is high but in general the distributed process used by MXNet is more effective than the one of TensorFlow.

Comments for author
-------------------
The paper is well written and clearly describes its motivation and results.

How did the author select the 2 platform studied in this paper? TF is the most widely used, so it makes sense but what about MXNet? PyTorch is the quickest rising platform hence it might have made more sense... please justify your choice.

Overall the question that remains is, how much of the scalability originates from the use of a PS examplified by TF in this work vs distributed KVStore used by MXNet? Is the difference in the plots only down to this aspect?


Review #6C
===========================================================================

Overall merit
-------------
2. Weak reject

Reviewer expertise
------------------
3. Knowledgeable

Paper summary
-------------
The paper compares two commonly used ML frameworks, TensorFlow and MXNet, in terms of their throughput, scalability, and ease of use. It suggests that MXNet is superior in most cases.

Comments for author
-------------------
STRENGTHS
The experimental setup seems rigorous and it is explained well in the paper.

WEAKNESSES
- The paper has limited novelty. It simply compares the throughput of TensorFlow and MXNet for different neural network models, the number of workers and batch-sizes. Although the authors claim that MXNet is better in terms of throughput, the reason why this happens is not clear. It would have been good if the authors provided some broader insights on what are the desired properties of a deep ML framework. Such insights can be used more widely, going beyond the two specific frameworks considered in this paper. 
- Basic concepts like model-parallelism and data-parallelism need not be explained in such great detail. The authors can instead cite relevant prior papers. The background section is unnecessarily long.
- I think there significant redundancy in the plots. Since the speed-up is just the ratio of throughput_n and throughput_1, Figure 8 can be inferred directly from Fig 7 and Fig 9. It would be better to just show Fig 8 and omit Fig 7, or just show a combined plot of the two subfigures in Fig 7 and omit Fig 8. The same comment applies to Fig 9 and Fig 10 as well. I suggest retaining only one of these figures.