Heterogeneous model parallelism for deep neural networks

Moreno Álvarez, Sergio; Haut Hurtado, Juan Mario; Paoletti Ávila, Mercedes Eugenia; Rico Gallego, Juan Antonio

Listar por

Estadísticas

Visualiza las estadísticas

Ayuda

Ayuda

Identificador persistente para citar o vincular este elemento: http://hdl.handle.net/10662/20373

0 0

Títulos:	Heterogeneous model parallelism for deep neural networks
Autores/as:	Moreno Álvarez, Sergio Haut Hurtado, Juan Mario Paoletti Ávila, Mercedes Eugenia Rico Gallego, Juan Antonio
Palabras clave:	Deep learning;High performance computing;Distributed training;Heterogeneous platforms;Model parallelism;HPC;Computación de alto rendimiento;Entrenamiento distribuido;Plataformas de computación heterogénea
Fecha de publicación:	2021
Editor/a:	Elsevier
Fuente:	Sergio Moreno-Alvarez, Juan M. Haut, Mercedes E. Paoletti, Juan A. Rico-Gallego, Heterogeneous model parallelism for deep neural networks, Neurocomputing, Volume 441, 2021, Pages 1-12, ISSN 0925-2312, https://doi.org/10.1016/j.neucom.2021.01.125 (https://www.sciencedirect.com/science/article/pii/S0925231221002320)
Resumen:	Deep neural networks (DNNs) have transformed computer vision, establishing themselves as the current state-of-the-art for image processing. Nevertheless, the training of current large DNN models is one of the main challenges to be solved. In this sense, data-parallelism has been the most widespread distributed training strategy since it is easy to program and can be applied to almost all cases. However, this solution suffers from several limitations, such as its high communication requirements and the memory con- straints when training very large models. To overcome these limitations model-parallelism has been pro- posed, solving the most substantial problems of the former strategy. However, describing and implementing the parallelization of the training of a DNN model across a set of processes deployed on several devices is a challenging task. Current proposed solutions assume a homogeneous distribution, being impractical when working with devices of different computational capabilities, which is quite com- mon on high performance computing platforms. To address previous shortcomings, this work proposes a novel model-parallelism technique considering heterogeneous platforms, where a load balancing mech- anism between uneven devices of an HPC platform has been implemented. Our proposal takes advantage of the Google Brain’s Mesh-TensorFlow for convolutional networks, splitting computing tensors across filter dimension in order to balance the computational load of the available devices. Conducted experi- ments show an improvement in the exploitation of heterogeneous computational resources, enhancing the training performance. The code is available on: https://github.com/mhaut/HeterogeneusModelDNN.
URI:	http://hdl.handle.net/10662/20373
ISSN:	0925-2312
DOI:	https://doi.org/10.1016/j.neucom.2021.01.125
Colección:	DIEEA - Artículos DISIT - Artículos DTCYC - Artículos

Archivos

Archivo	Descripción	Tamaño	Formato
j_neucom_2021_01_125.pdf ???org.dspace.app.webui.jsptag.ItemTag.accessRestricted???		1,78 MB	Adobe PDF	Descargar Pide una copia

Vista completa

Este elemento está sujeto a una licencia Licencia Creative Commons