*Result*: Supporting automatic recovery in offloaded distributed programming models through MPI-3 techniques

Title:
Supporting automatic recovery in offloaded distributed programming models through MPI-3 techniques
Contributors:
Barcelona Supercomputing Center
Publisher Information:
ACM Digital Library
Publication Year:
2017
Collection:
Universitat Politècnica de Catalunya, BarcelonaTech: UPCommons - Global access to UPC knowledge
Document Type:
*Conference* conference object
File Description:
10 p.; application/pdf
Language:
English
Relation:
http://dl.acm.org/citation.cfm?id=3079093; info:eu-repo/grantAgreement/MINECO//IJCI-2015-23266/ES/IJCI-2015-23266/; https://hdl.handle.net/2117/106857
DOI:
10.1145/3079079.3079093
Rights:
http://creativecommons.org/licenses/by-nc-nd/3.0/es/ ; Open Access ; Attribution-NonCommercial-NoDerivs 3.0 Spain
Accession Number:
edsbas.51753FC7
Database:
BASE

*Further Information*

*In this paper we describe the design of fault tolerance capabilities for general-purpose offload semantics, based on the OmpSs programming model. Using ParaStation MPI, a production MPI-3.1 implementation, we explore the features that, being standard compliant, an MPI stack must support to provide the necessary fault tolerance guarantees, based on MPI's dynamic process management. Our results, including synthetic benchmarks and applications, reveal low runtime overhead and efficient recovery, demonstrating that the existing MPI standard provided us with sufficient mechanisms to implement an effective and efficient fault-tolerant solution. ; This research received funding from the European Community’s 7th Framework Programme via the DEEP-ER project under Grant Agreement no. 610476. This work has also been supported by the Spanish Ministry of Science and Innovation (contract TIN2012-34557) and by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272). Antonio J. Peña is cofinanced by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva fellowship number IJCI-2015-23266. The authors thank Jorge Bell´on, from BSC, for his technical support with the Nanos++ internals. ; Peer Reviewed ; Postprint (author's final draft)*