# Research Topics - CDA

- the quest of efficient parallel or distributed algorithms that scale well;

Some computer scientists like Rodney Brooks consider that those “Grand Challenges” are among the most important ones in Computer Science and that the future of this science and other related sciences depends greatly on our capacity to address these important questions.

**A. Challenge related to the quest of efficient parallel or distributed algorithms that scale well**

**A.1. GPU Computing**

The introduction of general purpose computing on Graphics Processing Units (GPUs) is relatively recent. Before 2007-2008, graphic processors that were used in graphic cards of machines were essentially used for graphic tasks. With the advent of new programming tools like CUDA and OpenCL, GPUs that have many computing cores have started to be used as computing accelerators in collaboration with CPU for High Performance Computing Applications in science and engineering. The bulk of applications concerns signal and image processing and linear algebra. As an example, the Tesla K40 GPU based on NVIDIA Kepler architecture has 2880 CUDA cores and a 1,43 TFlops double precision floating point peak performance. Thus, most manufacturers have adopted computing accelerators in some of their machines. Some Supercomputer manufacturers have also designed massively parallel supercompters based on GPUs that were ranked in the Top 10 supercomputers like the Titan supercomputer at Oak Ridge National Laboratory, USA in 2012 or the Tianhe-1A at Tianjin China.

GPU computing is new. Almost all domains in science and engineering are now concerned. Among the most importants we can quote; astrophysics, seismic, oil industry, nuclear energy. Using GPUs for high Performance computing presents many interest:

- GPUs are powerful acceleratos with many cores;

- GPUs are widely available and relatively cheap;

- GPUs are energy efficient.

Team CDA has designed and analyzed several GPU-based exact methods like dynamic programming parallel algorithms, branch & bound parallel algorithms and Simplex algorithms obtaining reductions in computing time by a factor of 20 to 50 according to the method and problems considered (these recent papers have been quoted more than 80 times, see Google Scholar). The series of GPU computing methods developed by team CDA can be combined easily in order to propose efficient hybrid parallel methods to solve several types of combinatorial optimization problems.**The Work on GPU computing of team CDA is supported by NVIDIA Corporation **that is one of the leading GPUs founder. Dr Didier El Baz has received an NVIDIA Academic Partnership (see __NVIDIA Academic Partnership__) on October 2010 with a gift of two computing accelerators C2050 with 448 CUDA cores each one. **The support of NVIDIA Corporation was renewed in July 2014 with a gift of a Tesla K40 GPU**.

Team CDA has also produced a series of works on the solution of multidimensional knapsack problems. Several heuristics that outperform well known heuristics in the literature in time and quality of solution have been proposed. Exact cooperatives methods that combine dynamic programming and branch and bound have been studied. Research works have also been done on domain reduction techniques. These studies permit one to precondition problems. We have considered in particular constraint rotation method. Let us note finally that a method based on Lagrangian relaxation has also been proposed for the solution of combinatorial problems with equality constraints.

A parallel dynamic programming algorithm based on lists method with pseudo polynomial complexity has been proposed for knapsack problems. Data exchange management plays a prominent part in this parallel method. Several load balancing techniques have been studied in order to obtain efficient parallel methods.

**A.2. Asynchronous iterative algorithms**

The CDA team aims particularly at developing the concept of asynchronism which constitutes a major challenge for the advancement of distributed computing at a very large scale. This concept is gaining considerable attention today since large distributed systems must cope with the very asynchronous nature of communication networks, dynamicity and the presence of faults in the system. In particular, researchs are made on asynchronous iterative algorithms which are particularly well suited to the massive distribution context. Studies are developed for high performance distributed processing on dedicated architectures as in the ANR Smart Surface project and on general architectures.

We consider both traditional domains of research related to the design of distributed algorithms like determining the global state of a distributed system for distributed determination detection purpose and new challenges related to the solution of large scale problems in numerical simulation using techniques such as peer-to-peer distributed computing. As a consequence, the research work of the team concerns the design and analysis of efficient distributed algorithms. The lack of centre and global clock together with a priori unbounded communication delays (i.e. asynchronous model), massive parallelism and heterogeneity lead to interesting problems. Our results during the period mainly deal with the design and analysis of parallel and distributed algorithms. Our results concern two main topics: the solution of large scale systems of equations mainly derived from boundary value problems via parallel or distributed asynchronous iterative algorithms.

We have been working on asynchronous iterative algorithms for a while. We have considered in particular mathematical modeling and convergence analysis, stopping criteria and study of round off errors and study of performance. We have contributed to the design and study of a new class of parallel or distributed iterative algorithms, i.e. Asynchronous Algorithms with Order Intervals, AAOI for short. The AAOI class is a general class of parallel or distributed iterative algorithms whereby several processors can perform computations, i.e. updating phases, at their own pace, without any order nor synchronization. One of the main features of this new class of algorithms is that the current value of the components of the iterate vector, i.e. a partial update which is not necessarily labeled by an iteration number can be used in the computation whereas only values labeled by an iteration number are used with the classical asynchronous iterative model. This new feature is particularly important in the case of convergence under partial ordering since it permits one to obtain better efficiency of parallel/distributed algorithms. This feature is also attractive when using two stages iterative methods or bloc iterative methods since it is not necessary to wait for the completion of an iteration to get updates. As a consequence, the AAOI model fits new communication mechanisms like RMA and introduces more flexibility in the way data can be used in computations. Convergence results in a contraction context have been established for AAOI.

Several theoretical stopping criteria in relationship with forward, backward or absolute errors have been proposed for perturbed asynchronous fixed point methods in the presence of round-off errors. The last case is particularly interesting since all tests are made within the same macro-iteration. We have studied distributed termination detection; in particular, we have proposed a method based on activity graph and messages acknowledgement that tends to minimize the number of exchanged messages.

Issues related to the implementation of asynchronous algorithms on message passing architectures have been studied in details with emphathis on communications aspects. Efficiency of synchronous and asynchronous iterative schemes of computation has been studied for nonlinear convection diffusion problems. Computational results on IBM SP4 machines with more than one hundred processors have been analyzed in details for different communication frequencies showing the interest of asynchronous algorithms with high communication frequencies.

**B. Challenge related to easiness of programming and using massively parallel or distributed architectures**

We have designed and implemented a decentralized environment for High Performance Peer-to-Peer Distributed Computing, P2PDC. **These results were obtained in the framework of ANR Project High Performance Peer-to-Peer Computing (ANR-07-CIS7-011) coordinated by Dr. Didier El Baz**. This work represents our contribution to addressing the challenge of using and programming large scale distributed architectures. On what concerns more specifically peer-to-peer computing, we have been particularly attracted by the success of peer-to-peer applications that are popular for file sharing or video. We have thought that the peer-to-peer concept could be used with equivalent success for High performance Computing applications like combinatorial optimization applications (that are not necessarily data parallel) provided fast communication between peers was possible. In this context, peer-to-peer computing presents the main advantage to be an economic solution for HPC applications. Peer-to-peer computing is certainly not the solution for hexascale computing but it is an interesting approach to contribute to supercomputing new challenges like massive parallelism and scalability, heterogeneity and fault tolerance. Our idea was to use this concept for HPC applications in optimization and numerical simulation that use iterative methods. The challenge here is to design a communication protocol that permits one to have fast exchanges between peers, e.g., in order to communicate updates as in an iterative method and to design a decentralized environment on the top of the communication protocol. The P2PDC decentralized environment aims at providing an economic approach for the solution of large scale numerical simulation problems or difficult optimization problems via distributed iterative methods. The P2PDC environment is based the P2PSAP self-adaptive communication protocol that is an extension of Configurable Transport Protocol (CTP) based on the concept of micro-protocols and the CACTUS framework of University of Arizona. The P2PSAP protocol relies on a reduced set of communication operations for easiness of programming. The communication mode, e.g., synchronous or asynchronous, depends on elements of context at network level like type of communication network, e.g., Infiniband or Ethernet networks and topology, e.g., intra or inter cluster communication; it depends also of elements of context at application level like scheme of computation, e.g., synchronous iterative scheme, asynchronous iterative schemes or hybrid schemes.

In particular, Team CDA has worked on peer-to-peer distributed computing methods in collaboration with EPROAD in the framework of ANR project CIP. Several types of networks like Ethernet, Infiniband and Myrinet have been considered. Then we have designed and developed the P2PDC decentralized environment in order to carry out peer-to-peer HPC applications. Combinatorial optimization applications like cutting stocks problems have been solved via P2PDC. Nevertheless, we point out that our environment is not dedicated to this type of problems; it is in fact general and numerical simulation problems like the obstacle problem or option pricing problems have also been considered.