Laboratory for Analysis and Architecture of Systems
The increasing importance of massively defective technologies in the computing systems is the direct consequence of the dimensions reduction in today's CMOS and beyond CMOS nanoelectronics, in the technologies envisioned in the molecular-electronics approach, which in all cases increases the occurrence of permanent and transient physical faults in all future chips. In this context, the tolerance of physical faults in chips becomes an essential challenge that must be considered from the very beginning in every design. Moreover, the dependability of future chips should be achieved with very limited external control since the increasing chip complexity reduces the controllability and observability of the chip from the external world making external testing of the array components (processors, memories, routers, etc) not scalable for massively parallel architectures.
We advocate the vision that future chips should integrate a hierarchy of imbricate self-organizing and adaptive mechanisms at different abstraction levels (from circuit level to task management) to continue delivering their processing services in a quasi-autonomous way in spite of defective cores. Furthermore, chip organization should exhibit some physical redundancy to build a dependable sub-system out of a defective and intrinsically unreliable system. We have been studying for three years a self-organizing fault-tolerant (SOFT) methodology at the architectural level in the massively defective multicore arrays, considering that as much as 40% of defective cores could be defective in an array. SOFT multicore chips combine the following autonomous mechanisms [1,2,3]:
|
a) Chip self-diagnosis through the mutual test of adjacent cores. Simply, each core executes separately a software-based self test program, calculates a test signature and stops communications with the adjacent cores having different signatures. This distributed disconnection mechanism automatically splits the grid in several single connected zones (SCZ), regardless of the actions of defective cores! In other words, good cores disconnect bad cores. Fig. 1 displays a typical example of such chip partitioning achieved autonomously with no external control. The chip is here a 7x9 2D-mesh array including 14 defective (black-colored) nodes and 4 input/output ports (IOP, ,labeled N, E, S, W) positioned in the middle of each edge. A solid line between two adjacent routers shows an interconnect. The chip is split in a single-connected zone of good cores (green zone) which isolates the defective cores enclosed in separate red dotted curves. For clarity, we did not draw the interconnects logically disabled by good cores. Note that, some good cores may not be reachable by the IOPs. For instance, the dotted loop in the top left corner of Fig. 1 encloses a cluster of three good cores with coordinates (0,5) (0,6) and (1,5), which cannot take part in the processing. |
|
b) Self-configuration of communication routes. Step1: Each IOP emits a route request message (RRM). The message is propagated by means of flooding diffusion, where each node forwards each incoming RRM to all links except the incoming link. The idea here is that each router forwarding a RRM adds in the message header the routing which is locally executed (for instance, the index of the output link), so that during propagation, a RRM registers the route which it follows. Step 2: Each valid node receiving the RRM sends one route acknowledgment message (RAM) back to the emitter. The RAM simply follows the RRM route in the opposite direction, which dramatically limits the number of retransmissions. Globally, the number of RAMs returning to an IOP is as large as the number of nodes which can be contacted in the SCZ of good cores. Step 3: Each IOP collects the RAM messages and stores the routes in a dedicated array of its memory, which we call the valid route array (VRA). Thus, at the end of this route discovery phase, each IOP has stored in its VRA the routes to the cores which it can contact, and each core has stored the routes to the IOPs, which contacted it. Of course, the fraction of cores which can be contacted is a decreasing function of the density of defective cores. Fig 2 shows simulation results which calculate the average fraction of cores simultaneously accessible by 4, 3, 2, 1 or no IOP in a 11x11 array. The green bars show that almost all cores are accessible by all IOPs up to 25% of defective cores in the array (i.e., pf,N≤0.25). However, the reachability decreases dramatically above this threshold of faulty cores. The predominance of the red, and orange bars at 40% of faulty cores shows the applicability limit of the method which consists in isolating the defective cores. The references below provide a detailed description of our activity on SOFT chips.
|