Design and Implementation of a GALS Adapter for ANoC based
Transcription
Design and Implementation of a GALS Adapter for ANoC based
Design and Implementation of a GALS Adapter for ANoC based Architectures Yvain Thonnart, Edith Beigné, Pascal Vivet MINATEC, CEA-LETI, Grenoble, France ASYNC’09, Chapell Hill, NC, USA May 18th, 2009 GALS Adapter for ANoC 2007 Clk1 / Vdd1 Clk2 / Vdd2 Clk3 / Vdd3 Clk4 / Vdd4 Clk5 / Vdd5 Clk6 / Vdd6 Clk7 / Vdd7 Clk8 / Vdd8 Clk9 / Vdd9 Asynchronous NoC OffChip clk2 2D-Mesh based, Wormhole Source Routing, 2 Virtual Channels for QoS, QDI (4-rail, 4-phase) Low Power scheme : DVFS is used at IP level GALS scheme : IP are distinct frequency domains synchronous Off-chip NoC interfaces OffChip clk1 Need efficient GALS interfaces © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 2 Outline 2007 Introduction ANoC GALS adapter architecture New bi-synchronous FIFOs using Johnson code Design of ANoC Interfaces Implementation & Results Conclusion © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 3 Outline Introduction ANoC GALS adapter architecture 2007 Objectives & Principles New bi-synchronous FIFOs using Johnson code Design of ANoC Interfaces Implementation & Results Conclusion © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 4 ANoC GALS Adapter Objectives Design Objectives 2007 QDI asynchronous (4-rail/4-phase) vs Synchronous NoC protocols Virtual Channel policy (VC0 / VC1) Local Clock Generator for easy frequency decoupling High throughput, low latency, low area Implementation Objectives Standard-Cell based Design Delivered as a Hard Macro for easy design flow integration Hide asynchronous complexity to the final user © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 5 ANoC GALS Adapter Architecture Efficient A-S and S-A FIFOs based on Johnson encoding Protocol conversion between QDI & Synchronous logic Virtual Channel multiplexing / demultiplexing Programmable Local Clock Generator Using a standard-cell based delay-line Robust & efficient programming interface using pausable clocking 2007 ANoC (QDI Async. Logic) A-S FIFO S-A FIFO IP Unit (Local Clock Domain) Delay Line Prog. Clock Gen. © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 6 Outline Introduction ANoC GALS Adapter Architecture Proposal New bi-synchronous FIFOs using Johnson encoding 2007 Metastability issues in FIFOS Johnson Code FIFO Micro-Architecture Design of ANoC Interfaces Implementation & Results Conclusion © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 7 Dual clock FIFO principles Decoupled timing domains Write pointer updated with the write clock Read pointer updated with the read clock High throughput : 1 transfer / cycle Latency & minimal FIFO depth depend on the synchronization costs Synchronization issue ? pointers are cross timing domains need synchronization with opposite clock Needs ad-hoc encoding to ensure proper detection of full and empty states Write clock domain Read clock domain 2007 Radd Wadd full empty R=W+1 Wclk R=W Rclk © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 8 From Gray encoding to Johnson encoding Hamming distance of 1 Disadvantages of Gray 2007 always synchronizes either last value or new value Increment & comparison operations are not trivial needs to be done in binary, needing converters or with consequent translation tables Natural Gray is limited to 2N FIFO depths: area consuming Gray adaptations to non 2N depths need complex logic Johnson encoding is to Gray what 1-hot is to Binary Also has a Hamming distance of 1 Less dense, not restricted to 2^N values Increment & comparisons are trivial © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 9 Johnson counter Also known as “twisted ring” 2007 A single bit change propagates from LSB to MSB Implemented by a shift register looped by an inverter Increment and comparison are trivial N bit Johnson counter encodes 2N values [Johnson 74] Value Code 0 000 1 001 2 011 3 111 4 110 5 100 clk D Q D Q D Q ptr[0] ptr[1] ptr[2] ptr_incr[0] ptr_incr[1] ptr_incr[2] © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 10 Johnson encoding for FIFO design Use a N-bit Johnson counter (2N values) for a FIFO of depth N Each bit defines a FIFO register The bit value is the access parity for that given register Control logic equations Wen for register i: Wptr[i] xor Wptr_incr[i] 2007 Ren for register i: Rptr[i] xor Rptr_incr[i] Empty: Rptr = Wptr Full: Rptr = NOT Wptr Johnson Code Register id Parity 00000 00001 00011 00111 01111 11111 11110 11100 11000 10000 0 1 2 3 4 0 1 2 3 4 0 0 0 0 0 1 1 1 1 1 Detect code transition to access register i Full condition ? no cell lost in the FIFO to handle Full state: Read and Write on same register © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 11 Johnson encoding FIFO architecture The FIFO is cross timing domains between distinct synchronous or asynchronous domains wr_data 2007 R/W pointer synchronization ? using 2 (or more) Flip-Flops for synchronous domains using glitchless logic for asynchronous side of the AS/SA interfaces Generic rd_data wr_addr rd_addr rd_ptr wr_ptr wr_en rd_en wr_clk rd_clk Specific Synchronization ready wr_ptr = rd_ptr empty wr_ptr = not rd_ptr full Write Clock Domain valid Read Clock Domain © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 12 Outline Introduction ANoC GALS Adapter Architecture Proposal New bi-synchronous FIFOs using Johnson encoding Design of ANoC Interfaces 2007 A-S Interface & S-A Interface Local Clock Generator Implementation & Results ANoC Conclusion (QDI Async. Logic) A-S INTERFACE S-A INTERFACE IP Unit (Local Clock Domain) Delay Line Prog. Clock Gen. © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 13 A-to-S interface Synchronous operation rd_clk↑ rd_ptr++ Full / Empty detector for A-S interface ready↑ wr_ptr_incr[N-1] rd_ptr[N-1] wr_ptr_incr[0] rd_ptr[0] ai_accept↑ valid↑ rd_clk↑ wr_ptr[N-1] rd_ptr[N-1] wr_ptr_half[N-1] rd_ptr[N-1] Asynchronous operation empty full wr_ptr_half[0] rd_ptr[0] ai_send↑ wr_ptr[0] rd_ptr[0] wr_ptr[N-1] rd_ptr[N-1] 2-flop synchronization rd_clk↑ wr_ptr++ wr_ptr[0] rd_ptr[0] wr_clk↑ ai_data ai_send R read clk (from IP side) write clk (locally generated) ai_data_ack ai_send_ack Design constraints 2007 write clock is locally generated (output of the Input Data word completion tree) write clock edge must occur after 4rail-BD conversion FIFO size must be at least 5 due to AS interface round trip ai_accept[0] ai_accept_ack[0] S wr_clk rd_clk wr_en = 1 rd_en si_accept[0] si_accept[1] R ready valid FIFO ai_accept[1] ai_accept_ack[1] ai_send[0] ai_send[1] ai_data S si_send[0] R si_send[1] 4RailBD Asynchronous NoC side wr_data rd_data si_data Synchronous IP side © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 14 S-to-A interface Synchronous operation wr_clk↑ wr_ptr++ Full / Empty detector for S-A interface valid↑ wr_ptr_incr[N-1] rd_ptr[N-1] ao_send↑ ready↑ Asynchronous operation wr_clk↑ wr_ptr_incr[0] rd_ptr[0] full wr_ptr[N-1] rd_ptr[N-1] wr_ptr[N-1] rd_ptr[N-1] empty wr_ptr[0] rd_ptr[0] ao_send_ack↓ wr_ptr[0] rd_ptr[0] 2-flop synchronization wr_clk↑ rd_ptr_half++ rd_clk↓ clk Design constraints read2007 clock is locally generated (output of the asychronous data ack. completion tree) Asynchronous token generation must occur after BD-4rail conversion FIFO size must be at least 5 due to SA interface round trip write clk (from IP side) read clk (locally generated) wr_clk so_accept[0] so_accept[1] so_data R ao_data_ack ao_send_ack rd_clk ready valid wr_data ao_data rd_data FIFO ao_send[0] so_send[0] so_send[1] wr_en rd_en = 1 BD4Rail R ao_accept_ack[0] ao_accept[0] ao_send[1] R Synchronous IP side Asynchronous NoC side ao_accept_ack[1] ao_accept[1] © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 15 Local Clock Generator Safe reprogramming Use Pausable Clock Principle Design constraints AFSM for controlling the req/ack pause signals for reprogramming Linear delay increment for total delay line delay Clock division for lower frequencies for DVFS and test Reprogramming is only 2007 done during low phase of the clock (all signals are stable equal to 0) Full standard-cell implementation Latch-Mux based programmable delay line ri_dl Coarse grain frequency selection ai_dl R rclk clk_div dclk aclk ME Counter + Test /=0 clk_out iclk rclk dclk clk_freq R Fine grain frequency tuning clk_freq[0] delay_in 1 clk_freq[0] clk_freq[k] 2k clk_freq[k] delay_out © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 16 Outline Introduction ANoC GALS Adapter Architecture Proposal New bi-synchronous FIFOs using Johnson encoding Design of ANoC Interfaces Implementation & Results Conclusion 2007 © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 17 Design Flow ANoC (QDI Async. Logic) A-S INTERFACE S-A INTERFACE IP Unit (Local Clock Domain) Delay Line Prog. GALS adapter implementation 2007 TAL Library (C-elements, Mutex) Use the hard-macro as a synchronous IP with CTS + .lib files Easy top-level integration Full standard-cell (CORELIB + TAL) Mixed RTL / gate instantiation Synthesis with definition of the 3 clock domains Max-delay constraints on cross-domain paths Standard place&route design flow Validation on SDF back-annotated netlist in all corners Delivered as a hard-macro with all CAD views GALS adapter usage Clock Gen. Hiding asynchronous complexity to the final user Provides full timing domain decoupling Using a GALS NoC scheme © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 18 Hard-Macro Layout (STMicroelectronics CMOS65LP) Clk-in Clk-out reset_n test_scan clk_freq 80µm 2007 Clk Gen sync_out_data sync_in_data Interface S-A Interface A-S async_out_data async_in_data 160µm © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 19 ANoC GALS Adapter Performances Nominal Case, 1.20V, 25°C Worst Case, 1.05V, 105°C 1000 500 Frequency (MHz) 600 Frequency (MHz) 1200 800 600 400 Generated clock frequency A-S interface max. throughput S-A interface max. throughput ANoC router max. throughput 200 300 200 Generated clock frequency A-S interface max. throughput S-A interface max. throughput ANoC router max. throughput 100 0 0 0 5 cfg_freq 10 15 Frequency Range (Delay Line Prog)* A-to-S Throughput S-to-A Throughput WorstCase 200MHz – 500MHz 400MHz 280MHz NominalCase 380MHz – 980MHz 680MHz 510MHz 2007 400 (*) plus additional values through division factors 0 5 cfg_freq 10 15 Using QDI, ANoC and the GALS adapter will provide about 500MFlit/s. The Delay Line will provide local clock frequency up to 1GHz (Less precision for higher frequencies) © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 20 Comparison with previous solutions Design 2007 Max clock freq (MHz) Max A-S throughput (Mflit/s) Max S-A throughput (Mflit/s) Layout area (µm2) FAUST(*) (2005) 350 300 250 34,000 ALPIN (2007) 400 220 180 9,000 MAGALI (this work) 980 710 520 12,500 FAUST [Async’06] : use of FIFO interfaces based on Gray code ALPIN [NOCS’08] : use of Pausable clocking (*) 130nm values converted to 65nm © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 21 Outline Introduction ANoC GALS Adapter Architecture Proposal New bi-synchronous FIFOs using Johnson encoding Design of ANoC Interfaces Implementation & Results Conclusion 2007 © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 22 Lessons from the past… Proposed to (re-)use Johnson encoding [Johnson74] Pausable clocking ? Provides very low area overhead at the cost of very low bandwidth Limits the maximum clock frequency [Dobkin, Ginosar 05] Well suited to low-power solutions, not for high performance 2007 Nevertheless efficient to provide robust interface for DFS scheme NoC Virtual Channels are costly for GALS interfaces Do not demux the VCs in the GALS interface when VCs are not interleaved within the synchronous unit Allow efficient FIFO design & any FIFO depth compared to std Gray code The FIFO can be sized to the minimum area reduction Share all you can ! GALS Design Flow Provide the GALS interface as a Hard Macro with all CAD views For easy integration at top level © CEA 2008. Tous droits réservés. Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA GALS Adapter Architecture & Implementation 23