Design and Implementation of a GALS Adapter for ANoC based

Transcription

Design and Implementation of a GALS Adapter for ANoC based
Design and Implementation
of a GALS Adapter
for ANoC based Architectures
Yvain Thonnart, Edith Beigné, Pascal Vivet
MINATEC, CEA-LETI, Grenoble, France
ASYNC’09, Chapell Hill, NC, USA
May 18th, 2009
GALS Adapter for ANoC
2007
Clk1 / Vdd1
Clk2 / Vdd2
Clk3 / Vdd3
Clk4 / Vdd4
Clk5 / Vdd5
Clk6 / Vdd6
Clk7 / Vdd7
Clk8 / Vdd8
Clk9 / Vdd9
Asynchronous NoC
OffChip clk2
2D-Mesh based,
Wormhole Source Routing,
2 Virtual Channels for QoS,
QDI (4-rail, 4-phase)
Low Power scheme :
DVFS is used at IP level
GALS scheme :
IP are distinct frequency domains
synchronous Off-chip NoC interfaces
OffChip clk1
Need efficient GALS interfaces
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
2
Outline
2007
Introduction
ANoC GALS adapter architecture
New bi-synchronous FIFOs using Johnson code
Design of ANoC Interfaces
Implementation & Results
Conclusion
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
3
Outline
Introduction
ANoC GALS adapter architecture
2007
Objectives & Principles
New bi-synchronous FIFOs using Johnson code
Design of ANoC Interfaces
Implementation & Results
Conclusion
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
4
ANoC GALS Adapter Objectives
Design Objectives
2007
QDI asynchronous (4-rail/4-phase) vs Synchronous
NoC protocols
Virtual Channel policy (VC0 / VC1)
Local Clock Generator for easy frequency decoupling
High throughput, low latency, low area
Implementation
Objectives
Standard-Cell based Design
Delivered as a Hard Macro for easy design flow
integration
Hide asynchronous complexity to the final user
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
5
ANoC GALS Adapter Architecture
Efficient A-S and S-A FIFOs based on Johnson encoding
Protocol conversion between QDI & Synchronous logic
Virtual Channel multiplexing / demultiplexing
Programmable Local Clock Generator
Using a standard-cell based delay-line
Robust & efficient programming interface using pausable clocking
2007
ANoC
(QDI Async. Logic)
A-S
FIFO
S-A
FIFO
IP Unit
(Local Clock Domain)
Delay Line Prog.
Clock Gen.
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
6
Outline
Introduction
ANoC GALS Adapter Architecture Proposal
New bi-synchronous FIFOs using Johnson encoding
2007
Metastability issues in FIFOS
Johnson Code
FIFO Micro-Architecture
Design of ANoC Interfaces
Implementation & Results
Conclusion
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
7
Dual clock FIFO principles
Decoupled timing domains
Write pointer updated with the write clock
Read pointer updated with the read clock
High throughput : 1 transfer / cycle
Latency & minimal FIFO depth depend on the synchronization costs
Synchronization issue ?
pointers are cross timing domains need synchronization with opposite clock
Needs ad-hoc encoding to ensure proper detection of full and empty states
Write clock domain
Read clock domain
2007
Radd
Wadd
full
empty
R=W+1
Wclk
R=W
Rclk
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
8
From Gray encoding to Johnson encoding
Hamming distance of 1
Disadvantages of Gray
2007
always synchronizes either last value or new value
Increment & comparison operations are not trivial
needs to be done in binary, needing converters
or with consequent translation tables
Natural Gray is limited to 2N FIFO depths: area consuming
Gray adaptations to non 2N depths need complex logic
Johnson encoding is to Gray what 1-hot is to Binary
Also has a Hamming distance of 1
Less dense, not restricted to 2^N values
Increment & comparisons are trivial
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
9
Johnson counter
Also known as “twisted
ring”
2007
A single bit change propagates
from LSB to MSB
Implemented by a shift register
looped by an inverter
Increment and comparison
are trivial
N bit Johnson counter
encodes 2N values
[Johnson 74]
Value
Code
0
000
1
001
2
011
3
111
4
110
5
100
clk
D
Q
D
Q
D
Q
ptr[0]
ptr[1]
ptr[2]
ptr_incr[0]
ptr_incr[1]
ptr_incr[2]
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
10
Johnson encoding for FIFO design
Use a N-bit Johnson counter
(2N values) for a FIFO of depth N
Each bit defines a FIFO register
The bit value is the access parity
for that given register
Control logic equations
Wen for register i:
Wptr[i] xor Wptr_incr[i]
2007
Ren for register i:
Rptr[i] xor Rptr_incr[i]
Empty: Rptr =
Wptr
Full:
Rptr = NOT Wptr
Johnson
Code
Register
id
Parity
00000
00001
00011
00111
01111
11111
11110
11100
11000
10000
0
1
2
3
4
0
1
2
3
4
0
0
0
0
0
1
1
1
1
1
Detect code transition to access register i
Full condition ?
no cell lost in the FIFO to handle Full state:
Read and Write on same register
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
11
Johnson encoding FIFO architecture
The FIFO is cross timing domains
between distinct synchronous or
asynchronous domains
wr_data
2007
R/W pointer synchronization ?
using 2 (or more) Flip-Flops for
synchronous domains
using glitchless logic for
asynchronous side of the
AS/SA interfaces
Generic
rd_data
wr_addr
rd_addr
rd_ptr
wr_ptr
wr_en
rd_en
wr_clk
rd_clk
Specific
Synchronization
ready
wr_ptr = rd_ptr empty
wr_ptr = not rd_ptr full
Write Clock Domain
valid
Read Clock Domain
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
12
Outline
Introduction
ANoC GALS Adapter Architecture Proposal
New bi-synchronous FIFOs using Johnson encoding
Design of ANoC Interfaces
2007
A-S Interface & S-A Interface
Local Clock Generator
Implementation & Results
ANoC
Conclusion
(QDI Async. Logic)
A-S
INTERFACE
S-A
INTERFACE
IP Unit
(Local Clock Domain)
Delay Line Prog.
Clock Gen.
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
13
A-to-S interface
Synchronous
operation
rd_clk↑
rd_ptr++
Full / Empty detector
for A-S interface
ready↑
wr_ptr_incr[N-1]
rd_ptr[N-1]
wr_ptr_incr[0]
rd_ptr[0]
ai_accept↑
valid↑
rd_clk↑
wr_ptr[N-1]
rd_ptr[N-1]
wr_ptr_half[N-1]
rd_ptr[N-1]
Asynchronous
operation
empty
full
wr_ptr_half[0]
rd_ptr[0]
ai_send↑
wr_ptr[0]
rd_ptr[0]
wr_ptr[N-1]
rd_ptr[N-1]
2-flop
synchronization
rd_clk↑
wr_ptr++
wr_ptr[0]
rd_ptr[0]
wr_clk↑
ai_data
ai_send
R
read clk
(from IP side)
write clk
(locally generated)
ai_data_ack
ai_send_ack
Design constraints
2007
write clock is locally generated
(output of the Input Data word
completion tree)
write clock edge must occur
after 4rail-BD conversion
FIFO size must be at least 5
due to AS interface round trip
ai_accept[0]
ai_accept_ack[0]
S
wr_clk
rd_clk
wr_en = 1
rd_en
si_accept[0]
si_accept[1]
R
ready
valid
FIFO
ai_accept[1]
ai_accept_ack[1]
ai_send[0]
ai_send[1]
ai_data
S
si_send[0]
R
si_send[1]
4RailBD
Asynchronous NoC side
wr_data
rd_data
si_data
Synchronous IP side
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
14
S-to-A interface
Synchronous
operation
wr_clk↑
wr_ptr++
Full / Empty detector
for S-A interface
valid↑
wr_ptr_incr[N-1]
rd_ptr[N-1]
ao_send↑
ready↑
Asynchronous
operation
wr_clk↑
wr_ptr_incr[0]
rd_ptr[0]
full
wr_ptr[N-1]
rd_ptr[N-1]
wr_ptr[N-1]
rd_ptr[N-1]
empty
wr_ptr[0]
rd_ptr[0]
ao_send_ack↓
wr_ptr[0]
rd_ptr[0]
2-flop
synchronization
wr_clk↑
rd_ptr_half++
rd_clk↓
clk
Design constraints
read2007
clock is locally generated
(output of the asychronous
data ack. completion tree)
Asynchronous token
generation must occur after
BD-4rail conversion
FIFO size must be at least 5
due to SA interface round trip
write clk
(from IP side)
read clk
(locally
generated)
wr_clk
so_accept[0]
so_accept[1]
so_data
R
ao_data_ack
ao_send_ack
rd_clk
ready
valid
wr_data
ao_data
rd_data
FIFO
ao_send[0]
so_send[0]
so_send[1]
wr_en
rd_en = 1
BD4Rail
R
ao_accept_ack[0]
ao_accept[0]
ao_send[1]
R
Synchronous IP side
Asynchronous NoC side
ao_accept_ack[1]
ao_accept[1]
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
15
Local Clock Generator
Safe reprogramming
Use Pausable Clock Principle
Design constraints
AFSM for controlling the
req/ack pause signals for
reprogramming
Linear delay increment for
total delay line delay
Clock division for lower
frequencies for DVFS
and test
Reprogramming is only
2007 done during low phase of
the clock (all signals are
stable equal to 0)
Full standard-cell
implementation
Latch-Mux based
programmable delay line
ri_dl
Coarse grain
frequency selection
ai_dl
R
rclk
clk_div
dclk
aclk
ME
Counter
+
Test /=0
clk_out
iclk
rclk
dclk
clk_freq
R
Fine grain frequency tuning
clk_freq[0]
delay_in
1
clk_freq[0]
clk_freq[k]
2k
clk_freq[k]
delay_out
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
16
Outline
Introduction
ANoC GALS Adapter Architecture Proposal
New bi-synchronous FIFOs using Johnson encoding
Design of ANoC Interfaces
Implementation & Results
Conclusion
2007
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
17
Design Flow
ANoC
(QDI Async. Logic)
A-S
INTERFACE
S-A
INTERFACE
IP Unit
(Local Clock Domain)
Delay Line Prog.
GALS adapter implementation
2007
TAL
Library
(C-elements,
Mutex)
Use the hard-macro as a synchronous IP with CTS + .lib files
Easy top-level integration
Full standard-cell (CORELIB + TAL)
Mixed RTL / gate instantiation
Synthesis with definition of the 3 clock domains
Max-delay constraints on cross-domain paths
Standard place&route design flow
Validation on SDF back-annotated netlist in all corners
Delivered as a hard-macro with all CAD views
GALS adapter usage
Clock Gen.
Hiding asynchronous complexity to the final user
Provides full timing domain decoupling
Using a GALS NoC scheme
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
18
Hard-Macro Layout (STMicroelectronics CMOS65LP)
Clk-in
Clk-out
reset_n
test_scan
clk_freq
80µm
2007
Clk
Gen
sync_out_data
sync_in_data
Interface S-A
Interface A-S
async_out_data
async_in_data
160µm
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
19
ANoC GALS Adapter Performances
Nominal Case, 1.20V, 25°C
Worst Case, 1.05V, 105°C
1000
500
Frequency (MHz)
600
Frequency (MHz)
1200
800
600
400
Generated clock frequency
A-S interface max. throughput
S-A interface max. throughput
ANoC router max. throughput
200
300
200
Generated clock frequency
A-S interface max. throughput
S-A interface max. throughput
ANoC router max. throughput
100
0
0
0
5
cfg_freq
10
15
Frequency Range
(Delay Line Prog)*
A-to-S
Throughput
S-to-A
Throughput
WorstCase
200MHz – 500MHz
400MHz
280MHz
NominalCase
380MHz – 980MHz
680MHz
510MHz
2007
400
(*) plus additional values through division factors
0
5
cfg_freq
10
15
Using QDI, ANoC and the GALS
adapter will provide about 500MFlit/s.
The Delay Line will provide local clock
frequency up to 1GHz
(Less precision for higher frequencies)
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
20
Comparison with previous solutions
Design
2007
Max clock
freq (MHz)
Max A-S
throughput
(Mflit/s)
Max S-A
throughput
(Mflit/s)
Layout area
(µm2)
FAUST(*)
(2005)
350
300
250
34,000
ALPIN
(2007)
400
220
180
9,000
MAGALI
(this work)
980
710
520
12,500
FAUST [Async’06] : use of FIFO interfaces based on Gray code
ALPIN [NOCS’08] : use of Pausable clocking
(*) 130nm values converted to 65nm
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
21
Outline
Introduction
ANoC GALS Adapter Architecture Proposal
New bi-synchronous FIFOs using Johnson encoding
Design of ANoC Interfaces
Implementation & Results
Conclusion
2007
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
22
Lessons from the past…
Proposed to (re-)use Johnson encoding [Johnson74]
Pausable clocking ?
Provides very low area overhead at the cost of very low bandwidth
Limits the maximum clock frequency [Dobkin, Ginosar 05]
Well suited to low-power solutions, not for high performance
2007
Nevertheless efficient to provide robust interface for DFS scheme
NoC Virtual Channels are costly for GALS interfaces
Do not demux the VCs in the GALS interface when VCs are not
interleaved within the synchronous unit
Allow efficient FIFO design & any FIFO depth compared to std Gray code
The FIFO can be sized to the minimum area reduction
Share all you can !
GALS Design Flow
Provide the GALS interface as a Hard Macro with all CAD views
For easy integration at top level
© CEA 2008. Tous droits réservés.
Toute reproduction totale ou partielle sur quelque support que ce soit ou utilisation du contenu de ce document est interdite sans l’autorisation écrite préalable du CEA
All rights reserved. Any reproduction in whole or in part on any medium or use of the information contained herein is prohibited without the prior written consent of CEA
GALS Adapter Architecture & Implementation
23