Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Proposal Draft on H Matrix and Decoder Choice for P1890 LDPC Decoders IEEE P1890 Standards Working Group on Error Correction Coding for Non-Volatile.

Similar presentations


Presentation on theme: "1 Proposal Draft on H Matrix and Decoder Choice for P1890 LDPC Decoders IEEE P1890 Standards Working Group on Error Correction Coding for Non-Volatile."— Presentation transcript:

1 1 Proposal Draft on H Matrix and Decoder Choice for P1890 LDPC Decoders IEEE P1890 Standards Working Group on Error Correction Coding for Non-Volatile Memories July 2014

2 Member List of IEEE P1890 2 NameEmployerAffiliationRole Cole, JohnU.S. ArmyU.S. Army Research LaboratorySponsor Declercq, DavidEcole Nationale Superieure de l'Electronique et de ses Applications (ENSEA), Cergy, France Vice-Chair Dolecek, Lara University of California, Los Angeles, USUniversity of California, Los Angeles (UCLA), US Gunnam, KiranHGST Research, San Jose, CA, US Chair Karayer, ErdemEge university, İzmir, Turkey Mohsenin, Tinoosh University of Maryland, Baltimore County,US Secretary Motwani, RaviIntel Corporation, Santa Clara, CA, US CoChair Vasic, BaneUniversity of Arizona, Tuscon, AZ, USIEEE Member / Self Yee, RosannaIntel Corporation, Vancouver, Canada Kwok, ZionIntel Corporation, Vancouver, Canada Nelson, ScottIntel Corporation, Vancouver,Canada Wesel, RickUniversity of California, Los Angeles, USUniversity of California, Los Angeles (UCLA), US Sudarsan RanganathanUniversity of California, Los Angeles, USUniversity of California, Los Angeles (UCLA), US Paul SiegelUniversity of California, San Diego, USUniversity of California, San Diego (UCSD), US Behzad Amiri University of California, Los Angeles, USUniversity of California, Los Angeles (UCLA), US Shiva PlanjeryCodelucida, Tucson, AZ, US Chris DickXilinx, San Jose, CA, US Kasra Vakilinia University of California, Los Angeles, USUniversity of California, Los Angeles

3 Several features(such as LDPC decoder architectures, datapath organization, out-of-order processing for Rnew, out-of-order processing, memory organization such as using the same memory for L and Q values and using the same memory for L and P values, value-reuse property, check node unit designs, decoder scheduling features, code construction constraints for efficient hardware implementation and numerous other features which are equally applicable for binary and non-binary decoders) are covered by the following issued and pending patent applications owned by Texas A&M University System (TAMUS). 8,359,522 Low density parity check decoder for regular LDPC codes 8,418,023 Low density parity check decoder for irregular LDPC codes 8,555,140 Low density parity check decoder for irregular LDPC codes 8,656,250 Low density parity check decoder for regular LDPC codes 14/141,508 -pending patent. H matrix choice limits the decoder architecture choices. Based on the current choice of the H matrices for the standard, Block serial layered decoder with out-of-order processing is the most efficient hardware architecture whose features are covered by several claims of TAMU LDPC decoder patents. When H matrix choice is changed to varying circulant size for different code rates, Block parallel layered decoder with constraints on H matrix and reconfigurable CNU is the is the most efficient hardware architecture whose features are covered by several claims of TAMU LDPC decoder patents. Serial CNU for min-sum is covered by several claims of TAMU LDPC decoder patents. Parallel CNU for min-sum with reconfigurable min1-min2 finder is covered by several claims of TAMU LDPC decoder patents. Following slides is a brief overview of few key features and few independent apparatus claims. Method claims are not listed for brevity. Not all the relevant figures are listed. This is not intended to be exhaustive and this is not intended as discussion on claims. Technical Feature Discussion of Texas A&M LDPC Decoder 3

4 Texas A&M Block Serial Layered LDPC Decoder 4

5 Features/Advantage of Texas A&M LDPC Decoder compared to other decoders 1) Instead of having three separate memories to store L/P/Q values, Q memory –sometimes called LPQ memory– can be used to store these values. Since at any time we only need L or P or Q for a given circulant, memory is managed at circulant level. This feature is possible due to the specific way the datapath is organized and computations are scheduled. Out-of-order processing for Rnew messages is one key enabler for this advantage. 2) This architecture comprises of one cyclic shifter instead of two cyclic shifters. 3) The value-reuse property is effectively used to compute Rnew and Rold messages. 4) Low-complexity data path design with no redundant data path operations. 5) Low-complexity check node unit design based on the value reuse property 6) Out-of-order processing in PS processing to eliminate the pipeline and memory access stall cycles. 5

6 Savings in Texas A&M Decoder Feature 1 has a significant impact on reducing area in an LDPC decoder. For instance for in an LDPC decoder operating to sustain 15 iterations and operating at 1 GHz in 65nm CMOS process to support 1KB code, each L/P/Q memories consumes about 1mm 2 and logic for the decoder consumes an additional 1mm 2. Therefore, in standard implementation of the layered decoder, 4mm 2 is dedicated to memory and logic requirements whereas the proposed architecture only requires 2mm 2 of area (1mm 2 for LPQ memory and 1mm 2 for decoder logic). Here 50% percent memory is obtained. Furthermore, out-of-order processing in PS processing (Feature 6) reduces the necessary hardware by an additional 50% (i.e. from 2mm^2 to 1mm^2) for the class of LDPC codes needed and therefore the overall area reduction offered by the proposed architecture is 75% (i.e. from 4mm^2 to 1mm^2). See next slides. 6

7 Out-of-order block processing for Partial State Re-ordering of block processing. While processing the layer 2, the blocks which depend on layer 1 will be processed last to allow for the pipeline latency. In the above example, the pipeline latency can be 5. The vector pipeline depth is 5.so no stall cycles are needed while processing the layer 2 due to the pipelining. [In other implementations, the stall cycles are introduced – which will effectively reduce the throughput by a huge margin.] Also we will sequence the operations in layer such that we process the block first that has dependent data available for the longest time. This naturally leads us to true out-of-order processing across several layers. In practice we wont do out-of-order partial state processing involving more than 2 layers. 5

8 86 Out-of-order layer processing for R Selection Normal practice is to compute R new messages for each layer after CNU PS processing. However, here we decoupled the execution of R new messages of each layer with the execution of corresponding layer’s CNU PS processing. Rather than simply generating Rnew messages per layer, we compute them on basis of circulant dependencies. R selection is out-of-order so that it can feed the data required for the PS processing of the second layer. For instance Rnew messages for circulant 29 which belong to layer 3 are not generated immediately after layer 3 CNU PS processing. Rather, Rnew for circulant 29 is computed when PS processing of circulant 20 is done as circulant 29 is a dependent circulant of circulant of 20. Similarly, Rnew for circulant 72 is computed when PS processing of circulant 11 is done as circulant 72 is a dependent circulant of circulant of 11. Here we execute the instruction/computation at precise moment when the result is needed!

9 Clock Cycles Per Iteration for 2 circulant processing for 1KB codes. CCI_ IdealCCI, No OOP CCI, OOP Efficiency =CCI/CCI_Ideal Efficiency with OOP=CCI_OOP /CCI_Ideal H66613420413465.6%100% H86813822913860.2%100% H127214728414951.3%99.3% 9 CCI = Clock cycles per iteration Pipeline depth =8; Number of circulant banks are 3; Bits processed per clock= Number of Circulants processed per clock*Circulant Size/ Average Iterations*Column Degree=2*140/2*4=35 Effective Bits processed per clock= Bits processed per clock*CCI/CCI_Ideal Efficieny is measured as number of total clock cycles /ideal number of cycles needed if there are no stall cycles. Out-of-order processing reduces the number of stall cycles and enables close 100% efficiency vs 50% efficiency of the decoder that do not feature out-of-order processing. This results in 50% reduction in area and power for Texas A&M decoder vs other decoders.

10 Clock Cycles Per Iteration for 2 circulant processing for 2KB codes CCI_ IdealCCI, No OOP CCI, OOP Efficiency =CCI/CCI_Ideal Efficiency with OOP=CCI_OOP /CCI_Ideal H812825739225765.6%100% H1213226542626562.2%100% H1613627449327455.5%100% H2414428851529255.9%98.63% 10 Pipeline depth =8; Number of circulant banks are 3; Bits processed per clock= Number of Circulants processed per clock*Circulant Size/ Average Iterations*Column Degree=2*140/2*4=35 Effective Bits processed per clock= Bits processed per clock*CCI/CCI_Ideal Efficieny is measured as number of total clock cycles /ideal number of cycles needed if there are no stall cycles. Out-of-order processing reduces the number of stall cycles and enables close 100% efficiency vs 55% efficiency of the decoder that do not feature out-of-order processing. This results in 50% reduction in area and power for Texas A&M decoder vs other decoders.


Download ppt "1 Proposal Draft on H Matrix and Decoder Choice for P1890 LDPC Decoders IEEE P1890 Standards Working Group on Error Correction Coding for Non-Volatile."

Similar presentations


Ads by Google