# Implementation and Benchmarking of Round 2 Candidates in the NIST Post-Quantum Cryptography Standardization Process Using Hardware and Software/Hardware Co-design Approaches Viet Ba Dang<sup>1</sup>, Farnoud Farahmand<sup>1</sup>, Michal Andrzejczak<sup>2</sup>, Kamyar Mohajerani<sup>1</sup>, Duc Tri Nguyen<sup>1</sup> and Kris Gaj<sup>1</sup> > Cryptographic Engineering Research Group, George Mason University > Fairfax, VA, U.S.A. > > Military University of Technology > Warsaw, Poland **Abstract.** Performance in hardware has typically played a major role in differentiating among leading candidates in cryptographic standardization efforts. Winners of two past NIST cryptographic contests (Rijndael in case of AES and Keccak in case of SHA-3) were ranked consistently among the two fastest candidates when implemented using FPGAs and ASICs. Hardware implementations of cryptographic operations may quite easily outperform software implementations for at least a subset of major performance metrics, such as speed, power consumption, and energy usage, as well as in terms of security against physical attacks, including side-channel analysis. Using hardware also permits much higher flexibility in trading one subset of these properties for another. A large number of candidates at the early stages of the standardization process makes the accurate and fair comparison very challenging. Nevertheless, in all major past cryptographic standardization efforts, future winners were identified quite early in the evaluation process and held their lead until the standard was selected. Additionally, identifying some candidates as either inherently slow or costly in hardware helped to eliminate a subset of candidates, saving countless hours of cryptanalysis. Finally, early implementations provided a baseline for future design space explorations, paving a way to more comprehensive and fairer benchmarking at the later stages of a given cryptographic competition. In this paper, we first summarize, compare, and analyze results reported by other groups until mid-May 2020, i.e., until the end of Round 2 of the NIST PQC process. We then outline our own methodology for implementing and benchmarking PQC candidates using both hardware and software/hardware co-design approaches. We apply our hardware approach to 6 lattice-based CCA-secure Key Encapsulation Mechanisms (KEMs), representing 4 NIST PQC submissions. We then apply a software-hardware co-design approach to 12 lattice-based CCA-secure KEMs, representing 8 Round 2 submissions. We hope that, combined with results reported by other groups, our study will provide NIST with helpful information regarding the relative performance of a significant subset of Round 2 PQC candidates, assuming that at least their major operations, and possibly the entire algorithms, are off-loaded to hardware. **Keywords:** Post-Quantum Cryptography · hardware · software/hardware co-design · FPGA · System on Chip · ASIC · Key Encapsulation Mechanism · digital signature · public-key · ARM · NEON ### 1 Introduction Hardware benchmarking has played a major role in all recent cryptographic standardization efforts, such as the AES, eSTREAM, SHA-3 [11, 32, 43, 44], and CAESAR contests [17, 18]. With the emergence of commonly-accepted hardware application programming interfaces (APIs) [37], development packages [33, 36], specialized optimization tools [31, 23], new design methodologies based on High-Level Synthesis (HLS) [34, 35], and mandatory hardware implementations in the final round of the CAESAR contest [17], the percentage of initial submissions implemented in hardware grew from 27.5% in the SHA-3 contest [30] to 49.1% in the CAESAR competition [18, 29]. In Round 2, all AES, all SHA-3, and all but one CAESAR candidates had at least one hardware implementation reported by the end of the evaluation process. In almost all cases, candidates performing particularly well in hardware were identified quite early during the evaluation process. For example, Keccak led in terms of speed in hardware already in Round 2 of the SHA-3 contest. It outperformed 13 remaining Round 2 candidates and the old standard SHA-2. AEGIS-128 was identified as one of the three fastest authenticated ciphers in Round 2 of the CAESAR contest when implemented using high-performance FPGAs, Virtex-6, Virtex-7, Stratix IV, and Stratix V. It outperformed at least 25 other candidates and the current standard AES-GCM. At the same time, during each contest, several candidates were identified as particularly costly, slow, or cumbersome to implement in hardware. Examples included Mars during the AES contest; BMW, ECHO, and SIMD during the SHA-3 competition; HS1-SIV, POET, and OMD in the CAESAR contest. The early identification of hardware inefficiency helped to focus the effort of the cryptographic community on more promising candidates, potentially saving countless hours of cryptanalysis. Hardware vs. software. Cryptographic algorithms are routinely implemented using both software and hardware. By software, we mean implementations that can be executed using processors. These processors may vary from low-cost low-power embedded processors, such as ARM Cortex-M4, to high-performance general-purpose microprocessors, such as Intel Core i7, with Haswell microarchitecture, supporting Advanced Vector Extensions 2 (AVX2) and the AES New Instructions (AES-NI). The common feature is that all of these processors are typically programmed using high-level programming languages, such as C. Code written in these languages is portable among different processor types. Software implementations can be further optimized by using assembly language programming, involving instructions specific to a given processor (or more accurately to its Instruction Set Architecture (ISA)). Assembly language programs are not easily portable among processors based on different ISAs. By hardware, we mean implementations that can be executed using Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Programmable Logic (PL) of System on Chip FPGAs (SoC FPGAs), Application-Specific Standard Products (ASSPs), etc. The common feature is that most of these implementations are developed using hardware description languages (HDLs), such as VHDL and Verilog. These languages differ substantially from high-level programming languages by introducing the concepts of an entity, connectivity, concurrency, and timing. HDL source code is transformed by a synthesis tool to a netlist composed of basic logic components and connections among these components. Because of its generic nature, HDL code can be easily ported among different technologies, such as FPGAs and ASICs. ASIC implementations are faster, use less power, and require less physical area. FPGA implementations have the advantage of less expensive development tools, much shorter design cycle, and reconfigurability, understood as an ability to change the function of all internal building blocks and connections among them, even after a given integrated circuit has been deployed in actual products. Low cost and short development cycle are decisive factors making FPGAs more suitable for benchmarking and ranking candidates during the evaluation period. Reconfigurability supports the algorithm and parameter agility, making FPGAs more frequently used than ASICs during the early stages of deployment of PQC in real products. The relative performance of cryptographic algorithms in FPGAs has been shown correlated to their relative performance in ASICs [32]. At the same time, this correlation is not guaranteed to hold across multiple classes of cryptographic transformations, e.g., it is not guaranteed to work equally well for hash functions and PQC algorithms. Therefore, both FPGA and ASIC benchmarking studies are essential. Although software implementations are likely to be dominant during the first phase of deploying PQC standards in real applications, hardware implementations will inevitably follow. They are likely to start from hardware accelerators for constrained environments, such as smart cards and Internet of Things devices. Low-cost low-power processors used in such applications may not be able to keep up with the increased demands for computational power and energy usage. Thus, these processors may need to be extended with hardware accelerators. In the medium term, high-performance security processors enhanced with new PQC standards will emerge. These processors will be optimized to process in hardware all the algorithms associated with secure communication (such as those used in the post-quantum versions of TLS, IPSec, IKE, and WTLS/WAP protocols) and secure storage. Finally, in the longer-term, support for new instructions, enabling the efficient and side-channel resistant implementations of PQC standards, is likely to be added to the most popular processor ISAs. Co-processors for such instructions are, effectively, hardware implementations of PQC. Taking into account that the new PQC standards are likely to remain in use for decades, all of the mentioned above use cases should be given considerable weight. In particular, the performance of a given algorithm in hardware may affect its long-term performance in software, on processors equipped with new specialized instructions. Even if Round 2 hardware implementations are not a final word in terms of the algorithm performance, they provide the first glimpse into each candidate's suitability for hardware acceleration. They also establish an open source-code base on which more optimized implementation and implementations protected against side-channel and fault attacks can be built in Round 3 and beyond. **High-speed vs. lightweight.** Assuming the use of the same technology, hardware implementations outperform software implementations using at least one, and typically multiple metrics, such as speed, power consumption, energy usage, and security against physical attacks. They also allow much higher flexibility in trading one subset of these metrics for another. From the point of view of benchmarking and ranking of candidates, such flexibility may become a curse, especially taking into account that no two metrics are likely to have a simple linear dependence on each other. A practical solution to this problem is to focus during the evaluation process on two major types of implementations: high-speed and lightweight. In high-speed implementations, the primary target is speed. For PQC schemes, this target amounts to minimizing the execution times of major operations involving the public and private key, respectively. For Key Encapsulation Mechanisms (KEMs), these operations are encapsulation and decapsulation; for digital signature schemes, signature verification and generation; for public-key encryption (PKE), encryption and decryption. The time of key generation may also play a major role in the case when a public-private key pair cannot be reused for security reasons. The resource utilization is secondary. Still, hardware designers typically aim at achieving the Pareto optimality, in which any further improvement in speed comes at the disproportionate cost in terms of resource utilization. The primary advantage of high-speed implementations is that they reveal the inherent potential of a given algorithm for parallelization. As long as the resource-utilization limit is sufficiently high, this limit does not affect the ranking of algorithms. As a result, the ranking is strongly correlated with the features of algorithms themselves and is not substantially influenced by any additional assumptions and technology choices. Additionally, only high-speed hardware implementations may effectively compete with optimized software implementations targeting high-performance processors with vector instructions (e.g., AVX2). In lightweight implementations, the primary targets are typically minimum resource utilization and minimum power consumption, under the assumption that the execution time does not exceed a predefined maximum. Another way of formulating the goal is to achieve minimum execution time, assuming a given maximum budget in terms of resource utilization, power consumption, or energy usage. The maximum budget on resource utilization is related to the cost of implementation; the budget on power assures correct operation without overheating or devoting additional resources to cooling. The maximum energy usage affects how long a battery-operated device can function before the next battery recharge. In the context of the standardization process for cryptographic algorithms, the mentioned above maximum budgets are very hard to select. Any change in these thresholds may favor a different subset of candidates. With new standards remaining in use for decades, timing, cost, and power requirements of new and emerging applications are very challenging to predict. Additionally, changes in technology significantly affect which hardware architectures meet particular constraints. For example, an architecture capable of accomplishing the execution time of 0.1 seconds (or below), under a certain power or energy budget, may substantially change with the improvements in technology. As a result, the majority of current limits are selected somewhat arbitrarily by different designers, or left undefined in their reports. Consequently, the ranking of PQC candidates based on their lightweight implementations, especially those developed by different groups, is extremely challenging and assumption-dependent. These rankings have little to do with the parallelization allowed by each algorithm, as most of the operations must be executed sequentially due to the small resource budget. The primary feature of algorithms these implementations reveal is the number and complexity of its distinct elementary operations. Each major operation infers an additional functional unit, increasing resource utilization and power consumption. Additionally, lightweight hardware implementations can outperform only software implementations targeting specific low-cost low-power embedded processors, such as Cortex-M4. In the case of FPGA implementations, resource utilization is a vector, such as (#LUTs, #flip-flops, #DSP units, #BRAMs). No single element of this vector can be expressed in terms of other elements. As a result, imposing a resource limit implies specifying the values of all components of this resource vector. One possible approach may be to choose the resources of the smallest FPGA of a given low-cost FPGA family. However, FPGA families and their resources change over time, so this limit has only a physical meaning during the limited time, covering the evaluation period, and may lose its significance just a few years after the standard is published and deployed. Finally, the same FPGA device may also need to accommodate any overhead associated with countermeasures against side-channel attacks. At the same time, this overhead or even effective countermeasures may remain unknown at the time of the candidates' evaluation. As a result, in this paper, we focus on the development, benchmarking, and ranking of high-speed implementations. At the same time, we do our best to summarize lessons learned from the development of lightweight implementations by other groups. **Speeding-up the development process.** Traditionally software and hardware benchmarking were conducted separately by different groups of experts, equipped with different knowledge and tools. Even the units for expressing speed were different – cycles per byte for software and megabits per second for hardware. For PQC algorithms, this approach is hard to maintain. These algorithms are simply too complex and too different from the current state-of-the-art in public-key cryptography to permit the development of optimized purely-hardware implementations for a significant fraction of Round 2 candidates by any single group within the time frame imposed by the NIST evaluation process. Two approaches to overcome the long development time have emerged. The first is software/hardware co-design [22, 74]; the second is the use of HLS [14, 22, 56]. Software/hardware co-design has been used for years in the industry and studied extensively in academia, with the goal of reaching performance targets using a shorter development cycle than that typical for hardware-only implementations. To the best of our knowledge, no benchmarking of software/hardware co-designs was reported during any previous cryptographic competitions. As a result, multiple problems specific to cryptographic contests, such as the choice of the most representative platform(s) and the fairness of software/hardware partitioning schemes, have never been addressed. It should be clearly stated that software/hardware benchmarking is not intended as a replacement for purely-hardware benchmarking. On the contrary, applying this approach to selected Round 2 candidates and developing a library of hardware accelerators for major operations of these candidates will make it much easier to develop hardware-only implementations in subsequent rounds. Although the software/hardware co-design approach can be used to realize both high-speed and lightweight implementations, in this paper, we focus on its application to high-speed designs. Within the proposed framework, one of the first issues to address is the choice of the appropriate platform. In particular, we need a computing platform allowing fast communication across the software/hardware boundary. We also need the suitable prototyping board, as the timing measurements had to be performed experimentally, and the computing platform had to be well-suited for attempting various software/hardware partitioning schemes. The choice of a suitable device and prototyping board is addressed in Section 4. With the preferred platform identified, our second major concern is the fairness of software/hardware benchmarking, especially in terms of deciding which operations within each evaluated scheme should be offloaded to hardware. In this paper, we propose a comprehensive approach to address this issue, aimed at achieving the best possible trade-off between the final performance and the required development time. This approach is described in detail in Section 4. The second approach to substantially accelerating the development time is the use of High-Level Synthesis. This approach amounts to refactoring a software implementation in C or C++ in such a way that this implementation can be used as an input to a High-Level Synthesis tool, such as Vivado HLS or LegUp, capable of automatically transforming such an implementation to HDL code. The result is a purely hardware implementation obtained based on the code written in a traditional programming language (typically C and C++). This language is turned into a high-level hardware description language using synthesis directives encoded using pragmas and specific coding techniques aimed at exposing the potential for parallelization and resource utilization reduction. This approach has been demonstrated to substantially reduce the development time of Round 2 and Round 3 CAESAR candidates. At the same time, it provided an almost identical ranking of candidates in terms of throughput and throughput to area ratio. However, taking into account significant differences between the complexity and underlying operations of secret-key authenticated ciphers and public-key PQC schemes, the use of HLS for benchmarking of PQC candidates remains controversial. The common perception is that obtained results are significantly worse in terms of both speed and resource utilization compared to manual HDL coding. However, our preliminary research indicates that, with a proper approach, the penalty in terms of the execution time in clock cycles can be made negligibly small. Only the penalty in terms of resource utilization and clock frequency remains. The former overhead affects only the secondary metric in high-speed designs; the latter can be kept in a similar range for multiple candidates. As a result, the use of high-level synthesis when applied to high-speed designs should remain an active area of research, and should not be dismissed upfront before more case studies are performed. Choice of FPGA family. One of the major concerns is the NIST recommendation to focus on hardware benchmarking using the Xilinx Artix-7 FPGA family. This recommendation appeared in several NIST presentations related to Round 2 of the NIST standardization process, e.g., during PQCrypto 2019 in May 2019 and the Second PQC Standardization Conference in August 2019. We believe that, in its current form, this recommendation is counterproductive, and it impedes rather than supports fair and comprehensive hardware and software/hardware benchmarking. Let us start by explaining what an FPGA family is and what influence does it have on an evaluation process. FPGA family is a set of FPGA devices sharing the same internal structure and the same process technology (also known as technology node or process node), described by a number related to the size and density of transistors that can be fabricated using a given manufacturing process. With the steady improvements in process technology, described by Moore's Law, the maximum capacity and speed of FPGA devices have been steadily increasing while their prices have remained approximately the same. Every new generation of FPGA devices of a particular vendor receives a unique name, referred to as a family name. Every family consists of multiple devices with various distinct sizes to match the needs of different applications. All devices of a particular family share the same internal architecture and process technology but differ in terms of the number of resources of a particular type, such as Look-Up Tables (LUTs), flip-flops (FFs), block memories, and digital signal processing units (DSP units) or multipliers. Most vendors release both low-cost families (such as Xilinx Artix-7) and high-performance families (such as Xilinx Virtex-7). Most of them also release mid-range families, such as Xilinx Kintex-7. The maximum amount of resources available in the largest device of a low-cost family is naturally significantly smaller than the equivalent amount in the largest device of a high-performance family (e.g., over 5 times smaller for Artix-7 vs. Virtex-7). Additionally, in recent years, FPGA vendors started releasing new types of programmable devices that enhance Programmable Logic of traditional FPGAs with the Processing System based on a hardwired embedded processor, such as ARM. Since this processor is custom designed, it takes full advantage of a given technological process and operates at a clock frequency significantly higher than Programmable Logic. With a fast processor and an efficient interface between this processor and Programmable Logic, these devices are ideal for software/hardware co-designs targeting high-speed. Although these types of devices appear under multiple commercial names, they are often collectively referred to as System on Chip FPGAs (SoC FPGAs). The first family of this type was Xilinx Zynq-7000, released in 2011, based on ARM Cortex-A9 embedded processors. Hardware designs are described in hardware description languages. HDL code is typically identical for all FPGA families. As opposed to software, where each processor may require different optimized assembly language code, no such concepts exist for hardware. As a result, it is straightforward to synthesize the same HDL code targeting various FPGA families from various vendors, as long as the maximum capacity of the largest device of a given family is not exceeded. Giving preference to the Xilinx Artix-7 family has several undesired consequences summarized below: - 1. Artix-7 is a low-cost FPGA family. As such, it is not very suitable for high-speed implementations. Hardware resources of even the largest device of this family are often insufficient to demonstrate the full potential for parallelizing operations a given PQC algorithm. Thus, the use of Artix-7 makes perfect sense for benchmarking lightweight implementations but may lead to suboptimal results for high-speed implementations. - 2. Artix-7 is a traditional FPGA, and not an SoC FPGA. As a result, the only way to develop a single-chip software/hardware implementation using Artix-7 is the use of so-called "soft" processor cores, i.e., processors implemented using programmable logic. Soft processors compatible with Artix-7 include MicroBlaze and lightweight versions of RISC-V. All of them operate at much lower clock frequency than hardwired embedded processors of SoC FPGAs. - 3. Artix-7 is unsuitable for HLS designs. Such designs typically take significantly more resources than designs based on writing code manually in HDL. As a result, assuming the Pareto optimization for high-speed, they are unlikely to fit in the largest Artix-7 FPGA. - 4. Artix-7 is a relatively old FPGA family, released by Xilinx in 2010. By the time of the release of the PQC standard, this family will be at least 12 years old. While still relatively popular for low-cost applications, this family does not represent the state-of-the-art in FPGA technology. - 5. It is not customary to base ranking of candidates in cryptographic contests on results obtained for a single family of a single vendor. Although Xilinx is the largest developer of FPGAs and SoC FPGAs, Intel comes a strong second, and other vendors, such as Microchip and Lattice Semiconductor, also develop FPGAs suitable for implementing cryptographic algorithms. During the SHA-3 competition, the results were reported for seven FPGA families from two major vendors, Xilinx and Altera. During the CAESAR contest, four Xilinx families and four Altera families were employed. For all of these families, results were generated based on the same HDL code. There was no need to purchase multiple tools or boards. Free or trial versions of tools were sufficient. The designs ended with the generation of post-place-and-route reports, which correctly described the worst-case performance of any particular instance of the given FPGA device. - 6. Based on the authors' experiences, multiple reviewers of papers devoted to implementations of Round 2 PQC candidates treated the NIST's choice of Artix-7 as an absolute requirement. Submissions not complying with this requirement were subject to rejection or requests for major revisions. As a result, a noble goal of making the results more comparable with one another was turned into a reason for suppressing or delaying the publication of relevant results. Taking these concerns into account, our recommendation for Round 3 is to encourage reporting results for at least the following FPGA families: - 1. For lightweight hardware implementations and lightweight software/hardware implementations based on soft processor cores: Xilinx Artix-7 (for compatibility with Round 2 results) and Intel Cyclone 10 LP. - 2. For lightweight software/hardware implementations based on the use of hard processor cores: Xilinx Zynq 7000-series and Intel Cyclone V SoC FPGAs. - 3. For high-speed hardware and high-speed software/hardware implementations: Zynq Xilinx UltraScale+ and Intel Stratix 10 SoC. One of the reasons for selecting Zynq Xilinx UltraScale+, even for pure hardware implementations that do not require SoC capabilities, is the support for these devices by the free version of the Xilinx toolset, called Vivado HL WebPACK, which is sufficient to generate all required benchmarking results. Xilinx Virtex-7 UltraScale+ FPGAs, which could be considered as a natural candidate, are not supported by the same free version of tools. The Zynq Xilinx UltraScale+ family is also recommended for high-speed software/hardware implementations based on the use of hard processor cores because of moderate cost of suitable prototyping boards and the availability of a free Benchmarking Setup for Software/Hardware Implementations of PQC Schemes, developed at George Mason University [21]. # 2 Previous Work **Table 1:** Reported Hardware Implementations | Algorithms | High-Speed | Lightweight | |-----------------------------------|--------------------------------------------------------------------------------|---------------------------------------| | Lattice-h | pased : Encryption/Key Exchar | ıge | | CRYSTALS-KYBER<br>FrodoKEM<br>LAC | [77], [14] <sup>H</sup> , CERG<br>[38], [14] <sup>H</sup> , [19]<br>[77], CERG | [12], [13]*, [1], [27]<br>[12], [13]* | | NewHope | [14] <sup>H</sup> , [28], [78], [77], [40], CERG | $[12], [13]^*, [1], [27]$ | | NTRU | $[14]^H$ , CERG | _ | | NTRU Prime | CERG | _<br>[6] | | Round5 | [19], [4], CERG | [3] | | SABER | $[14]^{H}$ , $[19]$ , $[53]$ , $[61]$ | [27] | | Three Bears | | | | Isogeny-l | pased : Encryption/Key Exchar | nge | | SIKE | [48], [52], [20] | [52] | | Code-ba | ased : Encryption/Key Exchang | ge | | BIKE | [5], [59] | _ | | Classic McEliece | $[72], [14]^H$ | _ | | HQC | _ | _ | | LEDAcrypt | $[14]^{H}$ | [39] | | NTS-KEM | _ | _ | | ROLLO | _ | _ | | RQC | _ | _ | | Late | tice-based : Digital Signature | | | CRYSTALS-DILITHIUM | $[14]^H$ | [12], [13]* | | FALCON<br>qTESLA | $[14]^{H}$ | -<br>[12], [13]*, [73] | | - | netric-based : Digital Signature | [12], [10] , [10] | | Picnic | | | | SPHINCS+ | $[41]$ $[14]^H$ | _ | | | ltivariate : Digital Signature | | | GeMSS | _ | | | LUOV | _ | _ | | MQDSS | $[14]^{H}$ | _ | | | [] | | $<sup>^{</sup>H}$ design developed using the High-Level Synthesis (HLS) approach $^{\ast}$ extended version of [12] Hardware and software/hardware implementations of Round 2 PQC candidates reported to date are summarized in Table 1. The PQC candidates are grouped by family and a type of scheme. All Encryption and Key Exchange schemes are listed first, followed by Digital Signature schemes. The Encryption and Key Exchange schemes have candidates from three major families: lattice-based, isogeny-based, and code-based. The Digital Signature schemes have candidates representing lattice-based, symmetric-based, and multivariate families. All implementations are classified as either High-Speed or Lightweight. However, the dividing line is not always very clear, and, in multiple cases, the authors have not used these terms explicitly by themselves. HLS-based implementations are distinguished with the superscript $^{H}$ . Eight out of 26 candidates (31%) do not have any high-speed implementation to date; 17 out of 26 (65%) do not have any lightweight implementation. The coverage of the code-based family is the weakest, with only 3 out of 7 candidates (BIKE, Classic McEliece, and LEDAcrypt) implemented targeting high-speed, and only 1 out of 6 (LEDAcrypt) realized using a lightweight approach. Similarly, the multivariate family remains mostly unexplored. Only two out of four candidates have their implementations reported, including one using only the HLS-based methodology. The symmetric-based digital signatures have no lightweight implementations, and even among high-speed implementations, only one is the RTL-based implementation, with the HDL code written manually. The coverage of the lattice-based and isogeny-based encryption/key exchange schemes is the most complete. Eight out of nine lattice-based KEMs have high-speed implementations reported. The only exception is Three Bears. Five out of these eight have, on top of that, at least one lightweight implementation. In terms of the number of various implementations, NewHope leads the way with 10 related publications, followed by CRYSTALS-KYBER, with 7, and FrodoKEM and Saber with 5. The only isogeny-based scheme, SIKE, has been thoroughly explored in hardware as well, especially taking into account the earlier implementations of the underlying key agreement scheme SIDH [46], [9], [47], [45], [49]. The coverage of lattice-based signatures is not as good as lattice-based KEMs. In particular, FALCON appears to be very difficult to implement using either the high-speed or lightweight approach. Additionally, even in the case of CRYSTALS-DILITHIUM, somewhat surprisingly, its only high-speed implementation to date is an HLS-based design. In Tables 2–8, we summarize major results for hardware and software/hardware implementations of KEMs. Most of the schemes are KEMs with indistinguishability under chosen ciphertext attack (IND-CCA). Some are PKEs with indistinguishability under chosen plaintext attack (IND-CPA). If an IND-CPA-secure PKE is reported, this fact is marked with a superscript <sup>cpa</sup>. All mentioned above tables have the same fields. The first two columns contain a reference to the publication and the name of the algorithm variant, respectively. The superscript Z next to the publication reference indicates the implementation using Zynq-7000 SoC FPGA. The implementations targeting Artix-7 and Zynq-7000 are grouped together because the programmable logic of both families is realized using the same technological process and composed of the same basic building blocks. In the third column, the type of implementation is indicated, with HW standing for hardware, and SW/HW standing for software/hardware. Among the software/hardware implementations, we specify the embedded processors used with the following notation: RV represents a RISC-V processor with the RV32IM ISA, i.e., RISC-V with the base 32-bit integer ISA and the standard Integer Multiplication and Division extension. c represents a custom processor described in [52], with the instruction set specified in the appendix of this paper. <sup>A9</sup> represents a hard processor of Zyng-7000 SoC FPGA family, namely ARM Cortex-A9. Unlike the first two options, this processor operates with the frequency significantly higher than the maximum clock frequency of programmable logic. At the same time, the transfer of control and data between the processor and the hardware accelerator contributes a non-negligible transfer overhead to all reported execution times. The next column, Max. Freq. corresponds to maximum clock frequency in MHz. The next five columns are used to report FPGA resource utilization, described as a vector (LUT, FF, Slice, DSP, BRAM), where the subsequent fields represent the number of look-up tables, flip-flops, slices, DSP units, and 36 kbit Block RAMs. For the last of these values, BRAM, 0.5 represents the use of an 18-kbit block RAM. The remaining 6 columns are used to report the execution time of Key Generation, Encapsulation, and Decapsulation, expressed in cycles and $\mu$ s, respectively. The value in $\mu$ s can be obtained by dividing the corresponding number of clock cycles by the maximum clock frequency in MHz. **Table 2:** Level 1 KEMs and PKEs on Artix-7 (default) and Zynq-7000 (indicated with the superscript $^{Z}$ ) | HW HW WW SW/HWW SW/HW | HS | Fred. | | | 0 | Ç | | | • | | meaps./ ruc. | ( | | |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|----------|--------|---------|----------------|--------|-------|------------|-----------|------------|--------------|------------|-----------| | NewHope-512 <i>cpa</i> mceliece348864 <i>cpa</i> mceliece348864 <i>cpa</i> Kyber-512 FrodoKEM-640 16x Kyber-512 NewHope-512 NewHope-512 NewHope-512 SIKEp434 SIKEp434 SIKEp434 SIKEp434 SIKEp434 SIKEp434 SIKEp434 | HS | | | ! | | 1 | ΑM | cycles | $s\eta$ | cycles | $s\eta$ | cycles | $\mu s$ | | NewHope-512° page meelieee348864° page meelieee348864° Kyber-512 FrodoKEM-640 16x Kyber-512 NewHope-512 NewHope-512 NewHope-512 NewHope-512 NewHope-512 SIKEp434 SIKEp434 SIKEp434 SIKEp434 SIKEp434 SIKEp434 SIKEp434 | HS | | | | Security Level | y Leve | 11 | | | | | | | | mceliece348864°°° mceliece348864°°° Kyber-512 FrodoKEM-640 16x Kyber-512 NewHope-512 NewHope-512 LightSaber Kyber-512 NewHope-512 SIKEp434 SIKEp434 SIKEp434 SIKEp434 SIKEp434 SIKEp434 SIKEp434 | HS | 200 | 6,780 | 4,026 | I | 2 | 7.0 | 4,200 | 21.0 | 6,600 | 33.0 | 9,100 | 45.5 | | mceliece348864° Kyber-512 FrodoKEM-640 16x Kyber-512 NewHope-512 NewHope-512 LightSaber Kyber-512 NewHope-512 SIKEp434 SIKEp633 FrodoKEM-640 1x SIKEp434 SIKEp434 SIKEp434 | | 106 | 81,339 | 132,190 | I | 0 | 236.0 | 202,787 | 1,920.3 | 2,720 | 25.8 | 12,743 | 120.7 | | Kyber-512 FrodoKEM-640 16x Kyber-512 NewHope-512 NewHope-512 LightSaber Kyber-512 SIKEp434 SIKEp434 SIKEp434 SIKEp434 SIKEp434 SIKEp434 | | 108 | 25,327 | 49,383 | I | 0 | 168.0 | 1,599,882 | 14,800.0 | 2,720 | 25.2 | 18,358 | 169.8 | | FrodoKEM-640 16x Kyber-512 NewHope-512 NewHope-512 LightSaber Kyber-512 SIKEp434 SIKEp434 SIKEp434 SIKEp434 SIKEp434 SIKEp434 SIKEp434 | | I | 23,925 | 10,844 | I | 21 | 32.0 | 150,106 | 1 | 193,076 | I | 204,843 | I | | rrodoken-040 16x Kyber-512 NewHope-512 LightSaber Kyber-512 NewHope-512 SIKEp434 SIKEp434 SIKEp434 SIKEp434 SIKEp434 SIKEp434 SIKEp434 | | 172 | 2,587 | 2,994 | 855 | 16 | 0 | | | | | | | | Lox | $^{\mathrm{HS}}$ | 171 | 5,796 | 4,694 | 1,692 | 16 | 0 | 204,766 | 1,190.5 | 207,269 | 1,212.1 | 209,867 | 1,408.5 | | Kyber-512<br>NewHope-512<br>LightSaber<br>Kyber-512<br>NewHope-512<br>NewHope-512<br>SIKEp434<br>SIKEp434<br>SIKEp434<br>SIKEp434<br>SIKEp434<br>SIKEp434 | | 149 | 6,881 | 5,081 | 1,947 | 16 | 12.5 | | | | | | | | NewHope-512 NewHope-512 LightSaber Kyber-512 NewHope-512 SIKEp434 SIKEp434 SIKEp503 FrodoKEM-640 1x SIKEp434 SIKEp434 | ΓM | $25^{*}$ | 14,975 | 2,539 | 4,173 | 11 | 14.0 | 74,519 | 2,980.8 | 131,698 | 5,267.9 | 142,309 | 5,692.4 | | NewHope-512<br>LightSaber<br>Kyber-512<br>NewHope-512<br>SIKEp434<br>SIKEp503<br>FrodoKEM-640<br>1x<br>SIKEp434<br>SIKEp434 | LW | 1 | 23,925 | 10,844 | I | 21 | 32.0 | 123,860 | I | 207,299 | I | 226,742 | I | | LightSaber Kyber-512 NewHope-512 SIKEp434 SIKEp434 SIKEp400 Lx Lx SIKEp434 SIKEp434 | ΓM | $25^{*}$ | 14,975 | 2,539 | 4,173 | 11 | 14.0 | 696,26 | 3,918.8 | 236,812 | 9,472.5 | 258,872 | 10,354.9 | | Kyber-512<br>NewHope-512<br>SIKEp434<br>SIKEp503<br>FrodoKEM-640<br>1x<br>SIKEp434<br>SIKEp434 | | I | 23,925 | 10,844 | I | 21 | 32.0 | 366, 837 | 1 | 526, 496 | 1 | 657,583 | | | NewHope-512 SIKEp434 SIKEp434 SIKEp503 FrodoKEM-640 1x SIKEp434 SIKEp503 | | 29 | 1,842 | 1,634 | I | r. | 34.0 | 710,000 | 11,993.2 | 971,000 | 16,402.0 | 870,000 | 14,695.9 | | $\begin{array}{c} {\rm SIKEp434} \\ {\rm SIKEp503} \\ {\rm FrodoKEM-640} \\ {\rm 1x} \\ {\rm SIKEp434} \\ {\rm SIKEp503} \\ \end{array}$ | LW | 59 | 1,842 | 1,634 | I | IJ | 34.0 | 904,000 | 15,270.3 | 1,424,000 | 24,054.1 | 1,302,000 | 21,993.2 | | SIKEp503 $FrodoKEM-640$ $1x$ $SIKEp434$ $SIKEp503$ | $^{\mathrm{HS}}$ | 162 | 22,595 | 11,558 | 7,491 | 162 | 37.0 | 1,474,200 | 9100 | 2,494,800 | 15,400.0 | 2,656,800 | 16,400.0 | | FrodoKEM-640 $1x$ $1x$ SIKEp434 SIKEp503 | $^{\mathrm{HS}}$ | 162 | 22,595 | 11,558 | 7,491 | 162 | 37.0 | 1,733,400 | 10,700.0 | 2,932,200 | 18,100.0 | 3,126,600 | 19,300.0 | | $\begin{array}{c} \text{FIGUONEAU-040} \\ \text{1x} \\ \text{SIKEp434} \\ \text{SIKEp503} \end{array}$ | | 191 | 971 | 433 | 290 | П | 0 | | | | | | | | $_{ m LX}$ SIKEp434 SIKEp503 | ΓM | 190 | 4,246 | 2,131 | 1,180 | _ | 0 | 3,237,288 | 16,949.2 | 3,275,862 | 17,241.4 | 3,306,122 | 20,408.2 | | $ m SIKEp434 \ SIKEp503$ | | 162 | 4,446 | 2,152 | 1,254 | П | 12.5 | | | | | | | | m SIKEp503 | ΓM | 143 | 10,976 | 7,115 | 3,512 | 22 | 21.0 | 2,187,902 | 15,300.0 | 3,718,004 | 26,000.0 | 3,946,804 | 27,600.0 | | | | 143 | 10,976 | 7,115 | 3,512 | 22 | 21.0 | 2,602,603 | 18,200.0 | 4,390,104 | 30,700.0 | 4,676,105 | 32,700.0 | | [13] FrodoKEM-640 SW/HW <sup>RV</sup> | ΓM | $25^{*}$ | 14,975 | 2,539 | 4,173 | 11 | 14.0 | 11,453,942 | 458,157.7 | 11,609,668 | 464,386.7 | 12,035,513 | 481,420.5 | | [5] BIKE Level 1 HW | $^{\mathrm{HS}}$ | 135 | 1,865 | 589 | 290 | 0 | 4.0 | 7,370,429 | 54,540.0 | I | I | ı | I | $^Z$ Design implemented on Zynq-7000 $^{cpa}$ Design of a PKE variant resistant against Chosen-Plaintext Attack (CPA) $^{RV}$ co-design using RISC-V RV32IM $^{A9}$ co-design using ARM Cortex-A9 $^{\ast}$ Preliminary result **Table 3:** Level 3 & 5 KEMs and PKEs on Artix-7 (default) and Zynq-7000 (indicated with the superscript $^Z$ ) | [72] mucelie<br>[38] Frod-<br>[53] Z K <sub>Y</sub><br>[13] K <sub>Y</sub><br>[52] SI<br>[52] SI<br>[52] SI<br>[53] Frod-<br>[78] SI<br>[54] BIK | mceliece460896 <sup>cpa</sup> FrodoKEM-976 16x Saber Kyber-768 SIKEp610 | HW | Talger | ١ | 101 | 1 | 2110 | 127 | AM | poloric | 811 | ologo | 877 | ool orro | 877 | |--------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------|--------------------------------|------------------|----------|--------|--------|---------|------------------|-------|------------|-------------|------------|-------------|------------|-------------| | | sce460896 <sup>cpa</sup><br>loKEM-976<br>16x<br>Saber<br>Saber<br>IKEp610 | HW | | Fred. | | | | | | cycles | ь | Cy CICS | 22 | cycles | 2 | | | sce460896cpa<br>loKEM-976<br>16x<br>Saber<br>yber-768<br>IKEp610 | HW | | | | | Securit | Security Level 3 | 13 | | | | | | | | | loKEM-976 16x Saber yber-768 IKEp610 | | LW | 107 | 38,669 | 74,858 | 1 | 0 | 303.0 | 5,002,044 | 46,704.4 | 3,360 | 31.4 | 31,005 | 289.5 | | | loxEM-970 16x Saber yber-768 IKEp610 | | | 169 | 2,869 | 3,000 | 806 | 16 | 0 | | | | | | | | | Saber<br>Saber-768<br>IKEp610 | HW | HS | 168 | 6,188 | 4,678 | 1782 | 16 | 0 | 476,056 | 2,816.9 | 479,993 | 2,857.1 | 483,073 | 3,076.9 | | | Saber<br>yber-768<br>IKEp610 | | | 157 | 7,213 | 5,087 | 2042 | 16 | 19.0 | | | | | | | | | yber-768<br>IKEp610 | ${ m SW/HW}^{A9}$ | $^{\mathrm{H}}$ | 125 | 7,400 | 7,331 | ı | 28 | 2.0 | I | 3,273.0 | I | 4,147.0 | I | 3,844.0 | | | (KEp610 | $\mathrm{SW/HW}^{RV}$ | LW | $25^{*}$ | 14,975 | 2,539 | 4,173 | 11 | 14.0 | 111,525 | 4,461.0 | 177,540 | 7,101.6 | 190,579 | 7,623.2 | | | 010 | ${ m SW/HW}^c$ | $^{\mathrm{HS}}$ | 162 | 22,595 | 11,558 | 7,491 | 162 | 37.0 | 2,916,000 | 18,000.0 | 5,443,200 | 33,600.0 | 5,508,000 | 34,000.0 | | | 27.11.11.11.12.12 | | | 189 | 1,243 | 441 | 362 | _ | 0 | | | | | | | | | FIOGONEM-910 | HW | $\Gamma$ M | 187 | 4,650 | 2,118 | 1,272 | _ | 0 | 7,560,000 | 40,000.0 | 7,480,000 | 40,000.0 | 7,714,286 | 47,619.0 | | | TX | | | 162 | 4,888 | 2,153 | 1,390 | 1 | 19.0 | | | | | | | | | SIKEp610 | ${ m SW/HW}^c$ | $\Gamma$ M | 143 | 10,976 | 7,115 | 3,512 | 22 | 21.0 | 4,347,204 | 30,400.0 | 8,108,108 | 56,700.0 | 8,208,208 | 57,400.0 | | | FrodoKEM-976 | $\mathrm{SW/HW}^{RV}$ | LW | $25^{*}$ | 14,975 | 2,539 | 4,173 | 11 | 14.0 | 26,005,326 | 1,040,213.0 | 29,749,417 | 1,189,976.7 | 30,421,175 | 1,216,847.0 | | | BIKE Level 3 | HW | $^{\mathrm{HS}}$ | 135 | 1,884 | 557 | 593 | 0 | r. | 30,447,947 | 231,400.0 | 1 | 1 | 1 | | | | | | | | | | Securit | Security Level | 15 | | | | | | | | | NewHope- $1024^{cpa}$ | HW | HS | 200 | 6,781 | 4,127 | ı | 2 | 8.0 | 8,000 | 40.0 | 12,500 | 62.5 | 17,300 | 86.5 | | | NewHope- $1024^{cpa}$ | HW | $^{\mathrm{HS}}$ | 190 | 13,244 | 8,272 | 1 | 24 | 18.0 | I | I | 34,000 | 178.0 | 30,600 | 160.0 | | $[28]$ $NewH_{\bullet}$ | $NewHope - 1024^{cpa}$ | MH/MS | $^{\mathrm{HS}}$ | 25 | 26,606 | 26,303 | 1 | 32 | 1.0 | 357,052 | 14,282.1 | 589,285 | 23,571.4 | 756,932 | 30,277.3 | | [13] Ky | Kyber-1024 | $\mathrm{SW}/\mathrm{HW}^{RV}$ | $\Gamma$ M | $25^{*}$ | 14,975 | 2,539 | 4,173 | 11 | 14.0 | 148,547 | 5,941.9 | 223,469 | 8,938.8 | 240,977 | 9,639.1 | | | NewHope-1024 | $\mathrm{SW/HW}^{RV}$ | $\Gamma$ M | $25^{*}$ | 14,975 | 2,539 | 4,173 | 11 | 14.0 | 696,26 | 3,918.8 | 236,812 | 9,472.5 | 258,872 | 10,354.9 | | No. | Kyber-1024 | $_{ m MM/MS}$ | $\Gamma$ M | I | 23,925 | 10,844 | I | 21 | 32.0 | 349,673 | I | 405,477 | I | 424,682 | I | | | NewHope-1024 | MH/MS | $\Gamma$ M | I | 23,925 | 10,844 | I | 21 | 32.0 | 235,420 | I | 392,734 | I | 450,541 | I | | | FireSaber | $_{ m MM/MS}$ | $\Gamma$ M | I | 23,925 | 10,844 | I | 21 | 32.0 | 1,300,272 | I | 1,622,818 | I | 1,898,051 | I | | [1] Ky | Kyber-1024 | ${ m SW/HW}^{RV}$ | $\Gamma$ M | 29 | 1,842 | 1,634 | | n | 34.0 | 2,203,000 | 37,212.8 | 2,619,000 | 44,239.9 | 2,429,000 | 41,030.4 | | [1] New | NewHope-1024 | ${ m SW/HW}^{RV}$ | $\Gamma$ M | 29 | 1,842 | 1,634 | | n | 34.0 | 1,776,000 | 30,000.0 | 2,742,000 | 46,317.6 | 2,528,000 | 42,702.7 | | | SIKEp751 | ${ m SW/HW}^c$ | $^{\mathrm{HS}}$ | 162 | 22,595 | 11,558 | 7,491 | 162 | 37.0 | 3,742,200 | 23,100.0 | 6,188,400 | 38,200.0 | 6,658,200 | 41,100.0 | | [52] SI | SIKEp751 | ${ m SW/HW}^c$ | $\Gamma$ M | 143 | 10,976 | 7,115 | 3,512 | 22 | 21.0 | 7,965,108 | 55,700.0 | 13,156,013 | 92,000.0 | 14,185,614 | 99,200.0 | | [13] Frode | FrodoKEM-1344 | $\mathrm{SW}/\mathrm{HW}^{RV}$ | LW | $25^{*}$ | 14,975 | 2,539 | 4,173 | 11 | 14.0 | 67,994,170 | 2,719,766.8 | 71,501,358 | 2,860,054.3 | 72,526,695 | 2,901,067.8 | $^Z$ Design implemented on Zynq-7000 $^{cpa}$ Design of a PKE variant resistant against Chosen-Plaintext Attack (CPA) $^{RV}$ co-design using RISC-V RV32IM [40] only reports latency of Encapsulation and total latency of Key Generation and Decapsulation $<sup>^</sup>c$ co-design using a custom processor $^{A9}$ co-design using ARM Cortex-A9 $\,$ <sup>\*</sup> Preliminary result In Tables 2 and 3, we summarize implementations targeting Xilinx Artix-7 FPGAs and related Xilinx Zynq-7000 SoC FPGAs. Algorithm variants belonging to the security levels 1 and 2 are grouped together. So are variants belonging to security levels 4 and 5. In the first two security categories, 6 candidates - Classic McEliece, CRYSTALS-Kyber, FrodoKEM, NewHope, SIKE, and Saber - have implementations of all three operations reported. Preliminary implementation of BIKE focuses on key generation only. For most KEMs, the time of decapsulation is slightly longer than the time of encapsulation. The exceptions include the software/hardware implementations of NewHope-512 and Kyber-512 in [1]. Table entries are ordered according to the time of decapsulation in $\mu$ s (and, if needed, according to the decapsulation time in clock cycles). The ranking of algorithms is hard to determine because a) only three submissions - Classic McEliece, FrodoKEM, and NewHope - have hardware implementations supporting both encapsulation and decapsulation. Out of them, the Classic McEliece and NewHope are represented by their IND-CPA PKEs. Software/hardware implementations based on different processors are hard to compare with one another. Consequently, the comparison among the results obtained for software/hardware implementations based on RISC-V are most insightful. These results suggest ranking: 1. Kyber-512, 2. NewHope-512, and 3. LightSaber. The difference in the decapsulation time between positions 1 and 2 is in the range of 10%. The difference in the decapsulation time between positions 2 and 3 is by a factor exceeding 3. Even with the use of a custom processor, the software/hardware implementations of SIKE is by at least an order of magnitude slower than LightSaber. They also use the comparable number of LUTs and FFs, and the larger number of DSP units and BRAMs. For the security categories 4 and 5, very similar conclusions can be drawn. The ranking of software/hardware implementations that can be relatively fairly compared with each other in terms of the decapsulation and encapsulation times is 1. Kyber-1024, 2. NewHope-1024, 3. FireSaber, and 4. SIKEp751. A very small difference at positions 1 and 2, make the relation between Kyber-1024 and NewHope-1024 a virtual tie. The implementations of variants belonging to the security level 3 are very difficult to compare with one another fairly. In Tables 4 and 5, we summarize implementations targeting Xilinx Virtex-7 FPGAs. A significant difference compared to the results for Artix-7 is the inclusion of results for hardware implementations developed using the HLS approach. At the same time, several PQC submissions with results reported only for Artix-7 are missing here. The outcome is the following ranking for the security levels 1 and 2: 1. LEDAkem-128, 2. NTRU-HRSS, 3. Kyber-512, 4. FrodoKEM-640, 5. SIKEp503, and 6. New-Hope-512. This ranking is, however, hard to trust. LEDAkem-128 and NTRU-HRSS are implemented using their old parameter sets. Additionally, the implementations of 5 out of 6 candidates in this ranking are obtained using the experimental HLS-based approach, not verified by comparing any of the obtained results with results achieved using the traditional RTL-based methodology. For the security levels 3, 4, and 5, the results are very hard to compare. The implementations differ in terms of the type of KEM (IND-CCA vs. IND-CPA), implementation approach (HW-RTL, HW-HLS, SW/HW), and target (HS vs. LW). **Table 4:** Level 1 KEMs on Virtex-7 (default) and Virtex-6 (indicated with the superscript $^{V6}$ ) | | Design | Algorithm | Tvpe | Target | | LUT | H. | Slice | $\overline{\text{DSP}}$ | BR | Key Generation | eration | $\mathbf{Encap.}/\mathbf{Enc.}^{cpa}$ | $\mathbf{Enc.}^{cpa}$ | ${\bf Decaps./Dec.}^{cpa}$ | $\mathrm{Dec.}^{cpa}$ | |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------|--------------------------------------|---------------|------------------|--------------|--------------------------|---------------------|-----------|-------------------------|--------------------|----------------|----------|---------------------------------------|-----------------------|----------------------------|-----------------------| | EDAkem-128° HW-HLS HS 100 102,496 13,157 | ) | ) | | ) | | | | | | ΑM | cycles | $\mu s$ | cycles | $\mu s$ | cycles | $\mu s$ | | LEDAkem-128° HW-HLS HS 100 102,496 13,157 - 0 0.0 - - 11,075 NTRU-HRSS° HW-HLS HS 67 75,141 12,225 - 0 0.0 - - 100,208 Kyber-512 HW-HLS HS 67 1,307,815 11,699 - 0 0 - - 100,208 FrodoKEM-640 HW-HLS HS 10 173,290 105875 - 0 0 - 335,891 SIKEp503 HW HS 171 25,094 26,971 9,514 264 34.0 640,000 3,738.3 1,120,000 SIKEp434 W HS 171 25,094 26,977 7,408 162 38.0 981,180 6,900.0 1,677,960 1 SIKEp434 SW/HW HS 12,210 13,657 7,408 162 38.0 1,166,040 8,200.0 1,677,960 1 LE | | | | | | | Secu | rity Lev | rel 1 | | | | | | | | | | [14] | LEDAkem-128° | HW-HLS | SH | 100 | 102,496<br>406,135 | 13,157<br>164,230 | I | 0 | 0.0 | ı | ı | 11,075 | 110.8 | 18,079 | 180.8 | | $ \begin{array}{cccccccccccccccccccccccccccccccccccc$ | [14] | $ m NTRU ext{-}HRSS^o$ | HW-HLS | $^{\mathrm{H}}$ | 29 | 75,141 $97,791$ | 12,225 $11,514$ | I | 0 | 0.0 | I | I | 100,208 | 1,503.1 | 21,996 | 329.9 | | FrodoKEM-640 HW-HLS HS $100$ $179,290$ $105875$ $-$ 0 0.0 $-$ 335,891 $128,031$ $97355$ $-$ 0 0.0 $-$ 3.6 40,000 $3,738.3$ 1,120,000 $128,031$ $128,031$ $128,031$ $128,031$ $128,031$ $128,031$ $128,031$ $128,031$ $128,031$ $128,031$ $128,031$ $128,031$ $128,031$ $128,031$ $128,031$ $128,031$ $128,031$ $128,031$ $128,031$ $128,031$ $128,031$ $128,031$ $128,031$ $138,031$ $138,031$ $138,031$ $138,031$ $138,031$ $138,031$ $138,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $139,031$ $1$ | [14] | Kyber-512 | HW-HLS | $^{\mathrm{HS}}$ | 29 | $1,307,815 \\ 1,977,896$ | $11,699 \\ 194,126$ | I | 0 | 0.0 | I | I | 31,669 | 475.0 | 43,018 | 645.3 | | SIKEp503 HW HS 171 25,094 26,971 9,514 264 34.0 640,000 3,738.3 1,120,000 NewHope-512 HW-HLS HS 67 164,937 28,999 $-$ 0 0.0 $-$ 307,847 SIKEp434 SW/HW HS 142 21,210 13,657 7,408 162 38.0 1,166,040 8,200.0 1,677,960 1 SIKEp434 SW/HW LW LW 235 104 53 33 0 1.0 $-$ 712,000 SIKEp434 SW/HW LW LW 152 10,937 7,132 3,415 57 21.0 2,602,740 17,100.0 3,713,851 2 SIKEp503 SW/HW LW LW 152 10,937 7,132 3,415 57 21.0 2,602,740 17,100.0 4,383,562 2 | [14] | ${\rm FrodoKEM-640}$ | HW-HLS | $^{\mathrm{H}}$ | 100 | 179,290 $128,031$ | 105875 $97355$ | I | 0 | 0.0 | I | I | 335,891 | 3,358.9 | 117,736 | 1,177.4 | | NewHope-512 HW-HLS HS 67 $\frac{136,457}{164,937}$ $\frac{25,639}{28,999}$ - 0 0.0 307,847 $\frac{25,639}{28,999}$ - 0 0.0 307,847 $\frac{25,639}{28,999}$ - 0 0.0 307,847 $\frac{25,639}{28,999}$ - 162 38.0 981,180 6,900.0 1,677,960 1 $\frac{25,639}{28,999}$ - 162 38.0 1,166,040 8,200.0 1,677,960 1 $\frac{25,639}{28,999}$ - 104 $\frac{23}{28,999}$ - 104 $\frac{23}{28,999}$ - 104 $\frac{23}{28,999}$ - 104 $\frac{23}{28,999}$ - 105 $\frac{23}{28,99}$ 106 $\frac{23}{28,99}$ - 107 $\frac{23}{28,99}$ - 1 | [48] | m SIKEp503 | HW | $^{\mathrm{HS}}$ | 171 | 25,094 | 26,971 | 9,514 | 264 | 34.0 | 640,000 | 3,738.3 | 1,120,000 | 6,542.1 | 1,210,000 | 7,067.8 | | SIKEp434 SW/HW HS 142 21,210 13,657 7,408 162 38.0 981,180 6,900.0 1,677,960 1 SIKEp503 SW/HW HS 142 21,210 13,657 7,408 162 38.0 1,166,040 8,200.0 1,976,580 1 1 LEDAkem-128°°,cpa HW LW 140 2,222 658 870 0 13.0 - 712,000 SIKEp434 SW/HW LW 152 10,937 7,132 3,415 57 21.0 2,602,740 17,100.0 4,383,562 2 SIKEp503 SW/HW LW 152 10,937 7,132 3,415 57 21.0 2,602,740 17,100.0 4,383,562 2 | [14] | ${\tt NewHope-512}$ | HW-HLS | $^{\mathrm{H}}$ | 29 | 136,457 $164,937$ | 25,639 $28,999$ | I | 0 | 0.0 | I | I | 307,847 | 4,617.7 | 721,986 | 10,829.8 | | SIKEp503 SW/HW HS 142 21,210 13,657 7,408 162 38.0 1,166,040 8,200.0 1,976,580 1 LEDAkem-128°°, $cp^a$ HW LW LW 140 2,222 658 870 0 13.0 - 712,000 SIKEp434 SW/HW LW LW 152 10,937 7,132 3,415 57 21.0 2,191,781 14,400.0 3,713,851 2 SIKEp503 SW/HW LW 152 10,937 7,132 3,415 57 21.0 2,602,740 17,100.0 4,383,562 2 | [52] | SIKEp434 | MH/MS | $^{\mathrm{HS}}$ | 142 | 21,210 | 13,657 | 7,408 | 162 | 38.0 | 981,180 | 6,900.0 | 1,677,960 | 11,800.0 | 1,777,500 | 12,500.0 | | $ \begin{array}{cccccccccccccccccccccccccccccccccccc$ | [52] | $SIKE_{p503}$ | MH/MS | $^{\mathrm{HS}}$ | 142 | 21,210 | 13,657 | 7,408 | 162 | 38.0 | 1,166,040 | 8,200.0 | 1,976,580 | 13,900.0 | 2,104,560 | 14,800.0 | | SIKEp $434$ SW/HW LW 152 10,937 7,132 3,415 57 21.0 2,191,781 14,400.0 3,713,851 SIKEp $503$ SW/HW LW 152 10,937 7,132 3,415 57 21.0 2,602,740 17,100.0 4,383,562 | $[39]^{V6}$ | $\rm LEDAkem\text{-}128^{\it o,cpa}$ | | LW | $235 \\ 140$ | 104 $2,222$ | 53<br>658 | 33<br>870 | 0 0 | $\frac{1.0}{13.0}$ | I | I | 712,000 | 3,029.8 | 2,620,000 | 18,714.3 | | SIKEp503 SW/HW LW 152 $10,937$ $7,132$ $3,415$ $57$ $21.0$ $2,602,740$ $17,100.0$ $4,383,562$ $3$ | [52] | SIKEp434 | $_{ m MH/MS}$ | $\Gamma$ M | 152 | 10,937 | 7,132 | 3,415 | 57 | 21.0 | 2,191,781 | 14,400.0 | 3,713,851 | 24,400.0 | 3,957,382 | 26,000.0 | | | [52] | m SIKEp503 | $_{ m MH/MS}$ | ΓM | 152 | 10,937 | 7,132 | 3,415 | 22 | 21.0 | 2,602,740 | 17,100.0 | 4,383,562 | 28,800.0 | 4,672,755 | 30,700.0 | $^{cpa}$ Design of a KEM variant resistant against Chosen-Plaintext Attack (CPA) $^{V6}$ Design implemented on Virtex-6 $<sup>^{</sup>o}$ Design for an old parameter set **Table 5:** Level 3 & 5 KEMs and PKEs on Virtex-7 | mceliece460896cpa Saber Saber SIKEp610 SIKEp610 SIKEp610 mceliece6960119cpa mceliece688128cpa | HW S.H.WH | 0 | | $\mathbf{r}$ | Į. | Slice | DSP | 1 | Key Gen | Key Generation | Encaps./Enc. | - OHE | Decaps./(Dec.+Enc.) | · + Euc.) | |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|------------------|-------|---------------------|------------------|------------------|---------|------------------------|------------|----------------|--------------|----------|---------------------|-----------| | mceliece460896 <sup>cpa</sup> Saber Saber SIKEp610 SIKEp610 SIKEp610 mceliece6960119 <sup>cpa</sup> mceliece688128 <sup>cpa</sup> mcelece8192128 <sup>cpa</sup> mcelece688128 <sup>cpa</sup> mcelece688128 <sup>cpa</sup> mceliece6688128 <sup>cpa</sup> mceliece688128 | WH STH-WH | | Fred. | | l<br>I | | | $\mathbf{A}\mathbf{M}$ | cycles | $\mu s$ | cycles | $\mu s$ | cycles | $\mu s$ | | mceliece460896cpa Saber Saber SIKEp610 SIKEp610 SIKEp610 mceliece6960119cpa mceliece6688128cpa mceliece6688128cpa mceliece6688128cpa mceliece6688128cpa mceliece6688128cpa mceliece6688128cpa mceliece6688128cpa mceliece6688128cpa mceliece688128cpa mceliece688128cpa mceliece688128cpa | HW-HLS | | | | | Security Level 3 | r Level | 3 | | | | | | | | Saber SIKEp610 SIKEp610 SIREp610 meeliece6960119 <sup>cpa</sup> meeliece688128 <sup>cpa</sup> meeliece6960119 <sup>cpa</sup> meeliece6680119 <sup>cpa</sup> meeliece6680119 <sup>cpa</sup> meeliece6812128 <sup>cpa</sup> meeliece8192128 <sup>cpa</sup> | S.TH-WH | HS | 131 | 109,484 | 168,939 | ı | 0 | 446.0 | 515,806 | 3,943.5 | 3,360 | 25.7 | 17,931 | 137.1 | | SIKEp610 SIKEp610 SIKEp610 meeliece6960119 <sup>ppa</sup> meeliece688128 <sup>ppa</sup> meliece6960119 <sup>ppa</sup> meliece688128 <sup>ppa</sup> meliece688128 <sup>ppa</sup> meliece688128 <sup>ppa</sup> SIKEp751 SIKEp751 | | $^{\mathrm{HS}}$ | 29 | 234,171 $2,350,000$ | 40,824 $231,549$ | I | 0 | 0.0 | I | I | 367,099 | 5,506.5 | 365,015 | 5,475.2 | | SIKEp610 meeliece6960119 <sup>ppa</sup> meeliece688128 <sup>ppa</sup> meliece6960119 <sup>ppa</sup> meliece688128 <sup>ppa</sup> meliece688128 <sup>ppa</sup> siKEp751 SIKEp751 | MH/MS | $^{\mathrm{HS}}$ | 142 | 21,210 | 13,657 | 7,408 | 162 | 38.0 | 1,962,360 | 13,800.0 | 3,654,540 | 25,700.0 | 3,711,420 | 26,100.0 | | meelieee696011999a<br>meelieee68812899a<br>meelieee819212899a<br>meelieee6960119999a<br>meelieee68812899a<br>SIKEP751 | MH/MS | LW | 152 | 10,937 | 7,132 | 3,415 | 22 | 21.0 | 4,353,120 | 28,600.0 | 8,097,412 | 53,200.0 | 8,219,178 | 54,000.0 | | meelieee6960119 <sup>cpa</sup> meelieee688128 <sup>cpa</sup> meelieee8192128 <sup>cpa</sup> meelieee6960119 <sup>cpa</sup> meelieee6688128 <sup>cpa</sup> meelieee8192128 <sup>cpa</sup> sinchieee8192128 <sup>cpa</sup> | | | | | | Security Level | ' Level | 2 | | | | | | | | mceliece6688128 <sup>cpa</sup> mceliece8192128 <sup>cpa</sup> mceliece6960119 <sup>cpa</sup> mceliece6688128 <sup>cpa</sup> mceliece8192128 <sup>cpa</sup> SIKEp751 SIKEp751 | HW | HS | 130 | 116,928 | 188,324 | ı | 0 | 0.709 | 974,306 | 7,500.4 | 5,413 | 41.7 | 25,135 | 193.5 | | meelieee8192128 $^{cpa}$<br>meelieee6960119 $^{cpa}$<br>meelieee688128 $^{cpa}$<br>SIKEp $^{751}$<br>SIKEp $^{751}$ | HW | $^{\mathrm{HS}}$ | 137 | 122,624 | 186,194 | I | 0 | 589.0 | 1,046,139 | 7,658.4 | 5,024 | 36.8 | 29,754 | 217.8 | | meelieee6960119 $^{cpa}$<br>meelieee688128 $^{cpa}$<br>meelieee8192128 $^{cpa}$<br>SIKEp $^{751}$ | $_{ m HM}$ | $^{\mathrm{HS}}$ | 130 | 123,361 | 190,707 | I | 0 | 589.0 | 1,286,179 | 9,901.3 | 6,528 | 50.3 | 32,765 | 252.2 | | meelieee6688128 $^{cpa}$<br>meelieee8192128 $^{cpa}$<br>SIKEp $^{751}$<br>SIKEp $^{751}$ | HW | LW | 141 | 44,154 | 88,963 | I | 0 | 563.0 | 11,179,636 | 79,570.4 | 5,413 | 38.5 | 46,141 | 328.4 | | mceliece8192128 $^{cpa}$<br>SIKEp751<br>SIKEp751 | HW | ΓW | 136 | 44,345 | 83,637 | I | 0 | 446.0 | 12,389,742 | 91,034.1 | 5,024 | 36.9 | 52,333 | 384.5 | | SIKEp751 $SIKEp751$ | HW | LW | 134 | 45,150 | 88,154 | I | 0 | 525.0 | 15,185,314 | 113,154.4 | 6,528 | 48.6 | 55,330 | 412.3 | | SIKEp751 | HW | $^{\mathrm{HS}}$ | 167 | 45,893 | 50,390 | 17,530 | 512 | 43.5 | 1,240,000 | 7,407.4 | 2,170,000 | 12,963.0 | 2,330,000 | 13,918.8 | | | MH/MS | $^{\mathrm{H}}$ | 142 | 21,210 | 13,657 | 7,408 | 162 | 38.0 | 2,516,940 | 17,700.0 | 4,166,460 | 29,300.0 | 4,479,300 | 31,500.0 | | [14] mcehece6960119 E | HW-HLS | $^{\mathrm{H}}$ | 100 | 840,430 $870,908$ | 60,270 $79,962$ | I | 0 | 0.0 | I | I | 3,787,729 | 37,877.3 | 10,659,024 | 106,590.2 | | [52] SIKEp751 S | MH/MS | LW | 152 | 10,937 | 7,132 | 3,415 | 22 | 21.0 | 7,960,426 | 52,300.0 | 13,150,685 | 86,400.0 | 14,185,693 | 93,200.0 | cpa Design of a PKE variant resistant against Chosen-Plaintext Attack (CPA) In Tables 6 and 7, we summarize the results for ASICs. ASIC performance studies have been reported in [77], [12], and [27]. All three studies were performed using different ASIC processes and standard-cell libraries. Therefore, the obtained results cannot be compared across any two, not to mention three, publications from this list. In all three cases, the approach was the development of a lattice-based co-processor supporting at least three different IND-CCA secure KEMs. The design presented in [77] is a domain-specific vector co-processor, leveraging the extensible RISC-V architecture. This co-processor has been integrated with an open-source RISC-V microprocessor, supporting the RV32IMC ISA. The RISC-V core has been modified to recognize the custom instructions and forward them to the vector co-processor. Similarly, in [12], the domain-specific Sapphire crypto-processor is coupled with an efficient RISC-V microprocessor, supporting the RV32IM ISA. In [27], the authors proposed RISQ-V, an enhanced RISC-V architecture that embeds a set of powerful tightly coupled accelerators to speed up lattice-based PQC. These accelerators are deeply integrated into the RISC-V pipeline. RISC-V was also extended with 28 new instructions for performing packed modular arithmetic, butterfly operation, update of Twiddle factors, update/multiplication with scaling factors, bit-reversal, hash computations, and binomial sampling. In all three co-processors, all supported lattice-based KEMs share the same resources. Therefore, only the comparison in terms of the execution time, power consumption, and energy was possible. All implementations adopted the SW/HW co-design approach. In [77], the target was high-speed. In [12] and [27] the minimum power and energy. In Table 6, we report the execution times, and in Table 7 both energy and power consumption at a specific operating frequency, common for all schemes compared within each study. In the study reported in [77], for the most advanced TSMC 28nm library, the ranking of candidates in terms of the most time-critical decapsulation time is 1. Kyber, 2. NewHope, 3. LAC. The difference between positions 1 and 2 is about 6% for the security level 1, and 18% for the security level 5. LAC lags behind NewHope by a factor larger than 3. Kyber also uses less energy than NewHope for each major operation. The energy usage for decapsulation is almost identical for Kyber and NewHope. At the security level 1, Kyber uses about 1% less energy, and at the security level 5 about 7% less energy. At the same time, at the security level 1, LAC requires over 3 times more energy. In the study reported in [12], for the TSMC 40nm library, the ranking of candidates in terms of the decapsulation time is 1. Kyber, 2. NewHope, 3. Frodo-KEM. The difference between positions 1 and 2 is by a factor of 1.8 for the security level 1, and 7% for the security level 5. Frodo-KEM lags behind NewHope by a factor larger than 46 for the security level 1, and 280 for the security level 5. In terms of the power usage, the differences among all three candidates are very small, and their ranking identical to that for the decapsulation time. In terms of energy usage, Kyber and NewHope are very close to each other, and Frodo-KEM lags behind by more than two orders of magnitude. Finally, in the study reported in [27], for the UMC 65nm library, the ranking of candidates in terms of the decapsulation time is: 1. Kyber, 2. NewHope, 3. Saber. The difference between positions 1 and 2 is about 11% for the security level 1, and 6% for the security level 1, and 4 for the security level 1. Most likely because of reporting energy jointly for key generation, encapsulation, and decapsulation, NewHope outperforms Kyber in terms of energy usage. However, it uses only about 5% less energy at the security level 1 and about 16% less at level 1. At the same time, Saber requires over 10 times more energy than Kyber at the security level 11, and over 12 times more at the level 15. In all three studies, the Key Generation time seems to be comparable for Kyber and NewHope at the security level 1, and slightly smaller for NewHope at the security level 5. **Table 6:** All KEMs on ASIC | Design | Algorithm | Tvne | Target | Max. | Area | Memory | Key Gen. | Gen. | Encapsulation | ılation | Decapsulation | ulation | Technology | |--------|---------------|-------|------------------|-------|-------|-----------|------------------|-----------|---------------|-----------|---------------|-------------|-----------------------------------------| | | 9 | 24.6 | | Fred. | (kGE) | (kB) | cycles | sn | cycles | ns | cycles | sn | 600000000000000000000000000000000000000 | | | | | | | | Sec | Security Level 1 | 1 | | | | | | | [22] | Kyber-512 | SW/HW | HS | 300 | 979 | 12.00 | 18,556 | 61.9 | 45,886 | 153.0 | 79,989 | 266.6 | | | [22] | NewHope-512 | MH/MS | $^{\mathrm{HS}}$ | 300 | 979 | 12.00 | 18,563 | 61.9 | 44,513 | 148.4 | 84,501 | 281.7 | TSMC 28 nm | | [77] | LAC-128-v3a | MH/MS | $^{\mathrm{HS}}$ | 300 | 626 | 12.00 | 107,511 | 358.4 | 189,550 | 631.8 | 281,953 | 939.8 | | | [12] | Kyber-512 | SW/HW | ΓM | 72 | 106 | 40.25 | 74,519 | 1,035.0 | 131,698 | 1,829.1 | 142,309 | 1,976.5 | | | [12] | NewHope-512 | MH/MS | ΓM | 72 | 106 | 40.25 | 696,26 | 1,360.7 | 236,812 | 3,289.1 | 258,872 | 3,595.4 | TSMC 40 nm | | [12] | FrodoKEM-640 | MH/MS | ΓM | 72 | 106 | 40.25 | 11,453,942 | 159,082.5 | 11,609,668 | 161,245.4 | 12,035,513 | 167,159.9 | | | [27] | Kyber-512 | MH/MS | TM | 45 | 170 | 465* | 150,106 | 3,316.5 | 193,076 | 4,265.9 | 204,843 | 4,525.9 | | | [27] | NewHope-512 | MH/MS | LW | 45 | 170 | 465* | 123,860 | 2,736.6 | 207,299 | 4,580.2 | 226,742 | 5,009.8 | VMC 65 nm | | [27] | LightSaber | MH/MS | ΓM | 45 | 170 | 465* | 366,837 | 8,105.1 | 526,496 | 11,632.7 | 657,583 | 14,529.0 | | | | | | | | | Sec | Security Level | 3 | | | | | | | [12] | Kyber-768 | MH/MS | ΓM | 72 | 106 | 40.25 | 111,525 | 1,549.0 | 177,540 | 2,465.8 | 190,579 | 2,646.9 | 25 | | [12] | FrodoKEM-976 | MH/MS | ΓM | 72 | 106 | 40.25 | 26,005,326 | 361,185.1 | 29,749,417 | 413,186.3 | 30,421,175 | 422,516.3 | LSMC 40 nm | | | | | | | | Sec | Security Level 5 | 20 | | | | | | | [22] | Kyber-1024 | MH/MS | HS | 300 | 979 | 12.00 | 39,689 | 132.3 | 81,569 | 271.9 | 136,475 | 454.9 | 96 CJ (5E | | [77] | NewHope-1024 | MH/MS | $^{\mathrm{HS}}$ | 300 | 626 | 12.00 | 36,584 | 121.9 | 85,871 | 286.2 | 161,623 | 538.7 | L SIMIC Zo nin | | [12] | Kyber-1024 | MH/MS | ΓM | 72 | 106 | 40.25 | 148,547 | 2,063.2 | 223,469 | 3,103.7 | 240,977 | 3,346.9 | | | [12] | NewHope-1024 | MH/MS | LW | 72 | 106 | 40.25 | 92,969 | 1,360.7 | 236,812 | 3,289.1 | 258,872 | 3,595.4 | TSMC 40 nm | | [12] | FrodoKEM-1344 | MH/MS | ΓM | 72 | 106 | 40.25 | 67,994,170 | 944,363.5 | 71,501,358 | 993,074.4 | 72,526,695 | 1,007,315.2 | | | [27] | Kyber-1024 | MH/MS | LW | 45 | 170 | 465* | 349,673 | 7,725.9 | 405,477 | 8,958.8 | 424,682 | 9,383.2 | | | [27] | NewHope-1024 | MH/MS | ΓM | 45 | 170 | $465^{*}$ | 235,420 | 5,201.5 | 392,734 | 8,677.3 | 450,541 | 9,954.5 | VMC 65 nm | | [27] | FireSaber | MH/MS | LW | 45 | 170 | 465* | 1,300,272 | 28,728.9 | 1,622,818 | 35,855.5 | 1,898,051 | 41,936.6 | | All SW/HW co-designs using RISC-V RV32IM $^{\ast}$ Numbers reported in kGE **Table 7:** Power and Energy Comparison for all KEMs on ASIC | Design | Algorithm | Tvne | Tarvet | Fred. | Area | Memory | Key Ge | Key Generation | Encaps | Encapsulation | Decaps | Decapsulation | Technology | |-------------|---------------------|--------------|------------------|--------------|-------|------------------|------------|-----------------------------------------|------------|-----------------------------------------|------------|-----------------------------------------|-----------------------------------------| | | | | 0 | ;;<br>)<br>; | (kGE) | (kB) | Power (mW) | $\frac{\text{Energy}}{(\mu\mathbf{J})}$ | Power (mW) | $\frac{\text{Energy}}{(\mu\mathbf{J})}$ | Power (mW) | $\frac{\text{Energy}}{(\mu\mathbf{J})}$ | G C C C C C C C C C C C C C C C C C C C | | | | | | | | Security Level | Level 1 | | | | | | | | [22] | Kyber-512 | MH/MS | HS | 300 | 979 | 12.00 | 29.26 | 1.81 | 23.67 | 3.62 | 24.94 | 6.65 | | | [22] | NewHope-512 | MH/MS | HS | 300 | 626 | 12.00 | 31.84 | 1.97 | 27.77 | 4.12 | 23.82 | 6.71 | TSMC 28 nm | | [22] | LAC-128-v3a | MH/MS | $^{\mathrm{HS}}$ | 300 | 626 | 12.00 | 25.90 | 9.28 | 24.33 | 15.37 | 23.74 | 22.31 | | | [12] | Kyber-512 | MH/MS | LW | 72 | 106 | 40.25 | 5.77 | 5.97 | 5.12 | 9.37 | 5.69 | 11.25 | | | [12] | NewHope-512 | MH/MS | ΓM | 72 | 106 | 40.25 | 5.30 | 4.37 | 5.30 | 10.02 | 5.80 | 11.46 | TSMC 40 nm | | [12] | FrodoKEM-640 | MH/MS | $\Gamma$ M | 72 | 106 | 40.25 | 6.65 | 1,057.65 | 7.01 | 1,129.95 | 6.88 | 1,150.83 | | | [27] | NewHope-512 | MH/MS | LW | 10 | 170 | 465* | I | I | | I | 2.42 | 135.03 | | | [27] | Kyber-512 | MH/MS | ΓM | 10 | 170 | 465* | I | | I | I | 2.58 | 141.41 | m UMC~65~nm | | [27] | LightSaber | MH/MS | ΓM | 10 | 170 | 465* | I | I | I | I | 2.78 | 431.18 | | | | | | | | | Security Level | Level 3 | | | | | | | | [12] | Kyber-768 | MH/MS | ΓM | 72 | 106 | 40.25 | 5.28 | 8.19 | 5.19 | 12.80 | 5.86 | 15.52 | TONG 40 | | [12] | FrodoKEM-976 | MH/MS | $\Gamma$ M | 72 | 106 | 40.25 | 6.70 | 2,420.97 | 7.05 | 2,912.95 | 6.94 | 2,932.13 | LSMC 40 mm | | | | | | | | Security Level 5 | Level 5 | | | | | | | | [22] | Kyber-1024 | MH/MS | HS | 300 | 979 | 12.00 | 35.45 | 4.69 | 29.20 | 7.94 | 25.57 | 11.63 | 96 CJ 5E | | [22] | NewHope-1024 | MH/MS | $^{\mathrm{HS}}$ | 300 | 626 | 12.00 | 29.36 | 3.58 | 24.53 | 7.02 | 23.57 | 12.70 | LSMC 28 mm | | [12] | Kyber-1024 | MH/MS | LW | 72 | 106 | 40.25 | 5.95 | 12.27 | 5.25 | 16.30 | 5.91 | 19.76 | | | [12] | NewHope-1024 | MH/MS | $\Gamma$ M | 72 | 106 | 40.25 | 6.13 | 8.35 | 5.05 | 16.59 | 5.89 | 21.17 | TSMC 40 nm | | [12] | FrodoKEM-1344 | MH/MS | $\Gamma$ M | 72 | 106 | 40.25 | 6.75 | 6,374.45 | 7.10 | 7,050.83 | 7.00 | 7,051.21 | | | [27] | NewHope-1024 | MH/MS | LW | 10 | 170 | 465* | ı | I | ı | ı | 2.41 | 259.98 | | | [27] | Kyber-1024 | MH/MS | $\Gamma$ M | 10 | 170 | 465* | I | I | I | I | 2.60 | 307.68 | m UMC~65~nm | | [27] | FireSaber | MH/MS | ΓM | 10 | 170 | 465* | I | I | I | I | 2.77 | 1335.48 | | | A 11 CIX7 / | A 11 CAXX / TIXX 1: | 673G 77 DOIG | AT DO TAGE TA | | | | | | | | | | | All SW/HW co-designs using RISC-V RV32IM $^{\ast}$ Numbers reported in kGE Table 8: All KEMs and PKEs on Zynq Ultrascale+ | Design | Algorithm | Tvpe | Target | Max. | TOT | H.F. | Slice | $\overline{\mathrm{DSP}}$ | $\mathbf{BRAM}$ | Key Gen. | en. | Encaps | ulation | Encapsulation Decapsulation | ılation | |--------|-------------------------------------------|---------------|------------------|-------|--------|---------|------------------|---------------------------|-----------------|------------|------|--------|---------|-----------------------------|---------| | D | | <b>4</b> | 0 | Fred. | | l<br>I | | ! | | cycles | sn | cycles | sn | cycles | sn | | | | | | | | Securit | Security Level 1 | _ | | | | | | | | | [19] | $R5ND_1KEM_0d$ | MH/MS | HS | 260 | 55,442 | 82,341 | 10,627 | 0 | 2 | ı | I | ı | 19.0 | I | 24.0 | | [61] | LightSaber | HM | $^{\mathrm{HS}}$ | 150 | 25,079 | 10,750 | I | 0 | 2 | 2,761 18.4 | 18.4 | 4,033 | 26.9 | 5,037 | 33.6 | | [19] | LightSaber | MH/MS | HS | 322 | 12,343 | 11,288 | 1,989 | 256 | 3.5 | I | I | I | 53.0 | I | 56.0 | | [19] | FrodoKEM-640 | MH/MS | $^{ m HS}$ | 402 | 7,213 | 6,647 | 1,186 | 32 | 13.5 | I | I | I | 1,223.0 | I | 1,319.0 | | | | | | | | Securit | Security Level 3 | ~ | | | | | | | | | [19] | R5ND_3KEM_0d | MH/MS | HS | 249 | 73,881 | 109,211 | 14,307 | 0 | 2 | | 1 | ı | 24.0 | I | 33.0 | | [61] | Saber | HW | HS | 150 | 25,079 | 10,750 | I | 0 | 2 | 5,435 | 36.2 | 6,618 | 44.1 | 8,034 | 53.6 | | [19] | Saber | MH/MS | $^{\mathrm{HS}}$ | 322 | 12,566 | 11,619 | 1,993 | 256 | 3.5 | 1 | I | 1 | 0.09 | I | 65.0 | | [19] | FrodoKEM-976 | MH/MS | $^{\mathrm{HS}}$ | 402 | 7807 | 6693 | 1190 | 32 | 17 | I | I | I | 1,642.0 | I | 1,866.0 | | | | | | | | Securit | Security Level 5 | ,, | | | | | | | | | [19] | R5ND_5KEM_0d | MH/MS | HS | 212 | 91,166 | 151,019 | 18,733 | 0 | 2 | l | ı | ı | 32.0 | I | 42.0 | | [40] | NewHope-1024 <sup>cpa</sup> | HW | $^{\mathrm{HS}}$ | 406 | 13,961 | | 1 | 25 | 18 | I | I | 34,000 | 83.0 | 30,600 | 75.0 | | [19] | FireSaber | $_{ m MM/MS}$ | $^{\mathrm{HS}}$ | 322 | 12,555 | 11,881 | 2,341 | 256 | 3.5 | I | I | I | 74.0 | I | 80.0 | | [19] | FrodoKEM-1344 | $_{ m MH/MS}$ | HS | 417 | 7,015 | 6,610 | 1,215 | 32 | 17.5 | l | I | I | 2,186.0 | I | 3,120.0 | | All SW | All SW/HW co-designs using ABM Cortex-A53 | ng ARM C | Ortex-A5 | 23 | | | | | | | | | | | | All SW/HW co-designs using ARM Cortex-A53 $<sup>^{</sup>cpa}$ Design of a PKE variant resistant against Chosen-Plaintext Attack (CPA) [40] only reports latency of Encapsulation and total latency of Key Generation and Decapsulation Table 9: Digital Signature Schemes on Artix-7, Kintex-7 and Virtex-7 | Family | | | | Artix-7 | | | | Kintex-7 | | | | Virtex-7 | | | | | | At ::- 1 | Arthx-1 | | | Kintex-7 | Artix-7 | |----------------------------------|--------------------|--------------|--------------------------|-------------|--------------|------------|---------------|------------------|-----------------|---------------|------------------|-------------------|---------------------|-------------------|-----------------|------------------|--------------------------------|-------------------------------|---------------|--------------|------------------|--------------|--------------| | eneration<br>us | | 344.3 | 6,730.9 | 15,055.7 | 20,569.8 | 34,422.8 | 10.9 | 17.8 | 250.0 | 5.9 | 10.9 | 637.4 | 1,551.7 | 387,388.8 | 4,687,898.0 | | 12,683.3 | 13,937.2 | 25,390.5 | 64,009.0 | | 1,236.0 | 32,625.4 | | Signature Generation cycles us | | 31,300 | 168,273 | 376,392 | 514,246 | 4,165,160 | 626 | 1,980 | 31,300 | 626 | 1,980 | 63,736 | 155,166 | 25,825,918 | 468,789,803 | | 317,083 | 348,429 | 634,763 | 7,745,088 | | 154,500 | 815,636 | | erification<br>us | | 325.6 | 1,556.9 | 5,703.0 | 7,397.3 | 7,822.5 | ı | I | 237.0 | 1 | I | 364.2 | 53.8 | 376,273.6 | 9,379.8 | | 2,708.5 | 2,766.2 | 9,179.2 | 19,140.1 | | 1,173.0 | 11,048.8 | | Signature Verification cycles us | | 29,600 | 38,922 | 142,576 | 184,933 | 946,520 | ı | I | 29,600 | 1 | I | 36,423 | 5,380 | 25,084,906 | 937,975 | | 67,712 | 69,154 | 229,481 | 2,315,950 | | 146,600 | 276,221 | | den.<br>us | | ı | 193,878.0 | 3,808.1 | 5,200.9 | 7,648.2 | ı | I | I | I | I | I | I | I | I | | 475,929.6 | 459,167.6 | 6,697.3 | 19,051.4 | | ı | 8,930.9 | | Key Gen.<br>cycles | 2 | ı | 4,846,949 | 95,202 | 130,022 | 925,431 | ı | I | I | 1 | I | I | I | I | I | | 11,898,241 | 11,479,190 | 167,433 | 2,305,220 | 20. | I | 223,272 | | $_{ m AM}$ | vel 1 & | 52.5 | 14.0 | 14.0 | 14.0 | 139.0 | 67.0 | 59.0 | 52.5 | 67.0 | 59.0 | 0.0 | 0.0 | 0.0 | 0.0 | Level 3 | 14.0 | 14.0 | 14.0 | 147.0 | vel 4 & | 98.5 | 14.0 | | DSP | Security Level 1 & | 0 | 11 | 11 | 11 | 15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Security Level 3 | 11 | 11 | 11 | 15 | Security Level 4 | 0 | 11 | | Slice | Secu | 25,160 | 4,173 | 4,173 | 4,173 | 2,438 | 15,112 | 8,939 | I | 15,976 | 7,065 | I | I | I | I | Sec | 4,173 | 4,173 | 4,173 | 2,473 | Secu | ı | 4,173 | | मम | | 23,516 | 2,539 | 2,539 | 2,539 | 4,378 | 32,476 | 27,679 | 23,105 | 32,475 | 27,675 | 112,657 $77,567$ | 146,076<br>108,154 | 47,441<br>93,945 | 20,628 $3,335$ | | 2,539 | 2,539 | 2,539 | 4,518 | | 33,164 | 2,539 | | LUT | | 90,535 | 14,975 | 14,975 | 14,975 | 7,212 | 52,895 | 27,712 | 90,037 | 52,721 | 27,556 | 346,020 $247,726$ | 1,327,355 $388,991$ | 270,713 $323,734$ | 66,750 $11,438$ | | 14,975 | 14,975 | 14,975 | 7,475 | | 167,530 | 14,975 | | Max.<br>Freq. | | 91 | $25^{*}$ | $25^{*}$ | $25^{*}$ | 121 | 06 | 111 | 125 | 167 | 181 | 100 | 100 | 29 | 100 | | $25^{*}$ | $25^{*}$ | $25^{*}$ | 121 | | 125 | $25^{*}$ | | Target | | HS | ΓM | ΓM | ΓM | ΓM | HS | $^{\mathrm{HS}}$ | $^{\mathrm{H}}$ | HS | $^{\mathrm{HS}}$ | $^{\mathrm{H}}$ | SH | $^{\mathrm{H}}$ | HS | | ΓM | ΓM | ΓM | ΓM | | SH | TM | | Type | | HM | $_{ m MH/MS}$ | MH/MS | MH/MS | MH/MS | HW | HW | HW | HW | HW | HW-HLS | HW-HLS | HW-HLS | HW-HLS | | MH/MS | MH/MS | MH/MS | MH/MS | | HW | MH/MS | | Algorithm | | Picnic-L1-FS | $_{ m qTESLA-I}$ $^{o2}$ | Dilithium-I | Dilithium-II | qTESLA-p-I | Rainbow-Ic ol | Rainbow-Ia | Picnic-L1-FS | Rainbow-Ic o1 | Rainbow-Ia | qTesla $^{o2}$ | Dilithium | MQDSS $^{o1}$ | SPHINCS+ | | qTesla-III-speed <sup>o2</sup> | qTesla-III-size <sup>o2</sup> | Dilithium-III | qTESLA-p-III | | Picnic-L5-FS | Dilithium-IV | | Design | | [41] | [13] | [13] | [13] | [73] | [25] | [22] | [41] | [25] | [25] | [14] | [14] | [14] | [14] | | [13] | [13] | [13] | [73] | | [41] | [13] | $^{o2}$ Design for a heuristic parameter set with drawn by the submitters on Aug. 20, 2019 All SW/HW co-designs using RISC-V RV32 IM $^{o1}$ Design for a parameter set with drawn at the beginning of Round 2 <sup>\*</sup> Preliminary result **Table 10:** Digital Signature Schemes on ASIC | $ \begin{array}{c ccccccccccccccccccccccccccccccccccc$ | Design | Algorithm | Tvne | Target | Max. | Area | • | Key Gen. | Gen. | Signature | Signature Verification Signature Generation | Signature ( | Generation | Technology | |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|--------------------------------|-------|------------|-------|-------|-------|--------------|-----------|-----------|---------------------------------------------|-------------|------------|---------------------| | Security Level 1 & 2 qTESLA-1 °2 SW/HW LW 72 106 40.25 4.846,949 67,318.7 38,922 540.6 168,273 2,337.1 Dilithium-II SW/HW LW 72 106 40.25 130,022 1,322.3 142,576 1,980.2 376,392 5,227.7 Dilithium-III SW/HW LW 72 106 40.25 130,022 1,805.9 142,576 1,980.2 376,392 5,227.7 Action of Tesla-III-speed °2 SW/HW LW 72 106 40.25 11,898,241 165,253.3 67,712 940.4 317,083 4,403.9 QTesla-III-speed °2 SW/HW LW 72 106 40.25 11479,190 159,433.2 69,154 960.5 348,429 4,839.3 Dilithium-III SW/HW LW 72 106 40.25 167,433 2,325.5 229,481 3,187.2 634,763 8,816.2 Accurity Level A T 72 106 < | | | | | Freq. | (kGE) | (kB) | cycles | ns | cycles | sn | cycles | ns | 6 | | $ \begin{array}{c ccccccccccccccccccccccccccccccccccc$ | | | | | | | Secu | rity Level 1 | 8 2 | | | | | | | Dilithium-II SW/HW LW 72 106 40.25 132.02 1,322.3 142,576 1,980.2 376,392 5,227.7 Dilithium-II SW/HW LW 72 106 40.25 130,022 1,865.9 184,933 2,568.5 514,246 7,142.3 Accurity Level 4 | [12] | qTESLA-I o2 | MH/MS | LW | 72 | 106 | 40.25 | 4,846,949 | | 38,922 | 540.6 | 168,273 | 2,337.1 | | | Dilithium-II SW/HW LW 72 106 40.25 130,022 1,805.9 184,933 2,568.5 514,246 7,142.3 Activity Level 2 | [12] | Dilithium-I | MH/MS | $\Gamma$ M | 72 | 106 | 40.25 | 95,202 | | 142,576 | 1,980.2 | 376,392 | 5,227.7 | TSMC 40 nm | | qTesla-III-speed $^{o2}$ SW/HW LW 72 106 40.25 11,898,241 165,253.3 67,712 940.4 317,083 4,403.9 qTesla-III-speed $^{o2}$ SW/HW LW 72 106 40.25 11,479,190 159,433.2 69,154 960.5 348,429 4,839.3 Dilithium-III SW/HW LW 72 106 40.25 167,433 2,325.5 229,481 3,187.2 634,763 8,816.2 Security Level 4 Dilithium-IV SW/HW LW 72 106 40.25 223,272 3,101.0 276,221 8,586.4 815,636 11,328.3 | [12] | Dilithium-II | MH/MS | $\Gamma$ M | 72 | 106 | 40.25 | 130,022 | 1,805.9 | 184,933 | 2,568.5 | 514,246 | 7,142.3 | | | qTesla-III-speed of Tyles SW/HW LW 72 106 40.25 11,898,241 165,253.3 67,712 940.4 317,083 4,403.9 qTesla-III-size of SW/HW LW 72 106 40.25 11,479,190 159,433.2 69,154 960.5 348,429 4,839.3 Dilithium-III SW/HW LW 72 106 40.25 167,433 2,325.5 229,481 3,187.2 634,763 8,816.2 Security Level 4 Dilithium-IV SW/HW LW 72 106 40.25 223,272 3,101.0 276,221 3,836.4 815,636 11,328.3 | | | | | | | Se | curity Leve | 1 3 | | | | | | | qTesla-III-size of Dilithium-III SW/HW LW 72 106 40.25 11,479,190 159,433.2 69,154 960.5 348,429 4,839.3 Dilithium-III SW/HW LW 72 106 40.25 167,433 2,325.5 229,481 3,187.2 634,763 8,816.2 Security Level 4 Dilithium-IV SW/HW LW 72 106 40.25 223,272 3,101.0 276,221 8,586.4 815,636 11,328.3 | [12] | qTesla-III-speed <sup>o2</sup> | | LW | 72 | 106 | 40.25 | | 165,253.3 | 67,712 | 940.4 | 317,083 | 4,403.9 | | | Dilithium-III SW/HW LW 72 106 40.25 167,433 2,325.5 229,481 3,187.2 634,763 | [12] | qTesla-III-size <sup>02</sup> | MH/MS | ΓM | 72 | 106 | 40.25 | 11,479,190 | 159,433.2 | 69,154 | 900.5 | 348,429 | 4,839.3 | TSMC 40 nm | | Security Level 4 Dilithium-IV SW/HW LW 72 106 40.25 223,272 3,101.0 276,221 3,836.4 815,636 | [12] | Dilithium-III | MH/MS | $\Gamma$ M | 72 | 106 | 40.25 | 167,433 | 2,325.5 | 229,481 | 3,187.2 | 634,763 | 8,816.2 | | | Dilithium-IV SW/HW LW 72 106 40.25 223,272 3,101.0 276,221 3,836.4 815,636 | | | | | | | Se | curity Leve | 14 | | | | | | | | [12] | Dilithium-IV | MH/MS | TM | 72 | 106 | 40.25 | 223,272 | | 276,221 | 3,836.4 | 815,636 | 11,328.3 | 11,328.3 TSMC 40 nm | <sup>o2</sup> Design for a heuristic parameter set withdrawn by the submitters on Aug. 20, 2019 **Table 11:** Power and Energy Comparison for Digital Signature Schemes on ASIC | ution Technology | | 23.34 | 35.41 TSMC 40 nm | 54.82 | | 43.91 | 48.23 TSMC 40 nm | 65.26 | | 78.53 TSMC 40 nm | | |-------------------------------------------------------------------------------------|----------------------|-------------|------------------|--------------|------------------|--------------------------------|-----------------------------|---------------|------------------|-------------------------------|---| | Signature Generation Power Energy $(mW)$ | | | 6.77 | | | | 76.6 | | | 6.93 | | | Signature Verification Si. Power Energy Power $(\mu J)$ ( $\mu J$ ) (n | | 4.32 | 15.31 | 19.23 | | 98.9 | 7.27 | 23.63 | | 28.55 | | | Signature<br>Power<br>(mW) | | 7.99 | 7.73 | 7.49 | | 7.30 | 7.59 | 7.41 | | 7.44 | | | $\begin{array}{c} \text{neration} \\ \text{Energy} \\ (\mu \mathbf{J}) \end{array}$ | 1 & 2 | 531.55 | 9.00 | 13.08 | el 3 | 1,262.39 | 1,229.18 | 17.11 | el 4 | 21.38 | | | SRAM Key Generation (kB) Power Energy (mW) (μJ) | Security Level 1 & 2 | 7.89 | 6.82 | 7.24 | Security Level 3 | 7.64 | 7.71 | 7.36 | Security Level 4 | 68.9 | | | SRAM<br>(kB) | Securi | 40.25 | 40.25 | 40.25 | Seci | 40.25 | 40.25 | 40.25 | Seci | 40.25 | | | Area<br>(kGE) | | 106 | 106 | 106 | | 106 | 106 | 106 | | 106 | | | Max.<br>Freq. | | 72 | 72 | 72 | | 72 | 72 | 72 | | 72 | | | Target | | ΓM | LW | ΓM | | ΓM | ΓM | ΓM | | $\Gamma$ M | | | Type | | MH/MS | MH/MS | MH/MS | | MH/MS | MH/MS | MH/MS | | $_{ m MH/MS}$ | | | Algorithm | | qTESLA-I °2 | Dilithium-I | Dilithium-II | | qTesla-III-speed <sup>o2</sup> | q $Tesla$ -III-size $^{o2}$ | Dilithium-III | | $\operatorname{Dilithium-IV}$ | 1 | | Design | | [12] | [12] | [12] | | [12] | [12] | [12] | | [12] | 0 | $^{o2}$ Design for a heuristic parameter set with drawn by the submitters on Aug. 20, 2019 All SW/HW co-designs using RISC-V RV32 IM In Table 8, we compare results reported by our own group at the end of 2019 in [19], with results reported by other groups for Saber and NewHope, respectively. All results were obtained using the same SoC FPGA, Zynq UltraScale+. For security levels 1, 2, and 3 the ranking of candidates in terms of the decapsulation time is 1. Round5, 2. Saber, 3. FrodoKEM. Software/hardware implementations based on ARM Cortex-A53 are compared in [19]. The differences between the candidates are so large, that even when Saber is implemented in pure hardware by another group [61], the obtained improvement in terms of the decapsulation and encapsulation time does not affect Saber's position in the ranking. At the same time, [61] demonstrates the ability to avoid the use of a large number of DSP units, at the cost of approximately doubling the number of LUTs. It should be mentioned that a certain percentage of these additional LUTs might be necessary to support additional operations offloaded to hardware as compared to the software/hardware implementation reported in [19]. Somewhat similarly, for the security level 5, the purely hardware implementation of NewHope, reported in [40], is not fast enough to outperform the software/hardware implementation of Round5. In Tables 9, 10, and 11, we summarize results available for the implementations of digital signatures. The implementations targeting FPGAs are considered first in Table 9. Unfortunately, multiple results available for qTESLA concern heuristic parameter sets that have been withdrawn by submitters on Aug. 20, 2019. Among the remaining designs, for Artix-7, the ranking of candidates for the security level 1 is 1. Picnic, 2. Dilithium, and 3. qTESLA. The differences among these candidates in terms of the execution time for the signature generation (more critical) and signature verification are very significant. At the same time, only the implementation of Picnic is a high-speed and purely hardware implementation. The remaining implementations are software/hardware implementations based on RISC-V. Additionally, the number of LUTs for Picnic is approximately 6 times larger than for Dilithium, and the number of BRAMs, 3.75 times larger. At the same time, compared to Picnic, the execution time for signature generation is 12 times longer for Dilithium-II and 16 times longer for Dilithium-II. For the security level 3, no implementation of Picnic is available. The implementations of Dilithium-III and qTESLA-p-III are comparable in terms of type, target, and resource utilization. At the same time, the implementation of Dilithium is an order of magnitude more efficient. The implementations of digital signature schemes targeting Kintex-7 and Virtex-7 are summarized in the same table. After disregarding results for old parameter sets, the following observations can be made. For Kintex-7 implementations, Rainbow substantially outperforms Picnic at the security level 1. For Virtex-7, the ranking of candidates is 1. Rainbow, 2. Dilithium, and 3. SPHINCS+. However, the implementations of Dilithium and SPHINCS+ are HLS implementations, which put them in a disadvantage compared to the RTL implementation of Rainbow. In Tables 10 and 11, Dilithium and qTESLA are compared from the point of view of their execution time, energy usage, and power consumption in ASICs. Unfortunately, the practical importance of the underlying study, reported in [12] and performed in the first half of 2019, was diminished by the use of heuristic parameter sets of qTESLA, withdrawn by the submitters on Aug. 20, 2019. # 3 Choice of Algorithms to Implement In this paper, we focus on KEMs with indistinguishability under chosen-ciphertext attack (IND-CCA). Our primary goal was to implement all lattice-based IND-CCA secure KEMs described in the specifications of Round 2 PQC candidates. Eventually, we fell short of this goal by not implementing a KEM of a single lattice-based candidate, Three Bears. Additionally, we focused on Ring Learning with Rounding (RLWR) variants of Round5, and thus, we did not attempt to implement any LWR variants of this submission. The submission packages of four candidates – LAC, NTRU, NTRU Prime, and Round5 – describe two substantially different KEMs each. As a result, we have implemented 12 KEMs representing 8 Round 2 candidates. For each implemented KEM, we generated results for all supported security levels. With a few exceptions, we did not generate results for the underlying public-key encryption schemes (PKE) or concurrently proposed IND-CPA secure KEMs. The reason for that was a focus on the highest-level schemes, which could be securely used to agree on shared session keys, based on the long-term public-private key pairs valid for an extended period of time. In this scenario, the time of the public-private key-pair generation is non-critical, and the design can focus entirely on minimizing the time of encapsulation and decapsulation. All implemented PQC candidates can be divided into the following major sub-families, listed below together with their Round 2 representatives: - LWE: Learning With Errors FrodoKEM - RLWE : Ring Learning with Errors LAC (including LAC-v3a and LAC-v3b) and NewHope - Module-LWE: Module Learning with Errors CRYSTALS-KYBER - RLWR: Ring Learning With Rounding Round5 (with and without an error correcting code) - Mod-LWR: Module Learning with Rounding Saber, - NTRU-based: NTRU (including NTRU-HPS and NTRU-HRSS) and NTRU Prime (including Streamlined NTRU Prime and NTRU LPRime). Both implemented variants of LAC were announced in the middle of Round 2, on Dec. 19, 2019. The implemented variants of the remaining algorithms have remained unchanged since the beginning of Round 2. The following two submissions did not limit the generation of pseudorandom bits to any particular algorithm (e.g., SHAKE): LAC and NTRU. As a result, for each of them, we selected a variant of a pseudorandom number generator most efficient on our benchmarking platform. In the case of CRYSTALS-Kyber, we selected one of the variants described in the specification - a variant based on the SHA-3 functions. Selected features of all implemented KEMs are summarized in Tables 12 and 13. In all of these KEMs, the elementary operation is multiplication mod q. In FrodoKEM, LAC-v3b, Round5, Saber, NTRU-HPS, and NTRU-HRSS, q is a power of two, which significantly simplifies the reduction mod q. In NewHope and Kyber, q is a special prime, selected in such a way to support speeding up polynomial multiplication in $\mathbb{Z}_q[x]/(x^n+1)$ using the Number Theoretic Transform (NTT). In LAC-v3a, q is a one-byte prime (251). In Streamlined NTRU Prime and NTRU LPRime, it is a prime smaller than $2^{13}$ . The moduli chosen for NTRU Prime algorithms may potentially lead to a higher resistance against future attacks. In FrodoKEM, the most time-consuming operation is a matrix-by-matrix multiplication, where each component of a matrix is an element of $Z_q$ . In Kyber and Saber, the most time-consuming operations are matrix-by-vector and vector-by-vector multiplications, where each element of a matrix or a vector is a polynomial with n coefficients in $Z_q$ , and the multiplication of such polynomials is performed modulo the reduction polynomial $x^n+1$ . In New Hope, LAC, Round5, and all NTRU-based KEMs, the most time-consuming operation is a polynomial multiplication. **Table 12:** Features of selected NIST Round 2 PQC KEMs | Feature | LAC-(v3a/v3b) | NewHope | Round5 | Kyber | Saber | FrodoKEM | |--------------------------------|---------------------------------------------------------------------------------------|---------------------------------------------------------------------------|----------------------------------------------------------------------|---------------------------------------------------------------------------|---------------------------------------------------------------------------|--------------------------------------------------------------------| | Underlying | Ring-LWE: | Ring-LWE: | RLWR: | Module-LWE: | Mod-LWR: | LWE: | | problem | King Learning With<br>Errors | King Learning with<br>Errors | King Learning with<br>Rounding | Module Learning with<br>Errors | Module Learning with<br>Rounding | Learning With Errors | | Degree $n$ | Power of 2 | Power of 2 | $2^8 < n < 2^{11}$ | Power of 2 | Power of 2 | $n \equiv 0 \pmod{8}$ | | Modulus q | Byte-level<br>Prime / Power of 2 | Prime | Power of 2 | Prime | Power of 2 | Power of 2 | | | a/h. | | | k: the lattice | l: number of | B: number of bits, | | Other major | +n. Binomial distribution | k: noise parameter, | p, t: | dimension as a | polynomials per vector, | encoded in each | | parameters | $[l_c, l_m, l_d]$ : BCH code | $\gamma$ : NTT parameter | other moduli | multiple of $n$ , $\eta$ : noise parameter | $p, T$ : other moduli, $\mu$ : parameter of CBD | matrix entry, $\sigma$ : standard deviation | | Hash-based<br>functions | SHA3-512<br>SHAKE256 | SHAKE128,<br>SHAKE256 | 1: SHAKE128<br>3, 5: SHAKE256 | SHA3-256,<br>SHA3-512,<br>SHAKE128,<br>SHAKE256 | SHA3-256,<br>SHA3-512,<br>SHAKE128 | 1: SHAKE128<br>3, 5: SHAKE256 | | Sampling | Integers are sampled<br>from a fixed-weight<br>centered binomial<br>distribution(CBD) | Integers are sampled<br>from a centered<br>binomial distribution<br>(CBD) | Integers from a uniform distribution are produced by a DRBG taking a | Integers are sampled<br>from a centered<br>binomial distribution<br>(CBD) | Integers are sampled<br>from a centered<br>binomial distribution<br>(CBD) | Integers are sampled from an approximation of a rounded continuous | | Dogward | | | random seed | | | Gaussian distribution | | failures | Yes | Yes | Yes | Yes | Yes | Yes | | Polynomial<br>Rings | $\mathbb{Z}_q[x]/(x^n+1)$ | $\mathbb{Z}_q[x]/(x^n+1)$ | $Z_q[x]/\Phi_{n+1}{}^{**}$ | $\mathbb{Z}_q[x]/(x^n+1)$ | $\mathbb{Z}_q[x]/(x^n+1)$ | None | | #Polynomial<br>Multiplications | 2 | 2 | 2 | $k^2 + k$ | $l^2 + l$ | None<br>2 matrix-by-matrix* | | in Encapsulation | | | | | | - | | #Polynomial | | | | | | None | | Multiplications | 3 | 3 | 3 | $k^{2} + 2k$ | $l^2 + 2l$ | 3 matrix-by-matrix* | | in Decapsulation | | | | | | | <sup>\*</sup> Elements of matrices in $Z_q$ \*\* $\Phi_{n+1} = (x^{n+1} - 1)/(x - 1)$ | Feature | NTRU-HPS | NTRU-HRSS | Streamlined<br>NTRU Prime | NTRU<br>LPRime | | |----------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------|-------------------------------------------------------------------|-----------------------------------------------------------------------|--| | Underlying | Shortest Vector | Shortest Vector | Shortest Vector Shortest Vector | | | | problem | Problem | Problem | Problem | Problem | | | Polynomial $P$ | $x^n - 1$ | $\Phi_n = (x^n - 1)/(x - 1)^{**}$ | $x^n - x - 1$ irreducible in $Z_q[x]$ | $x^n - x - 1$ irreducible in $Z_q[x]$ | | | Degree $n^*$ | Prime | Prime | Prime | Prime | | | Modulus $q$ | power of 2 with $q/8 - 2 \le 2n/3$ | power of 2 with $q > 8\sqrt{2}(n+1)$ | Prime | Prime | | | Other major parameters | w: Fixed weight for f and r | N/A | w: Fixed<br>weight for f and r.<br>$3w \le 2n$<br>$16w + 1 \le q$ | $w$ : Fixed weight for b and a. $3w \le 2n$ $16w + 2\delta + 3 \le q$ | | | Hash-based<br>functions | SHA3-256 | SHA3-256 | SHA3-512 | SHA3-512 | | | Sampling | Fixed-weight and variable -weight polynomials are sampled from a uniform distribution | Variable-weight polynomials are sampled from a uniform distribution | Fixed-weight polynomials are sampled from a uniform distribution | Fixed-weight polynomials are sampled from a uniform distribution | | | Decryption<br>failures | No | No | No | No | | | Polynomial Rings | $\begin{array}{c} R/q; \\ Z_q[x]/(x^n-1) \\ S/q; \\ Z_q[x]/(\Phi_n)^{**} \\ S/3; \\ Z_3[x]/(\Phi_n)^{**} \end{array}$ | $Z_q[x]/(x^n - 1)$ $Z_3[x](x - 1)/(x^n - 1)$ | $R/q$ : $Z_q[x]/(x^n-x-1)$ $R/3$ : $Z_3[x]/(x^n-x-1)$ | | | | #Polynomial<br>Multiplications<br>in Encapsulation | 1 in R/q | 1 in R/q | 1 in R/q | 2 in R/q | | | #Polynomial<br>Multiplications<br>in Decapsulation | $\begin{array}{c} 1 \text{ in R/q} \\ 1 \text{ in S/q} \\ 1 \text{ in S/3} \end{array}$ | 1 in R/q<br>1 in S/q<br>1 in S/3 | 2 in R/q<br>1 in R/3 | 3 in R/q | | Table 13: Features of NIST Round 2 NTRU-based PQC KEMs The only KEMs with no Decryption Failure in the underlying PKE are NTRU-based KEMs (NTRU-HPS, NTRU-HRSS, Streamlined NTRU Prime, and NTRU LPRime). Round5 and NTRU-based KEMs use sampling from the uniform distribution. In LAC, NewHope, Kyber, and Saber, a Centered Binomial Distribution (CBD) is used. In FrodoKEM, an approximation of a rounded continuous Gaussian distribution is required. Parameter sets of 12 investigated algorithms are summarized in Table 14. The specification of NTRU associates two different security categories with each parameter set for NTRU-HPS and NTRU-HRSS. In this paper, we conservatively assumed the lower security level based on the so-called non-local computational models (see [69], Section 5.3 Security Categories). The same computation model is implicitly assumed by the submitters of the other investigated algorithms. In Table 14, we have divided parameter sets into three groups with security levels 1 and 2, 3 only, and 4 and 5, respectively. Only the first group contains variants of all 12 investigated algorithms (with ten at level 1 and two at level 2). The second group includes 10 variants at the security level 3. Finally, the last group includes 10 variants total (with two at level 4 and eight at level 5). # 4 Methodology ### 4.1 Assumptions All implemented schemes are Key Encapsulation Mechanisms (KEMs). For each of them, we support two major operations: Encapsulation and Decapsulation. Whenever possible, hardware resources and software functions are shared between these two operations. All <sup>\*</sup> Denoted by p in the specification of Streamlined NTRU Prime and NTRU LPRime <sup>\*\*</sup> $\Phi_n = (x^n - 1)/(x - 1)$ irreducible in $Z_q[x]$ | | Parameter | Security | Degree | Modulus | Sk Size | Pk Size | Ct Size<br>[bytes] | | |----------------|-------------------------|----------|--------|-----------------|---------|---------|--------------------|--| | Algorithm | Set | Level | n | q | [bytes] | [bytes] | | | | FrodoKEM | Frodo-640 | 1 | 640 | $2^{15}$ | 19,888 | 9,616 | 9,720 | | | Kyber | KYBER512 | 1 | 256 | 3329 | 1,632 | 800 | 736 | | | LAC-v3a | LAC-128 | 1 | 512 | 251 | 1,056 | 544 | 704 | | | LAC-v3b | LAC-128 | 1 | 512 | 256 | 1,056 | 544 | 704 | | | NewHope | NEWHOPE512<br>-CCA-KEM | 1 | 512 | 12289 | 1,888 | 928 | 1,120 | | | NTRU-HPS | ntruhps2048677 | 1* | 77 | $2^{11}$ | 1,235 | 931 | 931 | | | NTRU-HRSS | ntruhrss701 | 1* | 701 | $2^{13}$ | 1,452 | 1,138 | 1,138 | | | Str NTRU Prime | kem/sntrup653 | 2 | 653 | $4621 < 2^{13}$ | 1,518 | 994 | 897 | | | NTRU LPRime | kem/ntrulpr653 | 2 | 653 | $4621 < 2^{13}$ | 1,125 | 897 | 1,025 | | | Round5 | R5ND_CCA<br>_1KEM_0d | 1 | 586 | $2^{13}$ | 708 | 676 | 740 | | | Round5 | R5ND_CCA<br>_1KEM_5d | 1 | 508 | $2^{10}$ | 493 | 461 | 620 | | | Saber | LightSaber-KEM | 1 | 256 | $2^{13}$ | 1,568 | 672 | 736 | | | FrodoKEM | Frodo-976 | 3 | 976 | $2^{16}$ | 31,296 | 15,632 | 15,744 | | | Kyber | KYBER768 | 3 | 256 | 3329 | 2,400 | 1,184 | 1,088 | | | LAC-v3a | LAC-192 | 3 | 1024 | 251 | 2,080 | 1,056 | 1,352 | | | LAC-v3b | LAC-192 | 3 | 1024 | 256 | 2,080 | 1,056 | 1,352 | | | NTRU-HPS | ntruhps4096821 | 3* | 821 | $2^{12}$ | 1,592 | 1,230 | 1,230 | | | Str NTRU Prime | kem/sntrup761 | 3 | 761 | $4591 < 2^{13}$ | 1,763 | 1,158 | 1,039 | | | NTRU LPRime | kem/ntrulpr761 | 3 | 761 | $4591 < 2^{13}$ | 1,294 | 1,039 | 1,167 | | | Round5 | R5ND_CCA<br>_3KEM_0d | 3 | 852 | $2^{12}$ | 1,031 | 983 | 1,103 | | | Round5 | R5ND_CCA<br>_3KEM_5d | 3 | 756 | $2^{12}$ | 828 | 780 | 934 | | | Saber | Saber-KEM | 3 | 256 | $2^{13}$ | 2,304 | 992 | 1,088 | | | Str NTRU Prime | kem/sntrup857 | 4 | 857 | $5167 < 2^{13}$ | 1,463 | 1,184 | 1,312 | | | NTRU LPRime | kem/ntrulpr857 | 4 | 857 | $5167 < 2^{13}$ | 1,999 | 1,322 | 1,184 | | | FrodoKEM | Frodo-1344 | 5 | 1344 | $2^{16}$ | 43,088 | 21,520 | 21,632 | | | Kyber | KYBER1024 | 5 | 256 | 3329 | 3,168 | 1,568 | 1,568 | | | LAC-v3a | LAC-256 | 5 | 1024 | 251 | 2,080 | 1,056 | 1,464 | | | LAC-v3b | LAC-256 | 5 | 1024 | 256 | 2,080 | 1,056 | 1,464 | | | NewHope | NEWHOPE1024<br>-CCA-KEM | 5 | 1024 | 12289 | 3,680 | 1,824 | 2,208 | | | Round5 | R5ND_CCA<br>5KEM 0d | 5 | 1170 | $2^{13}$ | 1,413 | 1,349 | 1,509 | | **Table 14:** Parameter sets of investigated algorithms. Notation: Sk - Secret Key, Pk - Public key, Ct - Ciphertext. Round5 \_5KEM\_0d R5ND\_CCA \_5KEM\_5d FireSaber-KEM parameter sets of the given PQC scheme share the same HDL code. At the same time, the choice among parameter sets is made at the time of synthesis, so the exact amount of FPGA resources required to implement each particular parameter set can be determined and reported. The key generation is assumed to be performed in software or using a separate hardware unit. 946 256 5 $2^{11}$ $2^{13}$ 1,042 3,040 978 1,312 1,285 1,472 Based on the considerations discussed in Section 1, our optimization target is high-speed for both hardware and software/hardware implementation approaches. In both cases, the primary goal is the minimum execution time for Encapsulation and Decapsulation. No explicit limits are imposed on any resources of the FPGA platform, such as Configurable Logic Block Slices, LUTs, flip-flops, BRAMs, or DSP units. The goal is to demonstrate each algorithm's inherent ability to execute multiple operations in parallel. All implementations are required to be constant-time to make them resistant against <sup>\*</sup> assuming non-local computational models any known timing attacks. No physical access to the device or its proximity is assumed, which means that countermeasures against power-based and electromagnetic analysis-based attacks are considered non-essential. Developing and implementing such countermeasures is beyond the scope of this study. HDL code is required to be portable among multiple state-of-the-art FPGA families of Xilinx and Intel, assuming that a given design fits in the largest device of a given family. The code does not use any vendor or family-specific primitives or megafunctions. Each hardware unit uses only a single clock. This clock can operate at an arbitrary clock frequency lower than or equal to the maximum clock frequency determined by the critical path of a given hardware unit. All reported execution times correspond to this maximum clock frequency. ### 4.2 Choice of Benchmarking Platforms for Round 2 **Hardware.** The submissions selected for the hardware-only implementations (CRYSTALS-KYBER, LAC, New Hope, and Round5) have moderate resource requirements, even when optimized for high-speed. As a result, we have decided to generate results for two FPGA families: Artix-7 and Virtex-7. Based on Section 2, these families were selected for benchmarking by the largest number of other groups to date. Software/hardware co-design. In recent years, several hardware/software co-design platforms have emerged. The most popular in the industry are those based on integrating an ARM-based processor and FPGA fabric on a single chip. Examples include Xilinx Zynq 7000 System on Chip (SoC), Xilinx Zynq UltraScale+ MPSoC, Intel Cyclone V SoC FPGAs, Intel Arria 10 SoC FPGAs, and Intel Agilex F-Series SoC FPGAs. These devices support software/hardware co-designs based on a traditional high-level language program running on an ARM processor, with the most time-critical computations performed on a dedicated hardware accelerator. The advantages of these platforms include the use of the most popular embedded processor family (ARM) operating at high speed (1 GHz or above), state-of-the-art commercial tools (available for free, or at a reduced price for academic use), availability of relatively inexpensive prototyping boards, and practical deployment in multiple environments. The primary alternatives are FPGA-based systems with so-called "soft" processor cores implemented in reconfigurable logic. Examples include Xilinx MicroBlaze, Intel Nios II, and the open-source RISC-V, originally developed at the University of California, Berkeley [58, 75, 76]. The main advantage of these systems over "hard" processor cores is flexibility in the allocation of resources to processor cores, including the possibility of extending them with special instructions specific to PQC. Additionally, they are easy to port between different FPGA families, and even between FPGAs and ASICs. A disadvantage compared to the "hard" option is that the "soft" processors operate at much lower clock frequencies (typically 200-450 MHz). During Round 2, NIST asked designers to focus on the ARM Cortex-M4 for embedded software implementations and the Artix-7 for FPGA implementations. However, we are not aware of any SoC FPGA that contains a Cortex-M processor and the Artix-7 FPGA fabric on a single chip. Even if such a chip existed, it would be more suitable for benchmarking of lightweight implementations (optimized for minimum cost and power consumption), rather than benchmarking of the high-speed implementations targeted by our study. As a result, we have based our choice of a platform primarily on the projected practical importance of various platforms during the initial period of deploying new PQC standards, and the expected speed-up over purely-software implementations. These priorities led us to choose devices from the "hard" processor class, with a hard-wired ARM processor, and among them, the Zynq UltraScale+ family from Xilinx Inc., the vendor with the biggest market share in this device category. Zynq UltraScale+ and similar SoC FPGAs are likely to be used for practical deployments of PQC in the near future, wherever device speed and time-to-market are of primary concern. Implementations using these devices are even more likely than implementations using only hardware. However, the use of soft-core processors, and in particular the free and open-source RISC-V, should be considered as a natural next step, especially in light of DARPA's recent selection of the RISC-V Instruction Set Architecture (ISA) for investigation within its cybersecurity-related programs [54]. Since these soft-core processors can be implemented practically on any modern FPGA family, the choice of the family should be dependent primarily on the selected type of implementation: lightweight vs. high-speed. Based on the above discussion, we chose the Xilinx Zynq UltraScale+ MPSoC XCZU9EG-2FFVB1156E as our target device and the Xilinx ZCU102 Evaluation Kit as a prototyping board. Our target device, Xilinx Zynq UltraScale+ MPSoC XCZU9EG-2FFVB1156E, is composed of two major parts sharing the same chip. The primary component of the Processing System (PS) is a quad-core ARM Cortex-A53 Application Processing Unit, running at 1.2 GHz. As in the software benchmarking experiments conducted by other groups, we utilize only one core in all our experiments. The Programmable Logic (PL) includes a programmable FPGA fabric similar to that of Virtex UltraScale+ FPGAs, including Configurable Logic Block (CLB) slices, Block RAMs, DSP units, etc. The frequency of operation depends on the particular logic instantiated in the reconfigurable fabric but typically does not exceed 400 MHz. Computer-Aided Design Tools. The software used is Xilinx Vivado Design Suite HLx Edition, and Xilinx Software Development Kit (XSDK), all with version number 2018.2. ## 4.3 Benchmarking Setup for Software/Hardware Co-design A high-level block diagram of the experimental software/hardware co-design platform is shown in Fig. 1. The Hardware Accelerator is connected, through the dual-clock Input and Output FIFOs, to the AXI DMA, supporting the high-speed communication with the Processing System. Timing measurements are performed using the popular Xilinx IP unit called AXI Timer, which is capable of measuring time in clock cycles of the 200 MHz system clock. The Hardware Accelerator can operate at a variable clock frequency, controlled from software using the Clocking wizard unit. ### 4.4 Interface and Communication Protocol The interface of the hardware accelerator is shown in Fig. 2. This interface is assumed to be identical for both hardware and software/hardware implementations and matches the interface of the Input and Output FIFOs, shown in Fig. 3. The default width of the data bus is 64 bits. Each particular operation, such as load public key, start encapsulation, etc., is initiated by sending an appropriate header (in the form of a single 64-bit word) from a program running on the ARM processor to the data input of a hardware accelerator. When an operation requires additional data, this data is transmitted using the subsequent Input FIFO words. After the hardware accelerator produces results or detects an error, a header word is sent in the opposite direction. If an additional output is required, this output follows the header and is arranged in 64-bit words. The detailed format of the exchanged inputs and outputs is left up to the designer of a hardware accelerator. Compared to an earlier proposed PQC Hardware API [26], the adopted interface is significantly simpler and more flexible. Only one input port, infifo, is used in place of three separate ports, Public Data Input (PDI), Secret Data Input (SDI), and Random Data Input (RDI). Only one output port, outfifo, is used in place of two separate ports, Public Data Output (PDO) and Secret Data Output (SDO). Figure 1: Block diagram of software/hardware co-design. Figure 2: Hardware accelerator interface. The proposed interface does not provide physical separation among the public, secret, and random data. Still, it appears to be sufficient at the current stage of the evaluation process for both purely hardware and software/hardware implementations. It also significantly simplifies the software/hardware partitioning and transfer of data between the processor and the hardware accelerator. ### 4.5 Porting Software Implementations to ARM Cortex-A53 To minimize overhead, we have run software in the Bare Metal mode, without any operating system. We have started from the best high-level language implementations of selected candidates available to date. In order to be run on ARM Cortex-A53 in the Bare Metal mode, these implementations had to be modified as described below. Since no functions of Open-SSL are available in the Bare Metal mode, we have adopted for AES the Optimized ANSI C code of the Rijndael cipher based on the use of T-boxes, developed by Vincent Rijmen, Antoon Bosselaers, and Paulo Barreto [60]. Compared to the OpenSSL implementation, the selected implementation is written entirely in C, rather than in an assembly language of a specific processor. It does not contain any **Figure 3:** The Input and Output FIFO Interface. countermeasures against cache-timing attacks. For SHA-3, for all candidates other than Round5, we adopted fips202.c from SUPERCOP by Ronny Van Keer, Gilles Van Assche, Daniel J. Bernstein, and Peter Schwabe. For Round5, we used r5\_xof\_shake.c by Markku-Juhani O.Saarinen and keccak1600.c from SUPERCOP, by the same authors as fips202.c For all investigated KEMs, the encapsulation operation uses multiple calls to the function randombytes(), which produces a sequence of random bytes with uniform distribution. Other PQC benchmarking projects use a version of this function based on operating system functions and/or functions from OpenSSL [15, 62, 42, 65]. None of these options is available in the Bare Metal mode. Therefore, in our code, we use the implementation of randombytes() proposed by Saarinen in April 2018 [62], which is an improved version of the implementation developed by NIST for the generation of known-answer tests [57]. Since both of these implementations are based on AES in the ECB mode, from the OpenSSL library, we have replaced the code of AES by the mentioned above standalone, optimized implementation of AES in C [60]. As a result, the selected implementation of randombytes() is likely to have different timing characteristics than the implementations used in other benchmarking studies, such as SUPERCOP [15], pqcbench [62], pqm4 [42], and liboqs [65]. Taking into account that the C implementations of NTRU-HPS, NTRU-HRSS, and Streamlined NTRU Prime use randombytes() to generate 3211, 1400, and 2611 bytes, respectively, we have sped up these function calls by a) limiting the number of bytes returned by randombytes() to 32, and b) generating the remaining random bytes using SHAKE128. For the three KEMs mentioned above, this change resulted in the speed-up of the relevant functions by a factor greater than 3. No attempt at the optimization of the software implementations of KEMs by employing assembly language coding has been made. ### Software Profiling, C Source Code Analysis, and Software/Hardware 4.6 **Partitioning** Our first step in evaluating the suitability of cryptographic algorithms for software/hardware co-design was profiling of their software implementations using one core of the ARM Cortex-A53. Profiling produced a list of the most time-consuming functions, including their absolute execution time, percentage execution time, and the number of times they are called. We decided which functions to offload to hardware based on the highest potential for the total speed-up, as well as the fairness of comparison among investigated algorithms. The total speed-up obtained by offloading an operation to hardware depends on two major factors: the percentage of the execution time taken in software by an operation offloaded to hardware, and the speed-up for the offloaded operation itself. In order to maximize the first factor, we gave priority to operations that take the largest percentage of the execution time, preferably more than 90%. These operations may involve a single function call, several adjacent function calls, or a sequence of consecutive instructions in C. It is preferred that a given operation is executed only once, or only a few times, as each transfer of control and data between software and hardware involves a certain fixed timing overhead, independent of the size of input and output to the accelerator. In order to maximize the second factor, we gave priority to operations that have a high potential for parallelization in hardware, and a small total size of inputs and outputs (which will need to be transferred to and from the hardware accelerator, respectively) Most of the data required to make informed decisions regarding software/hardware partitioning can be obtained by profiling software implementations, possibly extended with some small modifications required to gather all relevant data. However, determining the potential for parallelization requires some knowledge of hardware or at least basic concepts of concurrent computing. To assure fairness in our comparison, we offloaded to hardware all operations common to or similar across the implemented algorithms (e.g., all polynomial multiplications), and all operations that contributed significantly to the total execution time. Nevertheless, it should be understood that this heuristic procedure may need to be repeated several times because, after each round of offloading to hardware, different software operations may emerge as taking the majority of the total execution time. This process can stop when the development effort required for offloading the next most-critical operation to hardware is disproportionately high compared to the projected speed-up. All encapsulations involve a single call to the function randombytes(), returning a seed to a pseudorandom number generator. This function could be possibly offloaded to hardware by implementing a True Random Number Generator (TRNG) in Programmable Logic. However, the correct implementation of a TRNG in FPGA fabric is a substantial project by itself. Additionally, in some cases, the seed would need to be transferred back to software, while in others it could be used directly by a hardware accelerator. To avoid these additional complications, in all the current software/hardware partitioning schemes, the seed is assumed to be generated in software. ### 4.7 RTL Design Methodology The design of a hardware accelerator follows a traditional Register-Transfer Level (RTL) methodology. The entire system is divided into the Datapath and Controller. The Datapath is described using a hierarchical block diagram, and the Controller using hierarchical algorithmic state machine (ASM) charts. Multiple local controllers may be advantageous compared to a single global Controller. The RTL approach, although not novel by itself, is an important part of our methodology as it facilitates very efficient hardware accelerator designs. The block diagrams and ASM charts are very easy to translate to efficient and fully synthesizable VHDL code. ### 4.8 Potential Software Optimizations Our software/hardware implementations could be potentially sped up by accelerating the remaining software part using assembly language programming. The ARM Cortex-A53 is a microarchitecture implementing the ARMv8-A 64-bit instruction set. This instruction set consists of traditional RISC instructions operating on 31 general-purpose 64-bit registers, as well as Single-Instruction Multiple-Data (SIMD) instructions operating on 32 128-bit registers, treated as vectors composed of smaller data words. The SIMD architecture extension for the ARM Cortex-A processors, including Cortex-A53, is referred to as NEON [7]. The NEON 128-bit registers are considered as vectors of elements of the same data type, with NEON instructions operating on multiple elements simultaneously. Multiple data types are supported by this technology, including floating-point and integer operations. A programmer can take advantage of NEON instructions using any of the following methods: a) Auto-vectorization by a compiler, b) Use of NEON-enabled libraries, c) using NEON intrinsics, and d) Using hand-coded NEON assembly language code [7]. Auto-vectorization is the process by which a compiler automatically analyzes the code and identifies opportunities to optimize performance using NEON instructions and register files. NEON-enabled libraries focus on signal and image processing, computer vision, physics, and machine learning. Examples include Arm Compute Library, Ne10, Libyuv, and Skia. NEON intrinsics are function calls that the compiler replaces with an appropriate NEON instruction or sequence of NEON instructions. Intrinsics provide almost as much control as writing assembly language but leave the allocation of registers to the compiler [6]. Finally, using hand-coded NEON assembly language code provides the programmer with the highest level of control, which may be used for the most aggressive and advanced optimizations. Our implementations take advantage of only Option a) Auto-vectorization by a compiler. Regarding Option b), we are not aware of any NEON-enabled library that explicitly benefits lattice-based cryptosystems. No attempt was made to optimize software parts of our implementations using either intrinsics or hand-coded NEON assembly. Our justification for these choices is as follows. Operations that are most suitable for a speed-up using NEON instructions are typically also excellent candidates for offloading to hardware. As a result, at the time when further offloading to hardware is judged to be either counterproductive or too labor intensive, the remaining operations executed in software are most likely sequential in nature and cannot take advantage of NEON instructions and registers. Still, some speed up could be potentially accomplished by hand-coding these operations using scalar instructions of the ARMv8-A 64-bit instruction set. However, doing that would make the implementation much less portable. A similar effort could possibly be better spent on offloading all remaining operations to hardware. Even if the further offloaded operations cannot by themselves benefit from any substantial speed-up, moving them to hardware will eventually eliminate the entire transfer time, which remains substantial in all our software/hardware co-designs. As a result, optimizing software implementations using NEON instructions and registers should be treated as an alternative optimization path, starting from the same starting point as our software/hardware co-designs. This starting point is a portable C implementation, that can be easily profiled and analyzed for inherent parallelism. ### 4.9 Verification and Generation of Results Functional verification of the hardware description language (HDL) code is performed by comparing simulation results with precomputed outputs generated by a reference software implementation. Fully verified and independently optimized VHDL code is then combined with the *optimized* software implementation of a given PQC candidate. Functional verification of the integrated software/hardware design is performed by running the code on the prototyping board and comparing the obtained outputs with outputs generated by a functionally equivalent reference implementation, run on the same ARM Cortex-A53 processor. Experimental timing measurements follow, with the hardware accelerator's clock set (using the Clocking wizard) to the optimal target frequency identified during the synthesis and implementation runs. The execution time is measured by using the AXI Timer module, shown in Fig. 1, in clock cycles of the AXI Timer, which operates at the default clock frequency of 200 MHz. The encapsulation time does not include the time necessary to transfer public key to the hardware accelerator. Similarly, the decapsulation time does not include the time necessary to transfer private key to the hardware accelerator. In case a public key of the receiver is required during decapsulation, this key is assumed to be a part of the corresponding private key. The time required for a key upload is calculated as a difference between the time necessary for transferring a concatenation of the key and the first input (e.g., the seed for encapsulation, or the ciphertext for decapsulation) minus the time required to transfer the first input itself. This convention is consistent with the fact that the transmission of the key does not need to be repeated if the same key is reused multiple times. At the same time, the key upload overhead is typically so small that it is not efficient to send the key well before its first use. As a result, the key upload is assumed to be always combined with the transmission of the first input. ### 5 Results ### 5.1 Results for Hardware Implementations Six CCA-secure KEMs representing four candidate - CRYSTALS-Kyber, LAC, NewHope, and Round5 - have been implemented in pure hardware. LAC is represented by two variants, v3a with q=251 and v3b with q=256. Round5 is represented by R5ND\_CCA\_KEM\_0d - a ring variant without any error correcting code, and R5ND\_CCA\_KEM\_5d - a ring variant with the XE5 forward error correcting code used to decrease decryption failure rates during decapsulation (and thus improve bandwidth and security). The maximum clock frequency and resource utilization of our hardware implementations of all six KEMs are summarized in Table 15 for Xilinx Artix-7 FPGAs and in Table 16 for Xilinx Virtex-7 FPGAs. All but one KEM fit within the largest device of the Artix-7 family. The only one that does not is the security level 5 variant of Round5 without error correction. Taking into account that we target high-speed implementations, more suitable for high-performance FPGAs, such as Virtex-7, the inability to fit the high-speed version of the highest-security variant in low-cost FPGA family should not be used against Round5. LAC-v3b, with q=256, clearly outperforms LAC-v3a, with q=251, in terms of both the maximum clock frequency and resource utilization. For example, for the security level 1 on Artix-7, the implementation of LAC-v3b has about 4% higher frequency and requires about 19% fewer LUTs. In the case of Round5, R5ND\_CCA\_KEM\_5d (with error correction) significantly outperforms R5ND\_CCA\_KEM\_0d (without error correction). For example, at the security level 1 on Artix-7, the difference is at the level of 10% in terms of clock frequency, and 36% in terms of the number of LUTs. | Algorithm | Security Category:<br>Parameter Set | Max.<br>Freq. | LUT | FF | Slice | DSP | BR<br>AM | |-----------|-------------------------------------|---------------|-------------|---------|--------|-----|----------| | Kyber | 1: KYBER512 | 210 | 11,864 | 10,348 | 3,989 | 8 | 15.0 | | Kyber | 2: KYBER768 | 210 | 11,884 | 10,380 | 3,984 | 8 | 15.0 | | Kyber | 5: KYBER1024 | 210 | 12,183 | 12,441 | 4,511 | 8 | 15.0 | | LAC-v3a | 1: LAC-128 | 185 | 23,314 | 15,950 | 7,099 | 0 | 8.5 | | LAC-v3a | 3: LAC-192 | 172 | 38,898 | 26,174 | 11,700 | 0 | 11.5 | | LAC-v3a | 5: LAC-256 | 167 | 42,721 | 26,872 | 12,903 | 0 | 11.5 | | LAC-v3b | 1: LAC-128 | 192 | 18,955 | 15,958 | 5,421 | 0 | 8.5 | | LAC-v3b | 3: LAC-192 | 190 | 28,362 | 26,182 | 7,949 | 0 | 11.5 | | LAC-v3b | 5: LAC-256 | 167 | 32,184 | 26,882 | 8,995 | 0 | 11.5 | | NewHope | 1: NEWHOPE512-CCA-KEM | 225 | 9,000 | 8,732 | 3,194 | 4 | 12.0 | | NewHope | 5: NEWHOPE1024-CCA-KEM | 225 | 9,000 | 8,732 | 3,194 | 4 | 12.0 | | Round5 | 1: R5ND_CCA_1KEM_0d | 185 | 57,137 | 80,676 | 21,291 | 0 | 3.0 | | Round5 | 3: R5ND_CCA_3KEM_0d | 165 | 78,825 | 107,564 | 29,441 | 0 | 3.0 | | Round5 | 5: R5ND_CCA_5KEM_0d | | Doesn't fit | | | | | | Round5 | 1: R5ND_CCA_1KEM_5d | 204 | $36,\!578$ | 56,355 | 14,042 | 0 | 3.0 | | Round5 | 3: R5ND_CCA_3KEM_5d | 174 | 59,852 | 95,170 | 24,869 | 0 | 3.0 | | Round5 | 5: R5ND CCA 5KEM 5d | 169 | 69.548 | 113.913 | 28,286 | 0 | 3.0 | **Table 15:** Maximum frequency and resource utilization of hardware implementations on Artix-7. **Table 16:** Maximum frequency and resource utilization of hardware implementations on Virtex-7. | Algorithm | Security Category: Parameter Set | Max.<br>Freq. | LUT | FF | Slice | DSP | BR<br>AM | |-----------|----------------------------------|---------------|------------|------------|------------|-----|----------| | Kyber | 1: KYBER512 | 245 | 13,745 | 11,107 | 4,590 | 8 | 14.0 | | v | | - | , | , | , | _ | - | | Kyber | 3: KYBER768 | 245 | 13,889 | 11,113 | 4,500 | 8 | 14.0 | | Kyber | 5: KYBER1024 | 245 | 14,163 | 13,179 | $5,\!172$ | 8 | 14.0 | | LAC-v3a | 1: LAC-128 | 286 | $24,\!452$ | 16,097 | 7,320 | 0 | 8.5 | | LAC-v3a | 3: LAC-192 | 250 | 39,220 | 26,325 | 12,021 | 0 | 11.5 | | LAC-v3a | 5: LAC-256 | 208 | 44,722 | 27,033 | 13,659 | 0 | 11.5 | | LAC-v3b | 1: LAC-128 | 294 | 18,972 | 16,061 | 5,300 | 0 | 8.5 | | LAC-v3b | 3: LAC-192 | 286 | 28,344 | 26,206 | 7,654 | 0 | 11.5 | | LAC-v3b | 5: LAC-256 | 213 | 32,177 | 26,846 | 9,111 | 0 | 11.5 | | NewHope | 1: NEWHOPE512-CCA-KEM | 295 | 11,345 | 8,838 | 3,572 | 4 | 12.0 | | NewHope | 5: NEWHOPE1024-CCA-KEM | 295 | 11,345 | 8,838 | 3,572 | 4 | 12.0 | | Round5 | 1: R5ND_CCA_1KEM_0d | 238 | $62,\!407$ | 80,726 | 23,918 | 0 | 3.0 | | Round5 | 3: R5ND_CCA_3KEM_0d | 208 | 78,727 | 107,631 | 28,034 | 0 | 3.0 | | Round5 | 5: R5ND_CCA_5KEM_0d | 215 | 108,472 | 156,532 | 39,008 | 0 | 3.0 | | Round5 | 1: R5ND_CCA_1KEM_5d | 256 | 38,350 | 56,413 | 14,731 | 0 | 3.0 | | Round5 | 3: R5ND_CCA_3KEM_5d | 222 | 59,824 | $95,\!270$ | $23,\!505$ | 0 | 3.0 | | Round5 | 5: R5ND_CCA_5KEM_5d | 202 | $69,\!561$ | 113,933 | $28,\!643$ | 0 | 3.0 | Taking into account the best variants of all four submissions at the security level 1, all clock frequencies are in a very small range between 192 and 210 MHz for Artix-7 and between 235 and 294 for Virtex-7. Thus, no significant advantage in terms of the maximum clock frequency is demonstrated by any candidate. Ranking of candidates in terms of resource utilization is also very difficult because of no clear equivalence between various elements of the resource utilization vectors. For example, on Artix-7, NEWHOPE512-CCA-KEM uses about 4 times fewer LUTs than R5ND\_CCA\_1KEM\_5d, but requires 4 vs. 0 DSP units, and 4 times more BRAMs. Thus, none of these implementations can be claimed to be clearly superior vs. the other. However, an important differentiating factor is the use of either similar or significantly different amount of resources for implementing different security levels. It is generally more desirable to have an algorithm that can be implemented using the same amount of resources, independently of the security level. This feature allows an easier upgrade of a security level. It also indirectly implies that the 3-in-1 or 2-in-1 designs will have a similar resource utilization as the lowest-security variant rather than the resource utilization higher than that of the highest-security variant. Out of six investigated KEMs, this desirable property is exhibited only by Kyber and NewHope. On top of that, Kyber is slightly more flexible, due to the existence of a variant at the security level 3. On the other hand, NewHope has a small advantage in terms of all elements of the resource utilization vector (e.g., for level 1 at Artix-7, it uses 9000 vs. 11,864 LUTs, 8,732 vs. 10,348 FFs, 3,194 vs. 3,989 slices, 4 vs. 8 DSP units, and 12 vs. 15 BRAMs). **Figure 4:** (a) Public Key and (b) Private Key transfer latency $(\mu s)$ on Artix-7 **Figure 5:** (a) Public Key and (b) Private Key transfer latency $(\mu s)$ on Virtex-7 The times necessary to load a public key (required for encapsulation) and a secret (private) key (required for decapsulation) are proportional to the size of the respective key and inversely proportional to the maximum clock frequency of a given PQC unit. All transfers are assumed to be conducted using a 64-bit infifo\_data bus. The sizes of keys for all variants of all investigated algorithms are summarized in Table 14. The maximum clock frequencies are listed in Table 15 for Artix-7 and Table 16 for Virtex-7. In Fig. 4, we compare these key loading times for Artix-7, and in Fig. 5 for Virtex-7. For both Artix-7 and Virtex-7, R5ND\_CCA\_KEM\_5d has the shortest key-loading times, and NewHope the longest. However, the differences among these times are relatively minor. They do not exceed a factor of 2 for loading a public key, and 3 for loading a private key. **Figure 6:** Execution Time for (a) Encapsulation and (b) Decapsulation ( $\mu s$ ) on Artix-7 Figure 7: Execution Time for (a) Encapsulation and (b) Decapsulation ( $\mu s$ ) on Virtex7 The ranking of all 6 implemented KEMs in terms of the two primary performance metrics, for high-speed implementations, is shown in Fig. 6 for Artix-7 and in Fig. 7 for Virtex-7. The exact results and relative differences among the candidates are also summarized in **Table 17:** Ranking of hardware implementations in terms of the execution time for encapsulation. For each algorithm, the first number represent the execution time in $\mu$ s; the second number is the ratio of the execution time for a given algorithm and the best execution time in the given ranking. | 5 | | |----------------------|----------------------| | 5 | | | | | | 27.6 | 1.00 | | 28.1 | 1.02 | | 28.5 | 1.03 | | 29.3 | 1.06 | | 33.9 | 1.23 | | | | | | | | | | | 5 | | | 5<br>22.1 | 1.00 | | | 1.00<br>1.01 | | 22.1 | | | 22.1<br>22.4 | 1.01 | | 22.1<br>22.4<br>23.0 | 1.01<br>1.04 | | 2 | 28.1<br>28.5<br>29.3 | **Table 18:** Ranking of hardware implementations in terms of the execution time for decapsulation. For each algorithm, the first number represent the execution time in $\mu$ s; the second number is the ratio of the execution time for a given algorithm and the best execution time in the given ranking. | | | | $\mathbf{Arti}$ | x-7 | | | | | |------------|------|------|-----------------|------|------|------------|------|------| | Lev | el 1 | | Leve | el 3 | | Leve | l 5 | | | R5ND_5d | 16.3 | 1.00 | Kyber | 27.2 | 1.00 | Kyber | 36.2 | 1.00 | | LAC-v3b | 18.9 | 1.16 | $R5ND\_5d$ | 28.4 | 1.04 | $R5ND\_5d$ | 36.4 | 1.01 | | NewHope | 19.7 | 1.21 | LAC-v3b | 28.7 | 1.06 | LAC-v3b | 37.9 | 1.05 | | $R5ND\_0d$ | 20.6 | 1.26 | $R5ND\_0d$ | 33.2 | 1.22 | NewHope | 39.2 | 1.08 | | Kyber | 21.4 | 1.31 | LAC-v3a | 37.4 | 1.38 | LAC-v3a | 43.8 | 1.21 | | LAC-v3a | 22.2 | 1.36 | | | | | | | | | | | Virte | ex-7 | | | | | | Lev | el 1 | | Leve | el 3 | | Leve | l 5 | | | LAC-v3b | 12.4 | 1.00 | LAC-v3b | 19.1 | 1.00 | NewHope | 30.0 | 1.00 | | $R5ND\_5d$ | 13.3 | 1.07 | $R5ND\_5d$ | 22.7 | 1.19 | Kyber | 31.0 | 1.03 | | LAC-v3a | 14.4 | 1.16 | Kyber | 23.3 | 1.22 | $R5ND\_5d$ | 31.2 | 1.04 | | NewHope | 15.0 | 1.21 | LAC-v3a | 25.8 | 1.35 | $R5ND_0d$ | 35.8 | 1.19 | | $R5ND\_0d$ | 17.0 | 1.37 | $R5ND\_0d$ | 27.0 | 1.41 | LAC-v3b | 37.7 | 1.26 | | Kyber | 18.3 | 1.48 | | | | LAC-v3a | 43.5 | 1.45 | Tables 17 and 18. The primary metrics used for ranking are the execution times for encapsulation and decapsulation, respectively. Out of the two, the time of decapsulation is always longer, and thus more critical. R5ND CCA KEM 5d and LAC-v3b are ranked consistently among the fastest two for both Encapsulation and Decapsulation, on Artix-7 and Virtex-7, and at all security levels. Kyber and NewHope are at positions 3 and 4 for Artix-7 and 5 and 6 for Virtex-7. The relative ranking of these two algorithms changes depending on the operation, security level, and FPGA family, but overall differences are minuscule. Thus, none of these two algorithms has a clear edge over the other in terms of hardware efficiency. Overall, the most efficient variants of all four candidates are in a virtual tie with one another. In Table 19, we compare our implementations of the CCA-secure KEMs, Kyber and NewHope, with the equivalent implementations reported in [14], benchmarked using the same platform. The only major difference between the compared designs is the use of HLS-based methodology in [14] and RTL-based methodology in our work. The differences in obtained results are huge, although probably not that surprising, taking into account the almost complete reliance on tools in [14]. The HLS-based designs are bigger than RTL-based designs in terms of the number of LUTs by a factor of at least 144 for Kyber, and 14.5 for NewHope. These factors are obtained by dividing the area of the decapsulation unit in the HLS-based approach by the area of the combined key generation/encapsulation/decapsulation unit in the RTL approach. Thus, if units with the same functionality were compared, the ratios could be even higher. The HLS to RTL ratios of the encapsulation times are 10.5 for Kyber and 93 for NewHope. For decapsulation, the corresponding ratios are 36 for Kyber and 742 for NewHope. Overall in terms of the latency times area product, the HLS-based designs are three orders of magnitude worse. Additionally, a significant difference in specific ratios for Kyber and NewHope, combined with the almost identical performance and resource usage of RTL designs, indicates that the approach pursued in [14] cannot correctly predict the relative ranking of PQC candidates unless the differences among them are truly enormous. In Table 20, we compare our hardware implementation of NewHope with the best high-speed implementation of this algorithm available to date. This implementation was described in [78], but it covered only a subset of the functionality of the IND-CCA KEM, namely the IND-CPA secure public-key encryption (PKE). Since for our own implementation, we could generate results for any subset of the complete CCA KEM design and using an arbitrary platform, the presented comparison is as fair as possible. Both sets of results concern exactly the same functionality, implemented using the same optimization target, with results generated using exactly the same platform. Our implementation outperforms the design by Zhang et al. [78] in terms of all execution times. At security level 1, the speed-up varies between 2.2 for decryption, through 2.4 for key generation, to 2.6 for encryption. Similarly, at the security level 5, the speed-up varies between 2.0 for decryption, through 2.4 for key generation, to 2.6 for encryption. The penalty paid for this increase in speed is the increase in the number of LUTs by 33%, in flip-flops by a factor of 2.2, doubling the number of DSP units from 2 to 4, and increasing the number of BRAMs from 7-8 to 12. Overall, taking into the optimization for high-speed, our design is superior. However, the design by [78] provides an interesting example of trading the speed for a reduction in resource utilization. In Table 21, we compare the execution times of the CCA-KEM schemes with the execution times of the underlying CPA-PKE schemes, for Kyber, LAC, and NewHope. The ratios of the encapsulation and encryption times vary between to 1.07 and 1.17. That means that the PKE encryption is a dominant operation, and an overhead of other operations does not exceed 17%. For all four KEMs listed in this table, decapsulation includes one call to decryption and one call to encryption. Thus, the ratio listed under Decapsulation is a ratio of the execution time of decapsulation over the sum of the execution times of encryption and decryption. This ratio varies between 1.03 and 1.06, which means that the overhead of remaining operations does not exceed 6%. **Table 19:** Comparison between RTL-based designs from this work and HLS-based designs from [14]. All results for Virtex-7 FPGAs. | Design Max. Freq. LUT LUT FF FF | LUT LUT FF | LUT LUT FF | H. | | FF | | DSP | BR | DSP BR Key Generation | neration | | Encapsulation | lation | | | Decapsulation | ılation | | |---------------------------------|------------|------------------|--------------------|------------------|------------------|-------|----------|------------------------|----------------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|-----------------|---------------|-----------------------------------|----------------|--------------------|---------------| | Freq. | | $\mathbf{Ratio}$ | | $\mathbf{Ratio}$ | 4 | Ratio | 1 | $\mathbf{A}\mathbf{M}$ | cycles $\mu s$ | $\mu s$ | ${ m cycles} { m cycl.} { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m } { m }$ | cycl.<br>ratio | $\mu s$ | $\mu s$ ratio | ${\mu s \over { m ratio}}$ cycles | cycl.<br>ratio | $\mu s$ | $\mu s$ ratio | | | | i | | | | | | Kybe | Kyber-512 CCA-KEM | | | | | | | | | | | 67 0.27 | | I | 1,307,815 | 95.18 | 11,699 | 1.05 | ı | ı | ı | ı | 31,669 | 10.47 | 9 10.47 475.0 3 | 38.47 | 38.47 43,018 | 9.79 | 645.3 | 35.97 | | 245 | | | 13745 | 143.9 | 11107 | 17.48 | $\infty$ | 14 | 14 2,190 | 8.9 | 3,025 | 1 | 12.3 | | 4,395 | ' | 17.9 | | | | | | | | | | Н | NewHo | ${ m NewHope-512~CCA-KEM}$ | CA-KEM | | | | | | | | | | 67 0.23 | 0.23 | 1 | 136,457<br>164,937 | 12.03 | 25,639<br>28,999 | 2.90 | ı | ı | ı | ı | 307,847 | 93.01 | 93.01 4,617.7 | 411.55 | 21,986 | 167.40 | 167.40 10,829.8 7. | 740.74 | | ı | | | 11,345 | - 14.54 - | 8,838 | 3.28 | 4 | 10.0 | 10.0 2,152 | 7.3 | 3,310 | ı | 11.2 | | 1,313 | ı | 14.6 | | **Table 20:** Comparison between this work and the best hardware design of NewHope reported in the literature to date. All results for Zynq-7000 SoC FPGA. | CPA Encryption CPA Decryption | cyc | | $2.39 \begin{array}{ccccccccccccccccccccccccccccccccccc$ | | $\frac{12,500}{5,521} 2.26 \frac{62.5}{24.5} 2.55 \frac{4,800}{2,667} 1.80 \frac{24.0}{11.9} 2.02$ | |-------------------------------------------------|-----------------------|---------------------|------------------------------------------------------------------------------|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Freq. 111T LUT FF FF DSP BR BRAM Key Generation | LOI Ratio II Ratio AM | NewHope-512 CPA-PKE | $0.89 \begin{array}{ccccccccccccccccccccccccccccccccccc$ | NewHope-1024 CPA-PKE | $0.89 \frac{6,781}{9,009} 0.75 \frac{4,127}{8,768} 0.47 \frac{2}{4} 0.50 \frac{8}{12} 0.67 \frac{8,000}{3,768} 2.12 \frac{40.0}{16.7} 2.39 \frac{12,500}{5,521} 0.89 \frac{12,500}{16.7} 12,5$ | | Doeign Max. F | Freq. | | $\begin{bmatrix} 78 \\ \mathbf{TW} \end{bmatrix} \qquad 200 \\ 225 \qquad 0$ | | $\begin{bmatrix} [78] & 200 \\ \mathbf{TW} & 225 \end{bmatrix} $ | **Table 21:** Comparison of the execution times of major operations of the related CPA PKE and CCA KEM schemes when implemented in hardware using Artix-7. The ratio columns contain ratios of the execution times of Encapsulation/Encryption and Decapuslation/(Encryption+Decryption). All execution times are calculated without taking into account the time necessary to read inputs and offload outputs. | | | CPA- | -PKE | | | | CCA- | KEM | | | |---------------|-----------|-------|--------|--------------|--------|-------|-------|-----------|-------|--------------| | Algorithms | Encryp | otion | Decryp | $_{ m tion}$ | Enca | psula | tion | Deca | psula | $_{ m tion}$ | | | cycles | us | cycles | us | cycles | us | ratio | cycles | us | ratio | | Kyber-512 | 2,632 | 12.5 | 1,638 | 7.8 | 3,025 | 14.4 | 1.15 | 4,395 | 20.9 | 1.03 | | Kyber-768 | 3,528 | 16.8 | 1,830 | 8.7 | 4,065 | 19.4 | 1.15 | $5,\!555$ | 26.5 | 1.04 | | Kyber-1024 | 5,104 | 24.3 | 2,022 | 9.6 | 5,785 | 27.5 | 1.13 | 7,395 | 35.2 | 1.04 | | LAC-128-v3a | 3,021 | 16.3 | 864 | 4.7 | 3,215 | 17.4 | 1.07 | 4,023 | 21.7 | 1.03 | | LAC-192-v3a | 4,516 | 26.3 | 1,461 | 8.5 | 4,840 | 28.1 | 1.07 | 6,272 | 36.4 | 1.05 | | LAC-256-v3a | $5,\!156$ | 30.9 | 1,607 | 9.6 | 5,480 | 32.8 | 1.06 | 8,499 | 42.7 | 1.05 | | LAC-128-v3b | 2,542 | 13.2 | 864 | 4.5 | 2,736 | 14.3 | 1.08 | 3,544 | 18.4 | 1.04 | | LAC-192-v3b | 3,542 | 18.6 | 1,461 | 7.7 | 3,866 | 20.3 | 1.09 | 5,297 | 27.8 | 1.06 | | LAC-256-v3b | 4,182 | 25.0 | 1,607 | 9.6 | 4,506 | 27.0 | 1.08 | 7,525 | 36.8 | 1.06 | | NewHope-512 | 2,833 | 12.6 | 1,259 | 5.6 | 3,310 | 14.7 | 1.17 | 4,313 | 19.2 | 1.05 | | New Hope-1024 | $5,\!521$ | 24.5 | 2,667 | 11.9 | 6,358 | 28.3 | 1.15 | 8,601 | 38.2 | 1.05 | **Table 22:** Comparison between CPA-KEM, CCA-KEM and CCA-PKE variants of R5ND\_5d on Artix-7 | Algorithm | Max. | En | caps./E | Encryp | t. | Dec | caps./D | ecryp | t. | |-------------|-------|-----------|---------|--------|-------|--------|---------|------------------------|-------| | Algorithm | Freq. | cycles | ratio | us | ratio | cycles | ratio | $\mathbf{u}\mathbf{s}$ | ratio | | CPA_1KEM | 204 | 2,308 | 1.00 | 11.3 | 1.00 | 1,137 | 1.00 | 5.6 | 1.00 | | CCA_1KEM | 204 | 2,492 | 1.08 | 12.2 | 1.08 | 3,328 | 2.93 | 16.3 | 2.93 | | CCA_1PKE | 183 | $2,\!518$ | 1.09 | 13.8 | 1.22 | 3,352 | 2.95 | 18.3 | 3.29 | | CPA_3KEM | 169 | 3,582 | 1.00 | 21.2 | 1.00 | 1,726 | 1.00 | 10.2 | 1.00 | | $CCA\_3KEM$ | 174 | 3,755 | 1.05 | 21.6 | 1.02 | 4,932 | 2.86 | 28.3 | 2.78 | | CCA_3PKE | 163 | 3,782 | 1.06 | 23.2 | 1.09 | 4,956 | 2.87 | 30.4 | 2.98 | | CPA_5KEM | 154 | 4,435 | 1.00 | 28.8 | 1.00 | 2,123 | 1.00 | 13.8 | 1.00 | | $CCA\_5KEM$ | 169 | 4,655 | 1.05 | 27.5 | 0.96 | 6,137 | 2.89 | 36.3 | 2.63 | | $CCA\_5PKE$ | 156 | 4,683 | 1.06 | 30.0 | 1.04 | 6,161 | 2.90 | 39.5 | 2.86 | In Table 22, we compare our hardware implementations of different schemes of the Round5 proposal. The dependencies among these schemes are graphically illustrated in Fig. 8. Our comparison contains results for maximum clock frequency and the execution times of major operations in the CPA-KEM, CCA-KEM, and CCA-PKE schemes. These metrics illustrate the cost of additional security of CCA-KEM as compared to CPA-KEM. The biggest difference is in the execution time of the CCA-KEM decapsulation and the CCA-PKE decryption, as compared to the CPA-KEM decapsulation. This difference comes from the Fujisaki-Okamoto transformation, used for providing CCA security. In CCA-KEM, during decapsulation, one additional CPA-PKE encryption is performed. CCA-KEM and CCA-PKE use the same parameter set, but CCA-PKE includes CCA-KEM and performs additional symmetric-key encryption after encapsulation of the key used for it. The differences in the execution times of the CPA-KEM and CCA-KEM encapsulations and the CCA-PKE encryption are negligible. Figure 8: Dependencies between CPA and CCA versions of Round5 proposals [70] # 5.2 Profiling the best available software implementations in C | Algorithm | Software Source | Ref | Opt | |------------|---------------------------------------------|--------------|--------------| | FrodoKEM | https://github.com/Microsoft/PQCrypto-LWEKE | <b>√</b> | <b>√</b> | | Kyber | https://github.com/pq-crystals/kyber | $\checkmark$ | | | LAC | https://github.com/pqc-lac/lac-intel64 | $\checkmark$ | $\checkmark$ | | NewHope | https://github.com/newhopecrypto/newhope | $\checkmark$ | | | NTRU | https://github.com/jschanck/ntru | $\checkmark$ | | | NTRU Prime | https://bench.cr.yp.to/supercop.html | $\checkmark$ | $\checkmark$ | | Round5 | https://github.com/r5embed/r5embed | $\checkmark$ | $\checkmark$ | | Saber | https://github.com/KULeuven-COSIC/SABER | $\checkmark$ | | **Table 23:** Source of software implementations We implemented 12 CCA-secure KEMs representing eight Round 2 lattice-based candidates using the software/hardware co-design approach described in detail in Section 4. In Table 23, we list the repositories containing C source code used as a starting point for our software/hardware implementations. In the case of four candidates - FrodoKEM, LAC, NTRU Prime, and Round5 - optimized implementations in C, different than reference implementations exist. For the remaining four candidates, their best portable implementations are the same (or almost the same) as their reference implementations submitted at the beginning of Round 2. We used the mentioned above implementations in C as a starting point for our first software implementation of each of the 12 implemented KEMs, ported to ARM Cortex-A53 using the procedure described in Section 4.5. The results of profiling for the obtained purely-software implementations, running on a single core of ARM Cortex-A53, at the frequency of 1.2 GHz, are presented in the left portions of Tables 29, 30, 31, 32, 33, 34, 35, 36, 37, and 38, in Appendix A. For each of the 12 investigated algorithms and each major operation (Encapsulation and Decapsulation), two to five most time-consuming functions are identified. For each of these functions, we provide their execution time (in microseconds) and the percentage of the total execution time. In the right portions of the same tables, we list in bold functions offloaded to hardware. For the functions combined together, they are listed in the same field of the table, with sub-indices, such as 1.1, 1.2, 1.3, etc. A single execution time and a single percentage of the software/hardware execution time is given for such a combined function. It is important to note that the execution time of all functions offloaded to hardware, listed in Tables 29–38 include both the execution time in hardware as well as the time necessary to transfer control, inputs, and outputs between the processor and a hardware accelerator. It should also be mentioned that the number of functions offloaded to hardware may be misleading, as these functions may appear at different levels of hierarchy. For example, for the encapsulation in Kyber, only two functions are offloaded. However, these are function involving the majority of operations of Kyber, amounting to 99.55-99.81% of the total execution time in the software-only implementation. For all algorithms, at least the first and the second most time-consuming functions are offloaded to hardware. The total percentage of the execution time taken by a portable software implementation to execute operations offloaded to hardware is shown in Figs. 9 and 10. Figure 9: Encapsulation: The Software Part Sped Up by Hardware [%] Figure 10: Decapsulation: The Software Part Sped Up by Hardware [%] ## 5.3 Results for Software/Hardware Implementations Twelve hardware accelerators developed using the methodology described in Section 4 are characterized in Table 24 using their maximum clock frequency and resource utilization when implemented on Xilinx Zynq UltraScale+ SoC FPGA. All results have been obtained after placing and routing. NewHope, Kyber, and FrodoKEM are able to achieve the highest clock frequencies, above $400~\mathrm{MHz}$ for all parameter sets. LAC has frequencies between 350 and 400 MHz, depending on a variant and security level. The maximum frequency of Round5 decreases significantly with the increase in the security level, especially for a version with the error-correcting code, where the frequency drops from 357 MHz for security level 1 to 238 MHz for security **Table 24:** Maximum frequency and resource utilization of hardware accelerators developed as a part of software/hardware co-designs targeting Zynq Ultrascale+ | Algorithm | Security Category:<br>Parameter Set | Max.<br>Freq. | LUT | FF | Slice | DSP | BR<br>AM | |----------------|-------------------------------------|---------------|------------|------------|--------|-----|----------| | FrodoKEM | Frodo-640 | 402 | 7,213 | 6,647 | 1,186 | 32 | 13.5 | | FrodoKEM | Frodo-976 | 402 | 7,087 | 6,693 | 1,190 | 32 | 17.0 | | FrodoKEM | Frodo-1344 | 417 | 7,015 | 6,610 | 1,215 | 32 | 17.5 | | Kyber | KYBER512 | 410 | 12,034 | 10,532 | 2,327 | 8 | 14.0 | | Kyber | KYBER768 | 405 | 12,195 | 10,461 | 2,253 | 8 | 14.0 | | Kyber | KYBER1024 | 405 | 12,589 | 12,574 | 2,635 | 8 | 14.0 | | LAC-v3a | LAC-128 | 385 | 25,123 | 16,005 | 3,720 | 0 | 8.5 | | LAC-v3a | LAC-192 | 370 | 41,898 | 26,233 | 6,134 | 0 | 11.5 | | LAC-v3a | LAC-256 | 357 | 46,756 | 26,989 | 6,774 | 0 | 11.5 | | LAC-v3b | LAC-128 | 400 | 18,311 | 15,966 | 2,672 | 0 | 8.5 | | LAC-v3b | LAC-192 | 385 | 27,209 | 26,193 | 4,024 | 0 | 11.5 | | LAC-v3b | LAC-256 | 357 | 33,234 | 26,567 | 4,889 | 0 | 11.5 | | NewHope | NEWHOPE512-CCA-KEM | 490 | 9,307 | 8,928 | 1,721 | 4 | 11.0 | | NewHope | NEWHOPE1024-CCA-KEM | 490 | 9,307 | 8,928 | 1,721 | 4 | 11.0 | | NTRU-HPS | ntruhps2048677 | 200 | $42,\!578$ | 22,717 | 8,235 | 677 | 8.5 | | NTRU-HPS | ntruhps 4096821 | 200 | 49,735 | 30,599 | 9,924 | 821 | 8.5 | | NTRU-HRSS | ntruhrss701 | 200 | 48,773 | $25,\!178$ | 8,110 | 701 | 2.5 | | NTRU LPRime | kem/ntrulpr653 | 278 | 45,901 | 39,426 | 8,938 | 0 | 8.0 | | NTRU LPRime | kem/ntrulpr761 | 263 | 55,054 | 45,133 | 9,769 | 0 | 8.0 | | NTRU LPRime | kem/ntrulpr857 | 250 | 64,022 | 50,120 | 10,554 | 0 | 8.0 | | Str NTRU Prime | kem/sntrup653 | 278 | 62,797 | $33,\!531$ | 9,110 | 0 | 9.0 | | Str NTRU Prime | kem/sntrup761 | 263 | 70,066 | 38,144 | 10,319 | 0 | 9.0 | | Str NTRU Prime | kem/sntrup857 | 250 | 78,379 | $42,\!274$ | 11,509 | 0 | 9.0 | | Round5 | $R5ND\_CCA\_1KEM\_0d$ | 294 | $52,\!589$ | 80,875 | 10,154 | 0 | 3.0 | | Round5 | $R5ND\_CCA\_3KEM\_0d$ | 267 | 72,870 | 107,748 | 13,360 | 0 | 3.0 | | Round5 | R5ND_CCA_5KEM_0d | 250 | 99,310 | 156,732 | 18,095 | 0 | 3.0 | | Round5 | R5ND_CCA_1KEM_5d | 357 | 38,116 | 56,189 | 7,538 | 0 | 3.0 | | Round5 | R5ND_CCA_3KEM_5d | 294 | 54,532 | $95,\!436$ | 12,395 | 0 | 3.0 | | Round5 | R5ND_CCA_5KEM_5d | 238 | 69,254 | 114,007 | 12,774 | 0 | 3.0 | | Saber | LightSaber-KEM | 322 | 12,343 | 11,288 | 1,989 | 256 | 3.5 | | Saber | Saber-KEM | 322 | 12,566 | 11,619 | 1,993 | 256 | 3.5 | | Saber | FireSaber-KEM | 322 | $12,\!555$ | 11,881 | 2,341 | 256 | 3.5 | level 5. On the other hand, Saber has the same clock frequency, 322 MHz, for all of its parameter sets. The operating frequencies for the two variants of NTRU Prime are in the range 250-280. They are limited mainly by the reduction modulo q. To reduce numbers with the prime modulus q, we selected the conditional subtraction method, which is relatively simple but comes with a long critical path. NTRU-HPS and NTRU-HRSS have the lowest clock frequency of 200 MHz. These frequencies are affected by the logic for converting polynomials from R/q to S/q and from R/q to S/3. The accelerators for NTRU-HPS and NTRU-HRSS involve the highest number of integer multiplications performed in parallel. These multiplications in the FPGA fabric are delegated to dedicated DSP units. The DSP units are also taken advantage of in Saber and to a lower extent in FrodoKEM, Kyber, and NewHope. LAC, Round5, NTRU LPRime, and Streamlined NTRU Prime do not involve any integer multiplications in hardware. This is because the coefficients of one of the multiplied polynomials always belong to the set $\{-1, 0, 1\}$ . FrodoKEM is the algorithm with the highest utilization of BRAMs, which reaches 17.5 blocks. The algorithms with the lowest utilization of BRAMs (between 2.5 and 3.5) include NTRU-HRSS, Round5, and Saber. The remaining KEMs require 8–14 BRAMs. Round5, Streamlined NTRU Prime, and NTRU LPRime use the largest number of LUT, flip-flops (FFs), and Slices. FrodoKEM, NewHope, Kyber, and Saber use the smallest number. The amount of resources used increases noticeably with the increase in the security level for 8 out of 12 KEMs. The following algorithms have a desirable property that the security levels do not substantially affect resource utilization (except for the small increase in the number of BRAMs in FrodoKEM): FrodoKEM, Kyber, NewHope, and Saber. Because of the timing dependencies, and in particular, the bottleneck caused by SHAKE, our implementation of FrodoKEM cannot be easily sped up by trading additional resources for speed. This example clearly illustrates the potential algorithmic limits on the amount of parallelization (and thus the maximum speed-up), which is independent of the amount of hardware resources available to the designer. The times necessary to load a public key (required for encapsulation) and a secret (private) key (required for decapsulation) are proportional to the size of the respective key and inversely proportional to the maximum clock frequency of a given PQC unit. All transfers are assumed to be conducted using a 64-bit infifo\_data bus. The sizes of keys for all variants of all investigated algorithms are summarized in Table 14. The maximum clock frequencies are listed in Table 24. In Fig. 11, we compare these key loading times for all 12 implemented KEMs. R5ND\_CCA\_KEM\_5d has the shortest key-loading times, and FrodoKEM the longest. However, the differences among these times are relatively minor for all KEMs other than FrodoKEM. They do not exceed a factor of 2 for loading a public key, and 4 for loading a private key. Except for NewHope at the security level 5 and FrodoKEM at all security levels, the public-key loading times stay below 1 $\mu$ s, and private-key loading times below 2 $\mu$ s. Total execution times of our software/hardware implementations are summarized in Fig. 12 for encapsulation, and Fig. 13 for decapsulation. Rankings can be considered separately for three groups of parameter sets listed in Table 14, with the security levels 1 and 2, 3 only, and 4 and 5, respectively. Only the first group contains all 12 investigated algorithms. In the second group, NTRU-HRSS and NewHope are missing, and in the third group, NTRU-HRSS and NTRU-HPS are not represented. In Figs. 12 and 13, KEMs are arranged according to their ranking for security levels 1 and 2. Each execution time is separated into three components: the execution time in hardware (i.e., in the hardware accelerator located in programmable logic of Zynq UltarScale+ SoC FPGAs), the time required to transfer data and control between the processor and the hardware accelerator, and the execution time in software (i.e., in ARM Cortex-A53). For encapsulation, at least the function randombytes() is assumed to be executed in software to generate a seed for a deterministic random bit generator (typically based on SHAKE) implemented in hardware. For decapsulation, no internal function of KEM has to be executed in software. We treat implementation as a software/hardware implementation even if the operation of the processor is limited only to sending KEM inputs to and receiving KEM outputs from the hardware accelerator. Hence, our software/hardware implementations of R5ND CCA KEM 5d, LAC-v3b, NewHope, Kyber, LAC-v3a, and R5ND CCA KEM 0d, which have the shortest execution times of encapsulation and decapsulation are based on the purely hardware implementations of KEMs, described in Section 5.1. The ranking of these six KEMs is similar, but not identical to the ranking of their corresponding hardware implementations. The small changes in rankings come from small differences in the transfer time and execution time in software, as well as from the different maximum clock frequency of hardware accelerators when implemented in programmable logic of Zynq UltraScale+ rather than Artix-7 or Virtex-7 (as in Figs. 6 and 7). Overall, however, not leaving any operation (other than randombytes()) in software gives these 6 KEMs enough advantage to outperform all six remaining schemes. FrodoKEM is by far the slowest KEM, and it cannot outperform any other scheme even if 100% of its operations are moved to hardware. For encapsulation, NTRU-HPS, Streamlined **Figure 11:** (a) Public Key and (b) Private Key transfer latency $(\mu s)$ of SW/HW co-design on Zynq-Ultrascale+ NTRU Prime, and NTRU LPRime are also very unlikely to move in ranking ahead of any of the first six schemes, because even after reducing their execution time in software to zero and making the transfer time similar to the transfer time of the first six schemes (i.e., in the range of 6.2-7.0 mus), their execution times would exceed the overall time for the KEM at position 6, R5ND\_CCA\_KEM\_0d. For Saber and NTRU-HRSS, it is too early to make such a judgment. However, the presented results at least reveal some potential weaknesses of these two algorithms (from the point of view of ease of their software/hardware partitioning), which can be observed by analyzing their profiling results, summarized in Tables 38 and 34. For NTRU-HRSS, even after moving to hardware its four most time-consuming operations, the software still amounts to a significant percentage of the total execution time. In Saber, even after moving to hardware its five most time-consuming operations, transfer time still dominates the **Figure 12:** Encapsulation: Total Execution Time in Software/Hardware $[\mu s]$ **Figure 13:** Decapsulation: Total Execution Time in Software/Hardware $[\mu s]$ total execution time. This last point can be reinforced by analyzing Table 25. According to this table, our software/hardware implementation of Saber has the largest number of transfers between the processor and the accelerator (6). Disregarding FrodoKEM, which is very slow in hardware already, NTRU-HRSS is the only other algorithm that requires more than one transfer during the enacapsulation. **Table 25:** Data Transfer Summary for SW-HW co-designs | A l | $\mathbf{E}$ | ncapsulat | ion | D | ecapsulat | ion | |----------------------------------------------|--------------|------------|------------|-------|-----------|---------| | Algorithms | Count | Load | Return | Count | Load | Return | | | Count | (bytes) | (bytes) | Count | (bytes) | (bytes) | | KYBER512 | 1 | 32 | 768 | 1 | 736 | 32 | | KYBER768 | 1 | 32 | 1,120 | 1 | 1,088 | 32 | | KYBER1024 | 1 | 32 | 1,600 | 1 | 1,568 | 32 | | $R5ND_1KEM_0d$ | 1 | 16 | 740 | 1 | 724 | 16 | | $R5ND_3KEM_0d$ | 1 | 24 | 1,103 | 1 | 1,079 | 24 | | $R5ND\_5KEM\_0d$ | 1 | 32 | 1,509 | 1 | 1,477 | 32 | | $R5ND_1KEM_5d$ | 1 | 16 | 620 | 1 | 604 | 16 | | $R5ND_3KEM_5d$ | 1 | 24 | 934 | 1 | 910 | 24 | | $R5ND_5KEM_5d$ | 1 | 32 | 1,285 | 1 | 1,253 | 32 | | LightSaber-KEM | 6 | 1,600 | 704 | 5 | 1,216 | 896 | | Saber-KEM | 6 | $2,\!272$ | 960 | 5 | 1,216 | 1,280 | | FireSaber-KEM | 6 | 2,976 | 1,216 | 5 | 1,600 | 1,664 | | Frodo-640 | 4 | 19,400 | $22,\!032$ | 3 | 9,768 | 22,008 | | Frodo-976 | 4 | $31,\!472$ | 31,328 | 3 | 15,816 | 31,304 | | Frodo-1344 | 4 | 42,780 | $43,\!136$ | 3 | 21,728 | 43,104 | | LAC-128-v3a | 1 | 16 | 736 | 1 | 704 | 32 | | LAC-192-v3a | 1 | 32 | 1,384 | 1 | 1,352 | 32 | | LAC-256-v3a | 1 | 32 | 1,496 | 1 | 1,464 | 32 | | LAC-128-v3b | 1 | 16 | 736 | 1 | 704 | 32 | | LAC-192-v3b | 1 | 32 | 1,384 | 1 | 1,352 | 32 | | LAC-256-v3b | 1 | 32 | 1,496 | 1 | 1,464 | 32 | | NEWHOPE512 | 1 | 32 | 960 | 1 | 928 | 32 | | NEWHOPE1024 | 1 | 32 | 1,856 | 1 | 1,824 | 32 | | ntruhps 2048677 | 1 | 720 | $1,\!372$ | 2 | 2,318 | 1,408 | | ntruhps 4096821 | 1 | 864 | 1,708 | 2 | 2,906 | 1,692 | | ntruhrss701 | 2 | 1,024 | 1,436 | 3 | 2,854 | 1,512 | | kem/sntrup653 | 1 | 40 | 1,448 | 1 | 912 | 72 | | kem/sntrup761 | 1 | 40 | 1,664 | 1 | 1,048 | 72 | | kem/sntrup857 | 1 | 40 | 1,856 | 1 | 1,192 | 72 | | $\frac{\text{kem/ntrulpr}653}{\text{model}}$ | 1 | 40 | 1,584 | 1 | 1,040 | 72 | | kem/ntrulpr761 | 1 | 40 | 1,800 | 1 | 1,176 | 72 | | kem/ntrulpr857 | 1 | 40 | 1,992 | 1 | 1,320 | 72 | For decapsulation, the execution time in software is eliminated entirely for the first six KEMs in the ranking. The transfer time is similar for all algorithms from that group. The transfer time dominates the execution time in Saber, because, as shown in Table 25, five transfers are required, more than for any other algorithm. The transfer time is also unusually long for NTRU-HRSS and NTRU-HPS (with 3 and 2 transfers, respectively). As a result, it might be too early to judge whether Saber, NTRU-HRSS, and NTRU-HPS can be made as efficient as the first six KEMs, after moving all their operations to hardware. On the other hand, for NTRU LPRime and Streamlined NTRU Prime there is already a strong indication that these algorithms will not be able to move ahead of any of the first six KEMs, even if implemented entirely in hardware. Finally, FrodoKEM is by far the slowest algorithm out of all 12 implemented in this study. #### **Use of High-Level Synthesis** 5.4 A traditional approach to high-level synthesis is based on starting from the existing implementation in C, C++, or System C, and then introducing modifications aimed at: - inferring the desired interface - optimizing speed - minimizing resource utilization. In the case of PQC candidates, a starting point is naturally determined by either reference implementation or the best portable implementation written entirely in C, such as those described in Section 4.5, used as a starting point for the software/hardware co-design. Traditionally, a significant percentage of all modifications amounts to guiding the synthesis tool toward the desired outcome using the C language pragmas, ignored by traditional high-level language compilers, but treated as directives by a given synthesis tool. This approach is used, in particular, in two popular HLS tools targeting FPGAs, Vivado HLS and LegUp. For example, there are over 20 pragma directives in the current version of Vivado HLS. Their different combinations lead to different hardware architectures. The impact of a particular pragma directive is heavily dependent on the code structure and the algorithm. Some directives may have no impact at all; others may dramatically change the speed vs. cost trade-off. Exploring all possible combinations is often unrealistic. Additionally, in many cases, code refactoring may give better results than an optimal choice and placement of directives. The first attempt at applying HLS to benchmarking PQC schemes was reported in [14]. Only a few directives aimed at accomplishing unrolling of loops and pipelining were applied. The authors also attempted to synthesize the C code of the entire algorithm. Taking into account the limited availability of RTL code, no comparison with the equivalent RTL code was attempted. An outcome of this approach is quite clearly illustrated in Table 19, where HLS designs of two representative algorithms, Kyber-512 and NewHope-512, are shown at least an order of magnitude slower than the equivalent RTL designs. Additionally, these HLS designs are shown to use at least an order of magnitude larger number of LUTs. When the product of the Decapsulation Time and the number of LUTs is considered, a difference by at least three orders of magnitudes emerges. As a result, a substantially different approach was needed to overcome this inefficiency. This approach was demonstrated by our group in [22], [24], [55], and [56]. First, HLS is combined with software/hardware co-design. This way, only the most time-consuming operations (and preferably a single operation) needs to be offloaded to hardware. These operations can be identified using techniques described in Section 4.6. Secondly, these critical operations are described using block diagrams. Third, the block diagrams are translated into HLS-ready C code, written from scratch, and enhanced with HLS directives encoded using pragmas. The designer then debugs the code using a C testbench, which is much easier to develop and easier to use than HDL testbench. When the code is determined to be functionally correct, it is passed through synthesis. If the number of clock cycles is different from the expected number obtained from the analysis of the block diagram, additional pragmas need to be added, or the code needs to be refactored (rewritten) to make it more suitable for HLS tools. For example, a programmer may apply explicit function sharing or eliminate dependencies preventing multiple operations from executing in parallel. These optimizations continue at least until the required number of clock cycles is reached. They may also be applied to reduce the number of specific logic resources, such as LUTs, DSP units, or BRAMs, as well as to increase the maximum clock frequency. Below, we demonstrate the application of this approach to the preliminary software/hardware implementations of two Round 2 candidates: NewHope and Kyber. In both implemen- | Method | DSP | BR<br>AM | LUT | FF | Slice | Max.<br>Freq. | Latency (cycles) | |---------|------|----------|----------|----------|----------|---------------|------------------| | | N | NewHop | e-512 w | ith 1 NT | T modu | ıle | | | HLS | 4 | 3 | 1,181 | 1,403 | 239 | 454 | 3,247 | | RTL | 4 | 3 | 1,040 | 940 | 190 | 476 | 3,247 | | HLS/RTL | 1.00 | 1.00 | 1.14 | 1.49 | 1.26 | 0.95 | 1.00 | | | N | ewHop | e-1024 w | ith 1 N | TT mod | ule | | | HLS | 4 | 5 | 1,110 | 1,342 | 219 | 455 | 6,266 | | RTL | 4 | 5 | 842 | 803 | 170 | 476 | 6,266 | | HLS/RTL | 1.00 | 1.00 | 1.32 | 1.67 | 1.29 | 0.96 | 1.00 | | | | Kyber | 512 with | 2 NTT | module | s | | | HLS | 24 | 7 | 2,325 | 2,346 | 430 | 455 | 1,271 | | RTL | 24 | 5 | 2,040 | 3,223 | 433 | 500 | 1,271 | | HLS/RTL | 1.00 | 1.40 | 1.14 | 0.73 | 0.99 | 0.91 | 1.00 | | | | Kyber' | 768 with | 3 NTT | module | s | | | HLS | 36 | 11 | 5,379 | 4,043 | 1,074 | 416 | 1,271 | | RTL | 36 | 7.5 | 3,054 | 5,098 | 637 | 500 | 1,271 | | HLS/RTL | 1.00 | 1.47 | 1.76 | 0.79 | 1.69 | 0.83 | 1.00 | | | | Kyber1 | 024 with | n 4 NTT | ` module | es | | | HLS | 48 | 14 | 7,111 | 5,457 | 1,374 | 416 | 1,271 | | RTL | 48 | 10 | 4,055 | 6,803 | 960 | 500 | 1,271 | | HLS/RTL | 1.00 | 1.40 | 1.75 | 0.80 | 1.43 | 0.83 | 1.00 | **Table 26:** Comparison between HLS and RTL method for NTT implementations. tations, we decided to offload to hardware only the most time-consuming operation, the Number Theoretic Transform (NTT). This operation was first expressed using a detailed block diagram, presented in [56]. Then, the described above methodology was followed. In parallel, optimized RTL implementation was developed for the purpose of evaluating the quality of our HLS design. The obtained results are summarized in Table 26. These results indicate that our primary goal of reaching the same number of clock cycles as that obtained using the RTL approach was accomplished. At the same time, the clock frequency was lower by up to 17%, and the number of LUTs, flip-flops, and slices higher by up to 76%, 67%, and 69%, respectively. Overall, the RTL- and HLS-based approaches to the design of a hardware accelerator for NTT led to almost the same total speed-up of the software/hardware implementation. At the same time, the development time was several times shorter for the HLS-based approach. The disadvantage of our approach is the need for a detailed block diagram, which requires either hardware expertise within a team of HLS programmers or collaboration with a group of hardware designers. Additionally, most of the HLS-ready code needs to written from scratch. The reduction in clock frequency plays a secondary role, as it typically does not significantly affect the overall speed-up of the software/hardware implementation over the portable C code. Similarly, for high-speed implementations, the exact resource utilization plays a secondary role and does not affect the ranking of candidates. # 5.5 Results for Software Implementations Optimized Using NEON Instructions of ARM On the selected platform, Zynq UltraScale+, a reference implementation of a PQC scheme in C can be optimized using several approaches shown in Fig. 14. First, basic optimizations **Figure 14:** Types of optimized implementations. **Table 27:** Comparison between Software Implementations using NEON instructions and Software-Hardware co-designs. | Algorithm | | -Ref<br>Decaps. | | Neon<br>Decaps. | | -HW<br>Decaps. | | Neon<br>Decaps. | | W-HW<br>Decaps. | Neon/S<br>Encaps. | | |----------------|---------|-----------------|---------|-----------------|---------|----------------|-------|-----------------|-------|-----------------|-------------------|-------| | | $\mu s$ | $\mu s$ | $\mu s$ | $\mu s$ | $\mu s$ | $\mu s$ | Ratio | Ratio | Ratio | Ratio | Ratio | Ratio | | NewHope-1024 | 723.5 | 891.2 | 338.3 | 363.6 | 20.8 | 23.8 | 2.1 | 2.5 | 34.8 | 37.4 | 16.27 | 15.27 | | ntruhrss701 | 2,964.5 | 8,789.8 | 123.7 | 252.1 | 68.3 | 135.6 | 24.0 | 34.9 | 43.4 | 64.8 | 1.81 | 1.86 | | ntruhps2048677 | 2,961.5 | 8,174.9 | 337.7 | 339.5 | 41.2 | 95.3 | 8.8 | 24.1 | 71.9 | 85.8 | 8.20 | 3.56 | | ntruhps4096821 | 4,285.1 | 11,981.7 | 401.4 | 434.3 | 48.4 | 107.1 | 10.7 | 27.6 | 88.5 | 111.9 | 8.29 | 4.06 | | LightŠaber | 373.4 | 470.6 | 173.1 | 189.7 | 49.0 | 52.5 | 2.2 | 2.5 | 7.6 | 9.0 | 3.53 | 3.61 | | Saber | 722.1 | 867.2 | 311.9 | 341.1 | 56.9 | 64.7 | 2.3 | 2.5 | 12.7 | 13.4 | 5.48 | 5.28 | | FireSaber | 1,181.0 | 1,376.4 | 489.1 | 534.9 | 65.2 | 77.1 | 2.4 | 2.6 | 18.1 | 17.9 | 7.50 | 6.94 | may be still possible in C without affecting the portability of the code. From here, two divergent paths are worth investigating. First, Optimized Portable Implementation in C can be turned into a Software/Hardware Implementation in C and HDL, using the methodology described in this paper in Sections 4.3, 4.4, 4.6, 4.7. After all C functions are moved to hardware, this implementation becomes a pure hardware implementation in HDL. An alternative path is based on the use of SIMD instructions of ARMv8-a, referred to as NEON instructions. These instructions can be called from C using the so-called intrinsics. NEON intrinsics are function calls that the compiler replaces with an appropriate NEON instruction or a sequence of NEON instructions. Intrinsics provide almost as much control as writing assembly language but leave the allocation of registers to the compiler [6]. Operations that cannot take advantage of vector instructions are left in C. This path can be further extended into pure assembly language code. This code may consist of hand-coded NEON assembly language instructions, as well as remaining (so-called scalar) assembly language instructions of ARMv8-a. We have developed NEON-optimized implementations in C with NEON intrinsics for 4 investigated KEMs, representing 3 Round 2 PQC candidates, namely NewHope, NTRU-HPS, NTRU-HRSS, and Saber. Our starting point consisted of optimized implementations of these algorithms, targeting Intel and AMD processors, using AVX2 (Advanced Vector Extensions 2). In Table 27, we compare the performance of NEON-optimized software implementations (based on intrinsics) with the performance of our software/hardware implementations. Our software/hardware implementations appear to be superior for all investigated candidates and parameter sets. Compared to the software implementation based on NEON intrinsics, the execution times for NewHope-1024 are over 15 times smaller in software/hardware. For NTRU-HPS software/hardware implementation is about 8 times faster for encapsulation and 3.5-4.0 times faster for decapsulation. Saber has the ratios approximately the same for encapsulation and decapsulation. However, the advantage of software/hardware increases with the increase in the security level. Finally, NTRU-HRSS has the smallest ratios in the range of 1.8-1.9. Multiple conference papers have been devoted to the NEON-based implementation of a single public-key cryptosystem [16, 63, 50, 10, 66, 64]. These papers demonstrate that developing optimized implementations based on NEON intrinsics, hand-coded NEON assembly language code, and hand-coded ARMv8-A RISC assembly language code is at least as complex and labor intensive as the development of optimized software/hardware implementations. The advantages of NEON-based implementations include a) software-only paradigm - no need for expertise in hardware and knowledge of HDLs, b) NEON vector instructions run at a higher clock frequency than an FPGA-based hardware accelerator, c) using the NEON co-processor involves minimal (if any) transfer overhead, d) the NEON co-processor can be potentially reused for non-cryptographic operations, such as signal and image processing. The primary advantages of the software/hardware implementations are: a) Programmable logic is much more powerful and less restrictive than the NEON co-processor in terms of the number and type of operations that can be executed in parallel. As a result, a higher overall speed-up is accomplished. b) Hardware written in HDL is likely to be more portable than software written in the assembly language of a particular processor. In particular, our software/hardware implementations can be ported to any other modern SoC FPGA, assuming that the amount of the required hardware resources does not exceed the capabilities of the programmable logic of a given SoC device. ### 6 Comparison with performance of the AVX2-optimized software implementations In Table 28, we compare the performance of our software/hardware implementations, running on Zynq UltraScale+, with the performance of the best software implementations available to date, running on Intel Xeon E3-1220 v3 (3.1 GHz). When comparing these implementations, one needs to keep in mind that the software portions of our implementations are written in portable C and run on a much less powerful processor, ARM Cortex-A53, at the frequency of 1.2 GHz. Hardware portions run in the programmable logic of Zynq UltraScale+, at a frequency specific to each algorithm, listed in Table 24, varying between 200 MHz for NTRU-HPS and NTRU-HRSS, through 322 MHz for Saber, until 490 MHz for NewHope. Even the frequency of NewHope is over 6 times smaller than the frequency of Intel Xeon. Additionally, all compared software implementations are optimized using AVX2 instructions, which let them take advantage of the parallelism present in each algorithm. Under these circumstances, it is no surprise that Zynq UltraScale+ can outperform Intel Xeon only when its software/hardware implementation is fully optimized by moving all operations other than randombytes() to programmable logic. Such implementations of Kyber, LAC, NewHope, and Round5, outperform the best software implementations for both encapsulation and decapsulation. For encapsulation, the speed-ups vary from 1.33 for Kyber at level 3 to 4.67 for Round5 without error correction for level 3. For decapsulation, the speed-ups vary from 1.03 for Kyber at level 3 to 4.54 for LAC at level 5. The only exceptions are: Kyber at level 1, which reaches only the ratio of 0.94 for encapsulation, and 0.71 for decapsulation, and Round5 without error correction, which reaches only the ratio of 0.72 for level 1. Somewhat surprisingly, also our software/hardware implementation of FrodoKEM out- **Table 28:** Comparison of the GMU software/hardware implementations, running on Zynq UltraScale+, with the software implementations in supercop-20200525 running on Intel Xeon E3-1220 v3 $(3100 \mathrm{MHz})$ | A1*11 | median | SW | SW/HW | Ratio | |-----------------------------------|-------------------|----------------|-----------------|---------------------| | Algorithm | cycles | (us) | (us) | natio | | | Encaps | ulation | | | | | Level | 1 & 2 | | | | ntruhrss701 | 26116 | 8.4 | 68.3 | 0.12 | | ntruhps2048677<br>kyber512 | 35352 $44404$ | 11.4<br>14.3 | 41.2<br>15.3 | 0.28<br>0.94 | | sntrup653 | 46620 | 14.3 $15.0$ | 48.5 | 0.34 | | lightsaber2 | 67568 | 21.8 | 49.0 | 0.44 | | ntrulpr653 | 69400 | 22.4 | 51.6 | 0.43 | | lac128<br>r5nd1kem0d | 82684<br>89500 | $26.7 \\ 28.9$ | 15.9<br>16.7 | $1.67 \\ 1.73$ | | newhope512cca | 109040 | 35.2 | 14.6 | 2.42 | | r5nd1kem5d | 122492 | 39.5 | 13.8 | 2.85 | | frodokem640shake | 4529184 | 1,461.0 | 1,223.0 | 1.19 | | | Lev | el 3 | | | | ntruhps 4096821 | 43100 | 13.9 | 48.4 | 0.29 | | sntrup761 | 48780 | 15.7 | 55.5 | 0.28 | | ntrulpr761<br>kyber768 | 72372<br>74040 | 23.3<br>23.9 | 59.6<br>18.0 | 0.39<br><b>1.33</b> | | saber2 | 115948 | 37.4 | 56.9 | 0.66 | | lac192 | 158628 | 51.2 | 21.4 | 2.39 | | r5nd3kem5d | 209572 | 67.6 | 19.2 | 3.52 | | r5nd3kem0d<br>frodokem976shake | 317244 | 102.3 | 21.9 | 4.67 | | Hodokem970snake | 9467152 | 3,053.9 | 1,642.5 | 1.86 | | | Level | | | | | sntrup857 | 60668 | 19.6 | 63.4<br>67.3 | 0.31 | | ntrulpr857<br>kyber1024 | 91416<br>103936 | $29.5 \\ 33.5$ | 22.1 | 0.44<br><b>1.51</b> | | firesaber2 | 175844 | 56.7 | 65.2 | 0.87 | | lac256 | 188244 | 60.7 | 23.8 | 2.55 | | newhope1024cca | 201772 | 65.1 | 20.8 | 3.13 | | r5nd5kem5d<br>r5nd5kem0d | 368004<br>392492 | 118.7 $126.6$ | 26.0 | $4.57 \\ 4.34$ | | frodokem1344shake | 16379980 | 5,283.9 | 29.2<br>2,186.2 | $\frac{4.34}{2.42}$ | | | Decaps | ulation | | | | | Level | 1 & 2 | | | | kyber512 | 37600 | 12.1 | 17.1 | 0.71 | | r5nd1kem0d | 43000 | 13.9 | 19.3 | 0.72 | | sntrup653 | 59324 | 19.1 | 66.9 | 0.29 | | ntruhps2048677<br>r5nd1kem5d | 62004 | 20.0 | 95.3 | 0.21 | | ntruhrss701 | 63624<br>63632 | $20.5 \\ 20.5$ | 15.7<br>135.6 | 1.31<br>0.15 | | lightsaber2 | 69508 | 22.4 | 52.5 | 0.43 | | ntrulpr653 | 82732 | 26.7 | 70.9 | 0.38 | | lac128 | 105388 | 34.0 | 17.1 | 1.99 | | newhope512cca<br>frodokem640shake | 109728 $4494652$ | 35.4 | 15.1 | 2.35 | | irodokemo40snake | | 1,449.9 | 1,321.3 | 1.10 | | | Lev | | =0.0 | 0.24 | | sntrup761<br>kyber768 | 59120<br>63916 | 19.1<br>20.6 | 78.9<br>20.1 | 0.24<br><b>1.03</b> | | ntruhps4096821 | 79448 | 25.6 | 107.1 | 0.24 | | ntrulpr761 | 85908 | 27.7 | 84.1 | 0.33 | | r5nd3kem5d | 117028 | 37.8 | 22.8 | 1.65 | | saber2 | 118848 | 38.3 | 64.7 | 0.59 | | r5nd3kem0d | 156692 | 50.5 | 27.0 | 1.87 | | lac192<br>frodokem976shake | 243008<br>9380108 | 78.4 $3,025.8$ | 23.7<br>1,866.2 | $3.30 \\ 1.62$ | | | Level | * | ,~~~- | | | sntrup857 | 80904 | 26.1 | 86.8 | 0.30 | | kyber1024 | 91628 | 29.6 | 24.7 | 1.20 | | ntrulpr857 | 112116 | 36.2 | 97.5 | 0.37 | | firesaber2 | 182136 | 58.8 | 77.1 | 0.76 | | r5nd5kem0d | 193228 | 62.3 | 35.9 | 1.73 | | newhope1024cca<br>r5nd5kem5d | 206248 $209136$ | $66.5 \\ 67.5$ | 23.8<br>31.7 | $2.79 \\ 2.13$ | | lac256 | 377784 | 121.9 | 26.9 | $\frac{2.13}{4.54}$ | | frodokem1344shake | 16312844 | 5,262.2 | 3,119.9 | 1.69 | | | | | | | performs the best software implementation, even though the percentage of operations offloaded to hardware in FrodoKEM is the smallest among all implemented KEMs, as shown in Figs. 9 and 10. ## 7 Conclusions In this paper, we first reviewed the previous work on hardware and software/hardware implementations of Round 2 PQC schemes. Out of 26 candidates, six - NewHope, CRYSTALS-Kyber, FrodoKEM, Saber, Round5, and SIKE - received the highest coverage in terms of the number of implementations and related publications. All of them have both high-speed and lightweight implementations reported. Candidates with the Register-Transfer Level (RTL) high-speed implementations and no lightweight implementations include LAC, Classic McEliece, Picnic, and Rainbow. The publications on BIKE focused on key generation and decoding but did not report results for the entire KEM or PKE. Candidates with at least one software/hardware lightweight implementation but no RTL high-speed implementations include LEDAcrypt, CRYSTALS-DILITHIUM, and qTESLA. The coverage of the following candidates was limited to High-Level Synthesis implementations: SPHINCS+ and MQDSS. We are not aware of any publications on hardware or software/hardware implementations of Three Bears, HQC, NTS-KEM (before the merger with Classic McEliece), ROLLO, RQC, FALCON, GeMSS, and LUOV. With a few exceptions, the majority of lightweight implementations were software/hardware implementations based on RISC-V. The lattice-based family received by far the most extensive coverage. The following candidates from other families were shown competitive to lattice-based cryptography in terms of speed: for encryption and key exchange: Classic McEliece, for digital signatures Picnic and Rainbow. However, all of them were investigated primarily from the point of view of high-speed implementations. In terms of the comparison of the lattice-based schemes, the previous publications were somewhat inconclusive. The largest differences were demonstrated in studies targeting ASIC implementations. These studies indicated the significant advantage of Kyber and NewHope over LAC, Saber, and FrodoKEM, in terms of both the execution times of encapsulation and decapsulation, as well as power consumption and energy usage. The benchmarking of lattice-based signature schemes was limited to CRYSTALS-DILITHIUM and qTESLA. The conclusions were complicated by the withdrawal of heuristic parameters of qTESLA by the submitters on Aug. 20, 2019, and very limited coverage of the remaining parameter sets. Due to the timing constraints, in our study, we decided to focus on 12 CCA-secure Key Encapsulation Mechanisms (KEMs) representing 8 out of 9 lattice-based key exchange schemes (all except Three Bears). Taking into account that even for this subset of candidates, the development of full RTL implementations appeared to be beyond the capabilities of a single group, we investigated the use of two techniques to speed up the development process: software/hardware co-design and High-Level Synthesis. A hybrid of these two approaches, with some modifications to the traditional HLS methodology, appeared to give quite promising results. However, we eventually devoted most of our effort to software/hardware co-design based on the merger of the RTL HDL code and optimized C code. Unlike other groups, we applied software/hardware co-design to high-speed rather than lightweight implementations, which led to the choice of Xilinx Zynq UltraScale+, a state-of-the-art SoC FPGA family, as our primary platform. The differentiating factor is that this platform includes a hardwired ARM Cortex-A53 processor operating at the frequency of 1.2 GHz and a significant amount of programmable logic supporting hardware accelerators operating at the clock frequencies up to 500 MHz. Still, our designs remained almost completely portable due to leaving the software portion in C and modeling hardware portion in hardware description languages, such as VHDL, Verilog, and Chisel. The detailed design methodology is described in this paper, and the corresponding code required to build a generic benchmarking platform, suitable for performing timing measurements of hardware and software/hardware co-designs is available for other groups to adopt. It is also our intention to make our implementations of PQC candidates open-source after the corresponding publications are accepted to peer-reviewed conferences or journals. Our software/hardware co-design approach was successfully applied to all 12 mentioned above KEMs. For each KEM, multiple parameters sets, typically corresponding to three security levels, were supported. In order to determine FPGA resources required for each parameter set individually, the choice between parameter sets is performed during logic synthesis rather than at the run time. For all algorithms other than FrodoKEM, the percentage of the original execution time in software taken by operations offloaded to hardware exceeded 97.4% for decapsulation and 96.9% for encapsulation. For FrodoKEM operations taking at least 94% of the execution time in software were offloaded to hardware. Significant speed-ups ranging between 7.6 and 111.1 were obtained versus a portable implementation in C, running on ARM Cortex-A53. More importantly, even when four KEMs representing three candidates - NewHope, NTRU, and Saber - were optimized in software by using NEON intrinsics, corresponding to special SIMD instructions of ARM, our sofware/hardware implementations maintained the lead by a factor varying between 1.81 for encapsulation in NTRU-HRSS up to 16.27 for encapsulation in NewHope. Finally, as an ultimate test, our implementations were compared with the software implementations optimized using AVX2 vector instructions, running on Intel Xeon E3-1220 v3, with the frequency 3.1 GHz. For each security level, between 4 and 6 software/hardware implementations, running on Zynq UltraScale+ with the 1.2 GHz ARM core were superior than the corresponding AVX2 implementations. For each candidate, an attempt was made to offload as many as possible operations to hardware. For 50% of investigated KEMs, this percentage reached 100%. Thus, the corresponding implementations could be treated as hardware implementations, assuming that a random seed (of the size of 16, 24, or 32 bytes) was transferred to the hardware module during encapsulation. KEMs implemented using this approach included Kyber, LAC (v3a and v3b), NewHope, and Round5 (with and without error-correcting code). Their code was benchmarked using Artix-7 and Virtex-7 FPGAs. In terms of both the execution times and resource utilization, Round5 with an error-correcting code (R5ND\_5d) outperformed Round5 without an error-correcting code (R5ND\_0d). Similarly, LAC-v3b appeared superior over LAC-v3a in terms of both speed and use of FPGA resources. Then, when the best representatives of four candidates - Kyber, LAC, NewHope, and Round5 - were compared, the following conclusions could be drawn. The execution times of these candidates were extremely close to one another. For encapsulation, the execution times were within 10% from one another at the security level 5, within 22% at the security level 3, and within 32% at the security level 1. For decapsulation, the largest differences were 26% at level 5, 22% at level 3, and 48% at level 1. In multiple instances, just a change of an FPGA family from low-cost Artix-7 to high-performance Virtex-7 caused a significant change in the rankings, even though the HDL code remained exactly the same. As a result, we must conclude that the differences among these candidates in terms of speed are too small to give preference to any particular candidate. These results contradict one of the earlier reports placing LAC well behind NewHope and Kyber. In terms of resource utilization, a small advantage belongs to NewHope and Kyber. Both of them use fewer LUTs and flip-flops than LAC and Round5, and their use of DSP units and BRAMs, although slightly higher, is very moderate. Additionally, both NewHope and Kyber use almost the same amount of resources independently of the security level. In the case of both LAC and Round5, resource usage increases sharply with the increase in security level. The former property appears to be an advantage for applications requiring support for the highest or all security levels. In particular, the k-in-1 designs, which support all k security levels and allow modifying them at run time, typically have only slightly higher resource utilization than that for the maximum security level. Thus, the flat dependence of the resource utilization on the security level implies a potential for very cost-effective k-in-1 designs. At the same time, this potential should still be confirmed through complete designs. For the remaining 6 KEMs, representing FrodoKEM, NTRU, NTRU Prime, and Saber - the conclusions could be drawn only by comparing their software/hardware implementations and contrasting them with the corresponding software/hardware implementations of Kyber, LAC, NewHope, and Round5. In this case, all KEMs were implemented in Zynq UltraScale+. Hardware accelerators were assumed to be preloaded with appropriate public and private keys. Encapsulation started from generating 16-32 random bytes in software and passing these bytes to the hardware accelerator. Decapsulation started by sending the ciphertext to the hardware accelerator. Both operations ended when the shared secret was available in the memory of the processor core. Our evaluation revealed that FrodoKEM was by at least an order of magnitude slower than the remaining investigated KEMs. Ranking of the remaining candidates in hardware could not be determined conclusively based on their software/hardware co-design rankings. Software/hardware co-designs of Saber, NTRU-HRSS, and NTRU-HPS in particular, and somewhat less likely of Streamlined NTRU Prime and NTRU LPRime, could be possibly still significantly improved by offloading more operations to hardware, up to the level of bypassing one of the first six candidates in the ranking. This pitfall of software/hardware co-designs was identified early on during the benchmarking process. It could have been overcome only if candidates were significantly different from the point of view of their hardware efficiency. Such large differences were not identified in the case of the mentioned above five lattice-based KEMs. Consequently, the only way to overcome this inherent weakness of the software/hardware methodology, when applied to this particular set of candidates, is to move all (or almost all) remaining operations of these algorithms to hardware. Doing that is, however, impractical at this point due to the timeline imposed by NIST. At the same time, taking into account that moving more operations of these KEMs to hardware can only increase the resource usage of the corresponding hardware accelerators, it is still fair to compare their resource utilization with those of Kyber and NewHope. For NTRU-HPS and NTRU-HRSS, the concern is a large number of DSP units, exceeding 700 for NTRU-HRSS and 800 for NTRU-HPS at the security level 3. For Streamlined NTRU Prime and NTRU LPRime, the only concern is a relatively large number of LUTs, clearly exceeding that of LAC-v3b and approaching or exceeding that of Round5 with an error-correcting code (R5ND\_5d). Still, it is up to NIST and the cryptographic community to decide whether such relatively small differences in the hardware efficiency of lattice-based candidates should play any role in the Round 3 down-selection process. #### 8 **Future Work** Future work will depend on the number and type of candidates qualified for Round 3. Based on the lessons learned from Round 2, the following adjustments may be advisable: • More focus on hardware implementations vs. software/hardware implementations. Software/hardware implementations may still be helpful for lightweight implementations with a clear resource utilization threshold. In these implementations, moving more operations to hardware may be prohibited by exceeding the resource budget. - More focus on comparisons across families, rather than within the same family. Round 2 designs illustrate substantial similarities between candidates belonging to the same family but give a hint of more profound differences among representatives of different families. - More hardware platforms to focus on. The larger the spectrum of platforms, the higher certainty that the reported rankings are not artifacts of a particular platform and will carry over to future generations of integrated circuits. For FPGAs and SoC FPGAs, benchmarking should target families of at least two major vendors, Xilinx and Intel. For ASIC implementations, different standard-cell libraries should be considered. ASIC studies are particularly challenging, as they are more time-consuming and costlier. However, they are indispensable as they may lead to different conclusions than those obtained from FPGA investigations. - More work on optimized software implementations targeting vector instructions of embedded processors, such as RISC-V and ARM (including NEON instructions). - Investigation of lightweight implementations protected against side-channel and fault attacks should be conducted by multiple groups, serving interchangeably as attackers and defenders. - Trade-offs among speed, area, power, energy, and resistance against side channel attacks should be thoroughly studied, especially for lightweight implementations. # References - [1] Erdem Alkim et al. ISA Extensions for Finite Field Arithmetic Accelerating Kyber and NewHope on RISC-V. Tech. rep. 049. 2020. - [2] Erdem Alkim et al. NewHope Algorithm Specifications and Supporting Documentation Version 1.1. en. Tech. rep. Apr. 2020. - [3] Michał Andrzejczak. "The Low-Area FPGA Design for the Post-Quantum Cryptography Proposal Round5". In: 2019 Federated Conference on Computer Science and Information Systems. Vol. 18. Leipzig, Germany, Sept. 2019, pp. 213–219. DOI: 10/ggbsbd. - [4] Michal Andrzejczak, Farnoud Farahmand, and Kris Gaj. "Full Hardware Implementation of the Post-Quantum Public-Key Cryptography Scheme Round5". en. In: 2019 International Conference on ReConFigurable Computing and FPGAs (Re-ConFig). Cancun, Mexico: IEEE, Dec. 2019, pp. 1–2. ISBN: 978-1-72811-957-1. DOI: 10.1109/ReConFig48160.2019.8994765. - [5] Nicolas Aragon. BIKE: Bit Flipping Key Encapsulation. en. Tech. rep. May 2020, p. 40. - [6] ARM. Neon Intrinsics Reference. https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics. 2020. - [7] ARM. Neon Programmer's Guide for Armv8-A. https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/neon-programmers-guide-for-armv8-a. 2020. - [8] Roberto Avanzi et al. KYBER: Algorithm Specications And Supporting Documentation, Version 2.0, NIST PQC Round 2. en. Tech. rep. Apr. 2019, p. 37. - [9] Reza Azarderakhsh et al. "Key Compression for Isogeny-Based Cryptosystems". en. In: Proceedings of the 3rd ACM International Workshop on ASIA Public-Key Cryptography - AsiaPKC '16. Xi'an, China: ACM Press, 2016, pp. 1–10. ISBN: 978-1-4503-4286-5. DOI: 10/ggbsbz. - [10] Reza Azarderakhsh et al. NEON PQCryto: Fast and Parallel Ring-LWE Encryption on ARM NEON Architecture. Cryptology ePrint Archive 2015/1081. Nov. 2015. - [11] Brian Baldwin et al. "FPGA Implementations of the Round Two SHA-3 Candidates". In: 2010 International Conference on Field Programmable Logic and Applications, FPL 2010. Milan, Italy, Aug. 2010, pp. 400–407. ISBN: 978-1-4244-7842-2. DOI: 10/bmn2zv. - [12] Utsav Banerjee, Tenzin S. Ukyab, and Anantha P. Chandrakasan. "Sapphire: A Configurable Crypto-Processor for Post-Quantum Lattice-Based Protocols". en. In: IACR Transactions on Cryptographic Hardware and Embedded Systems 2019.4 (Aug. 2019). DOI: 10.13154/tches.v2019.i4.17-61. - [13] Utsav Banerjee, Tenzin S. Ukyab, and Anantha P. Chandrakasan. Sapphire: A Configurable Crypto-Processor for Post-Quantum Lattice-Based Protocols (Extended Version). Cryptology ePrint Archive 2019/1140. Oct. 2019. - [14] Kanad Basu et al. NIST Post-Quantum Cryptography- A Hardware Evaluation Study. Cryptology ePrint Archive 2019/047. May 2019. - [15] Daniel J. Bernstein and Tanja Lange. eBACS: ECRYPT Benchmarking of Crypto-graphic Systems. https://bench.cr.yp.to. 2019. - [16] Daniel J Bernstein and Peter Schwabe. "NEON Crypto". In: Cryptographic Hardware and Embedded Systems CHES 2012. Vol. 7428. LNCS. Leuven, Belgium, Sept. 2012, pp. 320–339. DOI: https://doi.org/10.1007/978-3-642-33027-8\_19. - [17] CAESAR: Competition for Authenticated Encryption: Security, Applicability, and Robustness Web Page. https://competitions.cr.yp.to/caesar.html. 2019. - 18] Cryptographic Engineering Research Group (CERG) at George Mason University. Hardware Benchmarking of CAESAR Candidates. https://cryptography.gmu.edu/athena/index.php?id=CAES 2019. - [19] Viet B Dang et al. "Implementing and Benchmarking Three Lattice-Based Post-Quantum Cryptography Algorithms Using Software/Hardware Codesign". In: 2019 International Conference on Field Programmable Technology, FPT 2019. Tianjin, China: IEEE, Dec. 9-13, 2019, pp. 206–214. DOI: 10.1109/ICFPT47387.2019.00032. - [20] Rami Elkhatib, Reza Azarderakhsh, and Mehran Mozaffari-Kermani. Efficient and Fast Hardware Architectures for SIKE Round 2 on FPGA. Cryptology ePrint Archive 2020/611. May 2020. - [21] Farnoud Farahmand. Benchmarking Setup for Software/Hardware Implementations of PQC Schemes. Sept. 2019. - [22] Farnoud Farahmand et al. "Evaluating the Potential for Hardware Acceleration of Four NTRU-Based Key Encapsulation Mechanisms Using Software/Hardware Codesign". In: 10th International Conference on Post-Quantum Cryptography, PQCrypto 2019. LNCS. Chongqing, China: Springer, May 2019. - [23] Farnoud Farahmand et al. "Minerva: Automated Hardware Optimization Tool". In: 2017 International Conference on ReConFigurable Computing and FPGAs, ReConFig 2017. Cancun: IEEE, Dec. 2017, pp. 1–8. - [24] Farnoud Farahmand et al. "Software/Hardware Codesign of the Post Quantum Cryptography Algorithm NTRUEncrypt Using High-Level Synthesis and Register-Transfer Level Design Methodologies". In: 29th International Conference on Field Programmable Logic and Applications, FPL 2019. Barcelona, Spain: IEEE, Sept. 2019, pp. 225–231. ISBN: 978-1-72814-884-7. DOI: 10.1109/FPL.2019.00042. - [25] Ahmed Ferozpuri and Kris Gaj. "High-Speed FPGA Implementation of the NIST Round 1 Rainbow Signature Scheme". In: 2018 International Conference on ReCon-Figurable Computing and FPGAs (ReConFig). Cancun, Mexico: IEEE, Dec. 2018, pp. 1–8. ISBN: 978-1-72811-968-7. DOI: 10/ggbsdm. - [26] Ahmed Ferozpuri et al. *Hardware API for Post-Quantum Public Key Cryptosystems*. GMU Report. Fairfax, VA: George Mason University, Apr. 2018. - [27] Tim Fritzmann, Georg Sigl, and Johanna Sepúlveda. RISQ-V: Tightly Coupled RISC-V Accelerators for Post-Quantum Cryptography. Cryptology ePrint Archive 2020/446. Apr. 2020. - [28] Tim Fritzmann et al. "Towards Reliable and Secure Post-Quantum Co-Processors Based on RISC-V". In: 2019 Design, Automation Test in Europe Conference Exhibition (DATE). ZSCC: 0000001. Mar. 2019, pp. 1148–1153. DOI: 10.23919/DATE. 2019.8715173. - [29] Kris Gaj. "Challenges and Rewards of Implementing and Benchmarking Post-Quantum Cryptography in Hardware". In: 2018 Great Lakes Symposium on VLSI, GLSVLSI 2018. Chicago, IL, USA: ACM Press, 2018, pp. 359–364. ISBN: 978-1-4503-5724-1. DOI: 10/ggbscs. - [30] Kris Gaj, Ekawat Homsirikamol, and Marcin Rogawski. "Fair and Comprehensive Methodology for Comparing Hardware Performance of Fourteen Round Two SHA-3 Candidates Using FPGAs". In: Cryptographic Hardware and Embedded Systems, CHES 2010. Vol. 6225. LNCS. Santa Barbara, CA, Aug. 2010, pp. 264–278. ISBN: 978-3-642-15030-2 978-3-642-15031-9. DOI: 10.1007/978-3-642-15031-9\_18. - [31] Kris Gaj et al. "ATHENa Automated Tool for Hardware EvaluatioN: Toward Fair and Comprehensive Benchmarking of Cryptographic Hardware Using FPGAs". In: 2010 International Conference on Field Programmable Logic and Applications, FPL 2010. Milan, Italy: IEEE, Aug. 2010, pp. 414–421. ISBN: 978-1-4244-7842-2. DOI: 10/d2bzw2. - [32] Kris Gaj et al. Comprehensive Evaluation of High-Speed and Medium-Speed Implementations of Five SHA-3 Finalists Using Xilinx and Altera FPGAs. Cryptology ePrint Archive 2012/368. 2012. - [33] E. Homsirikamol et al. Implementer's Guide to Hardware Implementations Compliant with the CAESAR Hardware API. GMU Report. Fairfax, VA: GMU, 2016. - [34] Ekawat Homsirikamol and Kris Gaj. "Hardware Benchmarking of Cryptographic Algorithms Using High-Level Synthesis Tools: The SHA-3 Contest Case Study". In: Applied Reconfigurable Computing ARC 2015. Vol. 9040. LNCS. Cham: Springer International Publishing, 2015, pp. 217–228. ISBN: 978-3-319-16213-3 978-3-319-16214-0. DOI: 10.1007/978-3-319-16214-0\_18. - [35] Ekawat Homsirikamol and Kris Gaj. "Toward a New HLS-Based Methodology for FPGA Benchmarking of Candidates in Cryptographic Competitions: The CAESAR Contest Case Study". In: 2017 International Conference on Field Programmable Technology, FPT 2017. Melbourne, Australia: IEEE, Dec. 2017, pp. 120–127. ISBN: 978-1-5386-2656-6. DOI: 10/ggbsf4. - [36] Ekawat Homsirikamol, Panasayya Yalla, and Farnoud Farahmand. Development Package for Hardware Implementations Compliant with the CAESAR Hardware API. https://cryptography.gmu.edu/athena/index.php?id=CAESAR. 2016. - [37] Ekawat Homsirikamol et al. CAESAR Hardware API. Cryptology ePrint Archive 2016/626. 2016. - [38] James Howe. "Optimised Lattice-Based Key Encapsulation in Hardware". en. In: Second NIST Post-Quantum Cryptography Standardization Conference 2019. Aug. 2019, p. 13. - [39] Jingwei Hu et al. "Lightweight Key Encapsulation Using LDPC Codes on FPGAs". en. In: *IEEE Trans. Comput.* (2019). ISSN: 0018-9340, 1557-9956, 2326-3814. DOI: 10.1109/TC.2019.2948323. - [40] Arpan Jati et al. SPQCop: Side-Channel Protected Post-Quantum Cryptoprocessor. Cryptology ePrint Archive 2019/765. June 2019. - [41] Daniel Kales et al. "Efficient FPGA Implementations of LowMC and Picnic". In: *The Cryptographers' Track at the RSA Conference 2020, CT-RSA 2020.* San Francisco: Springer, Feb. 2020. - [42] Matthias J. Kannwischer et al. Pqm4 Post-Quantum Crypto Library for the $\{ARM\}$ $\{Cortex-M4\}$ . https://github.com/mupq/pqm4. 2019. - Jens-Peter Kaps et al. "Lightweight Implementations of SHA-3 Candidates on FPGAs". In: 12th International Conference on Cryptology in India, Indocrypt 2011. Vol. 7107. LNCS. Chennai, India, Dec. 2011, pp. 270–289. ISBN: 978-3-642-25577-9 978-3-642-25578-6. DOI: 10.1007/978-3-642-25578-6\_20. - [44] Miroslav Knezevic et al. "Fair and Consistent Hardware Evaluation of Fourteen Round Two SHA-3 Candidates". In: *IEEE Transactions on Very Large Scale Integration (VLSI) Systems* 20.5 (May 2012), pp. 827–840. ISSN: 1063-8210, 1557-9999. DOI: 10/ctjzhr. - [45] B. Koziel, R. Azarderakhsh, and M. M. Kermani. "A High-Performance and Scalable Hardware Architecture for Isogeny-Based Cryptography". In: *IEEE Transactions on Computers* 67.11 (Nov. 2018), pp. 1594–1609. ISSN: 0018-9340. DOI: 10/gff4vv. - [46] Brian Koziel et al. "NEON-SIDH: Efficient Implementation of Supersingular Isogeny Diffie-Hellman Key Exchange Protocol on ARM". en. In: Cryptology and Network Security. Ed. by Sara Foresti and Giuseppe Persiano. Vol. 10052. Cham: Springer International Publishing, 2016, pp. 88–103. ISBN: 978-3-319-48964-3 978-3-319-48965-0. DOI: 10.1007/978-3-319-48965-0\_6. - [47] Brian Koziel et al. "Post-Quantum Cryptography on FPGA Based on Isogenies on Elliptic Curves". In: *IEEE Transactions on Circuits and Systems I: Regular Papers* 64.1 (Jan. 2017), pp. 86–99. ISSN: 1549-8328, 1558-0806. DOI: 10/gd89pp. - [48] Brian Koziel et al. SIKE'd Up: Fast and Secure Hardware Architectures for Supersingular Isogeny Key Encapsulation. en. Cryptology ePrint Archive 2019/711. June 2019, p. 27. - [49] Weiqiang Liu et al. "High Performance Modular Multiplication for SIDH". en. In: *IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.* (2019), pp. 1–1. ISSN: 0278-0070, 1937-4151. DOI: 10.1109/TCAD.2019.2960330. - [50] Patrick Longa. FourQNEON: Faster Elliptic Curve Scalar Multiplications on ARM Processors. en. Cryptology ePrint Archive 2016/645. July 2016, p. 16. - [51] Xianhui Lu et al. LAC: Practical Ring-LWE Based Public-Key Encryption with Byte-Level Modulus. Cryptology ePrint Archive 2018/1009. Dec. 2019. - [52] Pedro Maat C. Massolino et al. "A Compact and Scalable Hardware/Software Co-Design of SIKE". en. In: IACR Transactions on Cryptographic Hardware and Embedded Systems (Mar. 2020), pp. 245–271. ISSN: 2569-2925. DOI: 10.13154/tches. v2020.i2.245-271. - [53] Jose Maria Bermudo Mera et al. Compact Domain-Specific Co-Processor for Accelerating Module Lattice-Based Key Encapsulation Mechanism. en. Cryptology ePrint Archive 2020/321. Mar. 2020, p. 15. - [54] Richard Newell. Survey of Notable Security-Enhancing Activities in the RISC-V Universe. 17th International Workshop on Cryptographic Architectures Embedded in Logic Devices, CryptArchi 2019. Pruhonice, Czech Republic, June 2019. - [55] Duc Tri Nguyen, Viet B. Dang, and Kris Gaj. "A High-Level Synthesis Approach to the Software/Hardware Codesign of NTT-Based Post-Quantum Cryptography Algorithms". en. In: 2019 International Conference on Field-Programmable Technology (ICFPT). Tianjin, China: IEEE, Dec. 2019, pp. 371–374. ISBN: 978-1-72812-943-3. DOI: 10.1109/ICFPT47387.2019.00070. - [56] Duc Tri Nguyen, Viet B Dang, and Kris Gaj. "High-Level Synthesis in Implementing and Benchmarking Number Theoretic Transform in Lattice-Based Post-Quantum Cryptography Using Software/Hardware Codesign". In: 16th International Symposium on Applied Reconfigurable Computing, ARC 2020. Apr. 2020. - [57] NIST. PQC API Notes. 2017. - [58] David Patterson and Andrew Waterman. The RISC-V Reader: An Open Architecture Atlas. Book version: 0.0.1. Strawberry Canyon LLC, Oct. 2017. - [59] Andrew H Reinders et al. Efficient BIKE Hardware Design with Constant-Time Decoder. Cryptology ePrint Archive 2020/117. Feb. 2020. - [60] Vincent Rijmen, Antoon Bosselaers, and Paulo Barreto. Optimized ANSI C Code for the Rijndael Cipher (Now AES), Rijndael-Alg-Fst.c, v3.0. Dec. 2000. - [61] Sujoy Sinha Roy and Andrea Basso. High-Speed Instruction-Set Coprocessor for Lattice-Based Key Encapsulation Mechanism: Saber in Hardware. Cryptology ePrint Archive 2020/434. Apr. 2020. - [62] Markku-Juhani O. Saarinen. Pqcbench. https://github.com/mjosaarinen/pqcbench. 2019. - [63] Hwajeong Seo et al. "Efficient Arithmetic on ARM-NEON and Its Application for High-Speed RSA Implementation: Efficient Arithmetic on ARM-NEON". en. In: Security and Communication Networks 9.18 (Dec. 2016), pp. 5401–5411. ISSN: 19390114. DOI: 10.1002/sec.1706. - [64] Hwajeong Seo et al. SIDH on ARM: Faster Modular Multiplications for Faster Post-Quantum Supersingular Isogeny Key Exchange. Cryptology ePrint Archive 2018/700. July 2018. - [65] Douglas Stebila and Michele Mosca. Liboqs Master Branch. https://github.com/open-quantum-safe/liboqs. 2019. - [66] Silvan Streit and Fabrizio De Santis. "Post-Quantum Key Exchange on ARMv8-A: A New Hope for NEON Made Simple". en. In: *IEEE Transactions on Computers* 67.11 (Nov. 2018), pp. 1651–1662. ISSN: 0018-9340, 1557-9956, 2326-3814. DOI: 10/gff3sc. - [67] FrodoKEM Submission Team. Round 2 Submissions FrodoKEM Candidate Submission Package. https://csrc.nist.gov/Projects/Post-Quantum-Cryptography/Round-2-Submissions. Apr. 2019. - [68] NTRU Prime Submission Team. Round 2 Submissions NTRU Prime Candidate Submission Package. https://csrc.nist.gov/Projects/Post-Quantum-Cryptography/Round-2-Submissions. Apr. 2019. - [69] NTRU Submission Team. Round 2 Submissions NTRU Candidate Submission Package. Apr. 2019. - [70] Round5 Submission Team. Round 2 Submissions Round5 Candidate Submission Package. https://csrc.nist.gov/Projects/Post-Quantum-Cryptography/Round-2-Submissions. Apr. 2019. - [71] Saber Submission Team. Round 2 Submissions Saber Candidate Submission Package. Apr. 2019. - [72] Wen Wang, Jakub Szefer, and Ruben Niederhagen. "FPGA-Based Niederreiter Cryptosystem Using Binary Goppa Codes". In: 9th International Conference on Post-Quantum Cryptography, PQCrypto 2018. Ed. by Tanja Lange and Rainer Steinwandt. Vol. 10786. LNCS. Fort Lauderdale, Florida: Springer International Publishing, Apr. 2018, pp. 77–98. ISBN: 978-3-319-79062-6 978-3-319-79063-3. DOI: 10.1007/978-3-319-79063-3\_4. - [73] Wen Wang et al. Parameterized Hardware Accelerators for Lattice-Based Cryptography and Their Application to the HW/SW Co-Design of qTESLA. Cryptology ePrint Archive 2020/054. Apr. 2020. - [74] Wen Wang et al. "XMSS and Embedded Systems XMSS Hardware Accelerators for RISC-V". In: Selected Areas in Cryptography SAC 2019. Vol. 11959. LNCS. Waterloo, Ontario, Canada: Springer, 2019, pp. 523–550. - [75] Andrew Waterman and Krste Asanovic. The RISC-V Instruction Set Manual. Volume I: Unprivileged ISA v2.2. Tech. rep. 20190608-Base-Ratified. June 2019, p. 236. - [76] Andrew Waterman and Krste Asanovic. "The RISC-V Instruction Set Manual, Volume II: Privileged Architecture, v1.12". In: (June 2019), p. 113. - [77] Guozhu Xin et al. "VPQC: A Domain-Specific Vector Processor for Post-Quantum Cryptography Based on RISC-V Architecture". In: *IEEE Transactions on Circuits and Systems I: Regular Papers* (2020), pp. 1–13. ISSN: 1558-0806. DOI: 10.1109/TCSI.2020.2983185. [78] Neng Zhang et al. "Highly Efficient Architecture of NewHope-NIST on FPGA Using Low-Complexity NTT/INTT". en. In: *IACR Transactions on Cryptographic Hardware and Embedded Systems* (Mar. 2020), pp. 49–72. ISSN: 2569-2925. DOI: 10.13154/tches.v2020.i2.49-72. #### **Results of Profiling** Α **Table 29:** Results of profiling FrodoKEM | D | Time | Time | <b>.</b> | Time | Time | |-----------------------------------|------------------|--------------|---------------------------------|------------------|----------------| | Function | [us] | [%] | Function | [us] | [%] | | Software | • | | Software/Hardware | | | | | | | Encapsulation | | | | 1. frodo_mul_add_sa_plus_e | 58,577.48 | 94.36 | 1.1 frodo_mul_add_sa_plus_e | | | | 2. Shake128 and frodo_sample_n x3 | 1,416.27 | 2.28 | 1.2 Shake128 and frodo_sample_n | 1.328.39 | 60.76 | | 3. frodo_mul_add_sb_plus_e | 654.64 | 1.05 | 1.3 frodo_mul_add_sb_plus_e | , | | | 4. Shake256 | 569.60<br>386.22 | 0.92 | 1.4 Shake256 2. frodo pack | 200.00 | 17.67 | | 5. frodo_pack 6. frodo_unpack | 276.00 | 0.62 | 2. frodo_pack 3. frodo unpack | 386.22<br>276.00 | 12.62 | | Others | 195.62 | 0.44 | Others | 195.62 | 8.95 | | Total | 62,075.83 | 100.00 | Total | 2,186.23 | 100.00 | | 10641 | / | | Decapsulation | 2,100.20 | 100.00 | | 1. frodo mul add sa plus e | 58,754.02 | 94.19 | 1.1 frodo mul add sa plus e | | | | 2. Shake128 and frodo sample n x3 | 883.14 | 1.42 | 1.2 Shake128 and frodo_sample_n | | | | 3. frodo unpack x3 | 765.56 | 1.23 | 1.3 frodo mul add sb plus e | 1,316.52 | 42.20 | | 4. frodo_mul_add_sb_plus_e | 649.68 | 1.04 | 1.4 Shake256 | | | | 5. frodo_mul_bs | 507.08 | 0.81 | 2. frodo_unpack x3 | 765.56 | 24.54 | | 6. Shake256 | 286.64 | 0.46 | 3. frodo_mul_bs | 507.08 | 16.25 | | Others | 530.74 | 0.85 | Others | 530.74 | 17.01 | | Total | 62,376.86 | 100.00 | Total | 3,119.90 | 100.00 | | | | | Encapsulation | | | | 1. frodo_mul_add_sa_plus_e | 31,430.38 | 90.82 | 1.1 frodo_mul_add_sa_plus_e | | | | 2. Shake128 and frodo_sample_n x3 | 1,410.18 | 4.07 | 1.2 Shake128 and frodo_sample_n | 760.74 | 46.32 | | 3. frodo_mul_add_sb_plus_e | 472.16 | 1.36 | 1.3 frodo_mul_add_sb_plus_e | | | | 4. Shake256 | 414.11<br>357.58 | 1.20 | 1.4 Shake256 | 357.58 | 01.77 | | 5. frodo_pack 6. frodo_unpack | 297.73 | 1.03<br>0.86 | 2. frodo_pack 3. frodo unpack | 297.73 | 21.77<br>18.13 | | Others | 226.40 | 0.65 | Others | 226.40 | 13.78 | | Total | 34,608.54 | 100.00 | Total | 1,642.45 | 100.00 | | Total | / | | Decapsulation | 1,042.40 | 100.00 | | 1. frodo mul add sa plus e | 31,441.14 | 90.74 | 1.1 frodo mul add sa plus e | | | | 2. Shake128 and frodo sample n x3 | 1,410.86 | 4.07 | 1.2 Shake128 and frodo_sample_n | | | | 3. frodo unpack x3 | 594.63 | 1.72 | 1.3 frodo_mul_add_sb_plus_e | 749.76 | 40.18 | | 4. frodo_mul_add_sb_plus_e | 471.29 | 1.36 | 1.4 Shake256 | | | | 5. frodo_mul_bs | 368.32 | 1.06 | 2. frodo_unpack x3 | 594.63 | 31.86 | | 6. Shake256 | 208.83 | 0.60 | 3. frodo_mul_bs | 368.32 | 19.74 | | Others | 153.51 | 0.44 | Others | 153.51 | 8.23 | | Total | 34,648.58 | 100.00 | Total | 1,866.22 | 100.00 | | | | | Encapsulation | | | | 1. frodo_mul_add_sa_plus_e | 13,794.27 | 85.19 | 1.1 frodo_mul_add_sa_plus_e | | | | 2. Shake128 and frodo_sample_n x3 | 1,002.40 | 6.19 | 1.2 Shake128 and frodo_sample_n | 352.52 | 28.82 | | 3. frodo_mul_add_sb_plus_e | 309.68 | 1.91 | 1.3 frodo_mul_add_sb_plus_e | | | | 4. Shake256 | 215.55 | 1.33 | 1.4 Shake256 2. frodo pack | 901.09 | 99.00 | | 5. frodo_pack 6. frodo_unpack | 291.83<br>277.26 | 1.80 | 2. frodo_pack 3. frodo unpack | 291.83<br>277.26 | 23.86<br>22.67 | | Others | 301.38 | 1.71 | Others | 301.38 | 24.64 | | Total | 16,192.37 | 100.00 | Total | 1,222.99 | 100.00 | | 10001 | | | Decapsulation | 1,444.99 | 100.00 | | 1. frodo_mul_add_sa_plus_e | 13,793.01 | 85.18 | 1.1 frodo mul add sa plus e | | | | 2. Shake128 and frodo sample n x3 | 1,002.85 | 6.19 | 1.2 Shake128 and frodo_sample_n | | | | 3. frodo unpack x3 | 548.74 | 3.39 | 1.3 frodo mul add sb plus e | 342.95 | 25.95 | | 4. frodo_mul_add_sb_plus_e | 309.21 | 1.91 | 1.4 Shake256 | | | | 5. frodo_mul_bs | 242.40 | 1.50 | 2. frodo_unpack x3 | 548.74 | 41.53 | | 6. Shake256 | 108.93 | 0.67 | 3. frodo_mul_bs | 242.40 | 18.35 | | Others | 187.23 | 1.16 | Others | 187.23 | 14.17 | | | 16,192.37 | 100.00 | Total | 1,321.32 | 100.00 | **Table 30:** Results of profiling Kyber | Function | Time | Time | Function | Time | Time | | | |------------------|--------|--------|-------------------------------|------|--------|--|--| | | [us] | [%] | | [us] | [%] | | | | Softwa | | | Software/Hardware | | | | | | Kyb | | -KEM | 1024 - Encapsulation | n | | | | | 1. indcpa_enc | 736.7 | 93.55 | 1.1. indcpa_enc | 20.6 | 93.21 | | | | 2. hash | 49.3 | 6.26 | 1.2. hash | 20.0 | 99.21 | | | | 3. randombytes | 1.5 | 0.19 | 2. randombytes | 1.5 | 6.79 | | | | Total | 787.5 | 100.00 | Total | 22.1 | 100.00 | | | | Kyb | er CCA | -KEM | 1024 - Decapsulation | 1 | | | | | 1. indcpa_enc | 734.2 | 76.99 | 1.1 indcpa_enc | | | | | | 2. indcpa_dec | 191.7 | 20.10 | 1.2 indcpa_dec | 24.7 | 100.00 | | | | 3. hash & verify | 27.7 | 2.90 | 1.3 hash & verify | | | | | | Total | 953.7 | 100.00 | Total | 24.7 | 100.00 | | | | Ky | ber CC | A-KEM | 768 - Encapsulation | | | | | | 1. indcpa_enc | 496.3 | 92.48 | 1.1. indcpa_enc | 16.4 | 91.62 | | | | 2. hash | 38.9 | 7.24 | 1.2. hash | 10.4 | 91.02 | | | | 3. randombytes | 1.5 | 0.28 | 2. randombytes | 1.5 | 8.38 | | | | Total | 536.7 | 100.00 | Total | 17.9 | 100.00 | | | | Kyl | ber CC | A-KEM | 768 - Decapsulation | | | | | | 1. indcpa_enc | 493.2 | 73.60 | 1.1 indcpa_enc | | | | | | 2. indcpa_dec | 154.6 | 23.07 | 1.2 indcpa_dec | 20.1 | 100.00 | | | | 3. hash & verify | 22.3 | 3.33 | 1.3 hash & verify | | | | | | Total | 670.1 | 100.00 | Total | 20.1 | 100.00 | | | | Kyl | ber CC | 4-KEM | 512 - Encapsulation | ļ | | | | | 1. indcpa_enc | 302.8 | 91.19 | 1.1. indcpa_enc | 13.7 | 90.15 | | | | 2. hash | 27.8 | 8.36 | 1.2. hash | 10.7 | 30.10 | | | | 3. randombytes | 1.5 | 0.45 | 2. randombytes | 1.5 | 9.85 | | | | Total | 332.0 | 100.00 | Total | 15.2 | 100.00 | | | | Kyl | ber CC | 4-KEM | 512 - Decapsulation | L | | | | | 1. indcpa_enc | 298.5 | 68.93 | 1.1 indcpa_enc | | | | | | 2. indcpa_dec | 117.9 | 27.22 | $1.2 \; \mathrm{indcpa\_dec}$ | 17.1 | 100.00 | | | | 3. hash & verify | 16.7 | 3.85 | 1.3 hash & verify | | | | | | Total | 433.0 | 100.00 | Total | 17.1 | 100.00 | | | **Table 31:** Results of profiling LAC-v3a | Than 4 | Time | Time | D | Time | Time | | | | |-----------------|-----------------------------|-----------|-------------------|------|--------|--|--|--| | Function | [us] | [%] | Function | [us] | [%] | | | | | Softw | vare | | Software/Hardware | | | | | | | | LAC-v | v3a-256 - | Encapsulation | | | | | | | 1. pke_enc_seed | 901.0 | 99.42 | 1.1 pke_enc_seed | | | | | | | 2. hash_to_k | 2.3 | 0.26 | 1.2 hash_to_k | 22.3 | 93.69 | | | | | 3. random_bytes | 1.5 | 0.17 | 1.3 Others | | | | | | | Others | 1.4 | 0.16 | 2. random_bytes | 1.5 | 6.31 | | | | | Total | 906.3 | 100.00 | Total | 23.8 | 100.00 | | | | | | LAC-v | v3a-256 - | Decapsulation | | | | | | | 1. pke_enc_seed | 901.1 | 65.37 | 1.1 pke_enc_seed | | | | | | | 2. pke_dec | 472.4 | 34.27 | $1.2~ m pke\_dec$ | 26.9 | 100.00 | | | | | 3. hash_to_k | 2.3 | 0.17 | 1.3 hash_to_k | 20.9 | 100.00 | | | | | Others | 2.6 | 0.19 | 1.4 Others | | | | | | | Total | 1,378.4 | 100.00 | Total | 26.9 | 100.00 | | | | | | LAC-v3a-192 - Encapsulation | | | | | | | | | 1. pke_enc_seed | 558.5 | 99.06 | 1.1 pke_enc_seed | | | | | | | 2. hash_to_k | 2.3 | 0.41 | 1.2 hash_to_k | 19.9 | 92.99 | | | | | 3. random_bytes | 1.5 | 0.27 | 1.3 Others | | | | | | | Others | 1.5 | 0.26 | 2. random_bytes | 1.5 | 7.01 | | | | | Total | 563.8 | 100.00 | Total | 21.4 | 100.00 | | | | | | LAC-v | v3a-192 - | Decapsulation | | | | | | | 1. pke_enc_seed | 558.7 | 71.44 | 1.1 pke_enc_seed | | | | | | | 2. pke_dec | 218.6 | 27.95 | $1.2~ m pke\_dec$ | 23.7 | 100.00 | | | | | 3. hash_to_k | 2.3 | 0.30 | 1.3 hash_to_k | 20.1 | 100.00 | | | | | Others | 2.5 | 0.31 | 1.4 Others | | | | | | | Total | 782.1 | 100.00% | Total | 23.7 | 100.00 | | | | | | | | Encapsulation | | | | | | | 1. pke_enc_seed | 328.6 | 98.73 | 1.1 pke_enc_seed | | | | | | | 2. hash_to_k | 2.3 | 0.69 | 1.2 hash_to_k | 14.9 | 93.73 | | | | | 3. random_bytes | 1.0 | 0.30 | 1.3 Others | | | | | | | Others | 0.9 | 0.27 | 2. random_bytes | 1.0 | 6.27 | | | | | Total | 332.8 | 100.00 | Total | 15.9 | 100.00 | | | | | | LAC-v | v3a-128 - | Decapsulation | | | | | | | 1. pke_enc_seed | 328.5 | 71.01 | 1.1 pke_enc_seed | | | | | | | 2. pke_dec | 130.3 | 28.17 | $1.2~ m pke\_dec$ | 17.1 | 100.00 | | | | | 3. hash_to_k | 2.3 | 0.50 | 1.3 hash_to_k | 11.1 | 100.00 | | | | | Others | 1.5 | 0.32 | 1.4 Others | | | | | | | Total | 462.6 | 100.00 | Total | 17.1 | 100.00 | | | | **Table 32:** Results of profiling LAC-v3b | Function | Time<br>[us] | <b>Time</b> [%] | Function | Time<br>[us] | <b>Time</b> [%] | | | |-----------------------------|--------------|-----------------|---------------------------|--------------|-----------------|--|--| | Softwa | | | Software/Hardware | | | | | | | LAC-v | 3b-256 - | Encapsulation | | | | | | 1. pke_enc_seed | 861.8 | 99.05 | 1.1 pke_enc_seed | | | | | | 2. hash_to_k | 2.3 | 0.27 | 1.2 hash_to_k | 19.6 | 92.88 | | | | 3. random_bytes | 1.5 | 0.17 | 1.3 Others | | | | | | Others | 4.4 | 0.51 | 2. random_bytes | 1.5 | 7.12 | | | | Total | 870.0 | 100.00 | Total | 21.1 | 100.00 | | | | | LAC-v | 3b-256 - | Decapsulation | | | | | | 1. pke_enc_seed | 861.6 | 64.69 | 1.1 pke_enc_seed | | | | | | 2. pke_dec | 463.3 | 34.78 | $1.2~\mathrm{pke\_dec}$ | 04.1 | 100.00 | | | | 3. hash_to_k | 2.3 | 0.17 | $1.3 \text{ hash\_to\_k}$ | 24.1 | 100.00 | | | | Others | 4.7 | 0.35 | 1.4 Others | | | | | | Total | 1,331.9 | 100.00 | Total | 24.1 | 100.00 | | | | LAC-v3b-192 - Encapsulation | | | | | | | | | 1. pke_enc_seed | 522.7 | 98.54 | 1.1 pke_enc_seed | | | | | | 2. hash_to_k | 2.3 | 0.44 | 1.2 hash_to_k | 16.8 | 91.82 | | | | 3. random_bytes | 1.5 | 0.28 | 1.3 Others | | | | | | Others | 3.9 | 0.74 | 2. random_bytes | 1.5 | 8.18 | | | | Total | 530.4 | 100.00 | Total | 18.3 | 100.00 | | | | | LAC-v | 3b-192 - | Decapsulation | | | | | | 1. pke_enc_seed | 522.8 | 70.59 | 1.1 pke_enc_seed | | | | | | 2. pke_dec | 211.2 | 28.51 | $1.2~ m pke\_dec$ | 20.6 | 100.00 | | | | 3. hash_to_k | 2.3 | 0.31 | 1.3 hash_to_k | 20.0 | 100.00 | | | | Others | 4.4 | 0.59 | 1.4 Others | | | | | | Total | 740.7 | 100.00 | Total | 20.6 | 100.00 | | | | | LAC-v | 3b-128 - | Encapsulation | ' | | | | | 1. pke_enc_seed | 309.4 | 98.34 | 1.1 pke_enc_seed | | | | | | 2. hash_to_k | 2.3 | 0.73 | 1.2 hash_to_k | 13.4 | 93.07 | | | | 3. random_bytes | 1.0 | 0.32 | 1.3 Others | | | | | | Others | 1.9 | 0.61 | 2. random_bytes | 1.0 | 6.93 | | | | Total | 314.6 | 100.00 | Total | 14.4 | 100.00 | | | | | LAC-v | 3b-128 - | Decapsulation | | | | | | 1. pke_enc_seed | 309.3 | 70.25 | 1.1 pke_enc_seed | | | | | | 2. pke_dec | 126.3 | 28.69 | $1.2~ m pke\_dec$ | 15.5 | 100.00 | | | | 3. hash_to_k | 2.3 | 0.52 | 1.3 hash_to_k | 15.5 | 100.00 | | | | Others | 2.3 | 0.53 | 1.4 Others | | | | | | Total | 440.2 | 100.00 | Total | 15.5 | 100.00 | | | Table 33: Results of profiling NewHope | Function | Time | Time | Function | Time | Time | | | |--------------------------------------|---------|--------|-----------------------|---------|--------|--|--| | Function | [us] | [%] | Function | [us] | [%] | | | | Softwa | are | | Software/H | ardware | | | | | NewH | Tope CC | CA-KEN | I 1024 - Encapsulati | on | | | | | 1. cpapke_enc | 668.3 | 91.33 | 1.1. cpapke_enc | 19.2 | 92.77 | | | | 2. hash | 62.0 | 8.47 | 1.2. hash | 19.2 | 92.11 | | | | 3. randombytes | 1.5 | 0.20 | 2. randombytes | 1.5 | 7.23 | | | | Total | 731.7 | 100.00 | Total | 20.7 | 100.00 | | | | NewHope CCA-KEM 1024 - Decapsulation | | | | | | | | | 1. cpapke_enc | 660.7 | 74.01 | 1.1 cpapke_enc | | | | | | 2. cpapke_dec | 193.9 | 21.72 | 1.2 cpapke_dec | 23.8 | 100.00 | | | | 3. hash & verify | 38.2 | 4.27 | 1.3 hash & verify | | | | | | Total | 892.7 | 100.00 | Total | 23.8 | 100.00 | | | | Newl | Hope Co | CA-KEN | I 512 - Encapsulation | on | | | | | 1. cpapke_enc | 316.9 | 89.60 | 1.1. cpapke_enc | 13.0 | 89.66 | | | | 2. hash | 35.3 | 9.97 | 1.2. hash | 15.0 | 09.00 | | | | 3. randombytes | 1.5 | 0.42 | 2. randombytes | 1.5 | 10.34 | | | | Total | 353.6 | 100.00 | Total | 14.5 | 100.00 | | | | Newl | Hope Co | CA-KEN | I 512 - Decapsulation | on | | | | | 1. cpapke_enc | 311.8 | 72.92 | 1.1 cpapke_enc | | | | | | 2. cpapke_dec | 93.3 | 21.82 | 1.2 cpapke_dec | 15.1 | 100.00 | | | | 3. hash & verify | 22.5 | 5.26 | 1.3 hash & verify | | | | | | Total | 427.5 | 100.00 | Total | 15.1 | 100.00 | | | $\textbf{Table 34:} \ \operatorname{Results} \ \operatorname{from} \ \operatorname{profiling} \ \operatorname{NTRU}$ | Function | Time | Time | Function | Time | Time [%] | |-------------------------|----------|---------|-----------------------------|-------|----------| | Software | [us] | [%] | Software/Hard | | | | | RII HPS | 4096821 | - Encapsulation | iware | | | 1. poly_Rq_mul | 3,954.9 | 92.29 | 1.1 poly_Rq_mul | | | | 2. owcpa_samplemsg | 251.4 | 5.87 | 1.2 owcpa_samplemsg | | | | 3. shake256 | 54.3 | 1.27 | 1.3 shake256 | 29.8 | 63.77 | | 4. sha3 256 | 7.6 | 0.18 | 1.4 sha3_256 | | | | Others | 16.9 | 0.39 | Others | 16.9 | 36.23 | | Total | 4,285.1 | 100.00 | Total | 46.7 | 100.00 | | NT | RU HPS | 4096821 | - Decapsulation | | | | 1. poly_S3_mul | 3,972.1 | 33.15 | 1.1 poly_Rq_mul | | | | 2. poly_Sq_mul | 3,960.3 | 33.05 | 1.2 poly_S3_mul | 37.4 | 24.09 | | 3. poly_Rq_mul | 3,955.4 | 33.01 | 1.3 poly_Sq_mul | 37.4 | 34.92 | | 4. poly_S3_frombytes x2 | 31.1 | 0.26 | 1.4 sha3_256 x2 | | | | 5. sha3_256 x2 | 24.2 | 0.20 | 2. poly_S3_frombytes x2 | 31.1 | 29.01 | | Others | 38.6 | 0.32 | Others | 38.6 | 36.07 | | Total | 11,981.7 | 100.00 | Total | 107.1 | 100.00 | | NT | RU HPS | 2048677 | - Encapsulation | | | | 1. poly_Rq_mul | 2,692.6 | 90.92 | 2.1 poly_Rq_mul | | | | 2. owcpa_samplemsg | 199.8 | 6.75 | $2.2~{ m owcpa\_samplemsg}$ | 26.0 | 62.44 | | 3. shake256 | 45.9 | 1.55 | $2.3 \mathrm{\ shake} 256$ | 20.0 | 02.44 | | 4. sha3_256 | 7.5 | 0.25 | 2.4 sha3_256 | | | | Others | 15.6 | 0.53 | Others | 15.6 | 37.56 | | Total | 2,961.5 | 100.00 | Total | 41.6 | 100.00 | | | | | - Decapsulation | | | | 1. poly_S3_mul | 2,706.8 | 33.11 | 1.1 poly_Rq_mul | | | | 2. poly_Sq_mul | 2,693.2 | 32.94 | 1.2 poly_S3_mul | 34.1 | 35.77 | | 3. poly_Rq_mul | 2,693.1 | 32.94 | 1.3 poly_Sq_mul | 01.1 | 00.11 | | 4. poly_S3_frombytes x2 | 25.9 | 0.32 | 1.4 sha3_256 x2 | | | | 5. sha3_256 | 20.6 | 0.25 | 2. poly_S3_frombytes x2 | 25.9 | 27.14 | | Others | 35.3 | 0.43 | Others | 35.3 | 37.10 | | Total | 8,174.9 | 100.00 | Total | 95.3 | 100.00 | | | | | ncapsulation | | | | 1. poly_Rq_mul | 2,886.1 | 97.36 | 1. poly_lift | 27.9 | 42.74 | | 2. shake256 | 24.2 | 0.82 | 2.1 poly_Rq_mul | | | | 3. poly_lift | 27.9 | 0.94 | 2.2 owcpa_samplemsg | 22.3 | 34.12 | | 4.sha3_256 | 7.6 | 0.26 | 2.3 shake256 | | | | 5. owcpa_samplemsg | 3.5 | 0.12 | 2.3 sha3_256 | 15.4 | 00.14 | | Others | 15.1 | 0.51 | Others | 15.1 | 23.14 | | Total | 2,964.5 | 100.00 | Total | 65.3 | 100.00 | | | | | ecapsulation | | | | 1. poly_S3_mul | 2,900.8 | 33.00 | 1.1 poly_Rq_mul | | | | 2. poly_Sq_mul | 2,890.7 | 32.89 | 1.2 poly_S3_mul | 46.4 | 34.17 | | 3. poly_Rq_mul | 2,886.6 | 32.84 | 1.3 poly_Sq_mul | | | | 4. poly_lift | 27.2 | 0.31 | 1.4 sha3_256 | 07.0 | 00.05 | | 5. sha3_256 | 22.3 | 0.25 | 2. poly_lift | 27.2 | 20.05 | | Others | 62.1 | 0.71 | Others | 62.1 | 45.78 | | Total | 8,789.8 | 100.00 | Total | 135.6 | 100.00 | Table 35: Results of profiling NTRULPRime | Function | Time | Time | Function | Time | Time | |------------------------------------------|---------------|----------------|---------------------------------------------------------|--------------|-----------------| | C. C. | [us] | [%] | G & /II 1 | [us] | [%] | | Software | NICOTITE | | Software/Hardware | | | | | | | 7 - Encapsulation | | | | 1. Rq_mult_small x2 | 1,448.9 | 70.32 | 1.1 Short_fromlist | | | | 2. Short_fromlist | 261.6 | 12.69 | 1.2 Expand x2 | | | | 3. Expand x2 4. Hash x4 | 245.4<br>60.7 | 11.91<br>2.95 | 1.3 Hash X4 | 69.6 | 70.97 | | 5. crypto decode 857x1723 | 28.5 | 1.38 | 1.4 Rq_mult_small x2<br>1.5.crypto_encode_857x1723round | | | | 6. crypto_decode_657x1723round | 4.7 | 0.23 | 1.6 Others | | | | Others | 10.9 | 0.23 | 2. crypto decode 857x1723 | 28.5 | 29.03 | | Total | 2.060.5 | 100.00 | Total | 98.0 | 100.00 | | | , | | 7 - Decapsulation | 30.0 | 100.00 | | 1. Rq_mult_small x3 | 2,173.5 | 78.00 | 1.1 Short fromlist | | | | 2. Short fromlist | 261.4 | 9.38 | 1.2 expand x2 | | | | 3. expand x2 | 246.5 | 8.85 | 1.3 Rq_mult_small x3 | | | | 4. crypto_decode_857x1723 x2 | 50.2 | 1.80 | 1.4 Hash x3 | 47.3 | 48.52 | | 5. Hash x3 | 34.0 | 1.22 | 1.6 crypto encode 857x1723round | | | | 6. crypto encode 857x1723round | 4.8 | 0.17 | 1.7 Others | | | | Others | 16.3 | 0.58 | 2. crypto decode 857x1723 x2 | 50.2 | 51.48 | | Total | 2,786.6 | 100.00 | Total | 97.5 | 100.00 | | | | | 1 - Encapsulation | | | | 1. Rq mult small x2 | 1,169.5 | 68.62 | 1.1 Short fromlist | | | | 2. Short fromlist | 226.2 | 13.27 | 1.2 Expand x2 | | | | 3. Expand x2 | 214.8 | 12.60 | 1.3 Hash X4 | 00.0 | a= 0= | | 4. Hash X4 | 54.4 | 3.19 | 1.4 Rq_mult_small x2 | 66.0 | 67.07 | | 5. crypto_decode_761x1531 | 25.7 | 1.51 | 1.5 crypto_encode_761x1531round | | | | 6. crypto_encode_761x1531round | 4.2 | 0.25 | 1.6 Others | | | | Others | 9.5 | 0.56 | 2. crypto_decode_761x1531 | 32.4 | 32.93 | | Total | 1,704.4 | 100.00 | Total | 98.4 | 100.00 | | | NTRULE | Prime76 | 1 - Decapsulation | | | | 1. Rq_mult_small x3 | 1,753.6 | 76.60 | 1.1 Short_fromlist | | | | 2. Short_fromlist | 225.9 | 9.87 | 1.2 expand x2 | | | | 3. Expand x2 | 214.9 | 9.39 | 1.3 Rq_mult_small x3 | 71.4 | 54.20 | | 3. crypto_decode_761x1531 x2 | 45.4 | 1.98 | 1.4 Hash x3 | ,1.1 | 01.20 | | 4. Hash x3 | 31.0 | 1.35 | 1.5 crypto_encode_761x1531round | | | | 5. crypto_encode_761x1531round | 4.3 | 0.19 | 1.6 Others | | | | Others | 14.2 | 0.62 | 2. crypto_decode_761x1531 x2 | 60.3 | 45.80 | | Total | 2,289.4 | 100.00 | Total | 131.7 | 100.00 | | | | | 3 - Encapsulation | | | | 1. Rq_mult_small x2 | 934.0 | 67.19 | 1.1 Short_fromlist | | | | 2. Short_fromlist | 190.0 | 13.67 | 1.2 Expand x2 | | | | 3. Expand x2 | 183.6 | 13.21 | 1.3 Rq_mult_small x2 | 58.6 | 67.90 | | 4. Hash x4 | 48.2<br>22.7 | 3.46 | 1.4 Hash x4 | | | | 5. crypto_decode_653x1541 | | 1.64 | 1.5 crypto_encode_653x1541round | | | | 6. crypto_encode_653x1541round | 3.6<br>8.1 | 0.26 | 1.6 Others | 97.7 | 20.10 | | Others Total | 1,390.2 | 0.58<br>100.00 | 2. crypto_decode_653x1541 Total | 27.7<br>86.3 | 32.10<br>100.00 | | | | | 3 - Decapsulation | 00.3 | 100.00 | | | 1.400.9 | 75.50 | 1.1 Short from list | | | | 1. Rq_mult_small x3 2. Short_fromlist | 1,400.9 | 10.13 | 1.1 Short_fromust<br>1.2 Expand x2 | | | | 3. Expand x2 | 187.9 | 9.90 | _ | | | | 4. crypto_decode_653x1541 x2 | 38.8 | 2.09 | 1.3 Rq_mult_small x3<br>1.4 Hash x3 | 64.2 | 55.60 | | 4. crypto_decode_055x1541 x2 5. Hash x3 | 27.6 | 1.49 | 1.4 Hash x3<br>1.5 crypto_encode_653x1541round | | | | 6. crypto encode 653x1541round | 3.7 | 0.20 | 1.6 Others | | | | Others | 12.9 | 0.70 | 2. crypto decode 653x1541 x2 | 51.3 | 44.40 | | Total | 1,855.5 | 100.00 | Total | 115.5 | 100.00 | | 10001 | 1,000.0 | 100.00 | 10001 | 110.0 | 100.00 | Table 36: Results of profiling Streamlined NTRU Prime | Function | Time | Time | Function | Time | Time | |-----------------------|----------|---------|-----------------------------------|-------|--------| | Software | [us] | [%] | Software/Hardy | [us] | [%] | | | | TRUPri | me857 - Encapsulation | varc | | | 1. Rq_mult_small | 724.5 | 63.36 | 1.1 crypto_sort_uint32 | | | | 2. crypto_sort_uint32 | 259.7 | 22.71 | 1.2 Hash x5 | | | | 3. Hash x5 | 71.9 | 6.29 | 1.3 Rq_mult_small | 60.1 | 62.95 | | 4. Rq_decode | 30.4 | 2.66 | 1.4 Round_and_encode | 00.2 | 000 | | 5. Round_and_encode | 8.2 | 0.72 | 1.5 Others | | | | Others | 48.8 | 4.27 | 2. Rq_decode | 35.4 | 37.05 | | Total | 1,143.5 | 100.00 | Total | 95.5 | 100.00 | | | | | ne857 - Decapsulation | | | | 1. R3_mult | 1,019.4 | 39.43 | 1.1 Hash x4 | | | | 2. Rq_mult_small x2 | 1,448.9 | 56.05 | 1.2 Rq_mult_small x2 | F 7 0 | 40.00 | | 3. Hash x4 | 42.1 | 1.63 | 1.3 R3_mult | 57.3 | 48.39 | | 4. Rq_decode | 27.9 | 1.08 | 1.4 Others | | | | 5. Rounded_decode | 28.1 | 1.09 | 2. Rq_decode | 33.0 | 27.90 | | Others | 18.9 | 0.73 | 3. Rounded_decode | 28.1 | 23.71 | | Total | 2,585.2 | 100.00 | Total | 118.4 | 100.00 | | Strea | amlineN | ΓRUPrin | ne761 - Encapsulation | | | | 1. Rq_mult_small | 584.6 | 61.65 | 1.1 crypto_sort_uint32 | | | | 2. crypto_sort_uint32 | 223.8 | 23.61 | 1.2 Hash x5 | | | | 3. Hash x5 | 62.5 | 6.59 | 1.3 Rq_mult_small | 56.3 | 63.97 | | 4. Rq_decode | 27.0 | 2.85 | 1.4. Round_and_encode | | | | 5. Round_and_encode | 7.7 | 0.81 | 1.5 Others | | | | Others | 42.7 | 4.50 | 2. Rq_decode | 31.7 | 36.03 | | Total | 948.2 | 100.00 | Total | 88.0 | 100.00 | | Stre | amlineN' | TRUPri | me761- Decapsulation | | | | 1. R3_mult | 816.2 | 39.06 | 1.1 Hash x4 | | | | 2. Rq_mult_small x2 | 1,169.4 | 55.96 | $1.2 \text{ Rq\_mult\_small } x2$ | 53.3 | 59.35 | | 3. Hash x4 | 35.8 | 1.72 | 1.3 R3_mult | 55.5 | 59.55 | | 4. Rq_decode | 24.5 | 1.17 | 1.4 Others | | | | 5. Rounded_decode | 25.8 | 1.24 | 2. Rq_decode | 32.3 | 35.97 | | Others | 17.8 | 0.85 | 3. Rounded_decode | 4.2 | 4.68 | | Total | 2,089.6 | 100.00 | Total | 89.8 | 100.00 | | Stream | amlineN | ΓRUPrin | ne653 - Encapsulation | | | | 1. Rq_mult_small | 467.0 | 60.19 | 1.1 crypto_sort_uint32 | | | | 2. crypto_sort_uint32 | 185.5 | 23.90 | 1.2 Hash x5 | | | | 3. Hash x5 | 54.8 | 7.06 | $1.3 \; \mathrm{Rq\_mult\_small}$ | 52.3 | 65.60 | | 4. Rq_decode | 24.2 | 3.11 | 1.4 Round_and_encode | | | | 5. Round_and_encode | 6.4 | 0.82 | 1.5 Others | | | | Others | 38.2 | 4.92 | 2. Rq_decode | 27.4 | 34.40 | | Total | 775.9 | 100.00 | Total | 79.7 | 100.00 | | | | | me653- Decapsulation | | | | 1. R3_mult | 617.3 | 37.58 | 1.1 Hash x4 | | | | 2. Rq_mult_small x2 | 933.7 | 56.85 | $1.2 \text{ Rq\_mult\_small } x2$ | 51.0 | 63.88 | | 3. Hash x4 | 35.7 | 2.17 | 1.3 R3_mult | 51.0 | 05.00 | | 4. Rq_decode | 21.0 | 1.28 | 1.4 Others | | | | 5. Rounded_decode | 22.4 | 1.37 | 2. Rq_decode | 25.2 | 31.61 | | Others | 12.4 | 0.76 | 3. Rounded_decode | 3.6 | 4.51 | | | | | | | | **Table 37:** Results of profiling Round5 | Function | Time<br>[us] | Time [%] | Function | Time<br>[us] | Time [%] | |-----------------------|--------------|----------|----------------------------|--------------|----------| | Software | | | Software/Hardw | are | | | R5N | | A_5KEN | M_0d - Encapsulation | | | | 1. r5_cpa_pke_encrypt | 290.8 | 86.42 | 1.1. r5_cpa_pke_encrypt | 30.1 | 95.20 | | 2. hash | 44.2 | 13.13 | 1.2. hash | | | | 3. randombytes | 1.5 | 0.45 | 2. randombytes | 1.5 | 4.80 | | Total | 336.5 | 100.00 | Total | 31.6 | 100.00 | | | | | M_0d - Decapsulation | | | | 1. r5_cpa_pke_encrypt | 287.1 | 69.05 | 1.1 r5_cpa_pke_encrypt | | | | 2. r5_cpa_pke_decrypt | 83.6 | 20.11 | 1.2 r5_cpa_pke_decrypt | 36.8 | 100.00 | | 3. hash & verify | 45.1 | 10.84 | 1.3 hash & verify | | | | Total | 415.8 | 100.00 | Total | 36.8 | 100.00 | | | D_CC | | M_0d - Encapsulation | | | | 1. r5_cpa_pke_encrypt | 211.4 | 86.27 | 1.1. r5_cpa_pke_encrypt | 22.6 | 95.72 | | 2. hash | 32.6 | 13.32 | 1.2. hash | 22.0 | 00.12 | | 3. randombytes | 1.0 | 0.41 | 2. randombytes | 1.0 | 4.28 | | Total | 245.0 | 100.00 | Total | 23.6 | 100.00 | | R5N | | | M_0d - Decapsulation | | | | 1. r5_cpa_pke_encrypt | 208.2 | 67.27 | 1.1 r5_cpa_pke_encrypt | | | | 2. r5_cpa_pke_decrypt | 67.5 | 21.80 | 1.2 r5_cpa_pke_decrypt | 27.6 | 100.00 | | 3. hash & verify | 33.9 | 10.94 | 1.3 hash & verify | | | | Total | 309.5 | 100.00 | Total | 27.6 | 100.00 | | R5N | | A_1KEN | M_0d - Encapsulation | | | | 1. r5_cpa_pke_encrypt | 133.9 | 86.69 | 1.1. r5_cpa_pke_encrypt | 16.9 | 94.46 | | 2. hash | 19.6 | 12.67 | 1.2. hash | 10.9 | 94.40 | | 3. randombytes | 1.0 | 0.64 | 2. randombytes | 1.0 | 5.54 | | Total | 154.5 | 100.00 | Total | 17.9 | 100.00 | | R5N | D_CC | A_1KEN | M_0d - Decapsulation | | | | 1. r5_cpa_pke_encrypt | 130.3 | 67.61 | 1.1 r5_cpa_pke_encrypt | | | | 2. r5_cpa_pke_decrypt | 41.7 | 21.62 | 1.2 r5_cpa_pke_decrypt | 19.7 | 100.00 | | 3. hash & verify | 20.8 | 10.77 | 1.3 hash & verify | | | | Total | 192.7 | 100.00 | Total | 19.7 | 100.00 | | R5N | D_CC | A_5KEN | M_5d - Encapsulation | | | | 1. r5_cpa_pke_encrypt | 372.0 | 91.87 | 1.1. r5_cpa_pke_encrypt | 00.4 | 0.4 55 | | 2. hash | 31.4 | 7.76 | 1.2. hash | 26.4 | 94.55 | | 3. randombytes | 1.5 | 0.38 | 2. randombytes | 1.5 | 5.45 | | Total | 404.9 | 100.00 | Total | 27.9 | 100.00 | | R5N | D_CC | A_5KEN | M_5d - Decapsulation | | | | 1. r5_cpa_pke_encrypt | 372.0 | 69.24 | 1.1 r5_cpa_pke_encrypt | | | | 2. r5_cpa_pke_decrypt | 132.3 | 24.63 | 1.2 r5_cpa_pke_decrypt | 32.5 | 100.00 | | 3. hash & verify | 32.9 | 6.12 | 1.3 hash & verify | | | | Total | 537.2 | 100.00 | Total | 32.5 | 100.00 | | R5N | D CC | | M_5d - Encapsulation | | | | 1. r5_cpa_pke_encrypt | 214.3 | 88.83 | 1.1. r5_cpa_pke_encrypt | 10.0 | 05.04 | | 2. hash | 25.9 | 10.75 | 1.2. hash | 19.3 | 95.04 | | 3. randombytes | 1.0 | 0.42 | 2. randombytes | 1.0 | 4.96 | | Total | 241.3 | 100.00 | Total | 20.4 | 100.00 | | | D_CC | | M_5d - Decapsulation | 1 | 1 | | 1. r5_cpa_pke_encrypt | 214.2 | 67.19 | 1.1 r5_cpa_pke_encrypt | | | | 2. r5_cpa_pke_decrypt | 78.5 | 24.63 | 1.2 r5_cpa_pke_decrypt | 23.3 | 100.00 | | 3. hash & verify | 26.1 | 8.18 | 1.3 hash & verify | | | | Total | 318.8 | 100.00 | Total | 23.3 | 100.00 | | R5N | | | M_5d - Encapsulation | | | | 1. r5_cpa_pke_encrypt | 111.7 | 88.59 | 1.1. r5_cpa_pke_encrypt | | | | 2. hash | 13.4 | 10.62 | 1.2. hash | 13.5 | 93.15 | | 3. randombytes | 1.0 | 0.79 | 2. randombytes | 1.0 | 6.85 | | Total | 126.1 | 100.00 | Total | 14.4 | 100.00 | | R5N | | | $M_{2}$ 5d - Decapsulation | | | | 1. r5_cpa_pke_encrypt | 111.8 | 64.72 | 1.1 r5_cpa_pke_encrypt | | | | 2. r5_cpa_pke_decrypt | 46.7 | 27.06 | 1.2 r5_cpa_pke_decrypt | 16.0 | 100.00 | | 3. hash & verify | 14.2 | 8.22 | 1.3 hash & verify | 10.0 | | | Total | 172.7 | 100.00 | Total | 16.0 | 100.00 | | 20001 | 1 1 2 . 1 | 100.00 | 10001 | 10.0 | 100.00 | Table 38: Results of profiling for Saber | Function | Time | Time | Function | Time | Time | | | |----------------------------|----------|-----------|---------------------|--------|----------|--|--| | Function | [us] | [%] | | [us] | [%] | | | | Softwa | | | Software/Har | rdware | | | | | | | | ncapsulation | | | | | | 1. MatrixVectorMul | 815.84 | 69.08% | 1.1 MatrixVectorMul | | | | | | 2. InnerProduct | 204.44 | 17.31% | 1.2 InnerProduct | | | | | | 3. GenMatrix | 92.93 | 7.87% | 1.3 GenMatrix | 55.09 | 84.43% | | | | 4. Hash | 45.10 | 3.82% | 1.4 Hash | | | | | | 5. GenSecret | 12.50 | 1.06% | 1.5 GenSecret | | | | | | Others | 10.16 | 0.86% | Others | 10.16 | 15.57% | | | | Total | 1,180.97 | 100.00% | Total | 65.25 | 100.00% | | | | | Fire | Saber - D | ecapsulation | | | | | | 1. MatrixVectorMul | 816.34 | 59.31% | 1.1 MatrixVectorMul | | | | | | 2. InnerProduct x2 | 408.14 | 29.65% | 1.2 InnerProduct x2 | | | | | | 3. GenMatrix | 92.99 | 6.76% | 1.3 GenMatrix | 55.14 | 71.55% | | | | 4. Hash | 24.49 | 1.78% | 1.4 Hash | | | | | | 5. GenSecret | 12.53 | 0.91% | 1.5 GenSecret | | | | | | Others | 21.92 | 1.59% | Others | 21.92 | 28.45% | | | | Total | 1,376.41 | 100.00% | Total | 77.06 | 100.00% | | | | | | | capsulation | | | | | | 1. MatrixVectorMul | 458.94 | 63.55% | 1.1 MatrixVectorMul | | | | | | 2. InnerProduct | 153.19 | 21.21% | 1.2 InnerProduct | | | | | | 3. GenMatrix | 53.29 | 7.38% | 1.3 GenMatrix | 49.15 | 86.36% | | | | 4. Hash | 37.98 | 5.26% | 1.4 Hash | | 00100,0 | | | | 5. GenSecret | 10.97 | 1.52% | 1.5 GenSecret | | | | | | Others | 7.76 | 1.07% | Others | 7.76 | 13.64% | | | | Total | 722.13 | 100.00% | Total | 56.91 | 100.00% | | | | 10001 | | | capsulation | 00.01 | 100.0070 | | | | 1. MatrixVectorMul | 458.98 | 52.93% | 1.1 MatrixVectorMul | | | | | | 2. InnerProduct x2 | 306.52 | 35.35% | 1.2 InnerProduct x2 | | | | | | 3. GenMatrix | 53.29 | 6.15% | 1.3 GenMatrix | 48.15 | 74.47% | | | | 4. Hash | 20.87 | 2.41% | 1.4 Hash | 10.10 | 14.41/0 | | | | 5. GenSecret | 11.00 | 1.27% | 1.5 GenSecret | | | | | | Others | 16.51 | 1.90% | Others | 16.51 | 25.53% | | | | Total | 867.17 | 100.00% | Total | 64.66 | 100.00% | | | | Total | | | Encapsulation | 04.00 | 100.0070 | | | | 1. MatrixVectorMul | 203.70 | 54.55% | 1.1 MatrixVectorMul | | | | | | 2. InnerProduct | 102.26 | 27.38% | 1.2 InnerProduct | | | | | | 3. GenMatrix | 23.67 | 6.34% | 1.3 GenMatrix | 43.36 | 88.49% | | | | 4. Hash | 27.31 | 7.31% | | 40.00 | 00.4970 | | | | 5. GenSecret | 10.86 | 2.91% | 1.5 GenSecret | | | | | | Others | 5.64 | 1.51% | Others | 5.64 | 11.51% | | | | | 373.44 | | | 49.00 | | | | | Total | | 100.00% | Total | 49.00 | 100.00% | | | | LightSaber - Decapsulation | | | | | | | | | 1. MatrixVectorMul | 204.43 | 43.44% | 1.1 MatrixVectorMul | | | | | | 2. InnerProduct x2 | 204.80 | 43.52% | 1.2 InnerProduct x2 | 41.05 | 70 ==04 | | | | 3. GenMatrix | 23.67 | 5.03% | 1.3 GenMatrix | 41.27 | 78.55% | | | | 4. Hash | 15.55 | 3.30% | 1.4 Hash | | | | | | 5. GenSecret | 10.83 | 2.30% | 1.5 GenSecret | 44.5- | 24 :=^ | | | | Others | 11.27 | 2.40% | Others | 11.27 | 21.45% | | | | Total | 470.55 | 100.00% | Total | 52.54 | 100.00% | | | #### В Pseudocode of Implemented Algorithms Below we show the pseudocode of all implemented KEMs, with parts offloaded to hardware marked with the gray background. #### **B.1 FrodoKEM** ``` Algorithm 1 Pseudocode of FrodoKEM.Encaps [67] Input: Public key pk = seed_A || \overline{b} \in \{0,1\}^{len_{seed_A} + D.n.\overline{n}}. Output: Ciphertext c_1||c_2 \in \{0,1\}^{(\bar{m}.n+\bar{m}.\bar{n})D} and shared secret SS \in \{0,1\}^{len_{ss}}. 1: Choose a uniformly random key \mu \leftarrow s U(\{0,1\}^{len_{\mu}}) 2: Compute \mathbf{pkh} \leftarrow \mathrm{SHAKE}(pk, len_{pkh}) 3: Generate pseudorandom values seed_{SE} \mid \mid k \leftarrow \text{SHAKE}\left(\mathbf{pkh} \mid \mid \mu, len_{seed_{SE}} + len_k\right) 4: Generate pseudorandom bit string (r^{(0)}, r^{(1)}, ..., r^{(2\bar{m}n + m\bar{n} - 1)})) \leftarrow SHAKE(0x96||seed_{SE}, 2\bar{m}n + \overline{mn}.len_x) 5: Sample error matrix \mathbf{S}' \leftarrow \text{Frodo.SampleMatrix}((r^{(0)}, r^{(1)}, ..., r^{(\bar{m}n-1)})), \bar{m}, n, T_x) 6: Sample error matrix \mathbf{E}' \leftarrow \text{Frodo.SampleMatrix}((r^{(\bar{m}n)}, r^{(\bar{m}n+1)}, ..., r^{(2\bar{m}n-1)})), \bar{m}, n, T_x) 7: Generate \mathbf{A} \leftarrow \text{Frodo.Gen}(seed_A) 8: Compute \mathbf{B}' \leftarrow \mathbf{S}'\mathbf{A} + \mathbf{E}' 9: Compute \mathbf{c}_1 \leftarrow \text{Frodo.Pack}(\mathbf{B}') 10: Sample error matrix \mathbf{E}'' \leftarrow \text{Frodo.SampleMatrix}((r^{(2\bar{m}n)}, r^{(2\bar{m}n+1)}, ..., r^{(2\bar{m}n+\overline{m}n-1)})), \bar{m}, \bar{n}, T_x 11: Compute \mathbf{B} \leftarrow \text{Frodo.Unpack}(\mathbf{b}, \mathbf{n}, \bar{n}) 12: Compute V \leftarrow S'B + E'' 13: Compute \mathbf{C} \leftarrow \mathbf{V} + \overline{\text{Frodo.Encode}(\mu)} 14: Compute \mathbf{c}_2 \leftarrow \text{Frodo.Pack}(\mathbf{C}) 15: Compute \mathbf{ss} \leftarrow \mathrm{SHAKE}(\mathbf{c}_1||\mathbf{c}_2||\mathbf{k}, ken_{ss}) 16: return ciphertext (\mathbf{c}_1||\mathbf{c}_2) and shared secret ss ``` # **Algorithm 2** Pseudocode of FrodoKEM.Decaps [67] Input: Ciphertext $\mathbf{c}_1||\mathbf{c}_2 \in \{0,1\}^{\bar{m}.n+\bar{m}.\bar{n})D}$ , secret key $sk' = (s||seed_A||\mathbf{b},\mathbf{S},\mathbf{pkh}) \in \{0,1\}^{len_s+len_{seed_A}+D.n.\bar{n}} \times Z_q^{n\times\bar{n}} \times \{0,1\}^{len_{pkh}}$ . Output: Shared secret $\mathbf{ss} \in \{0,1\}^{len_{ss}}$ . 1: $\mathbf{B}' \leftarrow \text{Frodo.Unpack}(c_1)$ 2: $\mathbf{C} \leftarrow \text{Frodo.Unpack}(c_2)$ 3: Compute $\mathbf{M} \leftarrow \mathbf{C}$ - $\mathbf{B}'\mathbf{S}$ 4: Compute $\mu' \leftarrow \text{Frodo.Decode}(\mathbf{M})$ 5: Parse $pk \leftarrow seed_A \mid\mid \mathbf{b}$ 6: Generate pseudorandom values $seed_{SE}' || k' \leftarrow \text{SHAKE} \left( \mathbf{pkh} || \mu', len_{seed_{SE}} + len_k \right)$ 7: Generate pseudorandom bit string $(r^{(0)}, r^{(1)}, ..., r^{(2\bar{m}n + \overline{m}n - 1)})) \leftarrow$ $SHAKE(0x96||seed'_{se}, 2\bar{m}n + \overline{m}n.len_x)$ 8: Sample error matrix $\mathbf{S}' \leftarrow \text{Frodo.SampleMatrix}((r^{(0)}, r^{(1)}, ..., r^{(\bar{m}n-1)})), \bar{m}, n, T_x)$ 9: Sample error matrix $\mathbf{E}' \leftarrow \text{Frodo.SampleMatrix}((r^{(\bar{m}n)}, r^{(\bar{m}n+1)}, ..., r^{(2\bar{m}n-1)})), \bar{m}, n, T_x)$ 10: Generate $\mathbf{A} \leftarrow \text{Frodo.Gen}(seed_A)$ 11: Compute $\mathbf{B}'' \leftarrow \mathbf{S}'\mathbf{A} + \mathbf{E}'$ 12: Sample error matrix $\mathbf{E}'' \leftarrow \text{Frodo.SampleMatrix}((r^{(2\bar{m}n)}, r^{(2\bar{m}n+1)}, ..., r^{(2\bar{m}n+\overline{m}n-1)})), \bar{m}, \bar{n}, T_x$ 13: Compute $\mathbf{B} \leftarrow \text{Frodo.Unpack}(\mathbf{b}, \mathbf{n}, \bar{n})$ 14: Compute $V \leftarrow S'B + E''$ 15: Compute $\mathbf{C'} \leftarrow \mathbf{V} + \text{Frodo.Encode}(\mu')$ 16: if B' || C = B" || C' then 17: **return** shared secret $ss \leftarrow SHAKE (c_1||c_2||k', len_{ss})$ # B.2 LAC 7: $sk = (\vec{s}, pk = (seed_{\vec{a}}, \vec{b}))$ 19: **return** shared secret $ss \leftarrow SHAKE (c_1||c_2||s, len_{ss})$ 18: **else** 20: **end if** ``` Algorithm 3 LAC-CCA Key Generation [51] Input: Random seed Output: pk = (seed_{\vec{a}}, \vec{b}), sk = (\vec{s}, pk) 1: (seed_{\vec{a}}, seed_{\vec{s}}, seed_{\vec{e}}) \leftarrow \text{GenSeed}(seed) 2: \vec{a} \leftarrow \text{UniformSampl}(seed_{\vec{a}}) \in R_q 3: \vec{s} \leftarrow \text{CBDSampl}(seed_{\vec{s}}) 4: \vec{e} \leftarrow \text{CBDSampl}(seed_{\vec{e}}) 5: \vec{b} \leftarrow \vec{a}\vec{s} + \vec{e} \in R_q 6: pk = (seed_{\vec{a}}, \vec{b}) ``` #### Algorithm 4 LAC-CCA Encapsulation [51] Input: $pk = (seed_{\vec{a}}, \vec{b})$ , message $\vec{m}$ **Output:** A ciphertext $\vec{c}$ and a session key $\vec{ss}$ ``` 1: seed \leftarrow \mathsf{GenSeed}(\vec{m}, pk) 2: seeds \leftarrow \mathsf{GenSeed}(seed) 3: \ \vec{a} \leftarrow \mathsf{UniformSampl}(seed_{\vec{a}}) \in R_q 4: (\vec{r}, \vec{e}_1, \vec{e}_2) \leftarrow \mathsf{CBDSampl}(seeds) 5: \vec{c}_1 \leftarrow \vec{a}\vec{r} + \vec{e}_1 \in R_q 6: \vec{m}' \leftarrow \mathsf{BCH.Enc}(\vec{m}) \in \{0,1\}^{l_v} 7: \vec{m}'' \leftarrow \mathsf{D2.Enc}(\vec{m}') \in \mathbb{Z}_q^{l_v} 8: \vec{c}_2 \leftarrow \mathsf{Compress}((\vec{b}\vec{r})_{l_v} + \vec{e}_2 + \vec{m}'') 9: \vec{c} \leftarrow (\vec{c}_1, \vec{c}_2) 10: \vec{ss} \leftarrow \mathsf{Hash}(\vec{m}, \vec{c}) ``` #### Algorithm 5 LAC-CCA Decapsulation [51] **Input:** $sk = (\vec{s}, pk = (seed_{\vec{a}}, \vec{b})), \vec{c} = (\vec{c}_1, \vec{c}_2)$ Output: A session key $\vec{ss}$ ``` 1: \vec{u} \leftarrow \vec{c}_1 \vec{s} \in R_q 2: \vec{m}'' \leftarrow \mathsf{Decompress}(\vec{c}_2) - (\vec{u})_{l_v} \in \mathbb{Z}_q^{l_v} 3: \vec{m}' \leftarrow \mathsf{D2.Dec}(\vec{m}'') \in \{0,1\}^{l_v} 4: \vec{m} \leftarrow \mathsf{BCH.Dec}(\vec{m}') 5: seed \leftarrow \mathsf{GenSeed}(\vec{m}, pk) 6: seeds \leftarrow \mathsf{GenSeed}(seed) 7: \vec{a} \leftarrow \mathsf{UniformSampl}(seed_{\vec{a}}) \in R_q 8: (\vec{r}, \vec{e}_1, \vec{e}_2) \leftarrow \mathsf{CBDSampl}(seeds) 9: \underline{\vec{c_1}} \leftarrow \vec{a}\vec{r} + \vec{e_1} \in R_q \begin{array}{ll} \text{10:} & \overline{\underline{\vec{m}}'} \leftarrow \mathsf{BCH}.\mathsf{Enc}(\vec{\vec{m}}) \in \{0,1\}^{l_v} \\ \text{11:} & \underline{\vec{m}''} \leftarrow \mathsf{D2}.\mathsf{Enc}(\underline{\vec{m}}') \in \mathbb{Z}_q^{l_v} \end{array} 12: \vec{c_2} \leftarrow \mathsf{Compress}((\vec{b}\vec{r})_{l_v} + \vec{e_2} + \underline{\vec{m}}'') 13: \underline{\vec{c}} \leftarrow (\underline{\vec{c}_1}, \underline{\vec{c}_2}) 14: if \vec{c} = \underline{\vec{c}} then 15: \vec{ss} \leftarrow \mathsf{Hash}(\vec{m}, \vec{c}) 16: else 17: \vec{ss} \leftarrow \mathsf{Hash}(\mathsf{Hash}(sk), \vec{c}) 18: end if ``` 10: return11: end if12: return K # B.3 Kyber $\mathbf{return}\ K := \mathrm{KDF}(z||\mathrm{H}(c))$ ``` \overline{\mathbf{Algorithm 6}} Pseudocode of Kyber.CCAKEM.Enc(pk) [8] Input: Public key pk \in \mathcal{B}^{12.k.n/8+32} Output: Ciphertext c \in \mathcal{B}^{d_u.k.n/8+d_v.n/8} Output: Shared key K \in \mathcal{B}^* 1: m \leftarrow \mathcal{B}^{32} 2: m \leftarrow H(m) 3: (\bar{K}, r) := G(m||H(pk)) 4: c := \text{Kyber.CPAPKE.Enc}(pk, m, r) 5: K := KDF(\bar{K}||H(c)) 6: return (c, K) Algorithm 7 Pseudocode of Kyber.CCAKEM.Dec(c, sk) Input: Ciphertext c \in \mathcal{B}^{d_u.k.n/8+d_v.n/8} Input: Secret key sk \in \mathcal{B}^{24.k.n/8+96} Output: Shared key K \in \mathcal{B}^* 1: pk := sk + 12.k.n/8 2: h := sk + 24.k.n/8 + 32 \in \mathcal{B}^{32} 3: z := sk + 24.k.n/8 + 64 4: m' := \text{Kyber.CPAPKE.Dec}(s, (u, v)) 5: (\bar{K}', r') := G(m'||h|) 6: c' := \text{Kyber.CPAPKE.Enc}(pk, m', r') 7: if c = c' then 8: return K := KDF(\bar{K}'||H(c)) 9: else ``` # **Algorithm 8** Pseudocode of Kyber.CPAPKE. $\operatorname{Enc}(pk, m, r)$ : encryption ``` Input: Public key pk \in \mathcal{B}^{12.k.n/8+32} Input: Message m \in \mathcal{B}^{32} Input: Random coins r \in \mathcal{B}^{32} Output: Ciphertext c \in \mathcal{B}^{d_u.k.n/8+d_v.n/8} 1: N := 0 2: \hat{t} := \text{Decode}_{12}(pk) 3: \rho := pk + 12.k.n/8 4: for i from 0 to k-1 do for j from 0 to k-1 do \hat{A}^T[i][j] := \text{Parse}(\text{XOF}(\rho, i, j)) 6: 7: end for 8: end for 9: for i from 0 to k-1 do r[i] := CBD_{\eta}(PRF(r, N)) 10: 11: N := N + 1 12: end for 13: for i from 0 to k-1 do e_1[i] := CBD_{\eta}(PRF(r, N)) 14: N:=N+1 15: 16: end for 17: e_2 := CBD_{\eta}(PRF(r, N)) 18: \hat{r} := \text{NTT(r)} 19: \mathbf{u} := \text{NTT}^{-1}(\hat{A}^T \circ \hat{r}) + \mathbf{e}_1 20: v := \text{NTT}^{-1}(\hat{t}^T \circ \hat{r}) + e_2 + \text{Decompress}_q(\text{Decode}_1(m), 1) 21: c_1 := \text{Encode}_{d_u}(\text{Compress}_q(\mathbf{u}, d_u)) 22: c_2 := \text{Encode}_{d_v}(\text{Compress}_q(\mathbf{v}, d_v)) 23: return c = (c_1||c_2) ``` ### **Algorithm 9** Pseudocode of Kyber.CPAPKE.Dec(sk, c): decryption ``` Input: Secret key sk \in \mathcal{B}^{12.k.n/8} Input: Ciphertext c \in \mathcal{B}^{d_u.k.n/8 + d_v.n/8} Output: Message m \in \mathcal{B}^{32} 1: u := Decompress_q(Decode_{d_u}(c), d_u) 2: v := \text{Decompress}_q(\text{Decode}_{d_v}(c + d_u.k.n/8), d_v) 3: \hat{s} := \text{Decode}_{12}(sk) 4: m := \text{Encode}_1(\text{Compress}_q(v - \text{NTT}^{-1}(\hat{s}^T \circ \text{NTT}(\mathbf{u})), 1)) 5: return m ``` ## B.4 NewHope ## Algorithm 10 Pseudocode of NewHope-CCA-KEM Encapsulation [2] ``` 1: function NewHope-CCA-KEM.Encaps(pk) 2: coin \stackrel{\$}{\leftarrow} \{0, \dots, 255\}^{32} 3: \mu \leftarrow \text{SHAE256}(32, coin) \in \{0, \dots, 255\}^{32} 4: K||coin'||d \leftarrow \text{SHAE256}(96, \mu||\text{SHAE256}(32, pk)) \in \{0, \dots, 255\}^{32+32+32} 5: c \leftarrow \text{NewHope-CPA-PKE.Encrypt}(pk, \mu, coins') 6: ss \leftarrow \text{SHAKE256}(32, K||\text{SHAKE256}(32, c||d)) 7: \mathbf{return} \ (\bar{c} = c||d, ss) ``` ### Algorithm 11 Pseudocode of Newhope-CCA-KEM Decapsulation [2] ``` 1: function NewHope-CCA-KEM.Decaps(\bar{c}, \bar{sk}) c||d \leftarrow \bar{c} \in \{0, \dots, 255\}^{3n/8 + 7n/4 + 32} sk||pk||h|s \leftarrow \bar{sk} \in \{0,\dots,255\}^{7n/4+7n/4+32+32+32} 3: \mu' \leftarrow \text{CPA.Decrypt}(c, sk) 4: K'||coin''||d' \leftarrow \text{SHAKE256(96}, \mu'||h) \in \{0, \dots, 255\}^{32+32+32} 5: if c = \text{NewHope-CPA-PKE.Encrypt}(pk, \mu', coin'') and d = d' then 6: 7: 8: else fail \leftarrow 1 9: K \leftarrow K' 10: K \leftarrow s 11: return (ss = SHAKE256(32, K_{fail}||SHAKE256(32, c||d))) 12: ``` ### Algorithm 12 Pseudocode of NewHope-CPA-PKE Encryption [2] ``` 1: function NewHope-CPA-PKE.Encrypt(pk \in \{0, ..., 255\}^{7n/4+32}, \mu \in \{0, ..., 255\}^{32}, 2: coin \in \{0, \dots, 255\}^{32} (\hat{b}, publicseed) \leftarrow \text{DecodePk}(pk) 4: \hat{a} \leftarrow \text{GenA}(publicseed) 5: s' \leftarrow \text{BitRev}(\text{Sample}(coin, 0)) 6: e' \leftarrow \text{BitRev}(\text{Sample}(coin, 1)) e'' \leftarrow \text{Sample}(coin, 2) 7: \hat{t} \leftarrow \text{NTT}(s') 8: 9: \hat{u} \leftarrow \hat{a} \circ \hat{t} + \text{NTT}(e') v \leftarrow \text{Encode}(\mu) 10: v' \leftarrow \text{NTT}^{-1}(\hat{b} \circ \hat{t}) + e'' + v 11: h \leftarrow \text{Compress}(v') 12: return c = \text{EncodeC}(\hat{u}, h) ``` ## Algorithm 13 Pseudocode of NewHope-CPA-PKE Decryption [2] ``` 1: function NewHope-CPA-PKE.Decrypt (c \in \{0, ..., 255\}^{7n/4+3n/8}, sk \in \{0, ..., 255\}^{7n/4}) (\hat{u}, h) \leftarrow \text{DecodeC}(c) 2: \hat{s} \leftarrow \text{DecodePoly}(sk) 3: v' \leftarrow \text{Decompress}(h) 4: \mu = \text{Decode}(v' - \text{NTT}^{-1}(\hat{u} \circ \hat{s})) 5: 6: return \mu ``` #### **B.5** NTRU-HPS and NTRU-HRSS # **Algorithm 14** Pseudocode of NTRU KEM Encapsulate(h) ``` 1: coins \leftarrow_{\$} \{0,1\}^{256} 2: (r, m) \leftarrow \text{Sample\_rm } (coins) 3: c \leftarrow \text{Encrypt } (h, (r, m)) 4: k \leftarrow H_1(r,m) 5: return (c, k) ``` #### Algorithm 15 Pseudocode of NTRU KEM Decapsulate $((f, f_p, h_q, s), c)$ [69] ``` 1: (r, m, fail) \leftarrow \text{Decrypt}((f, f_p, h_q), c) 2: k_1 \leftarrow H_1(r, M) 3: k_2 \leftarrow H_2(s,c) 4: if fail = 0 then 5: return k_1 6: else 7: return K_2 8: end if ``` ## **Algorithm 16** Pseudocode of NTRU DPKE Encrypt(h, (r, m))) [69] ``` 1: m' \leftarrow \text{Lift}(m) 2: c \leftarrow (r \cdot h + m') \mod (q, \Phi_1 \Phi_n) 3: return c ``` ## **Algorithm 17** Pseudocode of NTRU DPKE Decrypt $((f, f_p, h_p), c)$ [69] ``` 1: if c \neq 0 \pmod{(q, \Phi_1)} then 2: return (0, 0, 1) 3: end if 4: a \leftarrow (c \cdot f) \mod{(q, \Phi_1 \Phi_n)} 5: m \leftarrow (a \cdot f_p) \mod{(3, \Phi_n)} 6: m' \leftarrow \text{Lift } (m) 7: r \leftarrow ((c - m') \cdot h_q) \mod{(q, \Phi_n)} 8: if (r, m) \in \mathcal{L}_r \times \mathcal{L}_m then 9: return (r, m, 0) 10: else 11: return (0, 0, 1) 12: end if ``` #### B.6 Streamlined NTRU Prime and NTRULPRime **Algorithm 18** Pseudocode of Encapsulation in Streamlined NTRU Prime and NTRUL-PRime [68] ``` \overline{\mathbf{Input:}\ \underline{K} \in \mathsf{PublicKeys.}} ``` **Output:** Ciphertexts' $\times$ SessionKeys' = Ciphertexts $\times$ Confirm $\times$ SessionKeys'. - 1: Decode $\underline{K}$ , obtaining $K \in \mathsf{PublicKeys}$ . - 2: Generate a uniform random $r \in \mathsf{Inputs}$ . - 3: Encode r as a string $\underline{r} \in \mathsf{Inputs}$ . - 4: Compute $c = \mathsf{Encrypt}(r, K) \in \mathsf{Ciphertexts}$ . - 5: Encode c as a string $\underline{c} \in \mathsf{Ciphertexts}$ . - 6: Compute $C = (\underline{c}, \mathsf{HashConfirm}(\underline{r}, \underline{K})) \in \mathsf{Ciphertexts} \times \mathsf{Confirm}$ - 7: **Return** $(C, \mathsf{HashSession}(1, \underline{r}, C))$ . Algorithm 19 Pseudocode of Decapsulation in Streamlined NTRU Prime and NTRUL-PRime [68] **Input:** $C = (\underline{c}, \gamma) \in \mathsf{Ciphertexts} \times \mathsf{Confirm} \ \mathrm{and} \ (\underline{k}, \underline{K}, \rho) \in \mathsf{SecretKeys} \times \mathsf{PublicKeys} \times \mathsf{Inputs}$ Output: SessionKeys - 1: Decode $\underline{c}$ , obtaining $c \in \mathsf{Ciphertexts}$ . - 2: Decode k, obtaining $k \in SecretKeys$ . - 3: Compute $r' = \mathsf{Decrypt}(c, k) \in \mathsf{Inputs}$ . - 4: Encode r' as a string $\underline{r'} \in \mathsf{Inputs}$ . - 5: Compute $c' = \mathsf{Encrypt}(r', K) \in \mathsf{Ciphertexts}$ . - 6: Encode c' as a string $\underline{c'} \in \mathsf{Ciphertexts}$ . - 7: Compute $C' = (\underline{c'}, \mathsf{HashConfirm}(\underline{r'}, \underline{K})) \in \mathsf{Ciphertexts} \times \mathsf{Confirm}$ - 8: If C' = C then return HashSession(1, r, C). Otherwise return HashSession $(0, \rho, C)$ . (The choice between these two outputs is secret information.) ## Algorithm 20 Pseudocode of Encryption in Streamlined NTRU Prime **Input:** $r \in \text{Inputs and } K = h \in \text{PublicKeys.}$ **Output:** $c \in \mathsf{Ciphertexts}$ . - 1: Compute $hr \in \mathcal{R}/q$ . - 2: Return c = Round(hr). ### Algorithm 21 Pseudocode of Decryption in Streamlined NTRU Prime [68] **Input:** $c \in \text{Ciphertexts and } k = (f, v) \in \text{Short} \times \mathcal{R}/3.$ Output: $r' \in Inputs$ . - 1: Compute $3fc \in \mathcal{L}/q$ . - View each coefficient of $3fc \in \mathcal{R}/q$ as an integer between -(q-1)/2 and (q-1)/2, and then reduce modulo 3, obtaining a polynomial $e \in \mathcal{R}/3$ . - Multiply by $v \in \mathcal{R}/3$ . - 4: Lift $ev \in \mathcal{R}/3$ to a small polynomial $r' \in \mathcal{R}$ . - 5: **Return** r' if r' has weight w. Otherwise output (1, 1, ..., 1, 0, 0, ..., 0). ### Algorithm 22 Pseudocode of Encryption in NTRU LPRime Expand **Input:** $r \in \text{Inputs and } K = (S, A) \in \text{Seeds} \times \text{Rounded}.$ **Output:** $c \in \mathsf{Ciphertexts}$ . - 1: Compute $G = \mathsf{Generator}(S)$ . - 2: Generate a uniform random $b \in \mathsf{HashShort}(r)$ . - 3: Compute bG in $\mathcal{R}/q$ . - 4: Compute bA in $\mathcal{R}/q$ . - Compute $T = (T_0, T_1, ..., T_{I-1}) \in (\mathbb{Z}/\tau)^I$ as follows: $T_j = \mathsf{Top}((bA)_j + r_j(q-1)/2)$ . - 6: **Return** $c = (\mathsf{Round}(bG), T) \in \mathsf{Ciphertexts}$ . # Algorithm 23 Pseudocode of Decryption in NTRU LPRime Expand **Input:** $c = (B, T) \in \mathsf{Rounded} \times (\mathbb{Z}/\tau)^I$ and $k = a \in \mathsf{SecretKeys}$ . Output: $r' \in Inputs$ . - 1: Compute aB in $\mathcal{R}/q$ . - 2: Compute $(r'_0, r'_1, ..., r'_{I-1}) \in \{0, 1\}$ as follows. View $\mathsf{Right}(T_j) (aB)_j + 4w + 1 \in (\mathbb{Z}/q)$ as an integer between -(q-1)/2 and (q-1)/2. Then $r'_i$ is the sign bit of this integer: 1 if the integer is negative, otherwise 0. - 3: **Return** $r' = (r'_0, r'_1, ..., r'_{I-1}) \in \mathsf{Inputs}$ ## B.7 Round5 ``` Algorithm 24 Pseudocode of r5_cca_kem_encapsulate(pk) [70] Parameters: Integers p, t, q, n, d, \bar{m}, \bar{n}, \mu, b, k, f, \tau; \xi \in \{\Phi_{n+1}(x), x^{n+1} - 1\} Input: pk \in \{0,1\}^{\widetilde{k}} \times R_{n,p}^{d/n \times \overline{n}} Output: ct = (\tilde{U}, v, g) \in R_{n,p}^{\tilde{m} \times d/n} \times Z_t^{\mu} \times \{0, 1\}^k, k \in \{0, 1\}^k 1: m \stackrel{\$}{\leftarrow} \{0,1\}^k 2: (L, g, \rho) = G(m||pk) 3: (\tilde{U}, v) = r5\_cpa\_pke\_encrypt(pk, m, \rho) 4: ct = (\tilde{U}, v, g) 5: k = H(L||ct) 6: return (ct, k) Algorithm 25 Pseudocode of r5_cca_kem_decapsulate(ct,sk) [70] Input: ct = (\tilde{U}, v, g) \in R_{n,p}^{\bar{m} \times d/n} \times Z_t^{\mu} \times \{0, 1\}^k, sk = (sk_{CPA-PKE, y, pk}) \in \{0, 1\}^k \times \{0, 1\}^k \times \{0, 1\}^k Output: k \in \{0, 1\}^k Parameters: Integers p, t, q, n, d, \bar{m}, \bar{n}, \mu, b, k, f, \tau; \xi \in \{\Phi_{n+1}(x), x^{n+1} - 1\} 1: m' = r5\_cpa\_pke\_decrypt(sk_{CPA-PKE,(\tilde{U},v)}) 2: (L', g', \rho') = G(m'||pk) 3: (\tilde{U}', v') = r5\_cpa\_pke\_encrypt(pk, m', \rho') 4: ct' = (\tilde{U}', v', g') 5: if (ct = ct') then return k = H(L'||ct)) 8: return k = H(y||ct)) 9: end if ``` ## **Algorithm 26** Pseudocode of r5\_cpa\_pke\_encrypt(pk) ``` Parameters: Integers p, t, q, n, d, \bar{m}, \bar{n}, \mu, b, k, f, \tau; \xi \in \{\Phi_{n+1}(x), x^{n+1} - 1\} Input: pk = (\sigma, B) \in \{0, 1\}^k \times R_{n,p}^{d/n \times \bar{n}}, m, \rho \in \{0, 1\}^k Output: ct = (\tilde{U}, v) \in R_{n,p}^{\bar{m} \times d/n} \times Z_t^{\mu} ``` ``` 1: A = f_{d,n}^{(\tau)}(c) 2: R = f_R(\rho) 3: U = R_{q \to p, h_2}(\langle A^T R \rangle_{\Phi_{n+1}}) 4: \tilde{U} = U^T 5: v = \langle R_{p \to t, h_2}(Sample_{\mu}(\langle B^T R \rangle_{\xi})) \rangle_{\tau} 6: ct = (\tilde{U}, v) 7: return ct ``` ## Algorithm 27 Pseudocode of r5\_cpa\_pke\_decrypt(sk,ct) ``` Parameters: Integers p, t, q, n, d, \bar{m}, \bar{n}, \mu, b, k, f, \tau; \xi \in \{\Phi_{n+1}(x), x^{n+1} - 1\} Input: sk \in \{0,1\}^{\tilde{k}}, ct = (\tilde{U},v) \in R^{\tilde{m},vd/n}_{n,p} \times Z^{\mu}_{t} Output: \hat{m} \in \{0,1\}^{\tilde{k}} ``` ``` 1: v_p = \frac{p}{4}b 2: S = f_s(sk) 3: U = \tilde{U}^T 4: y = R_{p \to b, h_3}(v_p - Sample_{\mu}((S^T(U + h_4J))_{\xi})) 5: return \hat{m} ``` #### **B.8** Saber # **Algorithm 28** Pseudocode of Saber.KEM.Encaps $(pk = (seed_A, b))$ ``` 1: m \leftarrow u(\{0,1\}^{256}) 2: (\hat{K}, r) = g(F(pk), m) 3: c = \text{Saber.PKE.Enc}(pk, m; r) 4: K = H(\hat{K}, c) 5: \mathbf{return}(c, K) ``` ``` Algorithm 29 Pseudocode of Saber.KEM.Decaps (sk = (s, z, pkh), pk = (seed_A, b), c) [71] 1: m' = \text{Saber.PKE.Dec}(s, c) 2: (\hat{K}, r') = g(pkh, m') 3: c' = \text{Saber.PKE.Enc}(pk, m'; r') 4: if c = ct' then 5: return K = H(\hat{K}', c) 6: else 7: return K = H(z, c) 8: end if ``` ``` Algorithm 30 Pseudocode of Saber.PKE.Enc (pk = (seed_A, b), m \in R_2; r) [71] ``` ``` 1: A = gen(seed_A) \in R_q^{l \times l} 2: (\hat{K}, r') = g(pkh, m') 3: if r is not specified then 4: r = u(\{0, 1\}^{256}) 5: end if 6: s' = \beta_{\mu}(R_q^{l \times l}; r) 7: b' = ((As' + h) \mod q) >> (\epsilon_q - \epsilon_p) \in R_p^{l \times l} 8: v' = b^T(s' \mod p) \in R_p 9: c_m = (v' + h_1 - 2^{\epsilon_p - 1} m mod p) >> (\epsilon_q - \epsilon_T) \in R_T 10: return c := (c_m, b') ``` ## **Algorithm 31** Pseudocode of Saber.PKE.Dec $(sk = s, c = (c_m, b'))$ [71] ``` 1: v = b'^{T}(s \mod p) \in R_{p} 2: m' = ((v - 2^{\epsilon_{p}} - \epsilon_{T}c_{m} + h_{2}) \mod p) >> (\epsilon_{p} - 1) \in R_{2} 3: return m' ``` # C Speed-ups Compared to the Best Portable Implementations in C In Table 39, for each investigated KEM and each major operation (Encapsulation and Decapsulation), we list the total execution time in software (for the optimized portable software implementations in C running on ARM Cortex-A53 of Zynq UltraScale+ MPSoC), the total execution time in software and hardware (after offloading the most time-consuming operations to hardware), and the obtained speed-up. The ARM processor runs at 1.2 GHz, DMA for the communication between the processor and the hardware accelerator at 200 MHz, and the hardware accelerators at the maximum frequencies, specific for the RTL implementations of each algorithm, listed in Table 24. All execution times were obtained through experimental measurements using the setup shown in Fig. 1. The speed-up for the software part offloaded to hardware itself is given in the column Accel. Speed-up. This speed-up is a ratio of the execution time of the accelerated portion in software (column Accel. SW [ms]) and the execution time of the accelerated portion in hardware, including all overheads (column Accel. HW [ms]). The last column indicates how big percentage of the software-only execution time was taken by an accelerated portion of the program. Links to the underlying software implementations are summarized in Table 23. **Table 39:** Speed-ups Compared to the Best Portable Implementations in C | Algorithm | Parameter Set | Total<br>SW | Total<br>SW/HW | Total<br>Speed- | Accel.<br>SW | Accel<br>HW | Accel.<br>Speed- | SW part<br>Sped up<br>by | |--------------------------------------------------------------------------------------|-------------------------------|-------------------|------------------|------------------------|-----------------|---------------|------------------------|------------------------------| | | | [ms] | [ms] | $\mathbf{u}\mathbf{p}$ | [ms] | [ms] | $\mathbf{u}\mathbf{p}$ | HW [%] | | | | | ncapsulatio | | | | | | | FrodoKEM | Frodo-640 | 16.192 | 1.223 | 13.2 | 15.322 | 0.353 | 43.5 | 94.62 | | FrodoKEM<br>FrodoKEM | Frodo-976<br>Frodo-1344 | 34.609 $62.076$ | 1.642 $2.186$ | 21.1<br>28.4 | 33.727 $61.218$ | 0.761 $1.328$ | 44.3<br>46.1 | 97.45<br>98.62 | | Kyber | Kyber_512 | 0.327 | 0.015 | 21.4 | 0.326 | 0.014 | 23.7 | 99.52 | | Kyber | Kyber 768 | 0.533 | 0.018 | 29.7 | 0.531 | 0.014 | 32.4 | 99.71 | | Kyber | Kyber 1024 | 0.784 | 0.022 | 35.4 | 0.783 | 0.021 | 38.0 | 99.80 | | LAC-v3a | LAC-128-v3a | 0.333 | 0.016 | 22.0 | 0.332 | 0.015 | 22.2 | 99.70 | | LAC-v3a | LAC-192-v3a | 0.564 | 0.021 | 28.0 | 0.562 | 0.020 | 28.3 | 99.73 | | LAC-v3a | LAC-256-v3a | 0.906 | 0.024 | 41.0 | 0.905 | 0.022 | 40.6 | 99.8 | | LAC-v3b | LAC-128-v3b | 0.315 | 0.014 | 23.0 | 0.314 | 0.013 | 23.4 | 99.6 | | LAC-v3b<br>LAC-v3b | LAC-192-v3b<br>LAC-256-v3b | $0.530 \\ 0.870$ | 0.018 $0.021$ | 31.0<br>44.0 | 0.529 $0.868$ | 0.017 $0.020$ | $31.4 \\ 44.4$ | 99.7<br>99.8 | | NewHope | NewHope 512 | 0.348 | 0.021 | 23.9 | 0.346 | 0.020 | 26.6 | 99.5 | | NewHope | NewHope_1024 | 0.723 | 0.010 | 34.8 | 0.722 | 0.019 | 37.5 | 99.78 | | Round5 | R5ND-CCA1KEM0d | 0.155 | 0.018 | 8.6 | 0.154 | 0.017 | 9.1 | 99.3 | | Round5 | R5ND-CCA3KEM0d | 0.245 | 0.024 | 10.4 | 0.244 | 0.023 | 10.8 | 99.59 | | Round5 | R5ND-CCA5KEM0d | 0.337 | 0.032 | 10.6 | 0.335 | 0.030 | 11.1 | 99.5 | | Round5 | R5ND_CCA1KEM5d | 0.126 | 0.014 | 8.7 | 0.125 | 0.013 | 9.3 | 99.2 | | Round5 | R5ND_CCA3KEM5d | 0.241 | 0.020 | 11.9 | 0.240 | 0.019 | 12.4 | 99.5 | | Round5<br>Saber | R5ND_CCA5KEM5d | 0.405 | 0.028 | 14.5 | 0.403 | 0.026 | 15.3 | 99.6 | | Saber | LightSaber-KEM<br>Saber-KEM | 0.373 $0.722$ | 0.049 $0.057$ | 7.6 $12.7$ | 0.368 $0.714$ | 0.043 $0.049$ | 8.5<br>14.5 | 98.4<br>98.9 | | Saber | FireSaber-KEM | 1.181 | 0.065 | 18.1 | 1.171 | 0.045 | 21.3 | 99.1 | | NTRU LPRime | kem/ntrulpr653 | 1.390 | 0.052 | 26.9 | 1.367 | 0.029 | 47.4 | 98.3 | | NTRU LPRime | kem/ntrulpr761 | 1.704 | 0.060 | 28.6 | 1.679 | 0.034 | 49.6 | 98.4 | | NTRU LPRime | kem/ntrulpr857 | 2.061 | 0.067 | 30.6 | 2.032 | 0.039 | 52.3 | 98.6 | | NTRU-HPS | ntruhps2048677 | 2.961 | 0.041 | 71.9 | 2.946 | 0.026 | 115.2 | 99.4 | | NTRU-HPS | ntruhps4096821 | 4.285 | 0.048 | 88.5 | 4.268 | 0.032 | 135.4 | 99.6 | | NTRU-HRSS<br>Str NTRU Prime | ntruhrss701<br>kem/sntrup653 | 2.964 $0.776$ | 0.068<br>0.049 | 43.4 $16.0$ | 2.921 $0.752$ | 0.025 $0.024$ | $115.4 \\ 30.9$ | 98.5<br>96.8 | | Str NTRU Prime | kem/sntrup761 | 0.770 | 0.049 | 17.1 | 0.732 | 0.024 | 32.3 | 97.1 | | Str NTRU Prime | kem/sntrup857 | 1.143 | 0.063 | 18.0 | 1.113 | 0.033 | 33.7 | 97.3 | | | | | ecapsulatio | | | | | | | FrodoKEM | Frodo-640 | 16.192 | 1.321 | 12.3 | 15.214 | 0.343 | 44.4 | 93.9 | | FrodoKEM | Frodo-976 | 34.649 | 1.866 | 18.6 | 33.532 | 0.750 | 44.7 | 96.7 | | FrodoKEM | Frodo-1344 | 62.377 | 3.120 | 20.0 | 60.573 | 1.317 | $46.0 \\ 25.1$ | 97.1 | | Kyber<br>Kyber | Kyber_512<br>Kyber_768 | 0.428 $0.666$ | 0.017 $0.020$ | $25.1 \\ 33.2$ | 0.428 $0.666$ | 0.017 $0.020$ | $\frac{25.1}{33.2}$ | 100.0<br>100.0 | | Kyber | Kyber_1024 | 0.950 | 0.025 | 38.5 | 0.950 | 0.025 | 38.5 | 100.0 | | LAC-v3a | LAC-128-v3a | 0.463 | 0.017 | 27.1 | 0.463 | 0.017 | 27.1 | 100.0 | | LAC-v3a | LAC-192-v3a | 0.782 | 0.024 | 32.9 | 0.782 | 0.024 | 32.9 | 100.0 | | LAC-v3a | LAC-256-v3a | 1.378 | 0.027 | 51.3 | 1.378 | 0.027 | 51.3 | 100.0 | | LAC-v3b | LAC-128-v3b | 0.440 | 0.015 | 28.4 | 0.440 | 0.015 | 28.4 | 100.0 | | LAC-v3b | LAC-192-v3b | 0.741 | 0.021 | 36.0 | 0.741 | 0.021 | 36.0 | 100.0 | | LAC-v3b<br>NewHope | LAC-256-v3b<br>NewHope 512 | 1.332 | 0.024 | 55.2 | 1.332 | 0.024 | 55.2 | 100.0<br>100.0 | | NewHope<br>NewHope | NewHope_1024 | 0.426 $0.891$ | 0.015 $0.024$ | $28.2 \\ 37.4$ | 0.426 $0.891$ | 0.015 $0.024$ | $28.2 \\ 37.4$ | 100.0 | | Round5 | R5ND CCA1KEM0d | 0.193 | 0.024 | 9.8 | 0.19273 | 0.024 | 9.8 | 100.0 | | Round5 | R5ND CCA3KEM0d | 0.309 | 0.028 | 11.2 | 0.30946 | 0.028 | 11.2 | 100.0 | | Round5 | R5ND_CCA5KEM0d | 0.416 | 0.037 | 11.3 | 0.4158 | 0.037 | 11.3 | 100.0 | | Round5 | $R5ND\_CCA1KEM5d$ | 0.173 | 0.016 | 10.8 | 0.17267 | 0.016 | 10.8 | 100.0 | | Round5 | R5ND_CCA3KEM5d | 0.319 | 0.023 | 13.7 | 0.31877 | 0.023 | 13.7 | 100.0 | | Round5 | R5ND_CCA5KEM5d | 0.537 | 0.033 | 16.5 | 0.53721 | 0.033 | 16.5 | 100.0 | | Saber | LightSaber-KEM | 0.471 | 0.053 | 9.0 | 0.459 | 0.041 | 11.1 | 97.6 | | Saber<br>Saber | Saber-KEM<br>FireSaber-KEM | 0.867 $1.376$ | $0.065 \\ 0.077$ | 13.4<br>17.9 | 0.851 $1.354$ | 0.048 $0.055$ | $17.7 \\ 24.6$ | 98.1<br>98.4 | | NTRU LPRime | kem/ntrulpr653 | 1.856 | 0.077 | 26.2 | 1.354 | 0.035 | 56.7 | 98.4<br>97.9 | | NTRU LPRime | kem/ntrulpr761 | 2.289 | 0.071 | 27.2 | 2.244 | 0.032 $0.039$ | 58.1 | 98.0 | | | kem/ntrulpr857 | 2.787 | 0.098 | 28.6 | 2.736 | 0.047 | 57.8 | 98.2 | | NTRU LPRime | / 1 | 8.175 | 0.095 | 85.8 | 8.114 | 0.034 | 238.0 | 99.2 | | | ntruhps2048677 | 0.1.0 | | | | | | | | NTRU-HPS | ntruhps4096821 | 11.982 | 0.107 | 111.9 | 11.912 | 0.037 | 318.6 | | | NTRU-HPS<br>NTRU-HPS<br>NTRU-HRSS | ntruhps4096821<br>ntruhrss701 | $11.982 \\ 8.790$ | $0.107 \\ 0.136$ | $111.9 \\ 64.8$ | 8.700 | 0.046 | 187.7 | 98.9 | | NTRU LPRime<br>NTRU-HPS<br>NTRU-HPS<br>NTRU-HRSS<br>Str NTRU Prime<br>Str NTRU Prime | ntruhps4096821 | 11.982 | 0.107 | 111.9 | | | | 99.4<br>98.9<br>97.3<br>97.5 |