skip to main content
research-article
Open Access

Prediction Modeling for Application-Specific Communication Architecture Design of Optical NoC

Authors Info & Claims
Published:23 August 2022Publication History

Skip Abstract Section

Abstract

Multi-core systems-on-chip are becoming state-of-the-art. Therefore, there is a need for a fast and energy-efficient interconnect to take full advantage of the computational capabilities. Integration of silicon photonics with a traditional electrical interconnect in a Network-on-Chip (NoC) proposes a promising solution for overcoming the scalability issues of electrical interconnect. In this article, we derive and evaluate prediction modeling techniques for the design space exploration (DSE) of application-specific communication architectures for an Optical Network-on-Chip (ONoC). Our proposed model accurately predicts network packet latency, contention delay, and the static and dynamic energy consumption of the network. This work specifically addresses the challenge of accurately estimating performance metrics of the entire design space without having to perform time-consuming and computationally intensive exhaustive simulations. The proposed technique, based on machine learning (ML), can build accurate prediction models using only 10% to 50% (best case and worst case) of the entire design space. The accuracy, expressed as R2 (Coefficient of Determination) is 0.99901, 0.99967, 0.99996, and 0.99999 for network packet latency, contention delay, static energy consumption, and dynamic energy consumption, respectively, in six different benchmarks from the Splash-2 benchmark suite, chosen among 6 different machine learning prediction models.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Due to anticipated advancements in semiconductor manufacturing processes, the increased density of transistors will soon lead to several hundred cores on a single chip [19]. One of the most critical issues in the Chip Multiprocessor (CMP) era is the communication among different on-chip resources. Networks-on-Chips (NoCs) that use conventional RC wires to route data packets on shared channels are a good solution for replacing traditional dedicated buses [48]. However, due to capacitive and inductive effects, they cannot scale well with respect to performance and power when the number of cores grows to hundreds or thousands. The Optical Network on Chip (ONoC) is a hybrid communication architecture that, in addition to traditional electrical wires, utilizes waveguides (integrated optical structures) that transmit information using optical signals. The ONoC is a promising solution for efficient on-chip communication. The ONoC demonstrates significant advantages over the electrical NoC because of its high bandwidth (due to Wavelength Division Multiplexing (WDM)), reduced power consumption (due to bit rate transparency), and the property of distance-insensitivity (due to low losses in the optical waveguides) [53].

While designing an ONoC communication architecture, a designer needs to decide on values for many possible design parameters. Modeling, simulation, and evaluation of every design configuration from a large design space are time-consuming and computationally intensive, resulting in increased time to market and increased non-recurring engineering costs. Therefore, designers need methods to explore large design spaces without incurring the high costs of exhaustive simulations. To avoid exhaustive simulations, we propose a Machine Learning (ML)–based prediction modeling technique that predicts latency values (network latency, contention delay) and energy consumption values (static and dynamic) of each design configuration in large design spaces but avoids high simulation cost and provides high accuracy.

In the proposed design methodology, the designer selects the communication architectures to be considered and the parameters to be explored. The first step of the proposed prediction modeling technique is to uniformly and randomly sample different design configurations from the entire design space. Then, the sampled dataset is simulated and divided into a training and testing set. The training set is used to generate an application-specific ML-based prediction model and the testing set is used to compute the prediction error rate. Finally, the prediction model is used to predict latency and energy consumption values for all of the remaining design configurations that are not in the sampled dataset (complement of the sampled dataset) with the computed prediction error rate.

In this article, we consider two electro-optical ring communication network architectures: Optical Ring Network-on-Chip (ORNoC) [38] and ATAC Optical Network-on-Chip (ATAC) [19]. We model the considered communication architecture and, for each benchmark, we simulate the sampled design configurations using Graphite [44]. Graphite uses the “Design Space Exploration for Network Tool” (DSENT) [54] for delay and energy calculation. DSENT provides models for optical components, the electrical back-end circuitry, and the interface between electrical and optical components. Furthermore, we consider varying values of eight communication architecture design parameters, shown in Section 4.1.1. The given set of design parameters (Section 3) yields 32,768 design configurations for each benchmark; 26,988 of those are feasible configurations for all benchmarks (see Section 4.1.2). Traditionally, feasible design configurations would need to be modeled and evaluated (by simulation, emulation, or estimation). We will show that only 10% to 50% of feasible design configurations are required to be simulated to derive our prediction model, hence, reducing the time required for simulation by 90% to 50%. Once formulated, our derived prediction model can be used for predicting the output metrics for the entire design space, including the configurations that were not simulated. Based on the prediction, the designer may select a small subset of design configurations for detailed evaluation by modeling and simulation.

This work provides a rapid design space exploration opportunity to a designer who wants to capture the trade-offs between many design configurations early on in the design process. Once the early design space exploration is performed, a more detailed analysis of process variation and thermal effects of selected configurations can be performed using the methods in [45, 46]. Please note that prediction modeling of computation components (including communication inside single-core) and memory architecture is out of the scope of this article.

The contributions of this work are:

  • A novel prediction modeling technique for the design space exploration of application-specific ONoC communication architectures. The proposed modeling technique requires only 10% to 50% of the set of feasible configurations to be simulated to predict desired metrics with R2 (coefficient of determination) of 0.99901 to 0.99999.

  • A systematic evaluation and quantitative comparison of a set of traditional ML models for exploring architectural parameters of application-specific ONoC communication architectures.

  • A comprehensive quantitative comparison of ML models and corresponding training set sizes for application-specific ONoC communication architectures. For each pair of benchmark and target, we can determine the minimal training set size needed to create highly accurate ML models.

  • A rich benchmark for modeling of application-specific ONoC [55]. It includes network latency, contention delay, and static and dynamic energy consumption for all feasible configurations (26,988) for 6 different applications in the Splash benchmark suite [59].

Prior work [31, 32] presents a quantitative evaluation of the dynamic behavior of ORNoC using real application workloads. Additionally, it compared ORNoC performance with ATAC in terms of latency and energy consumption. The results showed that ORNoC improves total energy consumption, on average, by 14.97% compared with ATAC, while latency remains the same. Additionally, Karimi et al. [32] introduced a prediction modeling technique and evaluated different prediction models: using Regression models (Linear Regression and Additive Regression), Neural Network models (RBF Network and Multilayer Perceptron), and Tree models (M5P tree and REP tree), where one unified ML model was formulated for all the benchmarks.

In this work, we improve on the prediction modeling technique form [32] by creating separate yet more accurate prediction models specific for each benchmark, making the technique application-specific; by including and evaluating more sophisticated data preprocessing techniques to improve the accuracy of the prediction models; by exploring and evaluating a wide range of ML algorithms used for the creation of application-specific prediction models; and by employing state-of-the-art industry-standard ML tools for model generation. Moreover, this work presents a comprehensive evaluation of all considered prediction models across 3 evaluation metrics, as shown in Section 4.5.

The rest of this article is structured as follows. Section 2 reviews related works. Section 3 justifies the selection of ATAC and ORNoC and provides an overview. Section 4 introduces our design space exploration technique and further explains our prediction model. Section 5 presents the experimental results and Section 6 presents conclusions and discusses future work.

Skip 2RELATED WORK Section

2 RELATED WORK

We modeled the ORNoC architecture for the first time. To the best of our knowledge, our work in prediction modeling is unique. To date, neither extensive design space exploration nor prediction modeling has been done in the domain of ONoCs. We compare our work to the others in three main categories: (1) ONoC architecture design; (2) simulation, modeling, and benchmarks of NoC architectures; and (3) prediction modeling for design space exploration of NoC.

2.1 ONoC Architecture Design

New nanophotonic devices—such as modulators, photodetectors, and waveguides as well as on-chip and chip-to-chip communication architectures—and tools for rapid design and analysis are emerging as a result of research from industry and academia. Silicon nanophotonics [60] can improve bandwidth, latency, and energy by transferring data with light signaling between cores and memory. They also seem to pair well with existing board-to-board and chip-to-chip photonic offerings. However, there is still a need for more efficient communication architectures and more resilient nanophotonic devices before they can be mass-produced and silicon nanophotonic communication can be made reliable and robust. This section provides a comparison between different ONoC architectures.

Chameleon [35] has a reconfiguration layer that can open point-to-point communication channels at runtime to better utilize bandwidth according to the application traffic and to reduce the power consumption of the optical network. A separate electrical network manages this reconfiguration process. QuT [24] proposes an optical architecture with a wavelength assignment algorithm based on WDM to reduce the number of wavelengths and micro ring resonators but requires an additional optical control network. The need for a control network in QuT and Chameleon increases complexity, resulting in an area and power overhead, causing performance reduction. Shacham et al. [53] propose an approach that uses electrical links to control both flow and optical links for data transmission. Due to the use of this mechanism for optical path reservation, in addition to that mentioned earlier, this photonic network-on-chip suffers from high contention delay.

A \( \lambda \)-router [36] is a point-to-point, all-optical, contention-free NoC. The number of wavelengths and switches increases quadratically with the number of nodes in the network for the \( \lambda \)-router, which limits its scalability and increases power consumption and area. Snake [49] is a wavelength-routed ONoC architecture providing point-to-point connections between cores. It uses photonic switching elements that require waveguide crossings. This increases the optical losses significantly, which, in turn, increases power consumption and limits the scalability of the entire architecture.

Bahirat et al. [5] propose a framework to synthesize hybrid photonic NoCs by utilizing Particle Swarm Optimization and Simulated Annealing. HELIX [6] proposes a framework for the application-specific synthesis of hybrid NoC architectures using a graph-based algorithm and heuristic method. However, they both need a reservation process before the data transmission phase to avoid any collision at the destination node: the need for the reservation increases both system complexity and communication delay. Additionally, the back ends of both tools use ORION [30] for power estimation, which overestimate desired metrics by 2x and 7x compared with DSENT [44] and do not model necessary modules needed as an interface between the electrical and optical parts, therefore, suffering from lack of accuracy.

Many optical NoC architectures based on various topologies have been proposed. Shacham et al. [4] proposed an augmented-torus ONoC architecture that requires a path set-up and tear-down. Additionally, several authors proposed mesh-based ONoCs, such as a three-dimensional (3-D) optical cubic mesh [22], 3-D optical mesh architecture based on passive-routing [62], and an optical mesh architecture based on a hybrid opto-electrical global crossbar [8]. However, mesh-based or torus-based ONoC architectures would have a limited network performance for many-core CMPs because the network diameter would become very large [61]. To address the scalability issues, Yao et al. [61] propose Clos–Benes-Based Optical Network-on-Chip Architecture. This hierarchical optical architecture uses Clos topology for interswitch interconnect and Benes topology for interswitch interconnect. The architecture requires 2 levels of control units (one for each type of switch) that implement arbitration and routing. A crossbar-based architecture was proposed in [9] that utilizes 64 wavelengths over 270 waveguides. The total number of waveguides is distributed among control and data (256 waveguides) and broadcast and arbitration (14 waveguides). All of these architectures require some sort of control (arbitration or path set-up) that is not required by ORNoC or ATAC.

Abdollahi et al. [2] proposed an ONoC architecture ONC3 based on Cube-Connected Cycles topology, in which each node in hypercube topology is replaced by a cycle of nodes. The insertion loss-aware task mapping on ONC3 was proposed in [1]. The architecture uses passive routing, has no need for arbitration, and it is contention free but requires an additional step of checking if the optical destination is available. Kim et al. [33] propose the use of a Genetic Algorithm to create an irregular application-specific topology to optimize throughput and optical signal-to-noise ratio (OSNR). The generated topology uses optimized router architecture but requires an electrical network to establish the path, whereas ORNoC and ATAC do not require it. Kim et al. [33] present a case study for 16 and 36 core. We are not able to estimate whether the rapid generation techniques or the generated topology would scale well with the increase in the number of cores. Both of these architectures would need to be generated for the same subset of SPLASH-2 benchmarks as used in this article, and runtime performance and energy evaluation would be needed to compare them with ORNoC and ATAC.

Li et al. [42] proposed implementations for Ring-, Matrix-, \( \lambda \)-router-, and Snake-based topologies in which the layouts avoiding waveguide crossings are compared with those minimizing the waveguide length according to worst-case and average losses. Evaluation of energy efficiency was performed based on the computed requirement for laser output power, which is estimated from the losses. The results show that the ring topology leads to a 43% reduction in laser output power, also showing the promise of ORNoC and ATAC concerning power efficiency.

Corona [58], Firefly [48], and FlexiShare [48] use an optical crossbar with optical token ring arbitration. The drawback of token-based arbitration is the increase in wait time for receiving a token when the number of nodes increases. Corona uses 64 wavelengths to improve the latency; however, this requires a large number of micro-ring resonators, resulting in a large area and high power consumption. To reduce the number of waveguides and wavelengths, Le Beux et al. [38] propose the ORNoC architecture, in which the same wavelengths can be reused to realize multiple communications on the same waveguide concurrently, with no arbitration. ATAC [19] architecture also uses ring topology for communication in the optical network.

The reasons for selecting ORNoC and ATAC architectures for experimental evaluation of our prediction modeling are as follows: compared with Corona, Firefly, and FlexiShare, ORNoC, and ATAC do not have waveguide crossings due to ring topology and, hence, have smaller optical losses. Additionally, ring topology shows a significant reduction in the laser output power requirement compared with Matrix-, \( \lambda \)-router-, and Snake-based topologies. Moreover, compared with these architectures, ORNoC and ATAC do not need a separate control network and arbitration, which helps reduce the overall power consumption, packet latency, and design complexity. Therefore, ORNoC and ATAC are the most promising architectures due to the relative simplicity of design and feasibility of implementation compared with the alternatives.

2.2 Simulation and Modeling of NoC Architecture

Hardware-Assisted Lightweight Performance Estimations (HALWPE) [47] for GPUs present a predictive modeling framework for GPU performance. HALWPE uses the performance statistics that are collected on respective workloads running on current generation GPUs to effectively predict the performance of next-generation GPUs. Even though it achieves low error rates and faster simulation times compared with cycle-accurate simulation, it has significant simulation time, which makes its use for early design space exploration not practical. Moreover, HALWPE does not apply to the exploration of design space for ONoC as it relies on being able to run workloads on the previous generation of the same architecture, which does not exist in the case of ONoCs.

The ORION [30] simulator models power and area for design space exploration of an NoC. However, it does not model any optical components and has incomplete architectural models and timing for the router. PhoenixSim [31] is a photonic NoC simulator that models optical components in an NoC. PhoenixSim lacks electrical models and depends on ORION for modeling all electrical routers and links. We use Graphite [44] for ONoC simulation. Graphite utilizes DSENT [54] for delay and energy calculation. Moreover, DSENT provides models for optical devices, the electrical back-end circuitry, and the interface between electrical and optical parts. By using DSENT as a back-end, Graphite can model delay, area, and energy of both optical and electrical components with 20% accuracy compared with SPICE simulation [54]. Therefore, Graphite has an advantage over the other simulators, making it the best available solution for design space exploration. Graphite provides fast and scalable performance. It only has a 41x slowdown compared with native execution [44] and, by using Lax with Barrier synchronization, it can imitate a cycle-accurate simulation.

2.3 Prediction Modeling for DSE

Traditional DSE efforts can be placed in several categories: facilitating the simulation of all design space points by speeding up simulation itself [3]; raising the abstraction level of the simulated model [56]; increasing the granularity of the simulation [50]; utilizing the emulation [21]; and using abstract analytic models (both hardware and software optimizations) [57] and employing machine learning methods to predict the outputs for the entire design space, based on only a few points. The latter is the scope of our proposed technique. Joseph et al. [29] propose a linear regression model for predicting the performance of different processor architectures, while Lee et al. [39] propose regression modeling for predicting performance and power for various applications executing on any microprocessor configuration, both of which require numerically solving and evaluating linear systems to determine an efficient formulation of the linear regression function. Extracting all of the nonlinear interactions between one or more parameters and output is an inherently difficult task; it also limits the applicability to only parameters with numeric values. Moreover, for finding the parameters that have a significant impact on performance and power, these models rely on designer domain knowledge instead of a stepwise procedure, making them impractical for other domains. Finally, both apply to the architecture of a single processor.

For predicting processor performance in CMP systems, Ipek et al. [26] use a Neural Network (NN) model that requires an extensive number of data points in a dataset and often suffers from overfitting. Overfitting, in turn, reduces the model’s ability to generalize outside of the original dataset. ArchRanker [13] formulates the DSE as a ranking problem in which it trains a model to predict which of the two microprocessor architectures will perform better. However, it does not precisely estimate the performance of that specific architecture. Jia et al. [27] propose a regression-based, application-specific performance model for design space exploration of GPUs that can predict the program runtime. Due to the differences between the GPU and ONoC and their respective design spaces, this approach is unsuitable for our domain. Li et al. [40] proposed an efficient and precise DSE methodology by combining statistical sampling and the AdaBoost learning technique. The proposed method includes three phases. First, an orthogonal design-based feature selection prunes design space. Second, an orthogonal array-based training data sampling method selects the representative configurations for simulation. Third, a new active learning approach ActBoost builds a predictive model. The purposed technique is refined in [41], in which a more sophisticated selection method and machine learning techniques were employed. Both methods show that the proposed framework is more efficient and precise than state-of-the-art DSE techniques. All of the previously proposed prediction models target processor core architecture, while we focus on architectural parameters in the communication network design. Processor core, cache, main memory, and implementation technology parameters are not considered in this work; they will be part of our future work.

Uncore RPD [51] proposes DSE for uncore (i.e., “outside of the processor core”) by simultaneous consideration of the memory and NoC parameters, sophisticated sampling technique, and use of regression models. Similar to our techniques, Uncore RPD reduces the number of simulations required to characterize the NoC design space but focuses on memory hierarchy parameters and a few very-high-level communication architecture parameters. Jooya et al. [28] propose methods to find the effective range for configuration parameters and a Neural Network–based predictor to predict the power and performance of the application. Finally, it performs Pareto Optimal multi-objective optimization to highlight a smaller subset of configurations that meet a given system goal. It was applied to a GPGPU (general-purpose graphics processing unit).

Bahirat et al. [5] and [6] explore trade-offs in the design space, using a set of the design parameters orthogonal to one considered here. Sepulveda et al. [52] propose a method for exploring the channel bandwidth design space for the communication architecture proposed in [35]. Sepulveda et al. [52] explore the trade-off among channel bandwidth configurations (number of waveguides and wavelengths) and performance, area, and power. This work is significant in terms of exploring different design parameters; however, it relies on a “manual” exploration method and is limited to single communication architecture.

Skip 3ONOC COMMUNICATION ARCHITECTURES Section

3 ONOC COMMUNICATION ARCHITECTURES

Table 1 shows a comparison of different ONoC architectures for the number of wavelengths (#WL), the number of waveguides (#WG), and the number of micro-ring resonators (#MR) required to connect a specific number of cores (#Core). The number of wavelengths (#WL) refers to the overall number of different wavelengths in the communication architecture. The total number of different wavelengths serves as a proxy for complexity and is one of the major factors that affect power consumption. The wavelengths are distributed across different waveguides, assuming unidirectional transmission (due to increased losses for bidirectional transmission) and up to the maximum number of wavelengths per waveguide. We compare FlexiShare [48], Corona [58], QuT [24], \( \lambda \)-router [36], ATAC [19], and ORNoC [38]. A smaller number of WL, WG, and MR in any ONoC leads to smaller power consumption by the network, a simpler layout, smaller area, and easier fabrication. Two networks that require the smallest number of MR are ORNoC and ATAC (red in Table 1). The networks that require the smallest number of WL and number of WG are QuT and \( \lambda \)-router (red in Table 1), followed by ORNoC and ATAC (orange in Table 1). However, QuT requires an additional optical control network that increases complexity, resulting in an area and power overhead, and causes performance reduction; \( \lambda \)-router has limited scalability and suffers from waveguide crossing, which leads to increased power losses.

Table 1.
Architecture#Core#WG#WL#MR
FlexiShare64NA2,464550,000
Corona6438824,83245,056
25638824,8321,056,000
QuT64812845,056
12816256172,000
\( \lambda \)-router643251297,792
128641,024392,192
ATAC64634,0328,064
12825416,25632,512
ORNoC64231,4182,836
128915,76211,524

Table 1. Number of Wavelengths, Waveguides, and Micro-ring Resonators in Different ONoC Architectures

Based on the comparison of the architectural features in Section 2.1 and the earlier quantitative comparison, ORNoC and ATAC architectures have several advantages: they require no waveguide crossing, resulting in smaller optical losses; they do not need separate control network and arbitration, resulting in lower overall power consumption and packet latency; and, as shown here, they require the smallest number of micro-ring resonators, resulting in reduced power consumption and design complexity. All of this leads to the conclusion that ORNoC and ATAC are the most promising architectures due to feasibility of implementation, making them the focus of this work.

ORNoC and ATAC topology. An overview of ORNoC [38] and ATAC [19] architectures is presented here. Both communication architectures have the same hybrid electro-optical topology. Figure 1(a) shows a high-level view of the topology, in which cores are connected with both electrical (routers R and mesh interconnect are shown in green) and optical interconnect (ring, as a collection of waveguides, is shown in yellow). The interface for electro-optical and opto-electrical conversion and clustering are not shown in this figure. The topology consists of two layers: an electrical layer and an optical layer. The electrical layer hosts the processor cores, memory, and an electrical mesh network (EMesh—electrical NoC). The optical layer hosts an optical network. The optical network consists of the waveguides that transmit the information, on-chip lasers, modulators, micro-ring resonators, photo-detectors, and through-silicon via (TSV). TSV runs vertically, connecting electrical and optical layers. Once the electrical signal reaches the optical layer, it is converted into light, coupled onto waveguides, routed through the optical network, delivered to the destination, and converted back to the electrical signal before going back to the electrical layer using TSV. Conceptually, the optical network implements ring topology, and the waveguides are physically placed such that they form a serpentine layout (Figure 1(a)). For further details on topology and layout, as well as the type and geometry of the waveguides and micro-ring resonators and wavelength range used, see [37] (case study with one electrical and one optical layer).

Fig. 1.

Fig. 1. (a) Hybrid electro-optical communication architecture in ORNoC and ATAC (source: [32], Figure 1). (b) Illustration of wavelength reuse in ORNoC (source: [38], Figure 1(b)).

ORNoC and ATAC communication and routing. Both architectures consist of an electrical network that connects all cores using mesh topology and an optical network that connects the “optical hubs” via waveguides using a ring topology. An “optical hub” or “optical access point”1 is assigned to each cluster, which is a designated interface where electro-optical and opto-electrical conversion happens. Both ATAC and ORNoC implement clustering and cluster-based routing. The processing cores are grouped into clusters with a given number of cores. Both communication architectures implement hierarchical electro-optical routing. If the source and destination are in the same cluster, the packets are sent over the electrical (Emesh) network. In each cluster, there are one or more “access points,” that is, the cores that connect the cluster to the optical hub and, therefore, enable access to the optical network. If the source and destination are in different clusters and if the source core is not an access point itself, the packet is sent via Emesh to the cluster’s access point. After electro-optical conversion, the packet is sent over the optical network to the access point in the destination cluster. In the destination cluster, one or more receive networks handle the incoming packet. The optical hub of the receive network performs opto-electrical conversion and electrical Emesh routes it to the destination core within the cluster. This is referred to as a “cluster-based” running.

ORNoC can also be designed to implement a “distance-based” routing policy. The selection of the routing policy affects the design topology: for distance-based routing policy, the cores are “loosely” clustered. The designer defines “distance threshold” as the number of hops (network segments) in the NoC. If the distance between source and destination cores is less than the distance threshold, the packet will be sent over the Emesh. Otherwise, the packet will be sent via Emesh to an optical hub, then via optical network to the “destination” optical hub and from the hub, using the receive network to the destination core. The designer decides on the number and placement of optical hubs in addition to the value of the “distance threshold” during the design time.

Utilization of waveguides and wavelengths in ORNoC. Although ORNoC and ATAC have the same topology, the way they realize optical communication using waveguides and wavelengths differs and affects the design complexity and power consumption. ATAC implements all-to-all communication channels, in which each channel uses a dedicated [waveguide, wavelength] pair but the wavelength cannot be reused. ORNoC allows the reuse of a wavelength to realize several independent communication channels on a single waveguide. Figure 1(b) shows a virtual view of a waveguide with multiple wavelengths, multiplexed using WDM. Each concentric circle shows a distinct wavelength, and the collection of circles depicts the waveguide. The wavelengths are assigned to partitions to realize communications. For example, a “red” wavelength is used to realize communication between cores A and B (using only partition p1) but then reused to realize communication between cores B and C (using partition p2), and so forth. For communicating between cores A and B, an optical signal is injected into the waveguide at the source (core A) and ejected at the destination (core B). Core B injects its message into the waveguide using a “red” wavelength, thus reusing the wavelength to communicate with core C.

In ORNoC, a [wavelength, waveguide] pair is statically determined during the design time based on the source–destination pair. The optical signal is coupled at the source into the waveguide using a laser source, micro-ring resonator, and modulator. The destination has a micro-resonator that corresponds to the wavelength, which will decouple the optical signal, “taking it” out of the waveguide into the optical hub where it will undergo the opto-electrical conversation and be sent to the destination. A more detailed description is available in Section 3 and Figure 2 of [38].

Fig. 2.

Fig. 2. Overview of the proposed prediction modeling for design space exploration technique.

The algorithm [38] for assigning a [wavelength, waveguide] pair to each source–destination pair in-network enables the reuse of a wavelength within a waveguide for multiple communications. The ORNoC architecture mapping algorithm results in fewer total wavelengths needed, which causes both fewer waveguides and fewer micro-ring resonators needed. This reduces complexity and cost of the design, as well as power.

The selection of the architecture and numerous corresponding decisions about values for architectural parameters will affect the performance and energy consumption of the communication architecture running an application. Therefore, we present a technique for prediction modeling that significantly reduces design time while preserving high accuracy.

Skip 4PREDICTION MODELING TECHNIQUE Section

4 PREDICTION MODELING TECHNIQUE

The proposed prediction modeling technique for Design Space Exploration consists of three main steps, as shown at the top of Figure 2: (1) Generating the Dataset, (2) Data Preprocessing, and (3) Prediction Modeling. The designer may repeat the steps while modifying the configuration parameters selected in Step 1. Closing the loop—hence, automating DSE—is out of the scope of this work.

4.1 Generating the Dataset

The flowchart on the left-hand side of Figure 2 shows the steps in dataset generation. Different benchmarks, running on different design configurations of an ONoC architecture, produce different values for output metrics: the packet latency, contention delay, and static and dynamic energy consumption of the network. We identified the most relevant design parameters using analysis in Section 4.1.1. Table 2 shows eight design parameters and the corresponding sets of values. Each design parameter corresponds to a feature, that is, an input to ML algorithms for creating the prediction model. Note that “design parameters” and “features” refer to the same items: we will use the term “design parameter” in the context of ONoC communication architecture design and simulation, and the term “feature” in the context of ML modeling. Similarly, the simulation produces “output metrics” or “outputs” that are in ML context referred to as “targets” (i.e., the values that the ML model will be predicting).

Table 2.
FeatureDesign ParameterValues
\( f \)1Number of Cores64, 256
\( f \)2Cluster size1, 2, 4, 8, 16, 32, 64, 128
\( f \)3#Receive network1, 2, 4, 8, 16, 32, 64, 128
\( f \)4#Optical access point1, 2, 4, 8, 16, 32, 64, 128
\( f \)5Routing strategyDistance/Cluster-based
\( f \)6Distance threshold2, 4, 8, 16
\( f \)7ArchitectureATAC, ORNoC
\( f \)8LaserThrottled, Standard

Table 2. List of Features (Parameters) and Their Values

4.1.1 Selecting Design Parameters.

To be able to generate a prediction model with high accuracy, we need to select features with a significant impact on the values of the targets. Including the features that have little or no impact on the targets may result in overfitting. Overfitting results in the generated prediction model performing well for the feature values that were used for its generation and performing poorly for other values of the feature.

The optical technology parameters (e.g., ring tuning strategy, type of receiving network, and type of laser) and technology node are excluded from the feature set as they are out of the scope of this work. We also exclude the conservative and aggressive values for the network parameters. For example, for the distance-based routing strategy, a distance threshold value of 1 hop would be considered conservative, as it would imply the need to send the data via the optical network to the adjacent core. Similarly, a distance threshold value greater than 16 hops would be considered aggressive because it would imply the need to send the data via the optical network only for distances that are larger than the network diameter, which is equal to 16.

Next, we consider all other configurable parameters in the network and use the Correlation-based Feature Selection (CFS) algorithm to select a subset of parameters that has a significant impact on the latency and energy consumption of the network. The selected subset of the parameters becomes the set of features to be used for model generation. The CFS algorithm evaluates the weight of each subset of features by considering the individual predictive ability of each feature along with a degree of redundancy between them [23]. We evaluated 30 subsets of features [31] by computing the merit \( M_s \) of a feature subset, using Equation (1): (1) \( \begin{equation} M_s = \frac{k\overline{r_{fo}}}{\sqrt {k+k(k-1)\overline{r_{ff}}}} , \end{equation} \) where S is a feature subset containing k features, \( \overline{r_{fo}} \) is the mean of the feature-target correlation, and \( \overline{r_{ff}} \) is the average feature–feature intercorrelation [23]. The selected subset with the highest merit is shown in the Table 2. The feature selection is not limited to ORNoC and ATAC architectures, and can be generalized and applied to the other ONoC architectures.

4.1.2 Generating the Dataset.

The Cartesian Product (CP) \( \prod _{i=1}^8f_i \) for the identified set of features (\( f_1 \)–\( f_8 \)) defines the entire design space for ONoC communication architecture. Based on Table 2, the cardinality of the CP is equal to 32,768 configurations per benchmark. However, not all data points in this dataset correspond to Feasible Configurations (Figure 3(a)).

Fig. 3.

Fig. 3. (a) Cartesian Product (CP) containing non-feasible and feasible configurations. Feasible configurations make the entire dataset. (b) Sample set, divided into training set and test set.

The feature \( f_6 \) “distance threshold” is used only for the distance-based routing strategy and does not have any meaning for the “cluster-based” routing strategy (i.e., it would have a missing value). Therefore, for all configurations with cluster-based routing, the value of “distance threshold” is set to the maximum distance between any two cores in a cluster. This eliminates half of the configurations, leaving the set with 16,384 configurations.

Additionally, the number of cores needs to be larger than the cluster size. Thus, we need to exclude all possible configurations with 64 cores and cluster size of 64 or 128, a total of 4,096 of them.

Finally, we need to consider the specific limitation for configuring our network given that Equation (2) needs to be satisfied. (2) \( \begin{equation} cluster\: size \ge \#receive\: networks \ge \#access\: points \end{equation} \)

For example, given 256 cores and a cluster size of 128, all combinations of possible values for the number of receive networks and the number of optical access points are possible. However, for the cluster size of 64, the number of receive networks and the number of optical access points cannot assume a value of 128. The same applies for configurations in which all cluster sizes are smaller than 64, and respective values for the number of receive networks and the number of optical access points satisfy Equation (2). We exclude such configurations for 256 and 64 cores.

Finally, after the exclusions, the number of feasible configurations is 4,498 per benchmark. The latency and energy consumption of the network is dependent on an application that is executed on the ONoC. Therefore, for the six benchmarks from the Splash-2 benchmark suite [59], a total number of data points in feasible configurations (Figure 3(a)) is 26,988. For this study, we simulated all configurations in the set of feasible configurations: some will be used to generate the prediction models (training set) and the rest will be used as as a “golden reference” to evaluate the prediction models (testing set), as described in Section 4.3. The designer needs to simulate only a small subset that will be used for the training and testing in the NoC design flow.

The dataset is generated by simulation of the benchmarks on all feasible configurations on the developed models of ATAC and ORNoC in Graphite and extraction of the values for the outputs (targets) from the simulation output files. The parameters that assume default values during the simulation for all of the configurations are listed in Table 3. The dataset for each benchmark consists of data points, each of which represents a set of values for each of eight features and the corresponding values for the targets.

Table 3.
ParameterValueParameterValue
Technology node45 nmCache line size64 bytes
Temperature300 KL1 Cache size16 KB
Tile width1 mmL2 Cache size512 KB
Receive networkStarNetL1 Associativity4
Ring tuningAthermalL2 Associativity8
Flit width64Replacement policyLRU
Laser efficiency0.30Coupler loss2 dB
Waveguide loss0.2 dBRing drop loss1 dB
Ring through loss0.01 dBModulator loss0.01 dB
Photodetector capacitance5 fFRing heating efficiency100K/mW
Optical link data rate2Gb/sLink bit error rate10-15

Table 3. Fixed Parameters in Simulations

4.2 Data Preprocessing

Data preprocessing is needed for higher-quality data, which leads to better models and predictions. We apply standard data preprocessing techniques. First, we convert all numeric values to nominal values. Then, we “clean” the data by removing outliers through extreme value analysis (“original” dataset). Additionally, we remove the data points that have missing and very extreme values—for the targets Avg Latency(\( \ge \)100), Static energy (\( \ge \)4,000,000), and Avg Contention (\( \ge \)8,000), as they are the result of erroneous simulation (“clipped” dataset).

Finally, we applied the Isolation Forest algorithm [43] to the clipped dataset (“Isolation Forest” dataset). Isolation Forest is a novel anomaly detection algorithm that can isolate the anomaly instances from the whole dataset efficiently and accurately. The Isolation Forest algorithm works on recursive partitioning of data points, similar to the ensemble of binary trees. Because of the unique features of abnormal instances, the algorithm can explicitly identify the anomalies instead of profiling normal data points. We select the Isolation Forest algorithm as an essential step to remove outliers first before implementing ML models because the presence of outliers increases the standard deviation. The more extreme the outliers, the more the standard deviation is affected. Therefore, to ensure optimal performance of our proposed ML models, we remove the outliers using Isolation Forest

Figure 4 shows the effect of data preprocessing on the standard deviation values of features and targets for all three datasets. The graphs show that removing the missing and very extreme values outliers (clipped) tremendously improves the data quality, especially for the values of the targets (corresponding standard deviations have been significantly reduced). Isolation Forest further reduces standard deviation values. Hence, the Isolation Forest dataset will be used in the experiments.

Fig. 4.

Fig. 4. Comparison of standard deviation values with various data preprocessing techniques (logarithmic scale on Y-axis).

4.3 Prediction Modeling

The previous step generates a dataset ready to be used for training a prediction model. We employ several traditional ML algorithms to generate the models: (1) Linear Regression, (2) Support Vector Machine (SVM), (3) Decision Tree, (4) Random Forest, (5) AdaBoost, and (6) Neural Network. A brief description of the models is presented here. For more details, refer to [12].

4.3.1 Linear Regression.

Linear regression is a type of regression modeling in which there is a linear relationship between the target variable and the variables representing the features. The linear relationship is expressed as a linear combination of the features with predetermined weights: \( h_W(f) = w_0 + w_1f_1 + w_2f_2 + \cdots + w_kf_k \), where \( h_w(f) \) is a hypothesis function that will predict the target value, \( f_i \) are feature values, and \( w_i \) are weights (for \( i= 0,\ldots ,k \)). We define \( Loss(W) \) function (Equation (3)) for a given set W of weights as the sum of the squared difference between the targets predicted by the hypothesis function \( h_W(f^i) \) and the observed outputs \( y^i \) in the data set: (3) \( \begin{equation} Loss(W) = \sum _{i=1}^{n}(h_W(f^i) - y^i)^2 . \end{equation} \)

The Gradient Descent Algorithm is used to find the values of the weights \( w_i \) such that the value of \( Loss(W) \) is minimal, resulting in the “best fit” for the given set of data points.

4.3.2 Support Vector Machine (SVM).

The Support Vector Machine (SVM) is a well-known supervised learning model that can be used for regression problems as well. The SVM can be easily transformed into dual form when the data are non-separable. The dataset is represented in n-dimensional space, in which each data point corresponds to a point with a concrete value for each of the n features. The major intuition of employing the SVM in this project is to optimize the hyperplane that can be fitting in the n-dimensional space that represents the dataset. It is more robust and efficient compared with the linear regression model. In this study, we use a linear kernel to predict the outputs.

4.3.3 Decision Tree and Random Forest.

A decision tree is a non-parametric supervised learning method used to create a model that predicts the value of a target by learning simple decision rules inferred from the data features. It is suited for classification problems but can be used for regression problems as well (our problem is a regression problem), referred to as regression tree.

A decision tree is constructed by recursive partitioning: starting from the root node, each node can be split into left and right child nodes. These nodes can then be further split, becoming parent nodes for their sub-tree. Our implementation uses an optimized version of the Classification and Regression Trees (CARTs) algorithm with pruning. For regression predictive modeling problems with targets that are continuous values, such as ours, the model will choose the node to be split if the split will minimize Mean Absolute Error (MAE, Equation (4)) and Mean Squared Error (MSE, Equation (5)), where \( X_m \) is the training data in a node m. (4) \( \begin{align} \begin{aligned}MAE(X_m) = \frac{1}{N_m} \sum _{i \in N_m} \left|y_i - \underset{i \in N_m}{\mathrm{median}}(y_i)\right| \end{aligned} \end{align} \)

Random Forest (RF) is an ensemble technique that consists of a large collection of independent decision trees. Henceforth, it applies to both classification and regression problems. The original training set is randomly divided, and a tree model is generated for each subset, independently. Consecutively, the advantages of RF are that it can average noise and eliminate the variance among different decision trees. In general, Random Forest outperforms Decision Tree, but it is computationally more intensive, especially if there is no limit set for the number of trees. We set the number of trees to 100 and minimum samples per split to 2, which reduces the model training time without affecting accuracy. As shown in Experimental Results (Section 5), Decision Trees and Random Forest are the best performing models for our dataset.

4.3.4 AdaBoost.

AdaBoost is a popular boosting method that belongs to ensemble models. The major step of AdaBoost is to repeatedly modify the version of data and take the weighted majority voting to generate the final prediction model. The modification of data in each boosting iteration step is to apply weighted coefficients to each training sample, and the learning algorithm will be retrained. This procedure can force the weak classifier to focus on the training samples that are missed by the previous step. In this study, we implement AdaBoost to fit a regressor on the adjusted data.

4.3.5 Deep Neural Networks.

We focus on a multilayer perceptron (MLP) neural network, which is an artificial neural network composed of at least three layers of nodes: an input layer, a hidden layer, and an output layer. By training on a dataset, the MLP network learns a function \( R^m \rightarrow R^o \) that maps each input \( x \in R^m \) to an output \( y \in R^o \), where m is the number of dimensions for input and o is the number of dimensions for output. Given a set of features \( f_1, f_2,\ldots , f_m \) and a target y, it can learn a non-linear function approximator for regression. We implement an MLP model using Stochastic Gradient Descent (SGD). SGD updates parameters using the gradient of the loss function for the parameters that need adaptation. Usually, a larger dataset is needed to train a deep learning model. Due to the limited number of samples in this work, for the target “Static Energy Consumption” the performance of the deep learning model falls short. Additionally, a Neural Network usually also requires a large number of features. Despite the theoretical limitations, our results show that Neural Network models perform well for “Average Latency” and “Average Contention Delay.”

Inputs to all models are features selected in Section 4.1.1; a distinct model is generated for each of the four targets.

4.4 Evaluation of Prediction Models

A fair evaluation of a model requires disjoint datasets for model training and testing. Therefore, we split the dataset into two separate sets: one is a training set used as input for model starting (formulating) and the other is a test set used for performance evaluation (error calculation) for the model. This is to guarantee fair and unbiased model evaluation. The entire dataset is generated by simulation of design configurations as discussed in Section 4.1.2. The data are preprocessed (Section 4.2), split into the training and testing sets and various ML-based models are generated (Section 4.3), one per each target.

After formulating each model, we predict values for the targets and compute evaluation metrics to compare the predicted values to the simulation output values. We compute evaluation metrics (Section 4.5) for the models using tenfold cross-validation. The original dataset is randomly partitioned into ten equal-sized partitions. Out of those ten subsets, a single subset is retained as the cross-validation set for testing and the remaining 9 subsets are used for the training. The cross-validation process is then repeated ten times; each of the ten subsets is used exactly once as the cross-validation set. The results from all iterations are averaged to generate a single value for each evaluation metric of the model. In Section 5, we discuss our experimental results for each model.

4.5 Evaluation Metrics

To evaluate the performance of the several different ML models, we employ the commonly used measurements, including MSE (Mean Squared Error), RMSE (Root Mean Squared Error), and R-Squared (R2 or coefficient of determination). MSE is to calculate the average of the squared difference between the original and predicted values through the dataset. (5) \( \begin{equation} MSE = \frac{1}{N}\sum _{i=1}^{N}(y_{i}-\hat{y})^{2} \end{equation} \)

RMSE is the square root of MSE. (6) \( \begin{equation} RMSE =\sqrt {MSE} =\sqrt {\frac{1}{N}\sum _{i=1}^{N}(y_{i}-\hat{y})^{2}} \end{equation} \)

MSE measures the vertical distance from the prediction to original values. RMSE measures the variation between the predicted and original values. Therefore, we adopt both measurements to evaluate the performance of our prediction models. A model with a good fit will have values close to zero for both MSE and RMSE. A model with the perfect fit will have MSE and RMSE equal to 0.

R2 demonstrates the percentage relationship between predicted and original values in the dataset: (7) \( \begin{equation} R^{2} = 1- \frac{\sum (y_{i}-\hat{y})^{2}}{\sum (y_{i}-\bar{y})^{2}} , \end{equation} \) where \( \hat{y} \) is the predicted value of y; and \( \bar{y} \) is the mean value of y. A model with a good fit will have a value for R2 that is close to one. A model with the perfect fit will have R2 equal to 1, and a model with poor fit will have R2 equal to a large negative value.

The most commonly used measurements to evaluate the performance for the regression modeling are MAE, MSE, RMSE, and R2. In this work, we select MSE, RMSE, and R2 as our major evaluation metrics. First, MAE is not differentiable and makes it hard to perform any mathematical operation with compared to MSE. Second, even though MAE shows robustness to the outliers, we have already implemented the Isolation Forest algorithm to get rid of the outlier instances. Last, in practice, these three methods have been proven to be effective in evaluating the performance of regression models.

Skip 5EXPERIMENTAL RESULTS Section

5 EXPERIMENTAL RESULTS

5.1 Experimental Setup

We use a Graphite [44] simulator to simulate all 26,988 configurations. Graphite is not a cycle-accurate simulator; howevr, it supports several synchronization strategies that provide a trade-off between timing accuracy and simulation performance. Although cycle-accurate simulators provide extremely accurate results, they typically have 1,000\( \times \) to 100,000\( \times \) worse simulation time [44]. To provide the best timing accuracy, we use Lax with Barrier synchronization, in which all active threads wait on a barrier until each thread reaches a specified number of cycles (in our case, 1,000 cycles). Frequently waiting on the barrier keeps the cores tightly synchronized and emulates a cycle-accurate simulation.

Graphite uses the DSENT [54] as a back-end for delay and energy calculation. The DSENT provides models for optical components, the electrical back-end circuitry, and the interface between electrical and optical components. The DSENT calculates the values for energy, delay, and area parameters based on the value of the technology node supplied. The DSENT provides Graphite with necessary values to facilitate modeling of delay, area, and energy of both optical and electrical components. The values of parameters that are constant for all simulations are presented in Table 3. For definitions of parameters Technology node, Temperature, Cache line size, L1 Cache size, L2 Cache size, L1 Associativity, L2 Associativity, and Replacement policy, refer to [25]; for Tile width, Ring tuning and Receive network, see [34]; for Flit width, (Optical) link data rate, and Link bit error rate, see [10]; for Laser efficiency, Waveguide loss, Photodetector capacitance, and Ring heating efficiency, see [14]; and for Ring through loss, Ring drop loss, Coupler loss, and Modulator loss, refer to [11].

Benchmarks. We use the six benchmarks from the Splash-2 suite [59]: Radix, Barnes, Ocean, Cholesky, Lu, and Water. Splash-2 is a standard for comparing the performance of parallel systems. The benchmarks use shared memory and the Pthread library. The traffic was generated by simulating execution of each of the six different benchmarks from the Splash-2 suite on Emesh, ORNoC, and ATAC models implemented in a Graphite simulator. The benchmarks exhibit various communication patterns. For example, Radix has high rates of unicast traffic, Barnes has a high rate of broadcast traffic, and Ocean has an approximately equal amount of unicast and broadcast traffic. A short benchmark description follows. For more details and the instruction breakdown, see [7, 59].

Radix: Radix implements an integer radix sort. The algorithm operates on each radix r, causing all-to-all communication.

Barnes: Barnes implements the Barnes-Hut hierarchical N-body method to simulate the interaction of a system of bodies (galaxies) or particles under the influence of physical forces in three dimensions over many timesteps. Communication patterns are dependent on particle distribution.

Ocean: Ocean simulates large-scale ocean movements based on eddy and boundary currents. The grids are partitioned into square-like sub-grids to improve the communication-to-computation ratio.

Cholesky: Cholesky is a linear algebra algorithm used to decompose a Hermitian positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose. Compared with Lu, Cholesky operates on sparse matrices with a larger communication-to-computation ratio for comparable problem sizes, and there is no global synchronization between the steps.

Lu: The Lu kernel includes a dense matrix that is factored into the product of a lower triangular and upper triangular matrix. The matrix is divided into blocks (for temporal locality) and the elements inside the blocks are assigned contiguously (for spatial locality).

Water: Water evaluates forces and potentials that occur over time in a system of water molecules. Communication of molecules happens when molecules move into and out of cells, which causes the cell lists to be updated.

Implementation. We use the Python [20] programming language for statistical analysis, data preprocessing, and building our prediction models. For training ML models, we used Scikit-learn (version 0.20.3) [17], which is built on NumPy [15] and matplotlib [18]. Input data is presented as NumPy arrays, and the evaluation metrics were computed using Scikit-learn functions for MSE, RMSE, and R2-score [16]. Each iteration of Ten-Fold Cross-Validation generates values for MSE, RMSE, and R2. The values for each iteration are accumulated and averaged, and averages are presented. Due to the precision of data representation and operations, in some cases, RMSE is not equal to the \( \sqrt {MSE} \). We acknowledge the precision issue and compare the values reported by the error models.

We generated prediction models using six different ML algorithms: Linear Regression, SVMs, Decision Tree, Random Forest, AdaBoost, and MLP Neural Network. A separate model is generated for each benchmark and each of the four targets, generating 144 models total. We vary training set size in increments of 10%, from 10% to 90% of the entire set after preprocessing. For each training set size, we run Ten-fold Cross-Validationand report average values for each metric. We generated a total of 1,296 experiments, each implementing Ten-Fold Cross-Validation.

Runtime evaluation. The generation time for all 1,296 prediction models was 1 day, running concurrently on an Intel Core i7-8650u CPU at 1.9 GHz with 16 GB of RAM. This is a significantly shorter time compared with generating the entire dataset using the simulation. The most computationally intensive prediction model is the one using neural networks. It took 2 hours to train all neural network models for Ten-Fold Cross-Validation for each training set size (from 10%–90%) for the entire design space. On average, it took 56 seconds to train and test even the most computationally intensive model. However, each simulation took between 5 minutes to 1 hour, and the total time to simulate the entire design space was 4 months. Our experimental results show that a designer can simulate only a small subset of the entire design space, for example, 10% for training and another 10% for testing. Then, the designer can use our prediction modeling technique to predict the values for the remaining 80% of design space within a single day. The proposed technique significantly reduces the requirement for time-consuming and computationally intensive simulation during the early system design process.

5.2 Evaluation of Different Models

We present a quantitative analysis of all ML models grouped by target. Due to space constraints, we present a subset of benchmarks for each target that show various trends for the evaluation metrics. In each figure, we show in bold the best value for each metric (MSE, RMSE, R2), and in red and bold the best value across all ML models for an observed target.

5.2.1 Target: Average Network Latency (Tables 46).

Across all benchmarks, except Lu, the Random Forest algorithm generates models with the best accuracy for all 3 metrics. The training set size of 10% generates the minimum values for MSE, RMSE, and maximum for R2 for Radix and Barnes (Barnes, Ocean, and Water not shown) but only minimum RMSE for Lu. For the Lu benchmark, the minimum MSE and maximum R2 were obtained for the Decision Tree algorithm for a training set size of 10%. This is the only benchmark that has the desired values for 3 metricise across two models, although the difference is small: the Random Forest MSE is 0.00675 and the R2 is 0.99895, compared with the Decision Tree MSE being 0.0065 and R2 being 0.99901. The designer may trade off the convenience of having a single prediction model for all benchmarks for a small penalty in accuracy for MSE and R2 and use Random Forest to predict the target value for Lu as well. The best-performing model for Cholesky, Ocean, and Water benchmarks is also generated using the Random Forest algorithm but needs training set sizes of 50%, 40%, and 30%, respectively. However, if a slight trade-off in accuracy is acceptable, the training set size for Cholesky may be reduced to 20%, or even to 10%. A 50% training set size has R2 0.99408, compared with 20% and 10% training set sizes that have R2 0.99206 and 0.99204, respectively.

Table 4.

Table 4. Radix Benchmark—Latency: MSE, RMSE, R2 versus Training Set Size

Table 5.

Table 5. Lu Benchmark—Latency: MSE, RMSE, R2 versus Training Set Size

Table 6.

Table 6. Cholesky Benchmark—Latency: MSE, RMSE, R2 versus Training Set Size

5.2.2 Target: Average Contention Delay (Tables 79).

Across all benchmarks, the Random Forest algorithm generates models with the best accuracy for all 3 metrics. Models for Lu, Ocean, and Radix benchmarks (Lu, Ocean, and Water not shown here) require a training set size of 10%, where Cholesky requires 20%. The Barnes benchmark would require a training set size of 50% to obtain the best value for all 3 metrics. If willing to slightly sacrifice accuracy, a training set size of 10% Barnes could also be used (with R 2 0.99342 compared with 0.99416 for 50%).

Table 7.

Table 7. Radix Benchmark—Contention Delay: MSE, RMSE, R2 vs. Training Set Size

Table 8.

Table 8. Barnes Benchmark-Contention Delay: MSE, RMSE, R2 versus Training Set Size

Table 9.

Table 9. Cholesky Benchmark—Contention Delay: MSE, RMSE, R2 versus Training Set Size

5.2.3 Target: Average Static Energy Consumption (Tables 1012).

Across all benchmarks, the Decision Tree algorithm generates models with the best accuracy for all 3 metrics. With a training size of 10%, Radix, Barnes, Cholesky, Lu, and Water (Radix, Barnes, and Cholesky not shown here) obtain the best accuracy. The Ocean benchmark has the best value for MSE (5.884 x 10-5) for Decision Tree using 10%, the best value for RMSE (0.00770) for Decision Tree using 20%, and the best value R2 (0.99897) for Random Forest with 10%. However, with a small penalty in accuracy, we could use Decision Tree with the training set size of 10% for the Ocean benchmark as well, with 6.579 x 10-5, 0.00784, for MSE and RMSE, respectively.

Table 10.

Table 10. Lu Benchmark—Static Energy Consumption: MSE, RMSE, R2 versus Training Set Size

Table 11.

Table 11. Water Benchmark—Static Energy Consumption: MSE, RMSE, R2 versus Training Set Size

Table 12.

Table 12. Ocean Benchmark—Static Energy Consumption: MSE, RMSE, R2 versus Training Set Size

5.2.4 Target: Average Dynamic Energy Consumption (Tables 1315).

For most of the benchmarks, the Decision Tree algorithm generates models with the best accuracy for all 3 metrics. With a training size of 20%, Barnes, Cholesky, Lu, and Ocean (Cholesky, Lu, and Ocean not shown) obtain the best accuracy, where Radix requires a training set size of only 10%. Barnes has a smaller MSE for the model generated using Random Forest, with a training set of 10%. Another exception is Water, which has the best MSE for the model generated using the Decision Tree algorithm with a training set of 40%, and the best R2 for the Random Forest model with a training set of 20%. Both models for the Water benchmark have the same RMSE.

Table 13.

Table 13. Barnes Benchmark—Dynamic Energy Consumption: MSE, RMSE, R2 versus Training Set Size

Table 14.

Table 14. Radix Benchmark—Dynamic Energy Consumption: MSE, RMSE, R2 versus Training Set Size

Table 15.

Table 15. Water Benchmark—Dynamic Energy Consumption: MSE, RMSE, R2 versus Training Set Size

5.3 Prediction Models: Visualization, Discussion, and Applicability to DSE

Figures 5 to 7 visualize accuracy of the prediction models. Each dot in the figures indicates the actual value obtained by simulation (x-axis) versus its predicted value (y-axis) using the observed ML model. If the dots lie along the line y = x (dashed black line) the predicted values are the same as the simulated values. The closer the dots are to the y = x line, the better the prediction is.

Fig. 5.

Fig. 5. Prediction visualization for Linear Regression and SVM.

Linear Regression and SVM uniformly show poor performance for this dataset. For many benchmark/target pairs, the accuracy in terms of R2 does not exceed 0.6 using a training set size of 10% to 80%. A poor fit can be observed in Figure 5. For the Linear Regression model for Average Latency for Lu generated using a training set size of 60% with R2 = 0.40776 (Figure 5(a)), the dots are scattered within a wide envelope around the y = x line, showing that the predicted values do not match the simulated values. For the SVM model for Dynamic Energy Consumption—Cholesky, training set size 70% (Figure 5(b))—we observe that the prediction model predicts the dynamic energy consumption to be one of the two values and misses the simulated values by far, hence, the R2 value of –12.59192.

AdaBoost and Neural Network do not show consistent performance. AdaBoost has R2 in the range of 0.8 to 0.9 but requires a large training set size (20%–90%). An example of poor performance can be seen in Figure 6(a), plotted for Average Contention Delay target (Water with training set size of 30%), in which the predicted values do not correspond well to the simulated values (dots far away from the line y = x). In this case, AdaBoost creates a prediction model that only predicts several distinct values for the dynamic energy consumption for the Average Contention Delay target, thus missing the simulated values by far.

Fig. 6.

Fig. 6. Prediction visualization for AdaBoost and Neural Network.

For some benchmark/target pairs, Neural Network shows good performance: R2 around 0.97 for a training set size of 20% (Average Latency, Radix, Table 4) and for a training set size of 10% (Average Contention Delay, Cholesky, Table 9), but for the other, the performance is poor: R2 is less than zero (Dynamic Energy Consumption, Barnes, Table 13). An example of a good fit achieved with Neural Network is the model for target Average Latency (Radix, training set size of 20%), as seen in Figure 6(b). Our experiments show that across all targets, Neural Network models overfit with the larger training set size. Additionally, models created using Neural Network predict well Average Latency and Average Contention Delay (R2 ranges from 0.91 to 0.97) but predict poorly for Average Static and Dynamic Energy Consumption (R2 is less than 0.3).

Across all benchmarks and targets, two models consistently stand out: Random Forest and Decision Tree. While Random Forest seems to be better suited for Average Network Latency and Average Contention Delay, Decision Tree appears as better suited for Average Static and Dynamic Energy Consumption. Figure 7 shows examples of an excellent fit for Random Forest with R2 of 0.99206 (Cholesky, Average Latency, training set size of 20%) and Decision Tree with R2 of 0.99996 (Water, Static Energy Consumption, training set size of 10%), where the vast majority of the dots fall on the y = x line.

Fig. 7.

Fig. 7. Prediction visualization for Random Forest and Decision Tree.

In general, prediction errors for a model come from the model’s inability to capture the nature of the training data well enough for the model to make an accurate prediction using the test data, not only training data. For example, there is no linear relationship between the feature values and targets for Average Latency (Figure 5(a)); therefore, Linear Regression is not a suitable ML algorithm for predicting Average Latency. Additionally, an SVM (Figure 5(b)) does not predict Dynamic Energy Consumption well. We hypothesize that poor prediction is due to the use of a linear kernel that may not be suitable for this dataset. This article focuses on the evaluation of the traditional ML models. In the future, we will research other ML models and further fine-tune the ones that showed good performance to further reduce the prediction error.

Prediction errors will depend not only on the ML algorithm but also on the training set size. The training set size needs to be determined by experimentation for each ML algorithm and each dataset. A training set size that is too small would provide insufficient information for the resulting model to learn about the nature of the dataset. For example, for benchmark Cholesky, target Latency and Random Forest algorithm (Table 8) training set sizes from 10% to 40% will generate models with R2 between 0.9902 and 0.99206. A training set size of 50% will provide the best R2: 0.99408.

Similarly, a very large training set would cause reduced accuracy due to overfitting. The resulting model would be tailored very specifically to the training data and would not perform well for the test data. The examples for benchmark Cholesky, target Latency, and Random Forest algorithm (Table 8) would be training sets with a size larger than 50%. Neural network algorithms have been shown to suffer from overfitting for our dataset with an increase in training set size. One such example is benchmark Radix for target Latency (Table 4). The R2 is 0.97190 for a neural network with a 20% training set size, and R2 values decrease (not linearly) with an increase of a training set size, reaching R2 of 0.65119 for the training set size of 90%. Finding the most appropriate ML model and corresponding training set size is an important step for achieving a prediction model with high prediction accuracy.

We generated a total of 1,296 experiments (for 6 ML models, 6 benchmarks, 4 targets, using training set size, in increments of 10%, from 10% to 90% and Ten-fold Cross-Validation) to be able to perform a comprehensive comparison and evaluation of traditional ML models by computing the MSE, RMSE, and R2 for the predicted and simulated values. Such a comprehensive approach is not needed: due to the conclusion of this study, a designer may decide to use one or two best performing ML models and to focus on a smaller number of different training sizes (10% to 50% in increments of 10%), instead of generating all 1,296 models.

Skip 6CONCLUSION AND FUTURE WORK Section

6 CONCLUSION AND FUTURE WORK

This article presents an application-specific ML-based prediction modeling technique for design space exploration of communication architecture parameters for the optical network on chip in the multi-core system-on-chip. The proposed prediction model addresses the fundamental challenge of the early system design process: the need to accurately estimate the desired metrics without having to incur high simulation costs. Our study shows that the proposed technique can build a prediction model with R2 as high as 0.99901, 0.99967, 0.99996, and 0.99999 for network packet latency, contention delay, static energy consumption, and dynamic energy consumption, respectively. Additionally, we showed that only 10% to 50% (best case and worst case) of design space points are required to be simulated for building the prediction model. Finally, we compare performance for six different ML models and systematically evaluate the training size requirement. Our experiments show that the Decision Tree and Random Forest generate the best-performing prediction models for Average Network Latency and Average Contention Delay and for Average Static Energy Consumption and Dynamic Energy Consumption, respectively.

We plan to explore more sophisticated ML techniques (such as customized SVMs) and formulate the prediction problem as a classification problem to be able to better train neural networks to further improve the accuracy of our precision model. Additionally, we can extend our work to include other ONoC architectures, apply the technique to the memory and the core design parameter, and create a comprehensive prediction modeling technique for the overall system. Finally, we plan to research traditional NoC performance prediction models, such as queuing theory-based prediction models, and compare them with the ML-based models.

Skip ACKNOWLEDGMENT Section

ACKNOWLEDGMENT

The authors would like to thank Evan Livingstone-White and Amine Mhedbhi for their early work on the implementation of the ONoC model, and Hardit Singh for researching some of the related work.

Footnotes

  1. 1 In this article, an optical access point will be at each optical hub. However, it is possible to have other than “one-to-one” mapping between optical access points and hubs for each cluster.

    Footnote

REFERENCES

  1. [1] Abdollahi Meisam and Mohammadi Siamak. 2020. Insertion loss-aware application mapping onto the optical cube-connected cycles architecture. Computers and Electrical Engineering 82 (03 2020), 106559. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Abdollahi Meisam, Tavana Mohammad Khavari, Koohi Somayyeh, and Hessabi Shaahin. 2012. ONC3: All-optical NoC based on cube-connected cycles with quasi-DOR algorithm. In 15th Euromicro Conference on Digital System Design. IEEE, Cesme, Izmir, Turkey, 296303. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Ardestani E. K. and Renau J.. 2013. ESESC: A fast multicore simulator using time-based sampling. In IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13), Shenzhen, China. 448459.Google ScholarGoogle Scholar
  4. [4] Shacham Keren Bergman, Assaf, and Carloni Luca P.. 2007. On the design of a photonic network-on-chip. In 1st International Symposium on Networks-on-Chip (NOCS’07), Princeton, New Jersey, US. IEEE, 5364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Bahirat Shirish and Pasricha Sudeep. 2012. A particle swarm optimization approach for synthesizing application-specific hybrid photonic networks-on-chip. In 13th International Symposium on Quality Electronic Design (ISQED’12), Santa Clara, California, US. IEEE, 7883.Google ScholarGoogle Scholar
  6. [6] Bahirat Shirish and Pasricha Sudeep. 2014. HELIX: Design and synthesis of hybrid nanophotonic application-specific network-on-chip architectures. Proceedings of International Symposium on Quality Electronic Design (ISQED’14), Santa Clara, California, US. IEEE, 9198. Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Barrow-Williams Nick, Fensch Christian, and Moore Simon. 2009. A communication characterisation of Splash-2 and Parsec. In IEEE International Symposium on Workload Characterization (IISWC’09), Austin, Texas, US. 8697. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Batten Christopher, Joshi Ajay, Orcutt Jason, Khilo Anatol, Moss Benjamin, Holzwarth Charles W., Popovic Miloš A., Li Hanqing, Smith Henry I., Hoyt Judy L., Kartner Franz X., Ram Rajeev J., Stojanovic Vladimir, and Asanovic Krste. 2009. Building many-core processor-to-DRAM networks with monolithic CMOS silicon photonics. IEEE Micro 29, 4 (2009), 821. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Beausoleil R. G., Fiorentino M., Ahn J., Binkert N., Davis A., Fattal D., Jouppi N. P., McLaren M., Santori C. M., Schreiber R. S., Spillane S. M., Vantrease D., and Xu Q.. 2008. A nanophotonic interconnect for high-performance many-core computation. In 5th IEEE International Conference on Group IV Photonics, Stanford, California, USA. 365367. Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Benini Luca and Micheli Giovanni De. 2006. Networks on Chips - Technology and Tools. Elsevier Morgan Kaufmann.Google ScholarGoogle Scholar
  11. [11] Beux Sébastien, Li Hui, Nicolescu Gabriela, Trajkovic Jelena, and O’Connor Ian. 2014. Optical crossbars on chip: A comparative study based on worst-case losses. Concurrency and Computation: Practice and Experience 26 (10 2014), 2492–2503. Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Bishop Christopher M.. 2006. Pattern recognition. Machine Learning 128, 9 (2006).Google ScholarGoogle Scholar
  13. [13] Chen T., Guo Q., Tang K., Temam O., Xu Z., Zhou Z., and Chen Y.. 2014. ArchRanker: A ranking approach to design space exploration. In ACM/IEEE 41st International Symposium on Computer Architecture (ISCA’14), Minneapolis, Minnesota, USA. 8596.Google ScholarGoogle Scholar
  14. [14] Chrostowski Lukas and Hochberg Michael. 2015. Silicon Photonics Design: From Devices to Systems. Cambridge University Press. Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] community The NumPy. [n.d.]. numpy.org. Retrieved May 24, 2021 from https://numpy.org/.Google ScholarGoogle Scholar
  16. [16] developers Scikit-Learn. [n.d.]. scikit-learn Model Evaluation. Retrieved May 24, 2021 from https://scikit-learn.org/stable/modules/model_evaluation.html.Google ScholarGoogle Scholar
  17. [17] developers Scikit-Learn. [n.d.]. scikit-learn.org. Retrieved May 24, 2021 from https://scikit-learn.org/stable/.Google ScholarGoogle Scholar
  18. [18] team Matplotlib development. [n.d.]. Matplotlib.org. Retrieved May 24, 2021 from https://matplotlib.org/.Google ScholarGoogle Scholar
  19. [19] George Kurian, Chen Sun, Chen, Jason E. Miller, Jurgen Michel, Antoniadis Dimitri A. Wei, Lan, Li-Shiuan Peh, Lionel Kimerling, Vladimir Stojanovic, and Anant Agarwal. 2012. Cross-layer energy and performance evaluation of a nanophotonic manycore processor system using real application workloads. In IEEE 26th International Parallel and Distributed Processing Symposium. 11171130.Google ScholarGoogle Scholar
  20. [20] Foundation Python Software. [n.d.]. Python.org. Retrieved May 24, 2021 from http://www.python.org/.Google ScholarGoogle Scholar
  21. [21] Genko N., Atienza D., Micheli G. De, Mendias J. M., Hermida R., and Catthoor F.. 2005. A complete network-on-chip emulation framework. In Design, Automation and Test in Europe. Vol. 1. IEEE, 246–251.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Gu Huaxi and Xu Jiang. 2009. Design of 3D optical network on chip. In Symposium on Photonics and Optoelectronics. IEEE, 14. Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Hall Mark A.. 2000. Correlation-based feature selection for discrete and numeric class machine learning. In Proceedings of the 17th International Conference on Machine Learning (ICML’00). Morgan Kaufmann Publishers Inc., San Francisco, CA, 359366.Google ScholarGoogle Scholar
  24. [24] Hamedani P. K., Jerger N. E., and Hessabi S.. 2014. QuT: A low-power optical network-on-chip. In 8th IEEE/ACM International Symposium on Networks-on-Chip (NoCS’14), Ferrara, Italy. 8087.Google ScholarGoogle Scholar
  25. [25] Hennessy John L. and Patterson David A.. 2011. Computer Architecture: A Quantitative Approach (5th ed.). Morgan Kaufmann Publishers Inc., San Francisco, USA.Google ScholarGoogle Scholar
  26. [26] Ipek Engin, McKee Sally A., Singh Karan, Caruana Rich, Supinski Bronis R. de, and Schulz Martin. 2008. Efficient architectural design space exploration via predictive modeling. ACM Trans. Archit. Code Optim. 4, 4, Article (1) (Jan. 2008), 1–34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Jia Wenhao, Shaw Kelly, and Martonosi Margaret. 2012. Stargazer: Automated regression-based GPU design space exploration. In International Symposium on Performance Analysis of Systems & Software, New Brunswick, New Jersey, USA. IEEE, 2–13.Google ScholarGoogle Scholar
  28. [28] Jooya A., Dimopoulos N., and Baniasadi A.. 2016. MultiObjective GPU design space exploration optimization. In International Conference on High Performance Computing Simulation (HPCS’16), Innsbruck, Austria. 659666.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Joseph P. J., Vaswani Kapil, and Thazhuthaveetil M. J.. 2006. Construction and use of linear regression models for processor performance analysis. In 12th International Symposium on High-Performance Computer Architecture, Tampa, Florida, USA. 99108.Google ScholarGoogle Scholar
  30. [30] Kahng Andrew B., Li Bin, Peh Li-Shiuan, and Samadi Kambiz. 2012. ORION 2.0: A power-area simulator for interconnection networks. IEEE Trans. Very Large Scale Integr. Syst. 20, 1 (Jan. 2012), 191196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Karimi S.. 2016. Prediction Modeling and Design Space Exploration in Optical Network on Chip. Master’s thesis. Concordia University. Montreal, Canada.Google ScholarGoogle Scholar
  32. [32] Karimi S. and Trajkovic J.. 2018. Comparative study and prediction modeling of photonic ring network on chip architectures. In 19th International Symposium on Quality Electronic Design (ISQED’19), Santa Clara, California, USA. 119126.Google ScholarGoogle Scholar
  33. [33] Kim Yong Wook, Choi Seo Hong, and Han Tae Hee. 2021. Rapid topology generation and core mapping of optical network-on-chip for heterogeneous computing platform. IEEE Access 9 (2021), 110359110370. Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Kurian George, Miller Jason E., Psota James, Eastep Jonathan, Liu Jifeng, Michel Jurgen, Kimerling Lionel C., and Agarwal Anant. 2010. ATAC: A 1000-core cache-coherent processor with on-chip optical network. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (Vienna, Austria) (PACT’10). ACM, New York, NY, 477488. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Beux S. Le, Li H., O’Connor I., Cheshmi K., Liu X., Trajkovic J., and Nicolescu G.. 2014. Chameleon: Channel efficient optical network-on-chip. In Design, Automation Test in Europe Conference Exhibition (DATE’14), Dresden, Germany. IEEE, 16.Google ScholarGoogle Scholar
  36. [36] Beux Sébastien Le, O’Connor Ian, Nicolescu Gabriela, Bois Guy, and Paulin Pierre. 2013. Reduction methods for adapting optical network on chip topologies to 3D architectures. Microprocess. Microsyst. 37, 1 (Feb. 2013), 8798. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Beux Sébastien Le, Trajkovic Jelena, O’Connor Ian, and Nicolescu Gabriela. 2011. Layout guidelines for 3D architectures including optical ring network-on-chip (ORNoC). In IEEE/IFIP 19th International Conference on VLSI and System-on-Chip, Kowloon, Hong Kong. 242247. Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Beux S. Le, Trajkovic J., O’Connor I., Nicolescu G., Bois G., and Paulin P.. 2011. Optical ring network-on-chip (ORNoC): Architecture and design methodology. In Design, Automation Test in Europe, Grenoble, France. IEEE, 16.Google ScholarGoogle Scholar
  39. [39] Lee Benjamin C. and Brooks David M.. 2006. Accurate and efficient regression modeling for microarchitectural performance and power prediction. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California) (ASPLOS XII). ACM, New York, NY, 185194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Li Dandan, Yao Shuzhen, Liu Yu-Hang, Wang Senzhang, and Sun Xian-He. 2016. Efficient design space exploration via statistical sampling and AdaBoost learning. In Proceedings of the 53rd Annual Design Automation Conference (Austin, TX) (DAC’16). ACM, New York, NY, Article 142, 1–6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Li D., Yao S., and Wang Y.. 2018. Processor design space exploration via statistical sampling and semi-supervised ensemble learning. IEEE Access 6 (2018), 2549525505.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Li Hui, Beux Sébastien Le, Sepulveda Martha Johanna, and O’connor Ian. 2017. Energy-efficiency comparison of multi-layer deposited nanophotonic crossbar interconnects. ACM Journal on Emerging Technologies in Computing Systems 13, 4 (2017), 1–25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Liu F. T., Ting K. M., and Zhou Z.. 2008. Isolation forest. In 8th IEEE International Conference on Data Mining, Pisa, Italy. 413422.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Miller J. E., Kasture H., Kurian G., Gruenwald C., Beckmann N., Celio C., Eastep J., and Agarwal A.. 2010. Graphite: A distributed parallel simulator for multicores. In 16th International Symposium on High-Performance Computer Architecture (HPCA’10), Bangalore, India. IEEE, 112.Google ScholarGoogle Scholar
  45. [45] Nikdast M., Nicolescu G., Trajkovic J., and Liboiron-Ladouceur O.. 2016. Chip-scale silicon photonic interconnects: A formal study on fabrication non-uniformity. Journal of Lightwave Technology 34, 16 (2016), 36823695. Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Nikdast Mahdi, Nicolescu Gabriela, Trajkovic Jelena, and Liboiron-Ladouceur Odile. 2018. DeEPeR: Enhancing performance and reliability in chip-scale optical interconnection networks. In Proceedings of the 2018 on Great Lakes Symposium on VLSI (Chicago, IL, USA) (GLSVLSI’18). ACM, New York, NY, 6368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] O’Neal K., Brisk P., Shriver E., and Kishinevsky M.. 2017. HALWPE: Hardware-assisted light weight performance estimation for GPUs. In 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17), Austin, Texas, USA. 16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Pan Yan, Kumar Prabhat, Kim John, Memik Gokhan, Zhang Yu, and Choudhary Alok. 2009. Firefly: Illuminating future network-on-chip with nanophotonics. In Proceedings of the 36th Annual International Symposium on Computer Architecture (Austin, TX) (ISCA’09). ACM, New York, NY, 429440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Ramini L., Grani P., Bartolini S., and Bertozzi D.. 2013. Contrasting wavelength-routed optical NoC topologies for power-efficient 3D-stacked multicore processors using physical-layer analysis. In Design, Automation Test in Europe Conference Exhibition (DATE’13), Austin, Texas, USA. 15891594.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Sanchez Daniel and Kozyrakis Christos. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. In Proceedings of the 40th Annual International Symposium on Computer Architecture (Tel-Aviv, Israel) (ISCA’13). ACM, New York, NY, 475486. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Sangaiah K., Hempstead M., and Taskin B.. 2015. Uncore RPD: Rapid design space exploration of the uncore via regression modeling. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD’15), Austin, Texas, USA. 365372.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Sepúlveda J., Beux S. L., Luo J., Killian C., Chillet D., Li H., O’Connor I., and Sentieys O.. 2015. Communication aware design method for optical network-on-chip. In IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, Turin, Italy. 243250.Google ScholarGoogle Scholar
  53. [53] Shacham A., Bergman K., and Carloni L. P.. 2008. Photonic networks-on-chip for future generations of chip multiprocessors. IEEE Trans. Comput. 57, 9 (2008), 12461260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Sun C., Chen C. O., Kurian G., Wei L., Miller J., Agarwal A., Peh L., and Stojanovic V.. 2012. DSENT—a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In IEEE/ACM 6th International Symposium on Networks-on-Chip, Copenhagen, Denmark. 201210.Google ScholarGoogle Scholar
  55. [55] Trajkovic Jelena, Karimi Sara, and Hangsan Samantha. [n.d.]. Prediction Modeling Dataset - Available Upon Request. Accessed: 2021-05-24.Google ScholarGoogle Scholar
  56. [56] Uddin I., Poss R., and Jesshope C.. 2014. Analytical-based high-level simulation of the microthreaded many-core architectures. In 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, Torino, Italy. 344351.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Unat Didem, Chan Cy, Zhang Weiqun, Williams Samuel, Bachan John, Bell John, and Shalf John. 2015. ExaSAT: An exascale co-design tool for performance modeling. The International Journal of High Performance Computing Applications 29, 2 (2015), 209232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Vantrease D., Schreiber R., Monchiero M., McLaren M., Jouppi N. P., Fiorentino M., Davis A., Binkert N., Beausoleil R. G., and Ahn J. H.. 2008. Corona: System implications of emerging nanophotonic technology. In International Symposium on Computer Architecture, Beijing, China. 153164.Google ScholarGoogle Scholar
  59. [59] Woo S. C., Ohara M., Torrie E., Singh J. P., and Gupta A.. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings 22nd Annual International Symposium on Computer Architecture, Santa Margherita Ligure, Italy. 2436.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Xu Y. and Pasricha S.. 2014. Silicon nanophotonics for future multicore architectures: Opportunities and challenges. IEEE Design Test 31, 5 (2014), 917.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Yao Renjie and Ye Yaoyao. 2020. Toward a high-performance and low-loss Clos–Benes-based optical network-on-chip architecture. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 12 (2020), 46954706. Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Ye Yaoyao, Xu Jiang, Huang Baihan, Wu Xiaowen, Zhang Wei, Wang Xuan, Nikdast Mahdi, Wang Zhehui, Liu Weichen, and Wang Zhe. 2013. 3-D mesh-based optical network-on-chip for multiprocessor system-on-chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 32, 4 (2013), 584596. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Prediction Modeling for Application-Specific Communication Architecture Design of Optical NoC

                      Recommendations

                      Comments

                      Login options

                      Check if you have access through your login credentials or your institution to get full access on this article.

                      Sign in

                      Full Access

                      PDF Format

                      View or Download as a PDF file.

                      PDF

                      eReader

                      View online with eReader.

                      eReader

                      HTML Format

                      View this article in HTML Format .

                      View HTML Format
                      About Cookies On This Site

                      We use cookies to ensure that we give you the best experience on our website.

                      Learn more

                      Got it!