|
|
SESSION: Keynote address |
|
|
|
|
Silicon compilation: the answer to reducing IC development costs |
| |
Rajeev Madhavan
|
|
Pages: 1-2 |
|
doi>10.1145/1120725.1120728 |
|
Full text: PDF
|
|
Developing today's increasingly large and complex digital integrated circuit (IC) and system-on-chip (SoC) devices is becoming cost-prohibitive in terms of engineering resources and development time. Packing the advanced functionality of a microprocessor, ...
Developing today's increasingly large and complex digital integrated circuit (IC) and system-on-chip (SoC) devices is becoming cost-prohibitive in terms of engineering resources and development time. Packing the advanced functionality of a microprocessor, a graphics processor, or a network controller into a silicon die just 18 millimeters on a side is a complex undertaking that can require a 50-person engineering team and up to 4 million lines of HDL code. In these complex designs managing and minimizing power is becoming a huge challenge. Meanwhile, the cost of the mask set needed to drive the semiconductor production equipment stands at more than $1 million. Errors found after the mask set is created increase costs by an additional $1 million or more. expand
|
|
|
Design at the end of the silicon roadmap |
| |
Jan M. Rabaey
|
|
Pages: 1-2 |
|
doi>10.1145/1120725.1120729 |
|
Full text: PDF
|
|
Scaling of silicon integrated technology into the deep sub-100 nm space brings with it a number of formidable challenges to the designer. Issues such as design complexity, power dissipation, process variability and reliability are challenging the traditional ...
Scaling of silicon integrated technology into the deep sub-100 nm space brings with it a number of formidable challenges to the designer. Issues such as design complexity, power dissipation, process variability and reliability are challenging the traditional design methodologies. In this presentation, it is conjectured that the only viable long-term solution to these challenges is to drastically revise the way we do design, and a roadmap of potential solutions is presented. Ultimately, these innovative design solutions will help to pave the way to the post-silicon era. expand
|
|
|
The development of integrated circuit industry in China |
| |
Zhenghua Jiang
|
|
Pages: 1-2 |
|
doi>10.1145/1120725.1120727 |
|
Full text: PDF
|
|
The first semiconductor device of China was invented in Shanghai by Fudan University. The China made chips had been equipped national missiles in 1960s. However, The progress of technology was interrupted in the period of "cultural revolution". Since ...
The first semiconductor device of China was invented in Shanghai by Fudan University. The China made chips had been equipped national missiles in 1960s. However, The progress of technology was interrupted in the period of "cultural revolution". Since 1980s, the semiconductor industry has recovered and developed extremely fast due to many reasons. The progress of industrialization has been accelerated in late 1990s to early 2000s. The annual growth rate of general output of IC industry of China has reached more than 30% in last six years. We expect that the growth rate will be kept at same or higher level in next decade. Shanghai is the leading region and representative of IC industry of China. The output of IC industry in Shanghai took 51% of that of China in recent years. An IC industry chain system with design, production, encapsulation, test and other functions has been established. Among many enterprises, Huahong NEC, Zhongxin, Hongli and others own 6 four to six-inch chip production lines and 11 eight-inch IC production lines. Huahong Design, Shengsheng Shanghai, Jiaodahanxin, Fudan Micro-Electronics and some other enterprises are strong in IC design. Intel and many other world famous enterprises have set their foot in Shanghai too. expand
|
|
|
SESSION: Tree construction and buffering |
| |
Patrick H. Madden,
Cheng-Kok Koh
|
|
|
|
|
The polygonal contraction heuristic for rectilinear Steiner tree construction |
| |
Yin Wang,
Xianlong Hong,
Tong Jing,
Yang Yang,
Xiaodong Hu,
Guiying Yan
|
|
Pages: 1-6 |
|
doi>10.1145/1120725.1120731 |
|
Full text: PDF
|
|
Motivated by VLSI/ULSI routing applications, we present a heuristic for rectilinear Steiner minimal tree (RSMT) construction. We transform a rectilinear minimum spanning tree (RMST) into an RSMT by a novel method called polygonal contraction. Experimental ...
Motivated by VLSI/ULSI routing applications, we present a heuristic for rectilinear Steiner minimal tree (RSMT) construction. We transform a rectilinear minimum spanning tree (RMST) into an RSMT by a novel method called polygonal contraction. Experimental results show that the heuristic matches or exceeds the solution quality of previously best known algorithms and runs much faster. expand
|
|
|
An-OARSMan: obstacle-avoiding routing tree construction with good length performance |
| |
Yu Hu,
Tong Jing,
Xianlong Hong,
Zhe Feng,
Xiaodong Hu,
Guiying Yan
|
|
Pages: 7-12 |
|
doi>10.1145/1120725.1120732 |
|
Full text: PDF
|
|
Routing is one of the important steps in VLSI/ULSI physical design. The rectilinear Steiner minimum tree (RSMT) construction is an essential part of routing. Since macro cells, IP blocks, and pre-routed nets are often regarded as obstacles in the routing ...
Routing is one of the important steps in VLSI/ULSI physical design. The rectilinear Steiner minimum tree (RSMT) construction is an essential part of routing. Since macro cells, IP blocks, and pre-routed nets are often regarded as obstacles in the routing phase, obstacle-avoiding RSMT (OARSMT) algorithms are useful for practical routing applications. This paper focuses on the OARSMT problem and presents an algorithm, named An-OARSMan, based on ant colony optimization. A greedy obstacle penalty distance (OP-distance) local heuristic is used in the algorithm and performed on the track graph. The algorithm has been implemented and tested on different kinds of obstacles. Experimental results show that An-OARSMan can handle complex obstacle cases including both convex and concave polygon obstacles with good length performance. It can always achieve the optimal solution in the cases with no more than 7 terminals. expand
|
|
|
Making fast buffer insertion even faster via approximation techniques |
| |
Zhuo Li,
C. N. Sze,
Charles J. Alpert,
Jiang Hu,
Weiping Shi
|
|
Pages: 13-18 |
|
doi>10.1145/1120725.1120733 |
|
Full text: PDF
|
|
As technology scales to 0.13 micron and below, designs are requiring buffers to be inserted on interconnects of even moderate length for both critical paths and fixing electrical violations. Consequently, buffer insertion is needed on tens of thousands ...
As technology scales to 0.13 micron and below, designs are requiring buffers to be inserted on interconnects of even moderate length for both critical paths and fixing electrical violations. Consequently, buffer insertion is needed on tens of thousands of nets during physical synthesis optimization. Even the fast implementation of van Ginneken's algorithm requires several hours to perform this task. This work seeks to speed up the van Ginneken style algorithms by an order of magnitude while achieving similar results. To this end, we present three approximation techniques in order to speed up the algorithm: (1) aggressive pre-buffer slack pruning, (2) squeeze pruning, and (3) library lookup. Experimental results from industrial designs show that using these techniques together yields solutions in 9 to 25 times faster than van Ginneken style algorithms, while only sacrificing less than 3% delay penalty. expand
|
|
|
Concurrent flip-flop and buffer insertion with adaptive blockage avoidance |
| |
Zhong-Ching Lu,
Ting-Chi Wang
|
|
Pages: 19-22 |
|
doi>10.1145/1120725.1120734 |
|
Full text: PDF
|
|
Given a routing tree for a multi-pin net, two algorithms extending the van Ginneken algorithm [3] for concurrent flip-flop and buffer insertion were presented in [5]. One algorithm called MiLa targets at minimizing the latency, and the other algorithm ...
Given a routing tree for a multi-pin net, two algorithms extending the van Ginneken algorithm [3] for concurrent flip-flop and buffer insertion were presented in [5]. One algorithm called MiLa targets at minimizing the latency, and the other algorithm called GiLa aims to find a feasible solution subject to given latency constraints imposed on sinks. However, they both do not consider the case where buffer/flip-flop blockages are present. In this paper, we enhance the MiLa algorithm and GiLa algorithm to consider blockage avoidance by finding alternative registered-buffered paths between each internal node inside a blockage and its parent node. The experimental results show that in comparison to the MiLa algorithm, our approach is able to find a solution with the same latency (for about half of the test cases) or even better latency (for the remaining test cases) and the same wirelength, while the buffer/flip-flop usage and CPU time are comparable or acceptable. In comparison to the GiLa algorithm, our approach is able to find a feasible solution for each test case while the Gila algorithm fails to do so for several test cases. expand
|
|
|
Buffering global interconnects in structured ASIC design |
| |
Tianpei Zhang,
Sachin S. Sapatnekar
|
|
Pages: 23-26 |
|
doi>10.1145/1120725.1120735 |
|
Full text: PDF
|
|
Structured ASICs present an attractive alternative to reducing design costs and turnaround times in nanometer designs. As with conventional ASICs, such designs require global wires to be buffered. However via-programmable designs must prefabricate and ...
Structured ASICs present an attractive alternative to reducing design costs and turnaround times in nanometer designs. As with conventional ASICs, such designs require global wires to be buffered. However via-programmable designs must prefabricate and preplace buffers in the layout. This paper proposes a novel and accurate statistical estimation technique for distributing prefabricated buffers through a layout. It employs Rent's rule to estimate the buffer distribution required for the layout, so that an appropriate structured ASIC may be selected for the design. Experimental results show that the estimation for a uniform buffer distribution is accurate and economic. expand
|
|
|
SESSION: System level design methodology for network-on-chip |
| |
X. Sharon Hu,
Soonhoi Ha
|
|
|
|
|
Mapping and physical planning of networks-on-chip architectures with quality-of-service guarantees |
| |
Srinivasan Murali,
Luca Benini,
Giovanni De Micheli
|
|
Pages: 27-32 |
|
doi>10.1145/1120725.1120737 |
|
Full text: PDF
|
|
Networks on Chips (NoCs) have evolved as the communication design paradigm of future Systems on Chips (SoCs). In this work we target the NoC design of complex SoCs with heterogeneous processor/memory cores, providing Quality-of-Service ...
Networks on Chips (NoCs) have evolved as the communication design paradigm of future Systems on Chips (SoCs). In this work we target the NoC design of complex SoCs with heterogeneous processor/memory cores, providing Quality-of-Service (QoS) for the application. We present an integrated approach to mapping of cores onto NoC topologies and physical planning of NoCs, where the position and size of the cores and network components are computed. Our design methodology automates NoC mapping, physical planning, topology selection, topology optimization and instantiation, bridging an important design gap in building application specific NoCs. We also present a methodology to guarantee QoS for the application during the mapping-physical planning process by satisfying the delay/jitter constraints and real-time constraints of the traffic streams. Experimental studies show large area savings (up to 2x), bandwidth savings (up to 5x) and network component savings (up to 2.2x in buffer count, 3.8x in number of wires, 1.6x in switch ports) compared to traditional design approaches. expand
|
|
|
Time and energy efficient mapping of embedded applications onto NoCs |
| |
César Marcon,
André Borin,
Altamiro Susin,
Luigi Carro,
Flávio Wagner
|
|
Pages: 33-38 |
|
doi>10.1145/1120725.1120738 |
|
Full text: PDF
|
|
This work analyzes, the mapping of applications onto generic regular Networks-on-Chip (NoCs). Cores must be placed considering communication requirements so as to minimize the overall application execution time and energy consumption. We expand previous ...
This work analyzes, the mapping of applications onto generic regular Networks-on-Chip (NoCs). Cores must be placed considering communication requirements so as to minimize the overall application execution time and energy consumption. We expand previous mapping strategies by taking into consideration the dynamic behavior of the target application and thus potential contentions in the intercommunication of the cores. Experimental results for a suite of 22 benchmarks and various NoC sizes show that a 42% average reduction in the execution time of the mapped application can be obtained, together with a 21% average reduction in the total energy consumption for state-of-the-art technologies. expand
|
|
|
Communication-driven task binding for multiprocessor with latency insensitive network-on-chip |
| |
Liang-Yu Lin,
Cheng-Yeh Wang,
Pao-Jui Huang,
Chih-Chieh Chou,
Jing-Yang Jou
|
|
Pages: 39-44 |
|
doi>10.1145/1120725.1120739 |
|
Full text: PDF
|
|
Network-on-Chip is a new design paradigm for designing core based System-on-Chip. It features high degree of reusability and scalability. In this paper, we propose a switch which employs the latency insensitive concepts and applies the round-robin scheduling ...
Network-on-Chip is a new design paradigm for designing core based System-on-Chip. It features high degree of reusability and scalability. In this paper, we propose a switch which employs the latency insensitive concepts and applies the round-robin scheduling techniques to achieve high communication resource utilization. Based on the assumptions of the 2D-mesh network topology constructed by the switch, this work not only models the communication and the contention effect of the network, but develops a communication-driven task binding algorithm that employs the divide and conquer strategy to map applications onto the multiprocessor system-on-chip. The algorithm attempts to derive a binding of tasks such that the overall system throughput is maximized. To compare with the task binding without consideration of communication and contention effect, the experimental results demonstrate that the overall improvement of the system throughput is 20% for 844 test cases. expand
|
|
|
System-level communication modeling for network-on-chip synthesis |
| |
Andreas Gerstlauer,
Dongwan Shin,
Rainer Dömer,
Daniel D. Gajski
|
|
Pages: 45-48 |
|
doi>10.1145/1120725.1120740 |
|
Full text: PDF
|
|
As we are entering the network-on-chip era and system communication is becoming a dominating factor, communication abstraction and synthesis are becoming the integral part of system design flows. The key to the success of any design flow are well-defined ...
As we are entering the network-on-chip era and system communication is becoming a dominating factor, communication abstraction and synthesis are becoming the integral part of system design flows. The key to the success of any design flow are well-defined abstraction levels and models, which enable automation of early validation, synthesis and verification. In this paper, we define system communication abstraction layers and corresponding design models that support successive, stepwise refinement from abstract message-passing down to a cycle-accurate, bus-functional implementation. Experimental results show the benefits of our definitions and design flow. expand
|
|
|
MAIA: a framework for networks on chip generation and verification |
| |
Luciano Ost,
Aline Mello,
José Palma,
Fernando Moraes,
Ney Calazans
|
|
Pages: 49-52 |
|
doi>10.1145/1120725.1120741 |
|
Full text: PDF
|
|
The increasing complexity of SoCs makes networks on chip (NoC) a promising substitute for busses and dedicated wires interconnection schemes. However, new tools need to be developed to integrate NoC interconnection architectures and IP cores into SoCs. ...
The increasing complexity of SoCs makes networks on chip (NoC) a promising substitute for busses and dedicated wires interconnection schemes. However, new tools need to be developed to integrate NoC interconnection architectures and IP cores into SoCs. Such tools have to fulfill three main requirements: (i) automated NoC generation; (ii) automated production of NoC-IP core interfaces; (iii) seamless analysis of NoC traffic parameters. The objective of this paper is to present the MAIA framework, which includes functions to address all these requirements. NoCs generated by the MAIA framework have been used to successfully prototype SoCs in FPGAs. expand
|
|
|
SESSION: Test and DFT (1) |
| |
Alex Orailoglu,
Xiaoqing Wen
|
|
|
|
|
Theoretic analysis and enhanced X-tolerance of test response compact based on convolutional code |
| |
Yinhe Han,
Yu Hu,
Huawei Li,
Xiaowei Li
|
|
Pages: 53-58 |
|
doi>10.1145/1120725.1120743 |
|
Full text: PDF
|
|
This paper addresses the problem of test response compaction. In order to maximize compaction ratio, a single-output encoder based on check matrix of a (n, n-1, m, 3) convolutional code is proposed. Theoretic analysis for this encoder is presented ...
This paper addresses the problem of test response compaction. In order to maximize compaction ratio, a single-output encoder based on check matrix of a (n, n-1, m, 3) convolutional code is proposed. Theoretic analysis for this encoder is presented to avoid two and any odd erroneous bit cancellations, handle one unknown bit(X bit) and diagnose one erroneous bit. The X-bits tolerance capacity can be enhanced by choosing a proper memory size and weight of check matrix, which can also be obtained by an optimized input assignment algorithm. The theoretic analysis and experimental results on aliasing shows the efficiency of the proposed encoder. expand
|
|
|
Test compression for scan circuits using scan polarity adjustment and pinpoint test relaxation |
| |
Yasumi Doi,
Seiji Kajihara,
Xiaoqing Wen,
Lei Li,
Krishnendu Chakrabarty
|
|
Pages: 59-64 |
|
doi>10.1145/1120725.1120744 |
|
Full text: PDF
|
|
This paper presents a test compression method that effectively derives the capability of a run-length based encoding. The method employs two techniques: scan polarity adjustment and pinpoint test relaxation. Given a test set for a full-scan circuit, ...
This paper presents a test compression method that effectively derives the capability of a run-length based encoding. The method employs two techniques: scan polarity adjustment and pinpoint test relaxation. Given a test set for a full-scan circuit, scan polarity adjustment selectively flips the values of some scan cells in test patterns. It can be realized by changing connections between two scan cells so that the inverted output of a scan cell, Q, is connected to the next scan cell. Pinpoint test relaxation flips some specified 1s in the test patterns to 0s without any fault coverage loss. Both techniques are applied by referring to a gain-penalty table to determine scan cells or bits to be flipped. Experimental results on ISCAS'89 benchmark circuits show that the proposed method could reduce test data volume by 36%. Switching activities, i.e. test power during scan testing, were also reduced. expand
|
|
|
Testing comparison faults of ternary CAMs based on comparison faults of binary CAMs |
| |
Jin-Fu Li
|
|
Pages: 65-70 |
|
doi>10.1145/1120725.1120745 |
|
Full text: PDF
|
|
With the increasing demand for high-performance networking application, network components such as network interfaces and routers are built in dedicated hardware modulars. Content addressable memories (CAMs) play an important role in the network components. ...
With the increasing demand for high-performance networking application, network components such as network interfaces and routers are built in dedicated hardware modulars. Content addressable memories (CAMs) play an important role in the network components. Testing CAMs is very complicated due to their special structure. This paper presents an efficient March-like test algorithm for detecting the comparison faults of ternary CAMs based on the comparison fault models of binary CAMs. The test algorithm requires 5N Write operations, 2N Erase operations, and (3N + 2B) Compare operations for an N x B-bit TCAM. expand
|
|
|
SPIN-PAC: test compaction for speed-independent circuits |
| |
Feng Shi,
Yiorgos Makris
|
|
Pages: 71-74 |
|
doi>10.1145/1120725.1120746 |
|
Full text: PDF
|
|
SPIN-PAC is a static test compaction method for Speed-Independent circuits. We demonstrate how the test sets can be compacted by combining multiple consecutive test vectors within a test sequence into a vector pair of higher Hamming distance, and by ...
SPIN-PAC is a static test compaction method for Speed-Independent circuits. We demonstrate how the test sets can be compacted by combining multiple consecutive test vectors within a test sequence into a vector pair of higher Hamming distance, and by eliminating or pruning independent test sequences. We discuss the exponential nature of optimally solving this problem, we propose an efficient algorithm to approximate it, and we evaluate its performance through experiments. expand
|
|
|
A Huffman-based coding with efficient test application |
| |
Michihiro Shintani,
Toshihiro Ohara,
Hideyuki Ichihara,
Tomoo Inoue
|
|
Pages: 75-78 |
|
doi>10.1145/1120725.1120747 |
|
Full text: PDF
|
|
Test compression / decompression method using variable length coding is an efficient method for reducing the test application cost, i.e., test application time and the size of the storage of an LSI tester. However, some coding imposes slow test application, ...
Test compression / decompression method using variable length coding is an efficient method for reducing the test application cost, i.e., test application time and the size of the storage of an LSI tester. However, some coding imposes slow test application, and consequently it requires large test application time in spite of its high compression. In this paper, we clarify the fact that test application time depends on the compression ratio and the length of codewords, and then propose a new Huffman-based coding method for achieving small test application time in a given test environment. The proposed coding method adjusts both of the compression ratio and the length of the cord words to the test environment. Experimental results show that the proposed method can archieve small test application time while keeping high compression ratio. expand
|
|
|
SESSION: (Special session) DFM |
|
|
|
|
Embedded tutorial I: design for manufacturability |
| |
Vijay Pitchumani
|
|
Pages: 1-1 |
|
doi>10.1145/1120725.1120749 |
|
Full text: PDF
|
|
DFM (Design for Manufacturability) has recently become a buzzword; it excites passion in semiconductor process, design, EDA and manufacturing circles. What is all this hype about?This tutorial reviews DFM, the ugly cousin of technology scaling, in a ...
DFM (Design for Manufacturability) has recently become a buzzword; it excites passion in semiconductor process, design, EDA and manufacturing circles. What is all this hype about?This tutorial reviews DFM, the ugly cousin of technology scaling, in a broad context, and includes both hard defects and parametric variations arising from manufacturing issues in its scope. It presents the various sources of the problem and their impact on yield, silicon vs. timing model correlation, mask cost, data size and time-to-market. It then presents design methodology and EDA tool solutions, both current and future, including restrictive design rules, preferred rules, layout fixes, design-manufacturing integration, lay-out-dependent modeling, variation-aware analysis and design.This tutorial is intended for engineers and project managers involved in design, EDA, OPC/RET/tapeout and design rule formulation. expand
|
|
|
ESDZapper: a new layout-level verification tool for finding critical discharging path under ESD stress |
| |
Rouying Zhan,
Haolu Xie,
Haigang Feng,
Albert Wang
|
|
Pages: 79-82 |
|
doi>10.1145/1120725.1120750 |
|
Full text: PDF
|
|
On-chip ESD (electrostatic discharging) protection is a challenging IC design problem. New CAD tools are essential to ESD protection design prediction and verification at the full chip level. This paper reports a new CAD tool, entitled ESDZapper, to ...
On-chip ESD (electrostatic discharging) protection is a challenging IC design problem. New CAD tools are essential to ESD protection design prediction and verification at the full chip level. This paper reports a new CAD tool, entitled ESDZapper, to simulate the complex ESD protection zapping test, procedures and to find the critical discharging path under a specific ESD stress. ESDZapper is developed based on a novel concept of ESD-critical parameters. Capability of the new tool is demonstrated using a practical design example in a 0.35μm BiCMOS technology. expand
|
|
|
A new method for model based frugal OPC |
| |
Xiaolang Yan,
Ye Chen,
Zheng Shi,
Yue Ma
|
|
Pages: 83-86 |
|
doi>10.1145/1120725.1120751 |
|
Full text: PDF
|
|
Improvements on Resolution Enhancement Technologies (RETs) enable minimum feature size of IC to shrink consistently with Moore's Law. However growing mask data volume also tremendously increases manufacture cost. The cost increase is partially due to ...
Improvements on Resolution Enhancement Technologies (RETs) enable minimum feature size of IC to shrink consistently with Moore's Law. However growing mask data volume also tremendously increases manufacture cost. The cost increase is partially due to the complicated optical proximity corrections applied on mask design. Frugal OPC methods have been introduced to reduce the complexity. In this paper, a new method for frugal OPC is presented. Based on recognition of critical spots under yield related constraints, the new correction flow keeps fidelity on critical sites while still retaining the frugality of modified designs. expand
|
|
|
SESSION: Clock, power grid and thermal analysis and optimization |
| |
Xiaodong Yang,
Eli Chiprout
|
|
|
|
|
Fast computation of the temperature distribution in VLSI chips using the discrete cosine transform and table look-up |
| |
Yong Zhan,
Sachin S. Sapatnekar
|
|
Pages: 87-92 |
|
doi>10.1145/1120725.1120753 |
|
Full text: PDF
|
|
Temperature-related effects are critical in determining both the performance and reliability of VLSI circuits. Accurate and efficient estimation of the temperature distribution corresponding to a specific circuit layout is indispensable in physical design ...
Temperature-related effects are critical in determining both the performance and reliability of VLSI circuits. Accurate and efficient estimation of the temperature distribution corresponding to a specific circuit layout is indispensable in physical design automation tools. In this paper, we propose a highly accurate fast algorithm for computing the on-chip temperature distribution due to power sources located on the top surface of the chip. The method is a combination of several computational techniques including the Green function method, the discrete cosine transform (DCT), and the table look-up technique. The high accuracy of the algorithm comes from the fully analytical nature of the Green function method, and the high efficiency is due to the application of the fast Fourier transform (FFT) technique to compute the DCT and later obtaining the temperature field for any power source distribution using the pre-calculated look-up table. Experimental results have demonstrated that our method has a relative error of below 1% compared with commercial computational fluid dynamic (CFD) softwares for thermal analysis, while the efficiency of our method is orders of magnitude higher than the direct application of the Green function method. expand
|
|
|
Analysis of buffered hybrid structured clock networks |
| |
Yi Zou,
Qiang Zhou,
Yici Cai,
Xianlong Hong,
Sheldon X.-D. Tan
|
|
Pages: 93-98 |
|
doi>10.1145/1120725.1120754 |
|
Full text: PDF
|
|
This paper presents a novel approach for fast transient analysis of buffered hybrid structured clock networks. The new method applies structure reduction and relaxed hierarchical analysis methods to reduce the circuit complexity and speedup the simulation. ...
This paper presents a novel approach for fast transient analysis of buffered hybrid structured clock networks. The new method applies structure reduction and relaxed hierarchical analysis methods to reduce the circuit complexity and speedup the simulation. A simple controlled sources model is used for modeling clock buffers to deal with nonlinearity in the buffered clock trees. Our experiment results show that the proposed algorithm is about two orders of magnitude faster than HSPICE without loss on accuracy and stability. The relatively errors on delay times are within a few percent of the exact ones. expand
|
|
|
Clock network minimization methodology based on incremental placement |
| |
Liang Huang,
Yici Cai,
Qiang Zhou,
Xianlong Hong,
Jiang Hu,
Yongqiang Lu
|
|
Pages: 99-102 |
|
doi>10.1145/1120725.1120755 |
|
Full text: PDF
|
|
In ultra-deep submicron VLSI circuits, clock network is a major source of power consumption and power supply noise. Therefore, it is very important to minimize clock network size. Traditional design methodologies usually let the clock router to undertake ...
In ultra-deep submicron VLSI circuits, clock network is a major source of power consumption and power supply noise. Therefore, it is very important to minimize clock network size. Traditional design methodologies usually let the clock router to undertake the task of clock network minimization independently. Since a clock routing is carried out based on register locations, register placement actually has fundamental influence to a clock network size. In this paper, we propose a new clock network design methodology that Incorporates register placement optimization. Given a cell placement result, incremental modifications are performed according to clock skew specifications. The incremental placement change moves registers toward preferred locations that may enable a small clock network size. At the same time, the side-effect to logic cell placement and wire connections is controlled. Experimental results on benchmark circuits show that the proposed methodology can reduce clock network size considerably with limited impact on signal net wirelength and critical path delay. expand
|
|
|
A multi-level transmission line network approach for multi-giga hertz clock distribution |
| |
Hongyu Chen,
Chung-Kuan Cheng
|
|
Pages: 103-106 |
|
doi>10.1145/1120725.1120756 |
|
Full text: PDF
|
|
In high performance systems, process variations and fluctuations of operating environments have significant impact on the clock skew. Recently, hybrid structures of H-tree and mesh [2,15,18,19] were proposed to distribute the clock signal with a balanced ...
In high performance systems, process variations and fluctuations of operating environments have significant impact on the clock skew. Recently, hybrid structures of H-tree and mesh [2,15,18,19] were proposed to distribute the clock signal with a balanced H-tree and lock the skew using the shunt effect of the mesh. However, in multi-giga hertz regime, the RC model [15] of the mesh is no longer valid. The inductance effect of the mesh can even make the skew worse. In this paper, we investigate the use of a novel architecture which incorporates multiple level transmission line shunts to distribute global clock signal. We derive the analytical expression of the skew reduction contributed by the shunt of a transmission line with the length of an integral multiple of clock wavelength. Based on the analytical skew expression, we adopt convex programming techniques to optimize the wire widths of the multi-level transmission line network. Simulation results show that the multilevel network achieves below 4ps skew for 10GHz clock rate. expand
|
|
|
Gibbs sampling in power grid analysis |
| |
Zhixin Tian,
Huazhong Yang,
Rong Luo
|
|
Pages: 107-110 |
|
doi>10.1145/1120725.1120757 |
|
Full text: PDF
|
|
Power grid plays an important role in determining circuit performance, and the accuracy and efficiency of power grid analysis algorithm has become critical in timing, power and noise estimation of modern integrated circuits. In this paper a stochastic ...
Power grid plays an important role in determining circuit performance, and the accuracy and efficiency of power grid analysis algorithm has become critical in timing, power and noise estimation of modern integrated circuits. In this paper a stochastic algorithm based on Gibbs sampling is proposed to solve the problem of power grid analysis, and the test results shows that it reaches a good accuracy with linear complexity. The method has incremental property of localizing computation, a desirable property favoring in modern CAD. Therefore it can be embedded at all the design and verification levels of integrated circuits. expand
|
|
|
A wideband hierarchical circuit reduction for massively coupled interconnects |
| |
Hao Yu,
Lei He,
Zhenyu Qi,
Sheldon X.-D. Tan
|
|
Pages: 111-114 |
|
doi>10.1145/1120725.1120758 |
|
Full text: PDF
|
|
We develop a realizable circuit reduction to generate the interconnect macro-model for parasitic estimation in wideband applications. The inductance is represented by VPEC (vector potential equivalent circuit) model, which not only enables the passive ...
We develop a realizable circuit reduction to generate the interconnect macro-model for parasitic estimation in wideband applications. The inductance is represented by VPEC (vector potential equivalent circuit) model, which not only enables the passive sparsification but also gives correct low-frequency response, whereas the recent circuit reduction intrinsically has inaccurate value and low-frequency response due to nodal-susceptance formulation. Applying hierarchical circuit-reduction enhanced by multi-point expansions, we can obtain an accurate high-order impedance function to capture the high-frequency response. The impedance function is further enforced passivity by convex programming, and realized by a Foster's synthesis. Experiments show that our method is as accurate as PRIMA in high frequency range, but leads to a realized circuit model with up to 10X times less complexity and up to 8X smaller simulation time. In addition, under the same reduction ratio, its error margin is less than that for the time-constant based reduction in both time-domain and frequency-domain simulations. expand
|
|
|
SESSION: Routing and interconnects |
| |
Martin D. F. Wong,
Tong Jing
|
|
|
|
|
A Min-area Solution to Performance and RLC Crosstalk Driven Global Routing Problem |
| |
Tong Jing,
Ling Zhang,
Jinghong Liang,
Jingyu Xu,
Xianlong Hong,
Jinjun Xiong,
Lei He
|
|
Pages: 115-120 |
|
doi>10.1145/1120725.1120786 |
|
Full text: PDF
|
|
This paper presents a novel global routing algorithm, AT-PO-GR, to minimize the routing area under both congestion, timing, and RLC crosstalk constraints. The proposed algorithm is consisted of three key parts: (1) timing and congestion optimization; ...
This paper presents a novel global routing algorithm, AT-PO-GR, to minimize the routing area under both congestion, timing, and RLC crosstalk constraints. The proposed algorithm is consisted of three key parts: (1) timing and congestion optimization; (2) crosstalk budgeting and estimation; and (3) crosstalk elimination and local refinement. Compared with the recent work introduced in [9] and [10], the proposed algorithm can achieve smaller routing area and fewer shields under the same design constraints, yet use less running time. expand
|
|
|
Thermal-driven multilevel routing for 3-D ICs |
| |
Jason Cong,
Yan Zhang
|
|
Pages: 121-126 |
|
doi>10.1145/1120725.1120787 |
|
Full text: PDF
|
|
3-D IC has a great potential for improving circuit performance and degree of integration. It is also an attractive platform for system-on-chip or system-in-package solutions. A critical issue in 3-D circuit design is heat dissipation. In this paper we ...
3-D IC has a great potential for improving circuit performance and degree of integration. It is also an attractive platform for system-on-chip or system-in-package solutions. A critical issue in 3-D circuit design is heat dissipation. In this paper we propose an efficient 3-D multilevel routing approach that includes a novel through-the-silicon via (TS-via) planning algorithm. The proposed approach features an adaptive lumped resistive thermal model and a two-step multilevel TS-via planning scheme. Experimental results show that with multilevel TS-via planning, the thermal-driven approach can reduce the maximum temperature to the required temperature with reasonable wirelength increase. Compared to a post processing approach for dummy TS-via insertion, to achieve the same required temperature, our approach uses 80% fewer TS-vias. To our knowledge, this proposed approach is the first thermal-driven 3-D routing algorithm. expand
|
|
|
Wave-pipelined on-chip global interconnect |
| |
Lizheng Zhang,
Yuhen Hu,
Charlie Chung-Ping Chen
|
|
Pages: 127-132 |
|
doi>10.1145/1120725.1120788 |
|
Full text: PDF
|
|
A novel wave-pipelined global interconnect system is developed for reliable, high throughput, on-chip data communication. We argue that because there is only a single signal propagation path and a single type of 1-input gate(inverter), a wave-pipelined ...
A novel wave-pipelined global interconnect system is developed for reliable, high throughput, on-chip data communication. We argue that because there is only a single signal propagation path and a single type of 1-input gate(inverter), a wave-pipelined interconnect will have less stringent timing constraints than a wave-pipelined combinational logic block. A phase-lock loop based clock and data recovery unit architecture, adopted from off-chip high speed digital serial link, is designed for on-chip application so as to minimize power and area cost. Preliminary Monte Carlo simulation indicated that the wave-pipelined global interconnect architecture potentially can offer 18% higher throughput than a flip-flop pipelined global interconnect architecture at about the same level of reliability. While delivering data through long interconnect at the same bit rate, the wave-pipelined architecture consumes less power and requires less chip real estate. expand
|
|
|
Evaluation of on-chip transmission line interconnect using wire length distribution |
| |
Junpei Inoue,
Hiroyuki Ito,
Shinichiro Gomi,
Takanori Kyogoku,
Takumi Uezono,
Kenichi Okada,
Kazuya Masu
|
|
Pages: 133-138 |
|
doi>10.1145/1120725.1120789 |
|
Full text: PDF
|
|
On-chip transmission-line interconnect has been proposed to reduce delay time and power consumption. The transmission line is used to replace long RC interconnects. This paper proposes the methodology to replace RC lines with transmission lines, which ...
On-chip transmission-line interconnect has been proposed to reduce delay time and power consumption. The transmission line is used to replace long RC interconnects. This paper proposes the methodology to replace RC lines with transmission lines, which are estimated with Wire Length Distribution (WLD). Advantages of on-chip transmission line are discussed from the view point of delay time and power consumption. expand
|
|
|
SESSION: System level modeling and embedded software |
| |
Tim Tuan,
S. K. Nandy
|
|
|
|
|
A formalism for functionality preserving system level transformations |
| |
Samar Abdi,
Daniel Gajski
|
|
Pages: 139-144 |
|
doi>10.1145/1120725.1120791 |
|
Full text: PDF
|
|
With the rise in complexity of modern systems, designers are spending a significant time on modeling at the system level of abstraction. This paper introduces Model Algebra, a formalism built on top of system level design languages, that can be used ...
With the rise in complexity of modern systems, designers are spending a significant time on modeling at the system level of abstraction. This paper introduces Model Algebra, a formalism built on top of system level design languages, that can be used for implementing functionality preserving transformations on system level models. Such transformations enable us to implement high level design decisions without having to write new models for each design decision. Moreover, since these transformations preserve functionality, the transformed models do not need to be re-verified. We present the definition of Model Algebra and show how system level models can be represented as expressions in this formalism. The laws of Model Algebra are use to define correct model transformations. We show a system level design scenario, where design decisions gradually refine the functional model of the system to an architectural model with components and communication structure. The refinement can be performed using the correct model transformations in our formalism. expand
|
|
|
Embedded software generation from system level specification for multi-tasking embedded systems |
| |
KiSeun Kwon,
YoungMin Yi,
DoHyung Kim,
SoonHoi Ha
|
|
Pages: 145-150 |
|
doi>10.1145/1120725.1120792 |
|
Full text: PDF
|
|
In this paper we present a new design flow in which embedded software code is generated from system level specification of multi-tasking embedded system, both for simulation and implementation. The generated software has a layered structure using virtual ...
In this paper we present a new design flow in which embedded software code is generated from system level specification of multi-tasking embedded system, both for simulation and implementation. The generated software has a layered structure using virtual OS APIs and OS wrapper implementations to make it reconfigurable for multiple target platforms. Implementation of the OS wrapper is explained in details. With a Divx play example, we show some experimental results about the real-time performance comparison between two different platforms expand
|
|
|
Scheduler implementation in MP SoC design |
| |
Youngchul Cho,
Sungjoo Yoo,
Kiyoung Choi,
Nacer-Eddine Zergainoh,
Ahmed Amine Jerraya
|
|
Pages: 151-156 |
|
doi>10.1145/1120725.1120793 |
|
Full text: PDF
|
|
In the design of a heterogeneous multiprocessor system on chip, we face a new design problem; scheduler implementation. In this paper, we present an approach to implementing a static scheduler, which controls all the task executions and communication ...
In the design of a heterogeneous multiprocessor system on chip, we face a new design problem; scheduler implementation. In this paper, we present an approach to implementing a static scheduler, which controls all the task executions and communication transactions of a system according to a pre-determined schedule. For the scheduler implementation, we consider both intra-processor and inter-processor synchronization. We also consider scheduler overhead, which is often neglected. In particular, we address the issue of centralized implementation versus distributed implementation. We investigate the pros and cons of the two different scheduler implementations. Through experiments with synthetic examples and a real world multimedia application, we show the effectiveness of our approach. expand
|
|
|
Optimizing embedded applications using programmer-inserted hints |
| |
G. Chen,
M. Kandemir
|
|
Pages: 157-160 |
|
doi>10.1145/1120725.1120794 |
|
Full text: PDF
|
|
This paper explores the possibility of exploiting programmer-inserted hints in the application code to improve performance beyond what could be achieved using an optimizing compiler. These hints can be beneficial in two scenarios: (1) when compiler analysis ...
This paper explores the possibility of exploiting programmer-inserted hints in the application code to improve performance beyond what could be achieved using an optimizing compiler. These hints can be beneficial in two scenarios: (1) when compiler analysis fails to identify opportunity and/or legality for a potential optimization, and (2) when it is not a good idea to invoke an optimization at the point the opportunity is first encountered during execution. Our goal is to strike a balance between two extremes -- pure compiler-based scheme (i.e., user-transparent approach) and pure user-based scheme (i.e., assembly programming). In particular, we defend a strategy where a few programmer-inserted hints can enable the compiler to do much better job than the pure compiler approach without requiring the programmer to encode low-level optimizations. expand
|
|
|
Static analysis and automatic code synthesis of flexible FSM model |
| |
Dohyung Kim,
Soonhoi Ha
|
|
Pages: 161-165 |
|
doi>10.1145/1120725.1120795 |
|
Full text: PDF
|
|
To describe complex control modules, the following four features are requested for extended FSM models: concurrency, compositionality, static analyzability, and automatic code synthesis capability. In our codesign environment we use a new FSM extension ...
To describe complex control modules, the following four features are requested for extended FSM models: concurrency, compositionality, static analyzability, and automatic code synthesis capability. In our codesign environment we use a new FSM extension called flexible FSM model. It extends the expression capabilities by concurrency, hierarchy, and state variable while it maintains formal property. Because of formality and the structured nature of fFSM model, we can apply a static analysis method to find ambiguous behavior and synthesize software/hardware automatically, which is the main focus of this paper. We expect that the proposed technique can be applied to other compositional FSM extensions. expand
|
|
|
SESSION: Test and DFT (2) |
| |
Kwang-Ting (Tim) Cheng,
Shiyi Xu
|
|
|
|
|
Constraint extraction for pseudo-functional scan-based delay testing |
| |
Yung-Chieh Lin,
Feng Lu,
Kai Yang,
Kwang-Ting Cheng
|
|
Pages: 166-171 |
|
doi>10.1145/1120725.1120797 |
|
Full text: PDF
|
|
Recent research results have shown that the traditional structural testing for delay and crosstalk faults may result in over-testing due to the non-trivial number of such faults that are untestable in the functional mode while testable in the test mode. ...
Recent research results have shown that the traditional structural testing for delay and crosstalk faults may result in over-testing due to the non-trivial number of such faults that are untestable in the functional mode while testable in the test mode. This paper presents a pseudo-functional test methodology that attempts to minimize the over-testing problem of the scan-based circuits for the delay faults. The first pattern of a two-pattern test is still delivered by scan in the test mode but the pattern is generated in such a way that it does not violate the functional constraints extracted from the functional logic. In this paper, we use a SAT solver to extract a set of functional constraints which consists of illegal states and internal signal correlation. Along with the functional justification (also called broad-side) test application scheme, the functional constraints are imposed to a commercial delay-fault ATPG tool to generate pseudo-functional delay tests. The experimental results indicate that the percentage of untestable delay faults is non-trivial for many circuits which support the hypothesis of the over-testing problem in delay testing. The results also indicate the effectiveness of the proposed constraint extraction method. expand
|
|
|
Bridging fault detection in Double Fixed-Polarity Reed-Muller (DFPRM) PLA |
| |
Hafizur Rahaman,
Debesh K. Das
|
|
Pages: 172-177 |
|
doi>10.1145/1120725.1120798 |
|
Full text: PDF
|
|
Testable design for detecting stuck-at and bridging faults in Programmable Logic Arrays (PLAs) based on Double Fixed-Polarity Reed-Muller Expression (DFPRM) is proposed. DFPRMs are generalized expressions of FPRM. It has advantages of compactness and ...
Testable design for detecting stuck-at and bridging faults in Programmable Logic Arrays (PLAs) based on Double Fixed-Polarity Reed-Muller Expression (DFPRM) is proposed. DFPRMs are generalized expressions of FPRM. It has advantages of compactness and easy testability. The EXOR part in the proposed design is implemented with tree structure that admits a universal test set. For an n-variable function, this design can be tested by (2n+8) test vectors, which are independent of the function and the circuit-under-test (CUT). Excepting a few intergate bridging faults in the EXOR-tree, it detects all other single bridging (both OR-and AND-type) and all single stuck-at faults. This tree based implementation reduces circuit delay significantly compared to cascaded EXOR-part. expand
|
|
|
Propagation delay fault: a new fault model to test delay faults |
| |
Xijiang Lin,
Janusz Rajski
|
|
Pages: 178-183 |
|
doi>10.1145/1120725.1120799 |
|
Full text: PDF
|
|
A new fault model, named propagation delay fault model, is proposed to test the gross gate delay defects modeled at each gate terminal and the distributed delay defects in the fault propagation paths. The proposed fault model assumes that the sum of ...
A new fault model, named propagation delay fault model, is proposed to test the gross gate delay defects modeled at each gate terminal and the distributed delay defects in the fault propagation paths. The proposed fault model assumes that the sum of the gross gate delay defect and the distributed delay defect are large enough to cause timing violation for all the paths passing through the fault site and the fault propagation path. Experimental results demonstrate that high fault coverage can be achieved in a reasonable amount of time and the test set size is comparable to the test set size generated for the transition fault model. expand
|
|
|
Oscillation ring based interconnect test scheme for SOC |
| |
Katherine Shu-Min Li,
Chung Len Lee,
Chauchin Su,
Jwu E Chen
|
|
Pages: 184-187 |
|
doi>10.1145/1120725.1120800 |
|
Full text: PDF
|
|
We propose a novel oscillation ring (OR) test architecture for testing interconnects in SoC. In addition to stuck-at and open faults, this scheme can detect delay faults and crosstalk glitches. IEEE P1500 wrapper cells are modified. An efficient ring-generation ...
We propose a novel oscillation ring (OR) test architecture for testing interconnects in SoC. In addition to stuck-at and open faults, this scheme can detect delay faults and crosstalk glitches. IEEE P1500 wrapper cells are modified. An efficient ring-generation algorithm is proposed to construct ORs based on a graph model. Experimental results on MCNC benchmark circuits show the feasibility of the scheme and the effectiveness of the algorithm. Our method achieves 100% fault coverage with a small number of tests. expand
|
|
|
Bridging fault testability of BDD circuits |
| |
Junhao Shi,
Görschwin Fey,
Rolf Drechsler
|
|
Pages: 188-191 |
|
doi>10.1145/1120725.1120801 |
|
Full text: PDF
|
|
In this paper we study the testability of circuits derived from Binary Decision Diagrams (BDDs) under the bridging fault model. It is shown that testability can be formulated in terms of symbolic BDD operations. By this, test pattern generation can be ...
In this paper we study the testability of circuits derived from Binary Decision Diagrams (BDDs) under the bridging fault model. It is shown that testability can be formulated in terms of symbolic BDD operations. By this, test pattern generation can be carried out in polynomial time. A technique to improve testability is presented. Experimental results show that a complete classification can be carried out very efficiently. expand
|
|
|
SESSION: TCAD |
| |
Kenji Nishi,
Changhong Dai
|
|
|
|
|
Yield driven gate sizing for coupling-noise reduction under uncertainty |
| |
Debjit Sinha,
Hai Zhou
|
|
Pages: 192-197 |
|
doi>10.1145/1120725.1120803 |
|
Full text: PDF
|
|
This paper presents a post-route gate-sizing algorithm for coupling-noise reduction that constrains the yield loss under process variations. Algorithms for coupling-noise reduction which do not consider uncertainty in the manufacturing process can make ...
This paper presents a post-route gate-sizing algorithm for coupling-noise reduction that constrains the yield loss under process variations. Algorithms for coupling-noise reduction which do not consider uncertainty in the manufacturing process can make a circuit susceptible to failure. Using probabilistic models, the coupling-noise reduction problem is solved as a fixpoint computation problem on a lattice. A novel gate-sizing algorithm with low area overhead is proposed for coupling-noise reduction under uncertainty. Experimental results are reported for the ISCAS benchmarks and larger circuits with comparisons to traditional approaches. expand
|
|
|
Maze routing with OPC consideration |
| |
Yun-Ru Wu,
Ming-Chao Tsai,
Ting-Chi Wang
|
|
Pages: 198-203 |
|
doi>10.1145/1120725.1120804 |
|
Full text: PDF
|
|
As the technology of manufacturing process continues to advance, the process variation becomes more and more serious in nanometer designs. Optical proximity correction (OPC) is employed to correct the process variation of the diffraction effect. To obtain ...
As the technology of manufacturing process continues to advance, the process variation becomes more and more serious in nanometer designs. Optical proximity correction (OPC) is employed to correct the process variation of the diffraction effect. To obtain the desired layout as early as possible, routers must have some changes to handle the optical effects to speed up the OPC time and to avoid the routing result that cannot be corrected by the OPC process. In this paper, we propose two practical OPC-aware maze routing problems and present how to enhance an existing maze routing algorithm to get an optimal algorithm for each problem. The experimental results are also given to demonstrate the effectiveness of these two enhanced algorithms. expand
|
|
|
Towards automatic parameter extraction for surface-potential-based MOSFET models with the genetic algorithm |
| |
Masahiro Murakawa,
Mitiko Miura-Mattausch,
Tetsuya Higuchi
|
|
Pages: 204-207 |
|
doi>10.1145/1120725.1120805 |
|
Full text: PDF
|
|
In this paper, we present an automatic parameter extraction method with the GA (Genetic Algorithm) for surface-potential-based MOSFET models such as HiSIM (Hiroshima-university STARC IGFET Model). The method employs a two-stage extraction procedure operating ...
In this paper, we present an automatic parameter extraction method with the GA (Genetic Algorithm) for surface-potential-based MOSFET models such as HiSIM (Hiroshima-university STARC IGFET Model). The method employs a two-stage extraction procedure operating on different sets of model parameters. Experimental results demonstrate that extraction of 34 parameters can be completed within 23 hours with PC (AthlonXP 2500), although this would typically take a human expert several days. expand
|
|
|
Substrate resistance extraction with direct boundary element method |
| |
Xiren Wang,
Wenjian Yu,
Zeyi Wang
|
|
Pages: 208-211 |
|
doi>10.1145/1120725.1120806 |
|
Full text: PDF
|
|
It is important to model the substrate coupling for mixed-signal circuit designs today. This paper presents the direct boundary element method (BEM) for substrate resistance calculation, where only the boundary of substrate region is discretized. Firstly, ...
It is important to model the substrate coupling for mixed-signal circuit designs today. This paper presents the direct boundary element method (BEM) for substrate resistance calculation, where only the boundary of substrate region is discretized. Firstly, an efficient scheme for non-uniform element partition is proposed. Secondly, a new technique is presented which can reduce the scale of produced linear system and then accelerate the equation solving, especially for the multiple right-hand sides problem like substrate resistance extraction. Experiments show that the proposed method has shown high efficiency compared with existing methods while preserving high accuracy. expand
|
|
|
An efficient combinationality check technique for the synthesis of cyclic combinational circuits |
| |
Vineet Agarwal,
Navneeth Kankani,
Ravishankar Rao,
Sarvesh Bhardwaj,
Janet Wang
|
|
Pages: 212-215 |
|
doi>10.1145/1120725.1120807 |
|
Full text: PDF
|
|
It has been recently pointed out that cyclic circuits are not necessarily sequential, and cyclic topologies that are combinational generally have lower literal counts than their acyclic counterparts. However, the synthesis of cyclic combinational circuits ...
It has been recently pointed out that cyclic circuits are not necessarily sequential, and cyclic topologies that are combinational generally have lower literal counts than their acyclic counterparts. However, the synthesis of cyclic combinational circuits is potentially expensive due to the need to explore a wide range of cyclic topologies and check each of them for combinationality. We first obtain the acyclic implementation of the given set of boolean functions. Then using a branch-and-bound heuristic, we generate cyclic circuits that are to be checked for combinationality. Unlike earlier complex methods for combinationality check, our approach is to check whether this cyclic circuit is functionally equivalent to the acyclic circuit obtained earlier. While synthesizing cyclic circuits with the proposed method, we observed up to 45%. improvements in the literal count (for Espresso and LGsynth93 benchmarks) over the acyclic circuit synthesized by the Berkeley sis package. expand
|
|
|
Library cell layout with Alt-PSM compliance and composability |
| |
Ke Cao,
Puneet Dhawan,
Jiang Hu
|
|
Pages: 216-219 |
|
doi>10.1145/1120725.1120808 |
|
Full text: PDF
|
|
The sustained miniaturization of VLSI feature size presents great challenges to sub-wavelength photolithography and requests usage of many Resolution Enhancement Techniques (RET). The difficulty and feasibility of deploying the RET such as Alternating ...
The sustained miniaturization of VLSI feature size presents great challenges to sub-wavelength photolithography and requests usage of many Resolution Enhancement Techniques (RET). The difficulty and feasibility of deploying the RET such as Alternating Phase Shifting Mask (Alt-PSM) depend heavily on circuit layout. In this paper, we propose a Boolean satisfiability (SAT) based library cell layout method that can achieve Alt-PSM compliance and composability in a constructive manner. Compared to previously reported post processing approach, our method often leads to further cell area efficiency improvement. expand
|
|
|
Forward discrete probability propagation method for device performance characterization under process variations |
| |
Rasit Onur Topaloglu,
Alex Orailoglu
|
|
Pages: 220-223 |
|
doi>10.1145/1120725.1120809 |
|
Full text: PDF
|
|
Process variations are becoming influential at the device level in deep sub-micron and sub-wavelength design regimes, whereas they used to be a few generations away only influential at circuit level. Process variations cause device performance parameters, ...
Process variations are becoming influential at the device level in deep sub-micron and sub-wavelength design regimes, whereas they used to be a few generations away only influential at circuit level. Process variations cause device performance parameters, such as current or output resistance, to acquire a probability distribution. Estimation of these distributions has been accomplished using Monte Carlo techniques so far. The large number of samples needed by Monte Carlo methods adversely affects the possibility of integrating probabilistic device performance at the circuit level due to run-time inefficiency. In this paper, we introduce a novel technique called Forward Discrete Probability Propagation (FDPP). This method discretizes the probability distributions and effectively propagates these probabilities across a device formula hierarchy, such as the one present in the SPICE3v3 model. Consequently, probability distributions for process parameters are propagated to the device level. It is shown in the paper that with far fewer number of samples, comparable accuracy to a Monte Carlo method is achieved. expand
|
|
|
SESSION: Simulation and modeling techniques for RF/analog circuits |
| |
Jaijeet Roychowdhury,
Yici Cai
|
|
|
|
|
Wideband modeling of RF/Analog circuits via hierarchical multi-point model order reduction |
| |
Zhenyu Qi,
Sheldon X.-D. Tan,
Hao Yu,
Lei He
|
|
Pages: 224-229 |
|
doi>10.1145/1120725.1120811 |
|
Full text: PDF
|
|
This paper proposes a novel wideband modeling technique for high-performance RF passives and linear(ized) analog circuits. The new method is based on a recently proposed s-domain hierarchical modeling and analysis method [27]. Theoretically, we show ...
This paper proposes a novel wideband modeling technique for high-performance RF passives and linear(ized) analog circuits. The new method is based on a recently proposed s-domain hierarchical modeling and analysis method [27]. Theoretically, we show that the s-domain hierarchical reduction is equivalent to implicit moment matching around s = 0, and that the existing hierarchical reduction method by one-point expansion is numerically stable for general tree-structured circuits. Practically, we propose a hierarchical multi-point reduction scheme for high-fidelity, wideband modeling of general passive or active linear circuits. A novel explicit waveform matching algorithm is proposed for searching the dominant poles and residues from different expansion points based on the unique hierarchical reduction framework. Experimental results with large analog circuits, on-chip spiral inductors are presented to validate the proposed method. expand
|
|
|
Efficient symbolic sensitivity analysis of analog circuits using element-coefficient diagrams |
| |
Huiying Yang,
Mukesh Ranjan,
Wim Verhaegen,
Mengmeng Ding,
Ranga Vemuri,
Geoges Gielen
|
|
Pages: 230-235 |
|
doi>10.1145/1120725.1120812 |
|
Full text: PDF
|
|
This paper presents a new method to perform efficient first-order symbolic sensitivity analysis of analog circuits by direct differentiation of symbolic expressions stored as element-coefficient diagrams (ECDs). An ECD is a compact graphical representation ...
This paper presents a new method to perform efficient first-order symbolic sensitivity analysis of analog circuits by direct differentiation of symbolic expressions stored as element-coefficient diagrams (ECDs). An ECD is a compact graphical representation of a symbolic transfer function. It is the cancellation-free and per-coefficient term generation version of determinant decision diagrams (DDDs). The symbolic sensitivity equations obtained from ECDs are stored as a sensitivity-ECDs(SECDs) and can be evaluated extremely fast as it inherits the properties of ECDs. The proposed methodology has been applied to the calculation of sensitivities of four benchmark circuits and it has been demonstrated to be as accurate and more efficient than numerical sensitivity analysis done by SPECTRE. expand
|
|
|
A new approach for ring oscillator simulation using the harmonic balance method |
| |
Xiaochun Duan,
Kartikeya Mayaram
|
|
Pages: 236-239 |
|
doi>10.1145/1120725.1120813 |
|
Full text: PDF
|
|
A novel approach for simulating the periodic steady state of ring oscillators with the harmonic balance method is described. A single delay cell based equivalent circuit is simulated and used to determine the response of the overall circuit. This results ...
A novel approach for simulating the periodic steady state of ring oscillators with the harmonic balance method is described. A single delay cell based equivalent circuit is simulated and used to determine the response of the overall circuit. This results in an algorithm that is computationally efficient and readily converges for a variety of ring oscillator circuits. expand
|
|
|
Efficient transient simulation for transistor-level analysis |
| |
Zhengyong Zhu,
Khosro Rouz,
Manjit Borah,
Chung-Kuan Cheng,
Ernest S. Kuh
|
|
Pages: 240-243 |
|
doi>10.1145/1120725.1120814 |
|
Full text: PDF
|
|
In this paper, we introduce an efficient transistor level simulation tool with SPICE-accuracy for deep-submicron(DSM) VLSI circuits with strong coupling effects. The new approach uses multigrid for large networks of power/ground, clock and signal interconnect. ...
In this paper, we introduce an efficient transistor level simulation tool with SPICE-accuracy for deep-submicron(DSM) VLSI circuits with strong coupling effects. The new approach uses multigrid for large networks of power/ground, clock and signal interconnect. Transistor devices are integrated using a novel two-stage Newton-Raphson method to dynamically model the linear network and nonlinear devices interface. Orders of magnitude speedup over Berkeley SPICE3 is observed for sets of DSM design circuits. expand
|
|
|
Block SAPOR: block Second-order Arnoldi method for Passive Order Reduction of multi-input multi-output RCS interconnect circuits |
| |
Bang Liu,
Xuan Zeng,
Yangfeng Su,
Jun Tao,
Zhaojun Bai,
Charles Chiang,
Dian Zhou
|
|
Pages: 244-249 |
|
doi>10.1145/1120725.1120815 |
|
Full text: PDF
|
|
Recently model order reduction techniques for second-order systems have obtained many research interests for the simulation of RCS interconnect circuits employing susceptance elements. In this paper, we propose a Block SAPOR (Block Second-order Arnoldi ...
Recently model order reduction techniques for second-order systems have obtained many research interests for the simulation of RCS interconnect circuits employing susceptance elements. In this paper, we propose a Block SAPOR (Block Second-order Arnoldi method for Passive Order Reduction) for Multi-Input Multi-Output RCS Circuits. The proposed Block SAPOR algorithm can simultaneously guarantee passivity and achieve higher accuracy than the first order reduction technique PRIMA. Most importantly, the reduced system matrices obtained by the proposed method can preserve the structure of the original system matrices. Such a nice property makes it possible to construct an equivalent RCS circuit for the reduced system. expand
|
|
|
Block based statistical timing analysis with extended canonical timing model |
| |
Lizheng Zhang,
Yuhen Hu,
Charlie Chung-Ping Chen
|
|
Pages: 250-253 |
|
doi>10.1145/1120725.1120816 |
|
Full text: PDF
|
|
Block based statistical timing analysis (STA) tools often yield less accurate results when timing variables become correlated due to global source of variations and path reconvergence. To the best of our knowledge, no good solution is available handling ...
Block based statistical timing analysis (STA) tools often yield less accurate results when timing variables become correlated due to global source of variations and path reconvergence. To the best of our knowledge, no good solution is available handling both types of correlations simultaneously.In this paper, we present a novel statistical timing algorithm, AMECT (Asymptotic MAX/MIN approximation & Extended Canonical Timing model), that produces accurate timing estimation by handling both types of correlations simultaneously. An extended canonical timing model is developed to evaluate and decompose correlations between arbitrary timing variables. And an intelligent pruning method is designed enabling trade-off runtime with accuracy.Tested with ISCAS benchmark suites, AMECT shows both high accuracy and high performance compared with Monte Carlo simulation results: with distribution estimation error < 1.5% while with around 350X speed up on a circuit with 5355 gates. expand
|
|
|
SESSION: Logic synthesis |
| |
Jianwen Zhu,
Sikun Li
|
|
|
|
|
FSM re-engineering and its application in low power state encoding |
| |
Lin Yuan,
Gang Qu,
Tiziano Villa,
Alberto Sangiovanni-Vincentelli
|
|
Pages: 254-259 |
|
doi>10.1145/1120725.1120844 |
|
Full text: PDF
|
|
We propose Finite State Machine (FSM) re-engineering, a performance enhancement framework for FSM synthesis and optimization procedure. We start with any traditional FSM synthesis and optimization procedure; then re-construct a functionally equivalent ...
We propose Finite State Machine (FSM) re-engineering, a performance enhancement framework for FSM synthesis and optimization procedure. We start with any traditional FSM synthesis and optimization procedure; then re-construct a functionally equivalent but topologically different FSM based on the optimization objective; and conclude with another round of FSM synthesis and optimization (can be the same procedure) on the newly constructed FSM. This allows us to explore a larger solution space that includes synthesis solutions to the functionally equivalent FSMs instead of only the original FSM, making it possible to obtain solutions better than the optimal ones for the original FSM. Guided by the result of the first round FSM synthesis, the solution space exploration process can be rapid and cost-efficient.To demonstrate this framework, we develop a genetic algorithm and a fast heuristic to re-engineer a low power state encoding procedure POW3 [1]. On average, POW3 can reduce the switching activity by 12% over non-power-driven state encoding schemes on the MCNC FSM benchmarks. We then re-engineer these benchmarks by the proposed genetic algorithm and heuristic respectively. When we apply POW3 to the re-engineered FSMs, we observe an additional 8.9% and 6.0% switching activity reduction. This translates to an average of 7.9% energy reduction with little area increase. Finally, we obtain the optimal low power coding for benchmarks of small size from an integer linear programming formulation. We find that the POW3-encoded original FSMs are 27.0% worse than the optimal, but this number drops to 6.7% when we apply POW3 to the re-engineered FSMs. expand
|
|
|
Post-layout logic duplication for synthesis of domino circuits with complex gates |
| |
Aiqun Cao,
Ruibing Lu,
Cheng-Kok Koh
|
|
Pages: 260-265 |
|
doi>10.1145/1120725.1120845 |
|
Full text: PDF
|
|
Logic duplication to resolve the logic reconvergent paths problem encountered in Domino logic synthesis is expensive in terms of area and power. In this paper, we propose a combined logic duplication minimization and technology mapping scheme for Domino ...
Logic duplication to resolve the logic reconvergent paths problem encountered in Domino logic synthesis is expensive in terms of area and power. In this paper, we propose a combined logic duplication minimization and technology mapping scheme for Domino circuits with complex gates. The logic duplication is performed as a post-layout step as the duplication cost is minimized based on accurate timing information. Experimental results show significant improvements in area, power, and delay. expand
|
|
|
Detecting support-reducing bound sets using two-cofactor symmetries |
| |
Jin S. Zhang,
Malgorzata Chrzanowska-Jeske,
Alan Mishchenko,
Jerry R. Burch
|
|
Pages: 266-271 |
|
doi>10.1145/1120725.1120846 |
|
Full text: PDF
|
|
Detecting support-reducing bound sets is an important step in Boolean decomposition. It affects both the quality and the runtime of several applications in technology mapping and re-synthesis. This paper presents an efficient heuristic method for detecting ...
Detecting support-reducing bound sets is an important step in Boolean decomposition. It affects both the quality and the runtime of several applications in technology mapping and re-synthesis. This paper presents an efficient heuristic method for detecting support-reducing bound sets using two-cofactor symmetries. Experiments on the MCNC and ITC benchmarks show an average 40x speedup over the published exhaustive method for bound set construction. expand
|
|
|
Synthesis of quantum logic circuits |
| |
Vivek V. Shende,
Stephen S. Bullock,
Igor L. Markov
|
|
Pages: 272-275 |
|
doi>10.1145/1120725.1120847 |
|
Full text: PDF
|
|
The pressure of fundamental limits on classical computation and the promise of exponential speedups from quantum effects have recently brought quantum circuits to the attention of the EDA community [10, 17, 4, 16, 9]. We discuss efficient circuits to ...
The pressure of fundamental limits on classical computation and the promise of exponential speedups from quantum effects have recently brought quantum circuits to the attention of the EDA community [10, 17, 4, 16, 9]. We discuss efficient circuits to initialize quantum registers and implement generic quantum computations. Our techniques yield circuits that are twice as small as the best previously published technique. Moreover, a theoretical lower bound shows that our new circuits can be improved by at most a factor of two. Further, the circuits grow by at most a factor of nine under severe architectural restrictions. expand
|
|
|
STACCATO: disjoint support decompositions from BDDs through symbolic kernels |
| |
Stephen Plaza,
Valeria Bertacco
|
|
Pages: 276-279 |
|
doi>10.1145/1120725.1120848 |
|
Full text: PDF
|
|
A disjoint support decomposition (DSD) is a representation of a Boolean function F obtained by composing two or more simpler component functions such that the component functions have no common inputs. The decomposition of a function is desirable ...
A disjoint support decomposition (DSD) is a representation of a Boolean function F obtained by composing two or more simpler component functions such that the component functions have no common inputs. The decomposition of a function is desirable for several reasons. First, it's a method to obtain a multiple-level implementation of a function. It leads to a partition in simpler blocks that easily results in smaller areas and fewer interconnects. Moreover, it exposes a parallelism in the computation of the function that can be exploited by hardware as well as during simulation.In this paper we present a novel algorithm, STACCATO, that generates a DSD decomposition starting from the BDD of a function. STACCATO is novel because 1) it provides a complete description of each decomposition, that is, it computes the "kernel" function K relating the elements of each decomposition, and 2) it has better performance than previously known algorithms. Experimental results run on both IWLS and industrial test-benches show that STACCATO's performance is in most cases three times as fast or more than previously known solutions. expand
|
|
|
SESSION: System level architecture design |
| |
Sreedhar Natarrajan,
Soo-Ik Chae
|
|
|
|
|
A framework for automated and optimized ASIP implementation supporting multiple hardware description languages |
| |
Oliver Schliebusch,
A. Chattopadhyay,
D. Kammler,
G. Ascheid,
R. Leupers,
H. Meyr,
Tim Kogel
|
|
Pages: 280-285 |
|
doi>10.1145/1120725.1120850 |
|
Full text: PDF
|
|
Architecture Description Languages (ADLs) are widely used to perform design space exploration for Application Specific Instruction Set Processors (ASIPs). While the design space exploration is well supported by numerous tools providing high flexibility ...
Architecture Description Languages (ADLs) are widely used to perform design space exploration for Application Specific Instruction Set Processors (ASIPs). While the design space exploration is well supported by numerous tools providing high flexibility and quality, the methodology of automated implementation is limited to simple transformations. Assuming fixed architectural templates, information given in the ADL is directly mapped to a hardware description on Register Transfer Level (RTL). Gate-Level synthesis tools are not able to perform potential optimizations, as the computational complexity grows exponential with the size of the architecture. Information such as exclusiveness, parallelism or boolean relations are spread over multiple modules and therefore hard to determine. In this paper, we present an ASIP synthesis approach from architecture description languages, based on an Intermediate Representation (IR). The IR is the key technology to provide new language-independent high-level optimizations and to realize different hardware description language backends. The feasibility of our approach is proven in a case-study. expand
|
|
|
A processor core synthesis system in IP-based SoC design |
| |
Naoki Tomono,
Shunitsu Kohara,
Jumpei Uchida,
Yuichiro Miyaoka,
Nozomu Togawa,
Masao Yanagisawa,
Tatsuo Ohtsuki
|
|
Pages: 286-291 |
|
doi>10.1145/1120725.1120851 |
|
Full text: PDF
|
|
This paper proposes a new design methodology for SoCs reusing hardware IPs. In our approach, after system-level HW/SW partitioning, we use IPs for hardware parts, but synthesize a new processor core instead of reusing a processor core IP. System performs ...
This paper proposes a new design methodology for SoCs reusing hardware IPs. In our approach, after system-level HW/SW partitioning, we use IPs for hardware parts, but synthesize a new processor core instead of reusing a processor core IP. System performs efficient parallel execution of hardware and software by taking account of a response time of hardware IP obtained by the proposed calculation algorithm. We can use optimal hardware IPs selected by the proposed hardware IPs selection algorithm. The experimental results show effectiveness of our new design methodology. expand
|
|
|
Speed and voltage selection for GALS systems based on voltage/frequency islands |
| |
Koushik Niyogi,
Diana Marculescu
|
|
Pages: 292-297 |
|
doi>10.1145/1120725.1120852 |
|
Full text: PDF
|
|
Due to increasing clock speeds and shrinking technologies, distributing a single global clock signal throughout a chip is becoming a difficult and challenging proposition. In this paper, we address the problem of energy optimal local speed and voltage ...
Due to increasing clock speeds and shrinking technologies, distributing a single global clock signal throughout a chip is becoming a difficult and challenging proposition. In this paper, we address the problem of energy optimal local speed and voltage selection in frequency/voltage island based systems under given performance constraints. Our results show that static voltage and speed assignment can achieve up to 42% savings in total energy for various media and signal processing applications, while application specific dynamic approaches provide up to 44% energy savings in the case of MPEG-2 encoder application, when compared to a single clocked system architecture. expand
|
|
|
A system-level approach to hardware reconfigurable systems |
| |
Christian Haubelt,
Stephan Otto,
Cornelia Grabbe,
Jürgen Teich
|
|
Pages: 298-301 |
|
doi>10.1145/1120725.1120853 |
|
Full text: PDF
|
|
There is trend towards networked and distributed hardware reconfigurable systems, complicating the design process at the system-level. This paper will provide a solution to the problem of design space exploration for such embedded systems of the next ...
There is trend towards networked and distributed hardware reconfigurable systems, complicating the design process at the system-level. This paper will provide a solution to the problem of design space exploration for such embedded systems of the next generation. We will show the problems occurring while exploring the design space at the system-level, leading to new properties for valid implementations. The novelty of this approach lies in the support of explicit communication modeling and time-multiplexed architecture modeling in a single model. The proposed design space exploration is based on Evolutionary Algorithms and a new slack-based list scheduler. expand
|
|
|
High-level synthesis for DSP applications using heterogeneous functional units |
| |
Zili Shao,
Qingfeng Zhuge,
Chun Xue,
Bin Xiao,
Edwin H.-M. Sha
|
|
Pages: 302-304 |
|
doi>10.1145/1120725.1120854 |
|
Full text: PDF
|
|
This paper addresses high level synthesis for realtime digital signal processing (DSP) architectures using heterogeneous functional units (FUs). For such special purpose architecture synthesis, an important problem is how to assign a proper FU type to ...
This paper addresses high level synthesis for realtime digital signal processing (DSP) architectures using heterogeneous functional units (FUs). For such special purpose architecture synthesis, an important problem is how to assign a proper FU type to each operation of a DSP application and generate a schedule in such a way that all requirements can be met and the total cost can be minimized. In the paper, we propose a two-phase approach to solve this problem. In the first phase, we propose an algorithm to assign proper FU types to applications such that the total cost can be minimized while the timing constraint is satisfied. In the second phase, based on the assignments obtained in the first phase, we propose a minimum resource scheduling algorithm to generate a schedule and a feasible configuration that uses as little resource as possible. The experimental results show that our approach can generate high-performance assignments and schedules with great reduction on total cost compared with the previous work. expand
|
|
|
SESSION: Test and verification |
| |
Yinghua Min,
Alan J. Hu
|
|
|
|
|
Evaluation of the statistical delay quality model |
| |
Yasuo Sato,
Shuji Hamada,
Toshiyuki Maeda,
Atsuo Takatori,
Seiji Kajihara
|
|
Pages: 305-310 |
|
doi>10.1145/1120725.1120856 |
|
Full text: PDF
|
|
In this paper we introduce a quality model that reflects fabrication process quality, design delay margin, and test timing accuracy. The model provides a measure that can predict the level of chip defects that cause delay failure, including marginal ...
In this paper we introduce a quality model that reflects fabrication process quality, design delay margin, and test timing accuracy. The model provides a measure that can predict the level of chip defects that cause delay failure, including marginal delay. We can therefore use the model to make test vectors that are effective in terms of both testing cost and chip quality. The results of experiments using ISCAS89 benchmark data and some large industrial design data reflect various characteristics of our statistical delay quality model. expand
|
|
|
Fault tolerant nanoelectronic processor architectures |
| |
Wenjing Rao,
Alex Orailoglu,
Ramesh Karri
|
|
Pages: 311-316 |
|
doi>10.1145/1120725.1120857 |
|
Full text: PDF
|
|
In this paper we propose a fault-tolerant processor architecture and an associated fault-tolerant computation model capable of fault tolerance in the nanoelectronic environment that is characterized by high and time varying fault rates. The proposed ...
In this paper we propose a fault-tolerant processor architecture and an associated fault-tolerant computation model capable of fault tolerance in the nanoelectronic environment that is characterized by high and time varying fault rates. The proposed fault tolerant processor architecture not only guarantees the correctness of computation but also is flexible in that it dynamically trades-off computation resources and performance. The core of the architecture is a decentralized instruction control unit called the voter that achieves both fault tolerance and the maximum parallel execution of instructions by exploiting the abundant computational resources provided by nanotechnologies. Although the result of each instruction needs to be confirmed by executing it on multiple computation units, multiple unconfirmed instructions can proceed as speculative branches. The voter implements a hardware-frugal computation unit allocation algorithm to organize the redundant computations and to dynamically control the growth of speculative branches. expand
|
|
|
An efficient control-oriented coverage metric |
| |
Shireesh Verma,
Kiran Ramineni,
Ian G. Harris
|
|
Pages: 317-322 |
|
doi>10.1145/1120725.1120858 |
|
Full text: PDF
|
|
Coverage metrics, which evaluate the ability of a test sequence to detect design faults, are essential to the validation process. A key source of difficulty in determining fault detection is that the control flow path traversed in the presence of a fault ...
Coverage metrics, which evaluate the ability of a test sequence to detect design faults, are essential to the validation process. A key source of difficulty in determining fault detection is that the control flow path traversed in the presence of a fault cannot be determined. Fault detection can only be accurately determined by exploring the set of all control flow paths, which may be traversed as a result of a fault. We present a coverage metric that determines the propagation of fault effects along all possible faulty control flow paths. The complexity of exploring multiple control flow paths is greatly alleviated by heuristically pruning infeasible control flow paths using the algorithm that we present. The proposed coverage metric provides high accuracy in designs that contain complex control flow. The results obtained are promising. expand
|
|
|
An observability measure to enhance statement coverage metric for proper evaluation of verification completeness |
| |
Tai-Ying Jiang,
Chien-Nan Jimmy Liu,
Jing-Yang Jou
|
|
Pages: 323-326 |
|
doi>10.1145/1120725.1120859 |
|
Full text: PDF
|
|
Simulation based validation approaches are still the primary workhorse for solving the verification problem of getting the initial HDL description correct, especially for large scaled designs. However, most of existing code coverage metrics do not address ...
Simulation based validation approaches are still the primary workhorse for solving the verification problem of getting the initial HDL description correct, especially for large scaled designs. However, most of existing code coverage metrics do not address obsevability issue [2]. Therefore, we intend to provide additional observability measures to statement coverage metric for more proper and realistic evaluation of verification completeness for a HDL design. As compared to OCCOM [1,2,3], our approach estimates a real probabilistic likelihood of propagating erroneous effects without any unreasonable assumptions and can always provide lower bound estimation. expand
|
|
|
Tightly integrate dynamic verification with formal verification: a GSTE based approach |
| |
Jin Yang,
Avi Puder
|
|
Pages: 327-330 |
|
doi>10.1145/1120725.1120860 |
|
Full text: PDF
|
|
GSTE (Generalized Symbolic Trajectory Evaluation) is a high capacity formal verification technology that has been successfully applied to verifying complex Intel designs with tens of thousands of state elements. In this paper, we extend the use of GSTE ...
GSTE (Generalized Symbolic Trajectory Evaluation) is a high capacity formal verification technology that has been successfully applied to verifying complex Intel designs with tens of thousands of state elements. In this paper, we extend the use of GSTE by developing a dynamic checker that verifies a GSTE specification against a scalar simulation trace. Unlike previous approaches, both the formal checker and the dynamic checker work directly on a GSTE specification without the need for an intermediate monitor circuit. Our approach also offers a straight forward way to measure the quality (coverage) of a specification. The dynamic checker has been used in the real-life micro-processor design verification. expand
|
|
|
SESSION: Special session |
|
|
|
|
Panel I: who is responsible for the design for manufacturability issues in the era of nano-technologies? |
| |
C. K. Cheng,
Steve Lin,
Andrew Kahng,
Keh-Jeng Chang,
Vijay Pitchumani,
Toshiyuki Shibuya,
Roberto Suaya,
Zhiping Yu,
Fook-Luen Heng,
Don MacMillen
|
|
Pages: 1-1 |
|
doi>10.1145/1120725.1120862 |
|
Full text: PDF
|
|
The notion of design for manufacturability is blurring the separation between the tasks of design and manufacture. In the era of nano-technologies, the description of the design rules has retreated back to an early stage form of many conditional cases ...
The notion of design for manufacturability is blurring the separation between the tasks of design and manufacture. In the era of nano-technologies, the description of the design rules has retreated back to an early stage form of many conditional cases and even an art. Thus, it is important to set the metrics for the manufacturability. However, who is going to be held accountable for the final outcomes? Should the designer, the EDA developer, the manufacture engineer, or a new breed of experts take the lead to tackle the problem? expand
|
|
|
SESSION: Placement techniques |
| |
Xianlong Hong,
Ting-Chi Wang
|
|
|
|
|
On structure and suboptimality in placement |
| |
Satoshi Ono,
Patrick H. Madden
|
|
Pages: 331-336 |
|
doi>10.1145/1120725.1120864 |
|
Full text: PDF
|
|
Regular structures are present in many types of circuits. If this structure can be identified and utilized, performance can be improved dramatically. In this paper, we present a novel placement approach that successfully identifies regularity, and obtains ...
Regular structures are present in many types of circuits. If this structure can be identified and utilized, performance can be improved dramatically. In this paper, we present a novel placement approach that successfully identifies regularity, and obtains placements that are superior to other "general purpose" methods. This method has been integrated into our Feng Shui 2.6 bisection-based placement tool.On experiments with the PEKO benchmarks, our results are within 32% of optimal for both the large and small suites. The largest example, with 2.1 million cells, can be completed in sixteen hours. The majority of our run time is during detail placement--global placement takes under three hours. The success of our method shows that it can find structure, even when the structure was not expected or intended.As part of this work, we have made a number of observations related to the nature of suboptimality in placement. These observations have shown that some neglected research areas have great potential, while problems that receive considerable attention are essentially adequately solved. expand
|
|
|
Optimal placement by branch-and-price |
| |
Pradeep Ramachandaran,
Ameya R. Agnihotri,
Satoshi Ono,
Purushothaman Damodaran,
Krishnaswami Srihari,
Patrick H. Madden
|
|
Pages: 337-342 |
|
doi>10.1145/1120725.1120865 |
|
Full text: PDF
|
|
Circuit placement has a large impact on all aspects of performance; speed, power consumption, reliability, and cost are all affected by the physical locations of interconnected transistors. The placement problem is NP-Complete for even simple metrics.In ...
Circuit placement has a large impact on all aspects of performance; speed, power consumption, reliability, and cost are all affected by the physical locations of interconnected transistors. The placement problem is NP-Complete for even simple metrics.In this paper, we apply techniques developed by the Operations Research (OR) community to the placement problem. Using an Integer Programming (IP) formulation and by applying a "branch-and-price" approach, we are able to optimally solve placement problems that are an order of magnitude larger than those addressed by traditional methods. Our results show that suboptimality is rampant on the small scale, and that there is merit in increasing the size of optimization windows used in detail placement. expand
|
|
|
Detailed placement for improved depth of focus and CD control |
| |
Puneet Gupta,
Andrew B. Kahng,
Chul-Hong Park
|
|
Pages: 343-348 |
|
doi>10.1145/1120725.1120866 |
|
Full text: PDF
|
|
Sub-resolution assist features (SRAFs) provide an absolutely essential technique for critical dimension (CD) control and process window enhancement in subwavelength lithography. However, as focus levels change during manufacturing, CDs at a given "legal" ...
Sub-resolution assist features (SRAFs) provide an absolutely essential technique for critical dimension (CD) control and process window enhancement in subwavelength lithography. However, as focus levels change during manufacturing, CDs at a given "legal" pitch can fail to achieve manufacturing tolerances required for adequate yield. Furthermore, adoption of off-axis illumination (OAI) and SRAF techniques to enhance resolution at minimum pitch worsens printability of patterns at other pitches. This paper describes a novel dynamic programming-based technique for Assist-Feature Correctness (AFCorr) in detailed placement of standard-cell designs. For benchmark designs in 130nm and 90nm technologies, AFCorr achieves improved depth of focus and substantial improvement in CD control with negligible timing, area, or CPU overhead. The advantages of AFCorr are expected to increase in future technology nodes. expand
|
|
|
Floorplan management: incremental placement for gate sizing and buffer insertion |
| |
Chen Li,
Cheng-Kok Koh,
Patrick H. Madden
|
|
Pages: 349-354 |
|
doi>10.1145/1120725.1120867 |
|
Full text: PDF
|
|
Incremental physical design is an important methodology towards achieving design closure for high-performance large-scale circuits. Placement tools must accommodate incremental changes to the layout and netlist due to physical synthesis techniques without ...
Incremental physical design is an important methodology towards achieving design closure for high-performance large-scale circuits. Placement tools must accommodate incremental changes to the layout and netlist due to physical synthesis techniques without perturbing the original metrics. We present an incremental placement approach using floorplan sizing to manage the resources and demands of the whole chip region in order to accommodate the changes due to gate sizing and buffer insertion. The experimental results show that this approach can accommodate a wide range of incremental changes without a loss in wirelength and routability. Most important, it also maintains the stability of a placement such that the convergence of physical synthesis iterations can be greatly enhanced. expand
|
|
|
SESSION: Security processor design |
| |
Lorena Anghel,
Steve Lin
|
|
|
|
|
Low-power techniques for network security processors |
| |
Yi-Ping You,
Chun-Yen Tseng,
Yu-Hui Huang,
Po-Chiun Huang,
TingTing Hwang,
Sheng-Yu Hsu
|
|
Pages: 355-360 |
|
doi>10.1145/1120725.1120869 |
|
Full text: PDF
|
|
In this paper, we present several techniques for low-power design, including a descriptor-based low-power scheduling algorithm, design of dynamic voltage generator, and dual threshold voltage assignments, for network security processors. The experiments ...
In this paper, we present several techniques for low-power design, including a descriptor-based low-power scheduling algorithm, design of dynamic voltage generator, and dual threshold voltage assignments, for network security processors. The experiments show that the proposed methods and designs provide the opportunity for network security processors to achieve the goals of both high performance and low power. expand
|
|
|
A configurable AES processor for enhanced security |
| |
Chih-Pin Su,
Chia-Lung Horng,
Chih-Tsun Huang,
Cheng-Wen Wu
|
|
Pages: 361-366 |
|
doi>10.1145/1120725.1120870 |
|
Full text: PDF
|
|
We propose a configurable AES processor for extended-security communication. The proposed architecture can provide up to 219 different AES block cipher schemes within a reasonable hardware cost. Data can be encrypted not only with secret keys ...
We propose a configurable AES processor for extended-security communication. The proposed architecture can provide up to 219 different AES block cipher schemes within a reasonable hardware cost. Data can be encrypted not only with secret keys and initial vectors, but also by different block ciphers during the communication. A novel on-the-fly key expansion design is also proposed for 128-, 192-, and 256-bit keys. Our unified hardware can run both the original AES algorithm and the extended AES algorithm. The proposed processor design has been fabricated by a 0.25μm CMOS process, with a silicon area of 6.93mm2---about 200.5K equivalent gates. Under a 66MHz clock, the throughput rate for both the ECB and CBC operation modes are 844.8Mbps, 704Mbps, and 603.4Mbps for 128-bit, 192-bit, and 256-bit keys, respectively. expand
|
|
|
Power estimation starategies for a low-power security processor |
| |
Yen-Fong Lee,
Shi-Yu Huang,
Sheng-Yu Hsu,
I-Ling Chen,
Cheng-Tao Shieh,
Jian-Cheng Lin,
Shih-Chieh Chang
|
|
Pages: 367-371 |
|
doi>10.1145/1120725.1120871 |
|
Full text: PDF
|
|
In this paper, we present the power estimation methodologies for the development of a low-power security processor that contains significant amount of logic and memory. For the logic part, we present a highly accurate tool, called PowerMixer. ...
In this paper, we present the power estimation methodologies for the development of a low-power security processor that contains significant amount of logic and memory. For the logic part, we present a highly accurate tool, called PowerMixer. This tool is a refinement of the so-called mixed-level methodology that combines the accuracy of quick SPICE and the speed of gate-level simulation. A grouping scheme is proposed so as to improve the accuracy for design blocks as large as 100K gates. For the memory part, we investigated the power consuming behavior of memories and point out the potential problems associated with the current commercial design flow. These tools, along with a previously published static peak power estimation method [4], jointly provide an evaluation platform for the power optimization and verification process of our security processor in a practical way. expand
|
|
|
Design and test of a scalable security processor |
| |
Chih-Pin Su,
Chen-Hsing Wang,
Kuo-Liang Cheng,
Chih-Tsun Huang,
Cheng-Wen Wu
|
|
Pages: 372-375 |
|
doi>10.1145/1120725.1120872 |
|
Full text: PDF
|
|
This paper presents a security processor to accelerate cryptographic processing in modern security applications. Our security processor is capable of popular cryptographic functions such as RSA, AES, hashing and random number generation, etc. With proposed ...
This paper presents a security processor to accelerate cryptographic processing in modern security applications. Our security processor is capable of popular cryptographic functions such as RSA, AES, hashing and random number generation, etc. With proposed Crypto-DMA controller, data gathering and scattering become flexible for security processing, using a simple descriptor-based programming model. The architecture of the security processor with its core-based platform is scalable and configurable for security variations in performance, cost and power consumption. Different number of data channels and crypto-engines can be used to meet the specifications. In addition, a DFT platform is also implemented for the design-test integration. The security processor has been fabricated with 0.18μm CMOS technology. The core area is 3.899mm x 2.296mm (525K gates approximately) and the operating clock rate is 83MHz. expand
|
|
|
System-level design space exploration for security processor prototyping in analytical approaches |
| |
Yung Chia Lin,
Chung Wen Huang,
Jenq Kuen Lee
|
|
Pages: 376-380 |
|
doi>10.1145/1120725.1120873 |
|
Full text: PDF
|
|
The customization of architectures in designing the security processor-based systems typically involves timeconsuming simulation and sophisticated analysis in the exploration of design spaces. In this paper, we present an analytical modeling strategy ...
The customization of architectures in designing the security processor-based systems typically involves timeconsuming simulation and sophisticated analysis in the exploration of design spaces. In this paper, we present an analytical modeling strategy for synoptically exploring of the candidate architectures of security processor-based systems. of We demonstrate examples to employ our analytical models for design space explorations of embedded security systems to deal with scalability issues and architecture constraints. The experiments with the cycle-accurate simulation exhibit the applicability of analytical modeling: average prediction error is less than 10% while speed improvement is in several orders of magnitude. expand
|
|
|
SESSION: (Special session) embedded tutorial II |
| |
Lei He
|
|
|
|
|
Leakage power: trends, analysis and avoidance |
| |
David Blaauw,
Anirudh Devgan,
Farid Najm
|
|
Pages: 1-1 |
|
doi>10.1145/1120725.1120875 |
|
Full text: PDF
|
|
Leakage power is emerging as a key challenge in IC design. Leakage is increasingly exponentially with each technology generation and is expected to become the dominant part of total power. Device threshold voltage scaling, shrinking device dimensions, ...
Leakage power is emerging as a key challenge in IC design. Leakage is increasingly exponentially with each technology generation and is expected to become the dominant part of total power. Device threshold voltage scaling, shrinking device dimensions, and larger circuit sizes are causing this dramatic increase in leakage. As leakage varies exponentially with process parameters, yield of the chip is often directly influenced by leakage. Increasing amount of leakage is also critical for power constraint ICs. Traditionally, leakage has been considered as an important design variable in handheld devices and in standby circuit operation. However, this significant increase of leakage now warrants that it be considered as the key design variable in all IC designs.This tutorial presents a comprehensive review of leakage power issues in IC design. The tutorial is organized in four major parts. The first part provides an overview of technology and scaling trends which are causing the significant increase in leakage current. The device physics that leads to sub-threshold and gate leakage will be described, along with their dependence on circuit design variables. This part of the tutorial will also cover basic transistor and circuit techniques to minimize leakage, such as the stack effect.The second part of the tutorial will focus on circuit level leakage estimation and avoidance. Use of multiple threshold voltages has been very successful in controlling the leakage of the circuit. Comprehensive description of multiple-Vt techniques for leakage avoidance will be presented along with associated leakage estimation techniques. Multiple-threshold design (MTCMOS) will be described along with its leakage benefits and performance trade-offs. Multiple oxide technology options and associated impact on gate leakage will also be discussed.Third part of the tutorial focuses on chip level effects on leakage. Leakage is heavily dependent on local and global process variations and can vary by an order of magnitude over the technology spread. Leakage estimation techniques which consider both inter and intra-die process variations will be covered. This part of the tutorial also focuses on chip-level leakage minimization techniques. Leakage minimization techniques such as Adaptive Body Bias (ABB) and power supply control will be presented.The last part of the tutorial covers system and circuit architectures for leakage avoidance. In standby mode, the leakage of the circuit can be lowered by putting it a low-leakage state. Caches and memory circuits occupy large percentage of area in model chips. The leakage of caches and memories need to be carefully controlled. This section of the tutorial will cover topics including state assignment for leakage minimization, leakage-driven memory and cache circuits and architectures.The tutorial is intended for designers and CAD engineers interested in next generation design techniques and methodologies and emerging power challenges. Basic background of VLSI and CAD is useful though not needed. expand
|
|
|
SESSION: (Special session) CAD for microarchitecture designs |
| |
Hannah Honghua Yang
|
|
|
|
|
Challenges to covering the high-level to silicon gap |
| |
Bill Grundmann
|
|
Pages: 1-1 |
|
doi>10.1145/1120725.1120877 |
|
Full text: PDF
|
|
Silicon architects have a difficult task. They have to translate a high-level product desire into a lower-level description for silicon implementation. They are required to balance their own creativity, project schedule, solution cost, and the degree ...
Silicon architects have a difficult task. They have to translate a high-level product desire into a lower-level description for silicon implementation. They are required to balance their own creativity, project schedule, solution cost, and the degree of difficultly of implementation. Now they also have to worry about power dissipation and physical space realities. expand
|
|
|
Opportunities and challenges for better than worst-case design |
| |
Todd Austin,
Valeria Bertacco,
David Blaauw,
Trevor Mudge
|
|
Pages: 2-7 |
|
doi>10.1145/1120725.1120878 |
|
Full text: PDF
|
|
The progressive trend of fabrication technologies towards the nanometer regime has created a number of new physical design challenges for computer architects. Design complexity, uncertainty in environmental and fabrication conditions, and single-event ...
The progressive trend of fabrication technologies towards the nanometer regime has created a number of new physical design challenges for computer architects. Design complexity, uncertainty in environmental and fabrication conditions, and single-event upsets all conspire to compromise system correctness and reliability. Recently, researchers have begun to advocate a new design strategy called Better Than Worst-Case design that couples a complex core component with a simple reliable checker mechanism. By delegating the responsibility for correctness and reliability of the design to the checker, it becomes possible to build provably correct designs that effectively address the challenges of deep submicron design. In this paper, we present the concepts of Better Than Worst-Case design and high light two exemplary designs: the DIVA checker and Razor logic. We show how this approach to system implementation relaxes design constraints on core components, which reduces the effects of physical design challenges and creates opportunities to optimize performance and power characteristics. We demonstrate the advantages of relaxed design constraints for the core components by applying typical-case optimization (TCO) techniques to an adder circuit. Finally, we discuss the challenges and opportunities posed to CAD tools in the context of Better Than Worst-Case design. In particular, we describe the additional support required for analyzing run-time characteristics of designs and the many opportunities which are created to incorporate typical-case optimizations into synthesis and verification. expand
|
|
|
Microarchitecture evaluation with floorplanning and interconnect pipelining |
| |
Ashok Jagannathan,
Hannah Honghua Yang,
Kris Konigsfeld,
Dan Milliron,
Mosur Mohan,
Michail Romesis,
Glenn Reinman,
Jason Cong
|
|
Pages: 8-15 |
|
doi>10.1145/1120725.1120879 |
|
Full text: PDF
|
|
As microprocessor technology continues to scale into the nanometer regime, recent studies show that interconnect delay will be a limiting factor for performance, and multiple cycles will be necessary to communicate global signals across the chip. Thus, ...
As microprocessor technology continues to scale into the nanometer regime, recent studies show that interconnect delay will be a limiting factor for performance, and multiple cycles will be necessary to communicate global signals across the chip. Thus, longer interconnects need to be pipelined, and the impact of the extra latency along wires needs to be considered during early micro-architecture design exploration. In this paper, we address this problem and make the following contributions: (1) a oor plan-driven micro-architecture evaluation methodology considering interconnect pipelining at a given target frequency by selectively optimizing architecture level critical paths. (2) use of micro-architecture performance sensitivity models to weight micro-architectural critical paths during oor planning and optimize them for higher performance. (3) a methodology to study the impact of frequency scaling on micro-architecture performance with consideration of interconnect pipelining.For a sample micro-architecture design space, we show that considering interconnect pipelining can increase the estimated performance against a no-wire-pipelining approach between 25% to 45%. We also demonstrate the value of the methodology in exploring the target frequency of the processor. expand
|
|
|
SESSION: University design contest |
| |
Xiaoyang Zeng,
Makoto Ikeda,
Lin Yang
|
|
|
|
|
TERPS: the embedded reliable processing system |
| |
Hongxia Wang,
Samuel Rodriguez,
Cagdas Dirik,
Amol Gole,
Vincent Chan,
Bruce Jacob
|
|
Pages: 1-2 |
|
doi>10.1145/1120725.1120886 |
|
Full text: PDF
|
|
TERPS is a fault-tolerant computer design that significantly reduces the threat of electromagnetic interference (EMI), using hardware checkpoint/rollback-recovery. TERPS tolerates EMI by periodically checkpointing processor state into a special safe-storage ...
TERPS is a fault-tolerant computer design that significantly reduces the threat of electromagnetic interference (EMI), using hardware checkpoint/rollback-recovery. TERPS tolerates EMI by periodically checkpointing processor state into a special safe-storage device. The detection of EMI invokes rollback, which recovers processor state from a previously check-pointed state and resumes normal execution. Rollback results in loss of performance dictated by the EMI duration; TERPS ensures forward progress of the system provided EMI events are separated by some minimum time interval (e.g., at least 5.12μs for our prototype processor running at 100MHz). The performance overhead of our mechanism is reasonable: 5-6% overhead when check-pointing every 128 processor cycles. expand
|
|
|
AMDREL: a novel low-energy FPGA architecture and supporting CAD tool design flow |
| |
D. Soudris,
S. Nikolaidis,
S. Siskos,
K. Tatas,
K. Siozios,
G. Koutroumpezis,
N. Vasiliadis,
V. Kalenteridis,
H. Pournara,
I. Pappas,
A. Thanailakis
|
|
Pages: 3-4 |
|
doi>10.1145/1120725.1120887 |
|
Full text: PDF
|
|
The design of a novel embedded FPGA reconfigurable hardware architecture is introduced. The architecture features a number of circuit-level low-power techniques, since power consumption is considered a primary concern. Additionally, a complete set of ...
The design of a novel embedded FPGA reconfigurable hardware architecture is introduced. The architecture features a number of circuit-level low-power techniques, since power consumption is considered a primary concern. Additionally, a complete set of tools facilitating implementation of applications on the proposed FPGA was presented, starting from an RTL description and producing the actual configuration bit stream. The designed full-custom FPGA is under fabrication in 0.18μm STM CMOS technology. The prototype supports partial and dynamic reconfiguration. The efficiency of the entire system (FPGA and tools) was proven by comparisons with commercial systems. expand
|
|
|
Standard CMOS technology on-chip inductors with pn junctions substrate isolation |
| |
Hongyan Jian,
Zhangwen Tang,
Jie He,
Jinglan He,
Min Hao
|
|
Pages: 5-6 |
|
doi>10.1145/1120725.1120888 |
|
Full text: PDF
|
|
New substrate isolation structures using pattern stacked pn junctions for on-chip inductors in standard CMOS technology are presented. For the first time, through increasing the reverse bias voltage to pn junctions, the lower substrate ...
New substrate isolation structures using pattern stacked pn junctions for on-chip inductors in standard CMOS technology are presented. For the first time, through increasing the reverse bias voltage to pn junctions, the lower substrate eddy loss due to the pn junction substrate isolation is reliably validated and the maximum quality factor is improved by 19%. The inductor without substrate shielding layer is compared to the inductor with metal one pattern ground shielding, pattern n-well, n+ diffusion, dual pn junctions isolation. expand
|
|
|
A bandwidth efficient subsampling-based block matching architecture for motion estimation |
| |
Hao-Yun Chin,
Chao-Chung Cheng,
Yu-Kun Lin,
Tian-Sheuan Chang
|
|
Pages: 7-8 |
|
doi>10.1145/1120725.1120889 |
|
Full text: PDF
|
|
We have developed a new pel subsampling-based search hardware for motion estimation called quartet-pel motion estimation (QME). The memory access of search range memory can be reduced to 25%. The computational complexity can also be reduced to 25% with ...
We have developed a new pel subsampling-based search hardware for motion estimation called quartet-pel motion estimation (QME). The memory access of search range memory can be reduced to 25%. The computational complexity can also be reduced to 25% with respect to full-search block matching algorithm (FBMA). On the other hand, flexible and efficient hardware architecture is also implemented. The flexibility is based on the configuration of processing unit, and adjustable candidate number. In addition, complete verification and testing methods are also considered. expand
|
|
|
Design and measurement of 6.4 Gbps 8:1 multiplexer in 0.18μm CMOS process |
| |
Akinori Shinmyo,
Masanori Hashimoto,
Hidetoshi Onodera
|
|
Pages: 9-10 |
|
doi>10.1145/1120725.1120890 |
|
Full text: PDF
|
|
We develop and measure a 8:1 multiplexer in a CMOS 0.18μm process. We design the hybrid multiplexer based on a prior detailed performance evaluation both of CMOS static and current mode logic circuits, and build a hybrid structure. The fabricated ...
We develop and measure a 8:1 multiplexer in a CMOS 0.18μm process. We design the hybrid multiplexer based on a prior detailed performance evaluation both of CMOS static and current mode logic circuits, and build a hybrid structure. The fabricated chip operates at up to 6.4 Gbps with power consumption of 84mW. expand
|
|
|
A design of high speed double precision floating point adder using macro modules |
| |
Chi Huang,
Xinyu Wu,
Jinmei Lai,
Chengshou Sun,
Gang Li
|
|
Pages: 11-12 |
|
doi>10.1145/1120725.1120891 |
|
Full text: PDF
|
|
Based on SMIC 0.18 μm 1.8v six-layer-metal CMOS process, we implement a 64-bit high speed pipelined floating point adder which satisfied IEEE 754 standard. After the critical path analysis of the pipelined structure, we custom design three macro modules ...
Based on SMIC 0.18 μm 1.8v six-layer-metal CMOS process, we implement a 64-bit high speed pipelined floating point adder which satisfied IEEE 754 standard. After the critical path analysis of the pipelined structure, we custom design three macro modules in order to reduce critical path delay. After placement in datapath style and routing, we implement the layout of floating point adder. The chip area is 1.44 mm2 and clock frequency is 518MHz. expand
|
|
|
A low-power video segmentation LSI with boundary-active-only architecture |
| |
Takashi Morimoto,
Osamu Kiriyama,
Hidekazu Adachi,
Zhaomin Zhu,
Tetsushi Koide,
Hans Jürgen Mattausch
|
|
Pages: 13-14 |
|
doi>10.1145/1120725.1120892 |
|
Full text: PDF
|
|
We designed a cell-network-based video segmentation test-chip in 0.35μm CMOS technology including a power reduction technique which activates only boundary cells of currently grown regions. The effectiveness of the proposed technique is confirmed ...
We designed a cell-network-based video segmentation test-chip in 0.35μm CMOS technology including a power reduction technique which activates only boundary cells of currently grown regions. The effectiveness of the proposed technique is confirmed by measurement results for a 41x33-sized cell-network, with 23μsec segmentation time (avg.) and 45.8mW power-dissipation (avg.) at 10MHz clock frequency. expand
|
|
|
The design and implementation of a DVB receiving chip with PCI interface |
| |
Xu Ningyi,
Li Shaohua,
Yu Wei,
He Guanghui,
Zhang Hao,
Luo Fei,
Zhou Zucheng
|
|
Pages: 15-16 |
|
doi>10.1145/1120725.1120893 |
|
Full text: PDF
|
|
A DVB receiving chip with PCI interface for PC is presented. The chip supports DVB protocols and integrates useful interfaces, including I2C, SmartCard and PCI. A card with this chip could change PC into digital TV terminal. The architecture ...
A DVB receiving chip with PCI interface for PC is presented. The chip supports DVB protocols and integrates useful interfaces, including I2C, SmartCard and PCI. A card with this chip could change PC into digital TV terminal. The architecture of FPGA prototype system together with some main design issues is introduced. The experimental result shows that the chip could accomplish required functionalities. expand
|
|
|
Design and implementation of an SDH high-speed switch |
| |
De_Hui Zhang,
Quan_Liang Zhao,
Jun-Gang Han
|
|
Pages: 17-18 |
|
doi>10.1145/1120725.1120894 |
|
Full text: PDF
|
|
In this shot paper, we propose a design of SDH High-Speed Switch, which can switch 16x16 STM-16 streams with speed at 2.488 Gbit/s. In this design a novel fabric structure was used to perform non-blocking connection of STM-1 data in any timeslot of 16-bit ...
In this shot paper, we propose a design of SDH High-Speed Switch, which can switch 16x16 STM-16 streams with speed at 2.488 Gbit/s. In this design a novel fabric structure was used to perform non-blocking connection of STM-1 data in any timeslot of 16-bit parallel STM-16 data rate at 155.5MB/s. The prototype of the design is implemented in Altera's StratixTM GX FPGA devices. The test results show that the prototype meets all requirements. expand
|
|
|
Design of vehicle position tracking system using short message services and its implementation on FPGA |
| |
Arias Tanti Hapsari,
Eniman Y Syamsudin,
Imron Pramana
|
|
Pages: 19-20 |
|
doi>10.1145/1120725.1120895 |
|
Full text: PDF
|
|
This paper describes the design of a system that can give information of vehicle position everytime there's a request for it. The information of vehicle position is gained from GPS and it is transmitted using Short Message Services. The system is designed ...
This paper describes the design of a system that can give information of vehicle position everytime there's a request for it. The information of vehicle position is gained from GPS and it is transmitted using Short Message Services. The system is designed using VHDL on Altera MAX+plus II software, and it is implemented on Altera UPIX demoboard based on FPGA chip, which is Altera FLEX 10K EPF10K70RC240-4. expand
|
|
|
Design of A 2.4-GHz integrated frequency synthesizer |
| |
Fei Wang,
Jianyu Zhang,
Xuan Wang,
Jinmei Lai,
Chengshou Sun
|
|
Pages: 21-22 |
|
doi>10.1145/1120725.1120896 |
|
Full text: PDF
|
|
A 2.4-GHz integrated frequency synthesizer of PLL-based in 0.35-μm RF process is presented. A fully integrated cross-coupled LC VCO of low phase noise is implemented. Prescaler accompanied with phase-switching is used to eliminate the glitch. The ...
A 2.4-GHz integrated frequency synthesizer of PLL-based in 0.35-μm RF process is presented. A fully integrated cross-coupled LC VCO of low phase noise is implemented. Prescaler accompanied with phase-switching is used to eliminate the glitch. The charge-pump having excellent current matching performance and wide output voltage range is achieved. The synthesizer has a frequency tuning range from 2.28 to 2.75 GHz. The simulation results show that it dissipates less than 66mW; settle time is less 100us; the phase noise is -117dBc/Hz@600KHz. expand
|
|
|
An improved test access mechanism structure and optimization technique in system-on-chip |
| |
Feng Jianhua,
Long Jieyi,
Xu Wenhua,
Ye Hongfei
|
|
Pages: 23-24 |
|
doi>10.1145/1120725.1120897 |
|
Full text: PDF
|
|
This paper presents a new test access mechanism (TAM) architecture and optimization method based on an improved flexible-width test bus. The method is first to set up the test time lower bound that is not depends on TAM architecture, then to construct ...
This paper presents a new test access mechanism (TAM) architecture and optimization method based on an improved flexible-width test bus. The method is first to set up the test time lower bound that is not depends on TAM architecture, then to construct a bus assignment that makes test time up to the lower bound. We present experimental results on our improved flexible-width test buses for four benchmark SOCs. Experiment results in a significant reduction of the test time, and is better than the proposed traditional methods in test time. expand
|
|
|
SESSION: (Special session) embedded tutorial III |
| |
Howard Chen,
Lei He
|
|
|
|
|
Designing reliable circuit in the presence of soft errors |
| |
Vijaykrishnan Narayanan,
Yuan Xie,
Mary Jane Irwin
|
|
Pages: 1-1 |
|
doi>10.1145/1120725.1120905 |
|
Full text: PDF
|
|
As technology scales, with ever shrinking geometries and higher density circuits, the issue of soft errors and reliability in a complex chip design is becoming a challenging design criterion. Soft errors are caused by radiation, which directly or indirectly ...
As technology scales, with ever shrinking geometries and higher density circuits, the issue of soft errors and reliability in a complex chip design is becoming a challenging design criterion. Soft errors are caused by radiation, which directly or indirectly induces a localized ionization capable of upsetting internal circuit states. While these errors can result in an upset event, the circuit itself is most often not damaged. Addressing soft error issues is important for a broad range of companies either because they incorporate many semiconductor devices that are prone to soft errors in their system or because they design embedded memories, FPGAs and microprocessors. This tutorial is targeted at researchers/industry practitioners who wish to gain a background on the soft error problem, the techniques that exist to counter this problem and future challenges that lie ahead. expand
|
|
|
SESSION: Design optimization for high-performance digital circuits |
| |
Eli Chiprout,
Zheng Shi
|
|
|
|
|
Fast and effective gate-sizing with multiple-Vt assignment using generalized Lagrangian Relaxation |
| |
Hsinwei Chou,
Yu-Hao Wang,
Charlie Chung-Ping Chen
|
|
Pages: 381-386 |
|
doi>10.1145/1120725.1120881 |
|
Full text: PDF
|
|
Simultaneous gate-sizing with multiple Vt assignment for delay and power optimization is a complicated task in modern custom designs. In this work, we make the key contribution of a novel gate-sizing and multi-Vt assignment technique ...
Simultaneous gate-sizing with multiple Vt assignment for delay and power optimization is a complicated task in modern custom designs. In this work, we make the key contribution of a novel gate-sizing and multi-Vt assignment technique based on generalized Lagrangian Relaxation. Experimental results show that our technique exhibits linear runtime and memory usage, and can effectively tune circuits with over 15,000 variables and 8,000 constraints in under 8 minutes (250x faster than state-of-the-art optimization solvers). expand
|
|
|
Effective analytical delay model for transistor sizing |
| |
Zhaojun Wo,
Israel Koren
|
|
Pages: 387-392 |
|
doi>10.1145/1120725.1120882 |
|
Full text: PDF
|
|
This paper describes an analytical delay model for transistor sizing. Two primitives are selected to be mapped for computing gate delay. These primitives model the short-channel effect and body effect in deep submicron CMOS circuits. A mapping algorithm ...
This paper describes an analytical delay model for transistor sizing. Two primitives are selected to be mapped for computing gate delay. These primitives model the short-channel effect and body effect in deep submicron CMOS circuits. A mapping algorithm for arbitrary serial-parallel structures is adopted. The delay of complex gates using such mappings to primitives are found to be within 10% of SPICE for most of the gates. The delay model is incorporated into a transistor sizing algorithm based on TILOS. Also presented are the experimental results for several circuits from LGSynth91 benchmark suite. expand
|
|
|
Achieving continuous VT performance in a dual VT process |
| |
Kanak Agarwal,
Dennis Sylvester,
David Blaauw,
Anirudh Devgan
|
|
Pages: 393-398 |
|
doi>10.1145/1120725.1120883 |
|
Full text: PDF
|
|
In this paper, we present a novel approach to obtain any desired intermediate threshold voltage in a dual VT process. The intermediate threshold voltages are achieved by combining low and high threshold voltages in a device. We show that this ...
In this paper, we present a novel approach to obtain any desired intermediate threshold voltage in a dual VT process. The intermediate threshold voltages are achieved by combining low and high threshold voltages in a device. We show that this combination can be easily implemented in layouts with negligible design and manufacturing overhead. Our results show that power-delay characteristics of the achieved intermediate thresholds match well with the ideal (but impractical) scenario that assumes that all intermediate thresholds are available in the technology. expand
|
|
|
Runtime leakage minimization through probability-aware dual-Vt or dual-tox assignment |
| |
Dongwoo Lee,
David Blaauw,
Dennis Sylvester
|
|
Pages: 399-404 |
|
doi>10.1145/1120725.1120884 |
|
Full text: PDF
|
|
With process scaling runtime leakage current, when the circuit is operating, has become a major concern in addition to traditional standby mode leakage. In this paper we propose a new leakage reduction method that specifically targets runtime leakage ...
With process scaling runtime leakage current, when the circuit is operating, has become a major concern in addition to traditional standby mode leakage. In this paper we propose a new leakage reduction method that specifically targets runtime leakage current. We first observe that the state probabilities of nodes in a circuit tend to be skewed, meaning that they have either a high or a low value. We then propose a method that exploits these skewed state probabilities by setting only those transistors to high-Vt (thick-oxide) that have a high likelihood of being OFF (ON) and hence contributing significantly to the total runtime leakage. Accordingly, we also propose a library specifically tailored for the proposed approach, where Vt and Tox assignment with favorably trade-offs under skewed input probabilities are provided. The optimization algorithm performs simultaneous sizing, Vt and Tox assignment and shows substantial leakage improvement over probability-unaware optimization. expand
|
|
|
SESSION: Floorplanning and partitioning |
| |
Yao-Wen Chang,
Yoji Kajitani
|
|
|
|
|
Floorplanning for 3-D VLSI design |
| |
Lei Cheng,
Liang Deng,
Martin D. F. Wong
|
|
Pages: 405-411 |
|
doi>10.1145/1120725.1120899 |
|
Full text: PDF
|
|
In this paper we present a floorplanning algorithm for 3-D ICs. The problem can be formulated as that of packing a given set of 3-D rectangular blocks while minimizing a suitable cost function. Our algorithm is based on a generalization of the classical ...
In this paper we present a floorplanning algorithm for 3-D ICs. The problem can be formulated as that of packing a given set of 3-D rectangular blocks while minimizing a suitable cost function. Our algorithm is based on a generalization of the classical 2-D slicing floorplans to 3-D slicing floorplans. A new encoding scheme of slicing floorplans (2-D/3-D) and its associated set of moves form the basis of the new simulated annealing based algorithm. The bestknown algorithm for packing 3-D rectangular blocks is based on simulated annealing using sequence-triple floorplan representation. Experimental results show that our algorithm produces packing results on average 3% better than the sequence-triple-based algorithm under the same annealing parameters, and our algorithm runs much faster (17 times for problems containing 100 blocks) than the sequence-triple. Moreover, our algorithm can be extended to consider various types of placement constraints and thermal distribution while the existing sequence-triple-based algorithm does not have such capabilities. Finally, when specializing to 2-D problems, our algorithm is a new 2-D slicing floorplanning algorithm. We are excited to report the surprising results that our new 2-D floorplanner has produced slicing floorplans for the two largest MCNC benchmarks ami33 and ami49 which have the smallest areas (among all slicing/nonslicing floorplanning algorithms) ever reported in the literature. expand
|
|
|
Optimal redistribution of white space for wire length minimization |
| |
Xiaoping Tang,
Ruiqi Tian,
Martin D. F. Wong
|
|
Pages: 412-417 |
|
doi>10.1145/1120725.1120900 |
|
Full text: PDF
|
|
Existing floorplanning algorithms compact blocks to the left and bottom. Although the compaction obtains an optimal area, it may not be good to meet other objectives such as minimizing total wire length which is the first-order objective. It is not known ...
Existing floorplanning algorithms compact blocks to the left and bottom. Although the compaction obtains an optimal area, it may not be good to meet other objectives such as minimizing total wire length which is the first-order objective. It is not known in the literature how to place blocks to obtain an optimal wire length. In this paper, we first show that the problem can be formulated as linear programming. Thereafter, instead of using the general but slow linear programming, we propose an efficient min-cost flow based approach to solve it. Our approach guarantees to obtain the minimum of total wire length in polynomial time and meanwhile keep the minimum area by distributing white space smarter for a given floorplan topology. We also show that the approach can be easily extended to handle constraints such as fixed-frame (fixed area), IO pins, pre-placed blocks, boundary blocks, range placement, alignment and abutment, rectilinear blocks, soft blocks, one-dimensional cluster placement, and bounded net delay, without loss of optimality. Practically, the algorithm is so efficient in that it finishes in less than 0.4 seconds for all MCNC benchmarks of block placement. It is also very effective. Experimental results show we can improve 4.2% of wire length even on very compact floorplans. Thus it provides an ideal way of post-floorplanning (refine floorplanning). expand
|
|
|
Crowdedness-balanced multilevel partitioning for uniform resource utilization |
| |
Yongseok Cheon,
Martin D. F. Wong
|
|
Pages: 418-423 |
|
doi>10.1145/1120725.1120901 |
|
Full text: PDF
|
|
In this paper, we propose a new multi-objective multilevel K-way partitioning which is aware of resource utilization distribution, assuming the resource utilization for a partitioned block is proportional to the logic occupation and the interconnections ...
In this paper, we propose a new multi-objective multilevel K-way partitioning which is aware of resource utilization distribution, assuming the resource utilization for a partitioned block is proportional to the logic occupation and the interconnections required for the block. A new quality of the partitioning solution, crowdedness, is defined as a virtual complexity metric where the physical size and the local connectivity of a partitioned block are considered simultaneously in the form of a weighted sum. The partitioning solutions driven by overall cut quality minimization tend to have wide variances of local interconnections for different blocks. The difference of block sizes, combining with the variance of the interconnections, potentially leads to the significant imbalance of the crowdedness (equivalently, resource utilization), even though the feasibility imposed by a block-size constraint is satisfied.Using the crowdedness metric, we explore the new partitioning solution space where the local interconnections are adaptively adjusted according to the block sizes, still under the same objective of overall interconnections minimization. By the carefully designed prioritized cell move policy, the proposed crowdedness-based partitioning achieves near-optimal solutions in terms of resource utilization distribution, while the overall interconnection quality also is improved but the feasibility is barely violated. The proposed approach is practically beneficial to multi-FPGA applications, in which excessive interconnections for a FPGA generate additional logics inside of the FPGA. expand
|
|
|
Partitioning and placement for buildable QCA circuits |
| |
Ramprasad Ravichandran,
Mike Niemier,
Sung Kyu Lim
|
|
Pages: 424-427 |
|
doi>10.1145/1120725.1120902 |
|
Full text: PDF
|
|
Quantum-dot Cellular Automata (QCA) is a novel computing mechanism that can represent binary information based on spatial distribution of electron charge configuration in chemical molecules. In this paper, we present partitioning and placement algorithms ...
Quantum-dot Cellular Automata (QCA) is a novel computing mechanism that can represent binary information based on spatial distribution of electron charge configuration in chemical molecules. In this paper, we present partitioning and placement algorithms for a large-scale automatic QCA layout. The purpose of zone partitioning is to initially partition a given circuit such that a single clock potential modulates the interdot barriers in all of the QCA cells within each zone. We then place these zones during our placement step. We identify several objectives and constraints that will enhance the buildability of QCA circuits and use them in our optimization process. The results are intended to define what is computationally interesting and could actually be built within a set of predefined constraints. expand
|
|
|
PMP: performance-driven multilevel partitioning by aggregating the preferred signal directions of I/O conduits |
| |
Chanseok Hwang,
Massoud Pedram
|
|
Pages: 428-431 |
|
doi>10.1145/1120725.1120903 |
|
Full text: PDF
|
|
In this paper, we present a new performance-driven multilevel partitioning algorithm, which calculates the timing gain of a move in the move-based partitioning strategies based on the aggregation of preferred signal directions. In addition, we propose ...
In this paper, we present a new performance-driven multilevel partitioning algorithm, which calculates the timing gain of a move in the move-based partitioning strategies based on the aggregation of preferred signal directions. In addition, we propose a new timing-aware multilevel clustering algorithm that uses the connection strength of an edge as the primary objective, and the maximum depth or the maximum hop-count of any path containing the edge as a tiebreaker for the clustering step. These ideas are integrated into a general multilevel partitioning framework, which consists of three phases: uncoarsening, initial partitioning, and coarsening and refinement phases. The benchmarks show that, on average, we can reduce delay by 14.6%, while increasing the cutsize by 1.2% when compared to hMetis[1]. expand
|
|
|
SESSION: Advances in SAT technology and application |
| |
Masahiro Fujita,
Jeremy Levitt
|
|
|
|
|
MUP: a minimal unsatisfiability prover |
| |
Jinbo Huang
|
|
Pages: 432-437 |
|
doi>10.1145/1120725.1120907 |
|
Full text: PDF
|
|
After establishing the unsatisfiability of a SAT instance encoding a typical design task, there is a practical need to identify its minimal unsatisfiable subsets, which pinpoint the reasons for the infeasibility of the design. Due to the potentially ...
After establishing the unsatisfiability of a SAT instance encoding a typical design task, there is a practical need to identify its minimal unsatisfiable subsets, which pinpoint the reasons for the infeasibility of the design. Due to the potentially expensive computation, existing tools for the extraction of unsatisfiable subformulas do not guarantee the minimality of the results. This paper describes a practical algorithm that decides the minimal unsatisfiability of any CNF formula through BDD manipulation. This algorithm has a worse-case complexity that is exponential only in the treewidth of the CNF formula. We provide an empirical evaluation of the algorithm, highlighting its efficiency on a set of hard problems as well as its ability to work with existing subformula extraction tools to achieve optimal results. expand
|
|
|
Integration of supercubing and learning in a SAT solver |
| |
Domagoj Babić,
Alan J. Hu
|
|
Pages: 438-444 |
|
doi>10.1145/1120725.1120908 |
|
Full text: PDF
|
|
Learning is an essential pruning technique in modern SAT solvers, but it exploits a relatively small amount of information that can be deduced from the conflicts. Recently a new pruning technique called supercubing was proposed [1]. Supercubing can exploit ...
Learning is an essential pruning technique in modern SAT solvers, but it exploits a relatively small amount of information that can be deduced from the conflicts. Recently a new pruning technique called supercubing was proposed [1]. Supercubing can exploit functional symmetries that are abundant in industrial SAT instances. We point out the significant difficulties of integrating supercubing with learning and propose solutions. Our experimental solver is the first supercubing-based solver with performance comparable to leading edge solvers. expand
|
|
|
Dynamic symmetry-breaking for improved Boolean optimization |
| |
Fadi A. Aloul,
Arathi Ramani,
Igor L. Markov,
Karem A. Sakallah
|
|
Pages: 445-450 |
|
doi>10.1145/1120725.1120909 |
|
Full text: PDF
|
|
With impressive progress in Boolean Satisfiability (SAT) solving and several extensions to pseudo-Boolean (PB) constraints, many applications that use SAT, such as high-performance formal verification techniques are still restricted to checking satisfiability ...
With impressive progress in Boolean Satisfiability (SAT) solving and several extensions to pseudo-Boolean (PB) constraints, many applications that use SAT, such as high-performance formal verification techniques are still restricted to checking satisfiability of certain conditions. However, there is also frequently a need to express a preference for certain solutions. Extending SAT-solving to Boolean optimization allows the use of objective functions to describe a desirable solution. Although recent work in 0-1 Integer Linear Programming (ILP) offers extensions that can optimize a linear objective function, this is often achieved by solving a series of SAT or ILP decision problems. Our work articulates some pitfalls of this approach. An objective function may complicate the use of any symmetry that might be present in the given constraints, even when the constraints are unsatisfiable and the objective function is irrelevant. We propose several new techniques that treat objective functions differently from CNF/PB constraints and accelerate Boolean optimization in many practical cases. We also develop an adaptive flow that analyzes a given Boolean optimization problem and picks the symmetry-breaking technique that is best suited to the problem characteristics. Empirically, we show that for non-trivial objective functions that destroy constraint symmetries, the benefit of static symmetry-breaking is lost but dynamic symmetry-breaking accelerates problem-solving in many cases. We also introduce a new objective function, Localized Bit Selection (LBS), that can be used to specify a preference for bit values in formal verification applications. expand
|
|
|
A fast counterexample minimization approach with refutation analysis and incremental SAT |
| |
Shengyu Shen,
Ying Qin,
SiKun Li
|
|
Pages: 451-454 |
|
doi>10.1145/1120725.1120910 |
|
Full text: PDF
|
|
It is a hotly research topic to eliminate irrelevant variables from counterexample, to make it easier to be understood. BFL algorithm is the most effective Counterexample minimization algorithm compared to all other approaches, but its run time overhead ...
It is a hotly research topic to eliminate irrelevant variables from counterexample, to make it easier to be understood. BFL algorithm is the most effective Counterexample minimization algorithm compared to all other approaches, but its run time overhead is very large due to one call to SAT solver per candidate variable to be eliminated. So we propose a faster counterexample minimization algorithm based on refutation analysis and incremental SAT. First, for every UNSAT instance of BFL, we perform refutation analysis to extract the set of variables that lead to UNSAT, all variables not belong to this set can be eliminated simultaneously. In this way, we can eliminate many variables with only one call to SAT solver. At the same time, we employ incremental SAT approach to share learned clauses between similar instances of BFL, to prevent overlapped state space from being searched repeatedly. Theoretic analysis and experiment result shows that, our approach can be 1 to 2 orders of magnitude faster than BFL, and still retain the minimization ability of BFL. expand
|
|
|
Sequential equivalence checking using cuts |
| |
Wei Huang,
PuShan Tang,
Min Ding
|
|
Pages: 455-458 |
|
doi>10.1145/1120725.1120911 |
|
Full text: PDF
|
|
This paper presents an algorithm which is an improvement of Van Eijk's Algorithm[5] by incorporating a cutpoints technique[8]. Combinational verification often uses the technique to convert large scale circuits to several small ones, which will be verified ...
This paper presents an algorithm which is an improvement of Van Eijk's Algorithm[5] by incorporating a cutpoints technique[8]. Combinational verification often uses the technique to convert large scale circuits to several small ones, which will be verified separately. Reasonable cuts can bring less time consuming to combinational verification. We embed the technique into sequential equivalence checking. Experimental results show that the proposed method can achieve about 2x speedup over the original one. expand
|
|
|
SESSION: Analysis and simulation techniques |
| |
Richard Shi,
Koichiro Mashiko
|
|
|
|
|
Fast PLL simulation using nonlinear VCO macromodels for accurate prediction of jitter and cycle-slipping due to loop non-idealities and supply noise |
| |
Xiaolue Lai,
Yayun Wan,
Jaijeet Roychowdhury
|
|
Pages: 459-464 |
|
doi>10.1145/1120725.1120913 |
|
Full text: PDF
|
|
Phase-locked loops (PLLs) are widely used in electronic systems. As PLL malfunction is one of the most important factors in re-fabs of SoCs, fast simulation of PLLs to capture non-ideal behavior accurately is an immediate, pressing need in the semiconductor ...
Phase-locked loops (PLLs) are widely used in electronic systems. As PLL malfunction is one of the most important factors in re-fabs of SoCs, fast simulation of PLLs to capture non-ideal behavior accurately is an immediate, pressing need in the semiconductor design industry. In this paper, we present a nonlinear macromodel based PLL simulation technique that is considerably more accurate than prior linear PLL simulation techniques. Our method is able to accurately capture transient behavior and faithfully estimate timing jitter in noisy PLLs. We demonstrate the proposed technique on ring and LC voltage-controlled oscillator (VCO) based PLLs, and compare results against linear PLL macromodels and full SPICE-level simulation. We show that, unlike prior linear macromodel based approaches, the proposed nonlinear technique captures the dynamics of complex phenomena such as locking, cycle slipping and power supply noise induced PLL jitter, replicating qualitative features from full SPICE simulations accurately while providing speedups of over two orders of magnitude. expand
|
|
|
Hierarchical analysis of process variation for mixed-signal systems |
| |
Fang Liu,
Sule Ozev
|
|
Pages: 465-470 |
|
doi>10.1145/1120725.1120914 |
|
Full text: PDF
|
|
Increasing process variability necessitates reliable analysis of its effects on circuit performance not only at the top level but also at intermediate levels. Mixed-signal circuits with multiple hierarchical layers, multiple parameters, and complex functional ...
Increasing process variability necessitates reliable analysis of its effects on circuit performance not only at the top level but also at intermediate levels. Mixed-signal circuits with multiple hierarchical layers, multiple parameters, and complex functional relations are especially susceptible to such variations. In this paper, we present a hierarchical method for process variation analysis. The ability to compute the variance of parameters at each hierarchical layer makes the method particularly suited for helping designers through design iterations. Experimental results indicate that the proposed method achieves high computational efficiency with up to 2% compromise in accuracy even for highly non-linear functional relations. expand
|
|
|
A novel wavelet method for noise analysis of nonlinear circuits |
| |
Xuan Zeng,
Bank Liu,
Jun Tao,
Charles Chiang,
Dian Zhou
|
|
Pages: 471-476 |
|
doi>10.1145/1120725.1120915 |
|
Full text: PDF
|
|
In this paper, a novel wavelet method is proposed for noise analysis of nonlinear circuits. Compared with the existing algorithms capable of accessing circuit performance in the present of noise, the proposed method presents several merits. First, it ...
In this paper, a novel wavelet method is proposed for noise analysis of nonlinear circuits. Compared with the existing algorithms capable of accessing circuit performance in the present of noise, the proposed method presents several merits. First, it fully accounts for nonlinearities. Second, it can handle signals with continuous frequency spectra. Third, by taking advantage of the properties of the wavelet bases, such as local compactness and multi-resolution, it holds high simulation speed and high accuracy. Furthermore, an adaptive scheme exists to automatically select the wavelet basis functions for a desired accuracy. All these merits make the novel wavelet method outperforms its previous techniques. expand
|
|
|
An error-driven adaptive grid refinement algorithm for automatic generation of analog circuit performance macromodels |
| |
Mengmeng Ding,
Glenn Wolfe,
Ranga Vemuri
|
|
Pages: 477-482 |
|
doi>10.1145/1120725.1120916 |
|
Full text: PDF
|
|
In this paper, we present an error-driven adaptive sampling algorithm called adaptive grid refinement (AGR) algorithm to automatically generate performance macromodels for analog circuits. Starting from samples on a coarse grid, the AGR algorithm builds ...
In this paper, we present an error-driven adaptive sampling algorithm called adaptive grid refinement (AGR) algorithm to automatically generate performance macromodels for analog circuits. Starting from samples on a coarse grid, the AGR algorithm builds a global model and validates its accuracy on an independent validation data set sampled within this grid. If this model is not accurate enough on the validation data, the grid is split into equal sized smaller grids. On each of these grids, a local model is built using samples on this grid and its neighboring and validated similarly. A grid will not be further refined only if the corresponding local model is accurate on its validation data set. The algorithm will stop when all the local models are accurate on their corresponding validation data set. We build six performance macromodels of a CMOS opamp using the AGR algorithm and compare it with the competing techniques. The strengths and weaknesses of the proposed algorithm are discussed. expand
|
|
|
SESSION: Interconnect modeling and analysis and system level design methodology |
| |
Charlie Chung-Ping Chen,
Yici Cai
|
|
|
|
|
Partial reluctance based circuit simulation is efficient and stable |
| |
Yu Du,
Wayne Dai
|
|
Pages: 483-488 |
|
doi>10.1145/1120725.1120918 |
|
Full text: PDF
|
|
Partial reluctance K, the inversion of partial inductance L, is proposed by Devgan et al to capture the on-chip inductance effect [3]. Partial reluctance based circuit simulation is efficient and stable because it is believed that ...
Partial reluctance K, the inversion of partial inductance L, is proposed by Devgan et al to capture the on-chip inductance effect [3]. Partial reluctance based circuit simulation is efficient and stable because it is believed that partial reluctance effect is local and partial reluctance matrix is positive definite, although it has not been proved or illustrated clearly. In this paper, we are going to prove that mutual partial reluctance effect between a completely shielded short conductor segment and a conductor segment outside the shield is zero, which implies that the partial reluctance effect is local. Also, an iterative cutting algorithm is proposed to guarantee the strong diagonal dominance of the partial reluctance matrix, which is a sufficient condition for the partial reluctance matrix to be positive definite. With these two characters of partial reluctance, the circuit simulation based on partial reluctance is efficient and stable. expand
|
|
|
SAGA: synthesis technique for guaranteed throughput NoC architectures |
| |
Krishnan Srinivasan,
Karam S. Chatha
|
|
Pages: 489-494 |
|
doi>10.1145/1120725.1120919 |
|
Full text: PDF
|
|
We present SAGA, a novel genetic algorithm (GA) based technique for synthesis of custom NoC architectures that support guaranteed throughput traffic. The technique accepts as input a communication trace graph, amount of data, period, and deadline for ...
We present SAGA, a novel genetic algorithm (GA) based technique for synthesis of custom NoC architectures that support guaranteed throughput traffic. The technique accepts as input a communication trace graph, amount of data, period, and deadline for each trace, interconnection network architecture elements, and generates a custom NoC topology, and routing and schedule of the communication traces on the architecture. SAGA minimizes both the energy consumption and area of the design by solving a multi-objective optimization problem. We present a detailed analysis of the quality of the results and the solution times of the proposed technique by extensive experimentation with realistic benchmarks and comparisons with optimal MILP solutions. SAGA is able to generate solutions that are as good as the optimal solutions produced by the MILP formulation. Whereas the MILP formulation run time rises exponentially for even moderately sized graphs, SAGA generates solutions for large graphs in reasonable time. expand
|
|
|
Automated throughput-driven synthesis of bus-based communication architectures |
| |
Sudeep Pasricha,
Nikil Dutt,
Mohamed Ben-Romdhane
|
|
Pages: 495-498 |
|
doi>10.1145/1120725.1120920 |
|
Full text: PDF
|
|
As System-on-Chip (SoC) designs become more complex, it becomes increasingly harder to design communication architectures which satisfy design constraints. Manually traversing the vast communication design space for constraint-driven synthesis is not ...
As System-on-Chip (SoC) designs become more complex, it becomes increasingly harder to design communication architectures which satisfy design constraints. Manually traversing the vast communication design space for constraint-driven synthesis is not feasible anymore. In this paper we propose an approach that automates the synthesis of bus-based communication architectures for systems characterized by (possibly several) throughput constraints. Our approach accurately and effectively prunes the large communication design space to synthesize a feasible low-cost bus architecture which satisfies the constraints in a design. expand
|
|
|
Simulation acceleration of transaction-level models for SoC with RTL sub-blocks |
| |
Jae-Gon Lee,
Wooseung Yang,
Young-Su Kwon,
Young-Il Kim,
Chong-Min Kyung
|
|
Pages: 499-502 |
|
doi>10.1145/1120725.1120921 |
|
Full text: PDF
|
|
This paper presents an optimized channel usage between simulator and accelerator when the simulator models transaction-level SoC while accelerator models RTL sub-blocks. Conventional simulation accelerators synchronize the progresses of simulator and ...
This paper presents an optimized channel usage between simulator and accelerator when the simulator models transaction-level SoC while accelerator models RTL sub-blocks. Conventional simulation accelerators synchronize the progresses of simulator and accelerator at every simulation time, which results in poor performance by splitting transactions on the simulator-to-accelerator channel into pieces. Occasional synchronization with predictions and recoveries makes it possible to merge multiple transfers yielding substantial performance gain compared to the conventional method. expand
|
|
|
Statistical modeling of cross-coupling effects in VLSI interconnects |
| |
Mridul Agarwal,
Kanak Agarwal,
Dennis Sylvester,
David Blaauw
|
|
Pages: 503-506 |
|
doi>10.1145/1120725.1120922 |
|
Full text: PDF
|
|
In this paper, we develop an approach for statistical modeling of crosstalk noise and dynamic delay degradation in coupled RC interconnects under process variations. The proposed model enables closed-form computation of mean and variance of noise peak ...
In this paper, we develop an approach for statistical modeling of crosstalk noise and dynamic delay degradation in coupled RC interconnects under process variations. The proposed model enables closed-form computation of mean and variance of noise peak and worst case dynamic delay for given variabilities in physical dimensions. We compare the proposed model against HSPICE Monte Carlo simulations and report an average error in mean and standard deviation of noise peak to be 2.7% and 3.7% respectively. expand
|
|
|
Compact and stable modeling of partial inductance and reluctance matrices |
| |
Hong Li,
Venkataramanan Balakrishnan,
Cheng-Kok Koh,
Guoan Zhong
|
|
Pages: 507-510 |
|
doi>10.1145/1120725.1120923 |
|
Full text: PDF
|
|
The sparsification of the reluctance matrix L-1 (where L denotes the usual inductance matrix L) has been widely used in several recent investigations to make the problem of simulation of interconnects tractable. Although ...
The sparsification of the reluctance matrix L-1 (where L denotes the usual inductance matrix L) has been widely used in several recent investigations to make the problem of simulation of interconnects tractable. Although these sparsification techniques work well in practice, the stability of these approximations has not been established, i.e., the sparsified reluctance and inductance matrices are not guaranteed to be positive-definite. In this work, we propose a band matching method that enjoys two advantages: First, we exploit the elegant structure of the inverse of banded matrices so as to construct an approximate inductance matrix &Ltilde; whose band entries match the band entries of original L, and whose inverse is a banded matrix. This approach yields a compact representation of both inductance and reluctance matrices. Second, we establish that the compact approximant &Ltilde; is guaranteed to be positive-definite. Simulation results show that our approach enjoys an approximation accuracy that is comparable to that of existing methods. expand
|
|
|
SESSION: High-level synthesis |
| |
Fan Mo,
Jinian Bian
|
|
|
|
|
Scalable interprocedural register allocation for high level synthesis |
| |
Rami Beidas,
Jianwen Zhu
|
|
Pages: 511-516 |
|
doi>10.1145/1120725.1120951 |
|
Full text: PDF
|
|
The success of classical high level synthesis has been limited by the complexity of the applications it can handle, typically not large enough to necessitate the departure from the industrial standard, register transfer level design methodology. Recent ...
The success of classical high level synthesis has been limited by the complexity of the applications it can handle, typically not large enough to necessitate the departure from the industrial standard, register transfer level design methodology. Recent advances of micro-architecture model enabled the use of stacked based controller, allowing complex algorithms with multiple procedures to be implemented directly in hardware. Nevertheless, design optimizations across procedure boundaries have not been fully explored. In this paper, we address the problem of interprocedural register allocation in the context of high level synthesis. In contrast to a recently proposed interprocedural register allocation algorithm, which processes an expensive, global, graph representation of the conflict relation of all values to achieve near optimality, we introduce a new method, called color palette propagation (CPP). The key idea behind our method, is to propagate the use of colors, whose number is significantly smaller than the size of the conflict relation, across different procedures. With a complexity comparable to intraprocedural register allocation, we show that our method can scale to very large C programs. For those benchmarks that can be handled by conventional global methods, our method produced nearly the same number of registers, while providing an average speedup factor of 90. expand
|
|
|
Simultaneous floorplanning and resource binding: a probabilistic approach |
| |
Azadeh Davoodi,
Ankur Srivastava
|
|
Pages: 517-522 |
|
doi>10.1145/1120725.1120952 |
|
Full text: PDF
|
|
In this work we present a probabilistic approach to simultaneous floorplanning and resource binding for low power. Traditional approaches iteratively perform floorplanning and resource binding while using crude deterministic wire-length estimates like ...
In this work we present a probabilistic approach to simultaneous floorplanning and resource binding for low power. Traditional approaches iteratively perform floorplanning and resource binding while using crude deterministic wire-length estimates like bounding box (since we do not have routing information for inter module inter-connect). Non-availability of accurate wire-length results in suboptimal design and failure of timing closure. In this work we model the wire-lengths as probability distributions and propose a novel probabilistic optimization methodology. Experimental results using state of the art commercial and academic tools were conducted. The novelty in this work is in the higher chance of ending with a feasible design that is synthesizable without losing in overall power (interconnect + module + register). Experimental results show that on-average the number of unsynthesized modules after routing for Mediabench benchmarks were 2 in the conventional case, while on average our probabilistic approach had all modules synthesized after routing. expand
|
|
|
Reducing hardware complexity of linear DSP systems by iteratively eliminating two-term common subexpressions |
| |
Anup Hosangadi,
Farzan Fallah,
Ryan Kastner
|
|
Pages: 523-528 |
|
doi>10.1145/1120725.1120953 |
|
Full text: PDF
|
|
This paper presents a novel technique to reduce the number of operations in Multiplierless implementations of linear DSP transforms, by iteratively eliminating two-term common subexpressions. Our method uses a polynomial transformation of linear systems ...
This paper presents a novel technique to reduce the number of operations in Multiplierless implementations of linear DSP transforms, by iteratively eliminating two-term common subexpressions. Our method uses a polynomial transformation of linear systems that enables us to eliminate common subexpressions consisting of multiple variables. Our algorithm is fast and produces the least number of additions/subtractions compared to all known techniques. The synthesized examples show significant reductions in the area and power consumption. expand
|
|
|
A fast algorithm for finding common multiple-vertex dominators in circuit graphs |
| |
René Krenz,
Elena Dubrova
|
|
Pages: 529-532 |
|
doi>10.1145/1120725.1120954 |
|
Full text: PDF
|
|
In this paper we present a fast algorithm for computing common multiple-vertex dominators in circuit graphs. Dominators are widely used in CAD applications such as satisfiability checking, equivalence checking, ATPG, technology mapping, decomposition ...
In this paper we present a fast algorithm for computing common multiple-vertex dominators in circuit graphs. Dominators are widely used in CAD applications such as satisfiability checking, equivalence checking, ATPG, technology mapping, decomposition of Boolean functions and power optimization. State of the art algorithms compute single-vertex dominators in linear time. However, the rare appearance of single-vertex dominators in circuit graphs requires the investigation of a broader type of dominators and the development of algorithms to compute them. We show that our new technique is faster and computes more common multiple-vertex dominators than existing techniques. expand
|
|
|
SESSION: Low power |
| |
Hai Zhou,
Rob Roy
|
|
|
|
|
Low-power domino circuits using NMOS pull-up on off-critical paths |
| |
Abdulkadir U. Diril,
Yuvraj S. Dhillon,
Abhijit Chatterjee,
Adit D. Singh
|
|
Pages: 533-538 |
|
doi>10.1145/1120725.1120956 |
|
Full text: PDF
|
|
Domino logic is used extensively in high speed microprocessor datapath design. Although domino gates have small propagation delay, they consume relatively more power. We propose a scheme to reduce the power consumption of combinational domino logic blocks ...
Domino logic is used extensively in high speed microprocessor datapath design. Although domino gates have small propagation delay, they consume relatively more power. We propose a scheme to reduce the power consumption of combinational domino logic blocks while maintaining the performance. We replace the PMOS precharge transistor with an NMOS transistor to reduce the overall power consumption of the gate at the expense of higher delay. We use a heuristic algorithm to replace the fast, high power gates on the off-critical paths with slower, low power gates while maintaining the circuit performance. Our technique reduces dynamic energy of ISCAS'85 circuits by 16.25%. expand
|
|
|
Low-leakage robust SRAM cell design for sub-100nm technologies |
| |
Shengqi Yang,
Wayne Wolf,
Wenping Wang,
N. Vijaykrishnan,
Yuan Xie
|
|
Pages: 539-544 |
|
doi>10.1145/1120725.1120957 |
|
Full text: PDF
|
|
A novel low-leakage robust SRAM design for sub-100nm technologies, Hybrid SRAM (HSRAM) cell, is presented in this paper. Leakage power, especially subthreshold leakage and gate leakage, and soft error are challenging the design of SRAM. While these important ...
A novel low-leakage robust SRAM design for sub-100nm technologies, Hybrid SRAM (HSRAM) cell, is presented in this paper. Leakage power, especially subthreshold leakage and gate leakage, and soft error are challenging the design of SRAM. While these important issues have been separately addressed in previous SRAM designs, there exists no design that simultaneously cuts down leakage power and enhances the resistance to soft error. In this work, we have built the first such SRAM cell, by hybrid of high-k gate dielectric and dynamic threshold voltage which is realized in the form of jointly biased gate and substrate transistor. The HSRAM not only makes the gate leakage negligible, but lessens the severe increase of subthreshold leakage caused by Fringing/Field Induced Barrier Lowering (FIBL) effect accompanied with the introduction of high-k gate dielectric, and in the same time reduces the susceptibility to soft error by increasing the node capacitance. Experiments were performed in both transistor level and circuit level for this novel HSRAM using ISE8.0 and HSPICE. They indicate that up to 93% reduction in total leakage is possible by using HSRAM cell, with an up to 23% increase in reliability degree and and an up to 73% reduction in bitline delay, compared to standard 6T SRAM. expand
|
|
|
Studying interactions between prefetching and cache line turnoff |
| |
Ismail Kadayif,
Mahmut Kandemir,
Guilin Chen
|
|
Pages: 545-548 |
|
doi>10.1145/1120725.1120958 |
|
Full text: PDF
|
|
While lots of prior studies focused on performance and energy optimizations for caches, their interactions have received much less attention. This is unfortunate since in general the performance-oriented techniques influence energy behavior of the cache, ...
While lots of prior studies focused on performance and energy optimizations for caches, their interactions have received much less attention. This is unfortunate since in general the performance-oriented techniques influence energy behavior of the cache, and the energy-oriented techniques usually increase program execution cycles. The overall energy and performance behavior of caches in embedded systems when multiple techniques co-exist remains an open research problem. This paper studies this interaction and illustrates how performance and energy optimizations affect each other. We also point out several potential optimizations that could be based on this study. expand
|
|
|
The development of high performance FFT IP cores through hybrid low power algorithmic methodology |
| |
Wei Han,
A. T. Erdogan,
T. Arslan,
M. Hasan
|
|
Pages: 549-552 |
|
doi>10.1145/1120725.1120959 |
|
Full text: PDF
|
|
This paper presents a solution based on parallel-pipelined architectures for high throughput and power efficient FFT IP cores. Low power consumption can be gained through the combination of hybrid low power algorithms and architectures. A number of IP ...
This paper presents a solution based on parallel-pipelined architectures for high throughput and power efficient FFT IP cores. Low power consumption can be gained through the combination of hybrid low power algorithms and architectures. A number of IP cores have been implemented for the comparison of the impact of parameterization on power/area/speed performance. The results show that up to 55% and 52% power saving can be achieved by the combination of the above techniques for 64-point 4-parallel-pipelined FFT and 16-point 2-parallel-pipelined FFT respectively, as compared to R4SDC pipelined FFTs. expand
|
|
|
Battery-aware instruction generation for embedded processors |
| |
Newton Cheung,
Sri Parameswaran,
Jörg Henkel
|
|
Pages: 553-556 |
|
doi>10.1145/1120725.1120960 |
|
Full text: PDF
|
|
Automatic instruction generation is an efficient method to satisfy growing performance and meet design constraints for application specific instruction-set processors. A typical approach for instruction generation is to combine a large group of primitive ...
Automatic instruction generation is an efficient method to satisfy growing performance and meet design constraints for application specific instruction-set processors. A typical approach for instruction generation is to combine a large group of primitive instructions into a single extensible instruction for maximizing speedups. However, this approach often leads to large power dissipation and discharge current, posing a challenge to battery-powered products. In this paper, we propose a battery-aware automatic tool to design extensible instructions which minimizes power dissipation distribution by separating an instruction into multiple instructions. We verify our automatic tool using 50 different code segments, and five large real-world applications. Our tool reduces energy consumption by a further 5.8% on average (up to 17.7%) compared to extensible instructions generated by previous approaches. For real-world applications, energy consumption is reduced by 6.6% on average (up to 16.53%) as well as an increase in performance for most cases. The automatic instruction generation tool is integrated into our application specific instruction-set processor tool suite. expand
|
|
|
A variation-aware low-power coding methodology for tightly coupled buses |
| |
Masanori Muroyama,
Kosuke Tarumi,
Koji Makiyama,
Hiroto Yasuura
|
|
Pages: 557-560 |
|
doi>10.1145/1120725.1120961 |
|
Full text: PDF
|
|
This paper describes a novel low-power coding methodology for buses. Ultra deep submicron technology and system-on-chip have resulted in a considerable portion of power consumption on buses, in which the major sources of the power consumption are the ...
This paper describes a novel low-power coding methodology for buses. Ultra deep submicron technology and system-on-chip have resulted in a considerable portion of power consumption on buses, in which the major sources of the power consumption are the transition activities on the signal lines and the coupling capacitances of the lines. In addition, we enter an era of considering variation of the effective coupling capacitances. We address power reduction including these phenomena by using variable length coding. Experimental results show the effectiveness of our methodology. expand
|
|
|
SESSION: Formal verification: theory and practice |
| |
Karem A. Sakallah,
Yuan Lu
|
|
|
|
|
Automatic assume guarantee analysis for assertion-based formal verification |
| |
Dong Wang,
Jeremy Levitt
|
|
Pages: 561-566 |
|
doi>10.1145/1120725.1120963 |
|
Full text: PDF
|
|
Assertion based verification encourages the insertion of many assertions into a design. Typically, not all assertions can be proven (or falsified). The indeterminate assertions require manual analysis in order to determine design correctness -- a time-consuming ...
Assertion based verification encourages the insertion of many assertions into a design. Typically, not all assertions can be proven (or falsified). The indeterminate assertions require manual analysis in order to determine design correctness -- a time-consuming and error-prone process. This paper shows how automatic assume guarantee reasoning can be used to reduce the amount of manual analysis. We present algorithms to automatically compute the assume guarantee relations between assertions. We extend circular assume guarantee reasoning to compute more proofs. And, we show how automatic assume guarantee reasoning can be used in practice to reduce the number of indeterminate assertions requiring manual analysis. We present the results of applying our algorithms to large industrial designs. expand
|
|
|
TED+: a data structure for microprocessor verification |
| |
Pejman Lotfi-Kamran,
Mohammad Hosseinabady,
Hamid Shojaei,
Mehran Massoumi,
Zainalabedin Navabi
|
|
Pages: 567-572 |
|
doi>10.1145/1120725.1120964 |
|
Full text: PDF
|
|
Formal verification of microprocessors requires a mechanism for efficient representation and manipulation of both arithmetic and random Boolean functions. Recently, a new canonical and graph-based representation called TED has been introduced for verification ...
Formal verification of microprocessors requires a mechanism for efficient representation and manipulation of both arithmetic and random Boolean functions. Recently, a new canonical and graph-based representation called TED has been introduced for verification of digital systems. Although TED can be used effectively to represent arithmetic expressions at the word-level, it is not memory efficient in representing bit-level logic expressions. In this paper, we present modifications to TED to improve its ability for bit-level logic representation while maintaining its robustness in arithmetic word-level representation. It will be shown that for random Boolean expressions, the modified TED performs the same as BDD representation. expand
|
|
|
Improved Boolean function hashing based on multiple-vertex dominators |
| |
René Krenz,
Elena Dubrova
|
|
Pages: 573-578 |
|
doi>10.1145/1120725.1120965 |
|
Full text: PDF
|
|
The growing complexity of today's system designs requires fast and robust verification methods. Existing BDD, SAT or ATPG-based techniques do not provide sufficient solutions for many verification instances. Boolean function hashing is a probabilistic ...
The growing complexity of today's system designs requires fast and robust verification methods. Existing BDD, SAT or ATPG-based techniques do not provide sufficient solutions for many verification instances. Boolean function hashing is a probabilistic verification approach which can complement existing formal methods in a number of applications such as equivalence checking, biased random simulation, power analysis and power optimization. The proposed hashing technique is based on the arithmetic transform, which maps a Boolean function onto a probabilistic hash value for a given input assignment. The presented algorithm uses multiple-vertex dominators in circuit graphs to progressively simplify intermediate hashing steps. The experimental results on benchmark circuits demonstrate the robustness of our approach. expand
|
|
|
Lower bounds for dynamic BDD reordering |
| |
Rüdiger Ebendt,
Rolf Drechsler
|
|
Pages: 579-582 |
|
doi>10.1145/1120725.1120966 |
|
Full text: PDF
|
|
In this paper we present new lower bounds on BDD size. These lower bounds are derived from more general lower bounds that recently were given in the context of exact BDD minimization. The results presented in this paper are twofold: first, we gain deeper ...
In this paper we present new lower bounds on BDD size. These lower bounds are derived from more general lower bounds that recently were given in the context of exact BDD minimization. The results presented in this paper are twofold: first, we gain deeper insight by looking at the theory behind the new lower bounds. Examples lead to a better understanding, showing that the new lower bounds are effective in situations where this is not the case for previous lower bounds and vice versa. Following the constraints in practice, we then compromise between runtime and quality of the lower bounds. Finally, a clever combination of old and new lower bounds results in a final lower bound, yielding a significant improvement. Experimental results show the efficiency of our approach. expand
|
|
|
Partitioned model checking from software specifications |
| |
Xiushan Feng,
Alan J. Hu,
Jin Yang
|
|
Pages: 583-587 |
|
doi>10.1145/1120725.1120967 |
|
Full text: PDF
|
|
With the trends toward higher-level design, verification models written in software, and hardware/software codesign, it is increasingly important to verify that RTL hardware behaves correctly according to an executable software specification. In this ...
With the trends toward higher-level design, verification models written in software, and hardware/software codesign, it is increasingly important to verify that RTL hardware behaves correctly according to an executable software specification. In this paper, we propose a natural way to formalize a cycle-accurate software specification as an annotated control flow graph, and then we introduce a novel partitioned model-checking algorithm that exploits the annotated control flow graph. Preliminary experimental results show that our new method runs faster than standard model checking. expand
|
|
|
SESSION: Special session |
|
|
|
|
Are we ready for system-level synthesis? |
| |
Jason Cong,
Tony Ma,
Ivo Bolsens,
Phil Moorby,
Jan Rabaey,
John Sanguinetti,
Kazutoshi Wakabayashi,
Yoshi Watanabe
|
|
Pages: 1-1 |
|
doi>10.1145/1120725.1120969 |
|
Full text: PDF
|
|
Electronic system-level (ESL) design automation has been identified by Dataquest as the next productivity boost for the semiconductor industry. We have put together a distinguished panel of experts to discuss if we are ready for system-level synthesis.
Electronic system-level (ESL) design automation has been identified by Dataquest as the next productivity boost for the semiconductor industry. We have put together a distinguished panel of experts to discuss if we are ready for system-level synthesis. expand
|
|
|
SESSION: Robust and low-power clock design |
| |
C. K. Cheng,
Weiping Shi
|
|
|
|
|
Register placement for low power clock network |
| |
Yongqiang Lu,
C. N. Sze,
Xianlong Hong,
Qiang Zhou,
Yici Cai,
Liang Huang,
Jiang Hu
|
|
Pages: 588-593 |
|
doi>10.1145/1120725.1120971 |
|
Full text: PDF
|
|
In modern VLSI designs, the increasingly severe power problem requests to minimize clock routing wirelength so that both power consumption and power supply noise can be alleviated. In contrast to most of traditional works that handle this problem only ...
In modern VLSI designs, the increasingly severe power problem requests to minimize clock routing wirelength so that both power consumption and power supply noise can be alleviated. In contrast to most of traditional works that handle this problem only in clock routing, we propose to navigate standard cell register placement to locations that enable further less clock routing wirelength and power. To minimize adverse impacts to conventional cell placement goals such as signal net wirelength and critical path delay, the register placement is carried out in the context of a quadratic placement. The proposed technique is particularly effective for the recently popular prescribed skew clock routing. Experiments on benchmark circuits show encouraging results. expand
|
|
|
Skew scheduling and clock routing for improved tolerance to process variations |
| |
Ganesh Venkataraman,
C. N. Sze,
Jiang Hu
|
|
Pages: 594-599 |
|
doi>10.1145/1120725.1120972 |
|
Full text: PDF
|
|
The synthesis of clock network in the presence of process variation is becoming a vital design issue towards the performance of digital circuits. In this paper, we propose a clock tree design algorithm which is driven by the tolerance towards process ...
The synthesis of clock network in the presence of process variation is becoming a vital design issue towards the performance of digital circuits. In this paper, we propose a clock tree design algorithm which is driven by the tolerance towards process variations. We consider tolerance to process variation in various stages of clock tree synthesis which include clock skew scheduling, abstract tree generation and layout embedding. The primary objective of this work is to minimize the maximum skew violation and a layout embedding technique specifically targeting this objective is detailed. Experimental results indicate the our proposed procedure leads to significant reduction in maximum skew violation due to process variation with negligible change in wire length. expand
|
|
|
Stability analysis of active clock deskewing systems using a control theoretic approach |
| |
Vinil Varghese,
Tom Chen,
Peter Young
|
|
Pages: 600-605 |
|
doi>10.1145/1120725.1120973 |
|
Full text: PDF
|
|
In this paper, a methodology for analyzing closed loop clock distribution and active deskewing networks is proposed. An active clock distribution and deskewing network is modelled as a closed loop feedback control system using state space equations. ...
In this paper, a methodology for analyzing closed loop clock distribution and active deskewing networks is proposed. An active clock distribution and deskewing network is modelled as a closed loop feedback control system using state space equations. The state space models of the system were then used to simulate the clock deskewing scheme, and most importantly, to analyze the stability using the integral quadratic constraints method. Such a systematic analysis method can be very useful to designers as they will be able to determine how the deskewing network behaves, thus, avoiding repeated simulations. The proposed approach can be further extended to determine performance of such systems under different configurations. We show how the proposed method is applied to an experimental clock deskewing system for performance and stability analysis. expand
|
|
|
Process variation robust clock tree routing |
| |
Wai-Ching Douglas Lam,
Cheng-Kok Koh
|
|
Pages: 606-611 |
|
doi>10.1145/1120725.1120974 |
|
Full text: PDF
|
|
As the minimum feature sizes of VLSI circuits get smaller while the clock frequency increases, the effects of process variations become significant. We propose a UST/DME based approach to perform simultaneous non-zero clock skew scheduling and clock ...
As the minimum feature sizes of VLSI circuits get smaller while the clock frequency increases, the effects of process variations become significant. We propose a UST/DME based approach to perform simultaneous non-zero clock skew scheduling and clock tree routing, taking into consideration the effects of process variations on clock skews. Our approach ensures that the generated clock tree has a high tolerance to process variations while minimizing the total capacitance of the clock tree, which is proportional to the total wirelength and the total number of buffers. Monte Carlo simulations show that our approach generates clock trees that are highly tolerant to process variations. expand
|
|
|
SESSION: DSP |
| |
Makoto Ikeda,
Xiaoyang Zeng
|
|
|
|
|
IP-block-based design environment for high-throughput VLSI dedicated digital signal processing systems |
| |
Nacer-Eddine Zergainoh,
Katalin Popovici,
Ahmed Jerraya,
Pascal Urard
|
|
Pages: 612-618 |
|
doi>10.1145/1120725.1120976 |
|
Full text: PDF
|
|
The Growing requirement on the correct design of a high performance DSP system in short time force us to use IP's in many design. In this paper, we propose an efficient IP block based design environment for high throughput VLSI Systems. The flow generates ...
The Growing requirement on the correct design of a high performance DSP system in short time force us to use IP's in many design. In this paper, we propose an efficient IP block based design environment for high throughput VLSI Systems. The flow generates SystemC Register Transfer Level (RTL) architecture, starting from a Matlab functional model described as a netlist of functional IP. The refinement process inserts automatically control structures to treat delays induced by the use of RTL IPs. It also inserts a control structure to coordinate the execution of parallel clocked IP. The delays may be managed by registers or by counters included in the control structure. The experimentations show that the approach can produce efficient RTL architecture and allow a huge save of time. expand
|
|
|
A resource-shared VLIW processor architecture for area-efficient on-chip multiprocessing |
| |
Kazutoshi Kobayashi,
Masao Aramoto,
Yoichi Yuyama,
Akihiko Higuchi,
Hidetoshi Onodera
|
|
Pages: 619-622 |
|
doi>10.1145/1120725.1120977 |
|
Full text: PDF
|
|
We propose an area-efficient resource-shared VLIW processor (RSVP) for future leaky nm process technologies. It consists of several single-way independent processor units (IPUs) that share parallel processor resources. Each IPU works as a variable-way ...
We propose an area-efficient resource-shared VLIW processor (RSVP) for future leaky nm process technologies. It consists of several single-way independent processor units (IPUs) that share parallel processor resources. Each IPU works as a variable-way VLIW processor sharing the parallel resources according to priorities of given tasks. RSVP allocates shared parallel resources to the IPUs cycle by cycle. It can minimize the number of NOPs that waste power. The performance per power (P3) of a 4-parallel 4-way RSVP that corresponds to four 4way VLIWs is 3.7% better than a conventional 4-parallel 4-way VLIW multiprocessor in the current 90nm process. We estimate that the RSVP achieves 36% less leakage power and 28% better P3 in the future 25nm process. We have fabricated an RSVP test chip that contains two IPU and a shared resource equivalent to two 2way VLIWs in a 180nm process. It is functional at 100MHz clock speed and its power is 130mW. expand
|
|
|
An efficient deblocking filter architecture with 2-dimensional parallel memory for H.264/AVC |
| |
Lingfeng Li,
Satoshi Goto,
Takeshi Ikenaga
|
|
Pages: 623-626 |
|
doi>10.1145/1120725.1120978 |
|
Full text: PDF
|
|
In this paper, we present an efficient architecture for deblocking filter in H.264/AVC. A novel 2-dimensional parallel memory scheme is employed in order to achieve highly efficient parallel access in both horizontal and vertical directions. By using ...
In this paper, we present an efficient architecture for deblocking filter in H.264/AVC. A novel 2-dimensional parallel memory scheme is employed in order to achieve highly efficient parallel access in both horizontal and vertical directions. By using this parallel memory scheme, we also eliminate the need for a transpose circuit. Our design is implemented under 0.35μm technology. Synthesis results show that the equivalent gate count is only 9.35K (not including SRAMs) when the maximum frequency is 100MHz. expand
|
|
|
A new register file access architecture for software pipelining in VLIW processors |
| |
Yanjun Zhang,
Hu he,
Yihe Sun
|
|
Pages: 627-630 |
|
doi>10.1145/1120725.1120979 |
|
Full text: PDF
|
|
This paper presents a novel architecture of register files that combines the local register files and the global register file for clustered VLIW (Very Long Instruction Word) processors. The communication between function units through global register ...
This paper presents a novel architecture of register files that combines the local register files and the global register file for clustered VLIW (Very Long Instruction Word) processors. The communication between function units through global register file will be more efficient. The concept of associate register is introduced for this architecture. This makes it possible to write a result to two destination registers in one operation, which can efficiently speed up the software pipelining. expand
|
|
|
A fast VLSI architecture for full-search variable block size motion estimation in MPEG-4 AVC/H.264 |
| |
Minho Kim,
Ingu Hwang,
Soo-Ik Chae
|
|
Pages: 631-634 |
|
doi>10.1145/1120725.1120980 |
|
Full text: PDF
|
|
We describe a fast VLSI architecture for full-search motion estimation for the blocks with 7 different sizes in MPEG-4 AVC/H.264. The proposed variable block size motion estimation (VBSME) architecture consists of a 16x16 PE array, an adder tree and ...
We describe a fast VLSI architecture for full-search motion estimation for the blocks with 7 different sizes in MPEG-4 AVC/H.264. The proposed variable block size motion estimation (VBSME) architecture consists of a 16x16 PE array, an adder tree and comparators to find all 41 motion vectors and their minimum SADs for the blocks of 16x16, 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4. It employs a 2-D datapath and its control of the search area data is simple and regular. The proposed VBSME can achieve 100% PE utilization by employing a preload register and a search data buffer inside each PE and allow real-time processing of 4CIF(704x576) video with 15 fps at 100 Mhz for a search range of [-32~+31]. expand
|
|
|
Automatic synthesis and scheduling of multirate DSP algorithms |
| |
Ying Yi,
Mark Milward,
Sami Khawam,
Ioannis Nousias,
Tughrul Arslan
|
|
Pages: 635-638 |
|
doi>10.1145/1120725.1120981 |
|
Full text: PDF
|
|
To date, most high-level synthesis systems do not automatically solve present design problems, such as those related to timing associated with the physical implementation of multirate DSP architectures. Whilst others do not trade off area/speed of algorithm ...
To date, most high-level synthesis systems do not automatically solve present design problems, such as those related to timing associated with the physical implementation of multirate DSP architectures. Whilst others do not trade off area/speed of algorithm efficiently for such architectures. An automatic synthesis methodology based on both retiming techniques together with folding transformations is presented in this paper in order to solve timing problems associated with the implementation of multirate DSP algorithms. We demonstrate that techniques for modeling computational unit latencies, which can influence parameterisations of a multirate DSP IP core, can lead to highly efficient solutions. This is illustrated using a polyphase IIR IDCT example. Using the folding transformation, the control circuit for a hardware sharing multirate DSP is also presented. expand
|
|
|
SESSION: Low power and special purpose FPGAs |
| |
Lei He,
Yu-Liang Wu
|
|
|
|
|
A high performance synthesisable unsymmetrical reconfigurable fabric for heterogeneous finite state machines |
| |
Zhenyu Liu,
Tughrul Arslan,
Sami Khawam,
Iain Lindsay
|
|
Pages: 639-644 |
|
doi>10.1145/1120725.1120983 |
|
Full text: PDF
|
|
The use of synthesizable reconfigurable cores in system on chip (SoC) designs is increasingly becoming a trend. Such domain-special cores are being used for their flexibility, powerful function and low power consumption. A reconfigurable Finite State ...
The use of synthesizable reconfigurable cores in system on chip (SoC) designs is increasingly becoming a trend. Such domain-special cores are being used for their flexibility, powerful function and low power consumption. A reconfigurable Finite State Machine (FSM) is constantly required for the purpose of control in any reconfigurable SoC. This paper presents a novel unbalanced unsymmetrical reconfigurable architecture for generic FSM; Compared with commercial FPGA devices, the new architecture results in area reduction of 43% and power consumption decrease of 82%. expand
|
|
|
Routing track duplication with fine-grained power-gating for FPGA interconnect power reduction |
| |
Yan Lin,
Fei Li,
Lei He
|
|
Pages: 645-650 |
|
doi>10.1145/1120725.1120984 |
|
Full text: PDF
|
|
Power has become an increasingly important design constraint for FPGAs in nanometer technologies, and global interconnects should be the focus of FPGA power reduction as they consume more power than logic cells. We design area-efficient circuits for ...
Power has become an increasingly important design constraint for FPGAs in nanometer technologies, and global interconnects should be the focus of FPGA power reduction as they consume more power than logic cells. We design area-efficient circuits for programmable fine-grained power-gating of individual unused interconnect switches, and reduce interconnect leakage power dramatically because the interconnect switches have an intrinsically low utilization rate for the purpose of programmability. The low leakage interconnect via power-gating reduces total power by 38.18% for the FPGA in 100nm technology. Furthermore, it enables interconnect dynamic power reduction. We design a routing channel containing abundant or duplicated routing tracks with pre-determined high and low Vdd, and develop routing algorithm using low Vdd for non-critical routing to reduce dynamic power. The track-duplicated routing channel has small leakage power and increase the FPGA power reduction to 45.00%. expand
|
|
|
Exploiting temporal idleness to reduce leakage power in programmable architectures |
| |
Rajarshee P. Bharadwaj,
Rajan Konar,
Poras T. Balsara,
Dinesh Bhatia
|
|
Pages: 651-656 |
|
doi>10.1145/1120725.1120985 |
|
Full text: PDF
|
|
One of the biggest challenges that programmable devices like FPGAs are facing in ultra deep sub-micron regime is the exponential rise in leakage power consumption. As technology shrinks below 90nm, a new design paradigm has to evolve to tackle ...
One of the biggest challenges that programmable devices like FPGAs are facing in ultra deep sub-micron regime is the exponential rise in leakage power consumption. As technology shrinks below 90nm, a new design paradigm has to evolve to tackle the issue of leakage power consumption. In this work we focus on a new design methodology for reducing leakage power by exploiting temporal locality in designs and accordingly group them into. clusters that can be switched on and off. We propose a Power State Controller based method, which controls the switching of the clusters from one state to another. We show our technique using Data Flow Graphs where temporal locality can be effectively explored. Our results show that substantial leakage savings can be achieved if temporal idleness of designs can be exploited effectively. expand
|
|
|
Methodology for high level estimation of FPGA power consumption |
| |
Vijay Degalahal,
Tim Tuan
|
|
Pages: 657-660 |
|
doi>10.1145/1120725.1120986 |
|
Full text: PDF
|
|
Power consumption in FPGA designs calls for power-aware design and power budgeting early in the design cycle. In this work, we leverage the FPGA architecture to present an efficient and accurate methodology for pre-silicon dynamic power estimation of ...
Power consumption in FPGA designs calls for power-aware design and power budgeting early in the design cycle. In this work, we leverage the FPGA architecture to present an efficient and accurate methodology for pre-silicon dynamic power estimation of FPGA-based designs. Our methodology uses device-level simulations to characterize a coarse-grained architectural model and incorporates architectural parameters to estimate the dominant wire capacitance. Such an approach not only reduces the need for tedious and time consuming silicon characterizations but ensures accurate pre-silicon power predictions. We apply the methodology to estimate the power consumption of a state-of-the-art Spartan-3™ FPGA family, evaluate the estimation results against silicon measurements, and present a detailed power breakdown of the FPGA. Our results find that the routing resources and the clock to consume the maximum power. expand
|
|
|
Leakage control in FPGA routing fabric |
| |
Suresh Srinivasan,
A. Gayasen,
N. Vijaykrishnan,
T. Tuan
|
|
Pages: 661-664 |
|
doi>10.1145/1120725.1120987 |
|
Full text: PDF
|
|
As FPGA designs in 65nm are being explored, reducing leakage power becomes an important design issue. A significant portion of the FPGA leakage is expended in the unused multiplexers used in the interconnect fabric. This work focuses on reducing the ...
As FPGA designs in 65nm are being explored, reducing leakage power becomes an important design issue. A significant portion of the FPGA leakage is expended in the unused multiplexers used in the interconnect fabric. This work focuses on reducing the leakage of these unused multiplexers by controlling their inputs. We investigate the design issues involved in implementing such a technique and also show experimental results demonstrating the effectiveness of our approach. expand
|
|
|
SESSION: RF circuit design and design methodology |
| |
Koichiro Mashiko,
Wing Hung Ki
|
|
|
|
|
A 1GHz CMOS fourth-order continuous-time bandpass sigma delta modulator for RF receiver front end A/D conversion |
| |
K. Praveen Jayakar Thomas,
Ram Singh Rana,
Yong Lian
|
|
Pages: 665-670 |
|
doi>10.1145/1120725.1120989 |
|
Full text: PDF
|
|
A design and circuit implementation of a CMOS fourth-order continuous-time bandpass fs/4 sigma delta modulator is presented. The fully differential architecture of the modulator includes integrated LC resonators with active Q enhancement ...
A design and circuit implementation of a CMOS fourth-order continuous-time bandpass fs/4 sigma delta modulator is presented. The fully differential architecture of the modulator includes integrated LC resonators with active Q enhancement and return to zero, half return to zero latches to drive the feedback switched current source DACs. The modulator, designed for 0.18μm/1.8V 1P6M CMOS process occupies a total area of 1.8mm2 dissipating 290mW from a 1.8V power supply. At a sampling rate of 4GHz and a signal of 1GHz with 500kHz bandwidth, the circuit achieves a peak Signal-to-Noise and Distortion Ratio (SNDR) of 38dB. A CMOS implementation of the modulator provides the feasibility of integrating the following DSP circuits on the same chip in a RF receiver. This paper is aimed to provide a CMOS solution for RF signal of 1GHz range. expand
|
|
|
An elitist distributed particle swarm algorithm for RF IC optimization |
| |
Min Chu,
David J. Allstot
|
|
Pages: 671-674 |
|
doi>10.1145/1120725.1120990 |
|
Full text: PDF
|
|
An RF IC optimization methodology based on an elitist distributed particle swarm optimization algorithm is presented. By including a Pareto ranking mechanism and elitism in the algorithm, design alternatives and tradeoff information are provided with ...
An RF IC optimization methodology based on an elitist distributed particle swarm optimization algorithm is presented. By including a Pareto ranking mechanism and elitism in the algorithm, design alternatives and tradeoff information are provided with high efficiency. Post-optimization Monte-Carlo simulations are performed to assess first-order yield performance and aid in the selection of the final design. The approach is validated through the synthesis of a 5.2GHz direct-conversion front-end in 180nm CMOS. expand
|
|
|
Phase-locked loop synthesis using hierarchical divide-and-conquer multi-optimization |
| |
Min Chu,
David J. Allstot
|
|
Pages: 675-678 |
|
doi>10.1145/1120725.1120991 |
|
Full text: PDF
|
|
A hierarchical divide-and-conquer multi-optimization methodology for phase-locked loop synthesis is presented. By optimizing each building block in the PLL separately with various optimization techniques, high optimization efficiency and good circuit ...
A hierarchical divide-and-conquer multi-optimization methodology for phase-locked loop synthesis is presented. By optimizing each building block in the PLL separately with various optimization techniques, high optimization efficiency and good circuit performance are achieved. The methodology is validated with the synthesis of a 1GHz third-order PLL in 240nm SiGe BiCMOS. expand
|
|
|
A 10Gb/s transmitter with multi-tap FIR pre-emphasis in 0.18μm CMOS technology |
| |
Miao Li,
Tad Kwasniewski,
Shoujun Wang,
Yuming Tao
|
|
Pages: 679-682 |
|
doi>10.1145/1120725.1120992 |
|
Full text: PDF
|
|
A 10Gb/s current mode logic (CML) transmitter with multi-tap finite impulse response (FIR) pre-emphasis has been implemented in 0.18μm CMOS technology. A half-rate clock retiming circuit for generating symbol-spaced data is proposed to alleviate the ...
A 10Gb/s current mode logic (CML) transmitter with multi-tap finite impulse response (FIR) pre-emphasis has been implemented in 0.18μm CMOS technology. A half-rate clock retiming circuit for generating symbol-spaced data is proposed to alleviate the speed requirement of the traditional full-rate clock retiming. HSPICE simulation results of a 5-tap FIR transmitter show that the closed eye over a 34" FR4 backplane can be opened to 0.72UI at 10Gb/s. The power dissipation of the transmitter is 50mW at a 1.8V supply. expand
|
|
|
A dynamic reconfigurable RF circuit architecture |
| |
Kenichi Okada,
Yoshiaki Yoshihara,
Hirotaka Sugawara,
Kazuya Masu
|
|
Pages: 683-686 |
|
doi>10.1145/1120725.1120993 |
|
Full text: PDF
|
|
This paper proposes a dynamic reconfigurable architecture for analog RF circuits. The architecture consists of RF circuits and a control circuit. The RF circuits can be reconfigured by bias voltages of transistors and variable passive components, and ...
This paper proposes a dynamic reconfigurable architecture for analog RF circuits. The architecture consists of RF circuits and a control circuit. The RF circuits can be reconfigured by bias voltages of transistors and variable passive components, and the RF circuit block can also be switched dynamically. The proposed architecture can realize the multi-band/mode RF circuit in single chip for the Software Defined Radio, which achieves considerable reduction of circuit area and power consumption. On the other hand, we can obtain robust RF circuits by the dynamic reconfiguration for the process variation, the dynamic change of temperature, etc. expand
|
|
|
Prediction of LC-VCOs' tuning curves with period calculation technique |
| |
Zhangwen Tang,
Jie He,
Hongyan Jian,
Haiqing Zhang,
Jie Zhang,
Hao Min
|
|
Pages: 687-690 |
|
doi>10.1145/1120725.1120994 |
|
Full text: PDF
|
|
This paper describes a new prediction method of tuning curves of a LC-tank voltage-controlled oscillator (VCO) with period calculation technique. With this period calculation technique, the prediction of oscillator tuning curves is more accurate compared ...
This paper describes a new prediction method of tuning curves of a LC-tank voltage-controlled oscillator (VCO) with period calculation technique. With this period calculation technique, the prediction of oscillator tuning curves is more accurate compared with the traditional harmonic approximation. The theoretical analyses are experimentally validated with a CMOS complementary LC-tank VCO implemented in 0.35μm 1P4M pure logic CMOS process. expand
|
|
|
SESSION: Design techniques in embedded and real-time system |
| |
Soon-Hoi Ha,
Chenglian Peng
|
|
|
|
|
Hardware/software partitioning for platform-based design method |
| |
Zhihui Xiong,
Jihua Chen,
Sikun Li
|
|
Pages: 691-696 |
|
doi>10.1145/1120725.1120996 |
|
Full text: PDF
|
|
Ant System Algorithm has the advantages of positive feedback and efficient convergence in optimal searching, but it lacks initial pheromone, which greatly limits this algorithm's searching speed. Oriented to Platform-Based Design of System-on-a-Chip, ...
Ant System Algorithm has the advantages of positive feedback and efficient convergence in optimal searching, but it lacks initial pheromone, which greatly limits this algorithm's searching speed. Oriented to Platform-Based Design of System-on-a-Chip, we present a hardware/software bi-partitioning algorithm based on Ant System, Algorithm with Initial Pheromone. The main ideas are: a). Reuse the partitioning result of reference design provided by Platform-Based Design method as current design's initial partitioning, which is then converted into the initial pheromone needed by Ant System Algorithm. b). Search for the optimal partitioning scheme with the Ant System Algorithm based on the initial pheromone. Our algorithm adopts system level reusing feature of Platform-Based Design method to prevent the disadvantages of Ant System Algorithm. Experiments show our algorithm improves the efficiency of Ant System Algorithm by an average of forty percent. expand
|
|
|
Abstracting functionality for modular performance analysis of hard real-time systems |
| |
Ernesto Wandele,
Lothar Thiele
|
|
Pages: 697-702 |
|
doi>10.1145/1120725.1120997 |
|
Full text: PDF
|
|
System level performance analysis techniques play an important role in the design process of complex embedded systems. They allow to analyze essential characteristics of a system design in an early design stage and support therewith the choice of important ...
System level performance analysis techniques play an important role in the design process of complex embedded systems. They allow to analyze essential characteristics of a system design in an early design stage and support therewith the choice of important design decisions. While analytical methods for system level performance analysis lead to hard bounded analysis results, the obtained results are often overly pessimistic due to a lack of details such analytical methods can incorporate in their system analysis. To overcome this problem, we present new abstract models for event streams and system components of embedded systems, and show how these models can be combined to modules for modular performance analysis. With the presented models, we can capture complex functional properties of systems, as for example caches, variable resource demand of events in an event stream, or arbitrary up- and down-sampling of event streams in a system component. The applicability of our models and their advantages over traditional models for performance analysis are shown in a case study of a system component with LRU (Least Recently Used) cache. expand
|
|
|
Optimizing intra-task voltage scheduling using data flow analysis |
| |
Dongkun Shin,
Jihong Kim
|
|
Pages: 703-708 |
|
doi>10.1145/1120725.1120998 |
|
Full text: PDF
|
|
Intra-task voltage scheduling (IntraDVS), which adjusts the supply voltage within an individual task boundary, is an effective technique for developing low-power applications. In IntraDVS, slack times are estimated by analyzing program's control flow ...
Intra-task voltage scheduling (IntraDVS), which adjusts the supply voltage within an individual task boundary, is an effective technique for developing low-power applications. In IntraDVS, slack times are estimated by analyzing program's control flow information. In this paper, we propose an optimization technique for IntraDVS using data flow information. The proposed algorithm improves the energy efficiency by moving the voltage scaling points to earlier instructions based on the analysis results of program's data flow. The experimental results using an MPEG-4 encoder program show that the proposed algorithm reduces the energy consumption by 40-45% over the original IntraDVS algorithm. expand
|
|
|
FD-HGAC: a hybrid heuristic/genetic algorithm hardware/software co-synthesis framework with fault detection |
| |
John Conner,
Yuan Xie,
Mahmut Kandemir,
Robert Dick,
Greg Link
|
|
Pages: 709-712 |
|
doi>10.1145/1120725.1120999 |
|
Full text: PDF
|
|
Embedded real-time systems are becoming increasingly complex. To combat the rising design cost of those systems, co-synthesis tools that map tasks to systems containing both software and specialized hardware have been developed. As system transient fault ...
Embedded real-time systems are becoming increasingly complex. To combat the rising design cost of those systems, co-synthesis tools that map tasks to systems containing both software and specialized hardware have been developed. As system transient fault rates increase due to technology scaling, embedded systems must be designed in fault tolerant ways to maintain system reliability. This paper presents and analyzes FD-HGAC, a tool using a genetic algorithm and heuristics to design real-time systems with partial fault detection. Results of numerous trials of the tool are shown to produce systems with average 22% detection coverage that incurs no cost or performance penalty. expand
|
|
|
Compiler-directed selective data protection against soft errors |
| |
G. Chen,
M. Kandemir,
M. J. Irwin,
G. Memik
|
|
Pages: 713-716 |
|
doi>10.1145/1120725.1121000 |
|
Full text: PDF
|
|
Soft errors in electronic devices are a growing concern for many embedded systems from diverse domains. Chip vendors are already working with system customers on ways to guard against the effects of soft errors. While error code based protection mechanisms ...
Soft errors in electronic devices are a growing concern for many embedded systems from diverse domains. Chip vendors are already working with system customers on ways to guard against the effects of soft errors. While error code based protection mechanisms for memories such as ECC are important, indiscriminately applying them to all data can have serious memory space and energy overheads. This paper demonstrates how an optimizing compiler can be useful in deciding which data elements need to be protected based on user-specified annotations. The proposed idea makes use of a variant of forward slicing. expand
|
|
|
SESSION: Crosstalk noise avoidance and power/ground network optimization |
| |
David Z. Pan,
Dennis Sylvester
|
|
|
|
|
A perturbation-aware noise convergence methodology for high frequency microprocessors |
| |
Prashant Saxena,
Kumar N. Lalgudi,
Hans J. Greub,
Janet M. Wang-Roveda
|
|
Pages: 717-722 |
|
doi>10.1145/1120725.1121002 |
|
Full text: PDF
|
|
We present a practical flow that automates the process of analyzing noise failures and determining and implementing the most appropriate design fixes in high performance designs. For each noise problem, the flow implicitly identifies the most sensitive ...
We present a practical flow that automates the process of analyzing noise failures and determining and implementing the most appropriate design fixes in high performance designs. For each noise problem, the flow implicitly identifies the most sensitive relevant electrical parameter(s) which it then maps to a physical solution that minimizes design perturbation. Integrated with standard physical synthesis, it was used extensively in a high volume 90 nm multi-GHz microprocessor project. expand
|
|
|
Successive pad assignment algorithm to optimize number and location of power supply pad using incremental matrix inversion |
| |
Takashi Sato,
Masanori Hashimoto,
Hidetoshi Onodera
|
|
Pages: 723-728 |
|
doi>10.1145/1120725.1121003 |
|
Full text: PDF
|
|
An efficient pad assignment algorithm to minimize voltage drop on a power distribution network is proposed. Combination of the successive pad assignment (SPA) and the incremental matrix inversion (IMI) provides an efficient assignment for both location ...
An efficient pad assignment algorithm to minimize voltage drop on a power distribution network is proposed. Combination of the successive pad assignment (SPA) and the incremental matrix inversion (IMI) provides an efficient assignment for both location and number of power supply pads. The SPA creates equivalent resistance matrix which preserves both pad candidates and power consumption points as external ports so that topological modification due to connection or disconnection between voltage sources and candidate pads are consistently represented. By reusing sub-matrix of equivalent matrix, the SPA greedily searches next pad location that minimizes the worst drop voltage. Each time the candidate pad is added, the IMI reduces computational complexity significantly. Experimental results show that the proposed procedures efficiently enumerate pad order in practical time. expand
|
|
|
A unified transformational approach for reductions in fault vulnerability, power, and crosstalk noise & delay on processor buses |
| |
Raid Ayoub,
Alex Orailoglu
|
|
Pages: 729-734 |
|
doi>10.1145/1120725.1121004 |
|
Full text: PDF
|
|
In this paper we propose a coding scheme for general-purpose applications that can reduce power dissipation, crosstalk noise and crosstalk delay on the bus lines while simultaneously detecting errors at run time. The reduction in power dissipation can ...
In this paper we propose a coding scheme for general-purpose applications that can reduce power dissipation, crosstalk noise and crosstalk delay on the bus lines while simultaneously detecting errors at run time. The reduction in power dissipation can be achieved through reducing the bus switching activity. Not only is the switching activity in individual lines reduced but so is the coupling activity across the adjacent lines, the major contributor to the overall power dissipation in deep submicron technology. Detailed analysis of crosstalk noise and delay shows that eliminating certain patterns of transitions and reducing the infeasible ones in terms of crosstalk noise and power dissipation is a feasible strategy for alleviating these problems. We propose an encoding technique consisting of the use of predefined patterns of transitions, one for each possible combination of input data, to generate the codewords. The restriction to the predefined patterns of transitions enables fast encoding and low hardware overhead. This work presents an extensive analysis of the consequent reduction in crosstalk and power. SPICE derived experimental results show a reduction in worst case crosstalk delay and noise, ranging up to 24% and 10% respectively. Extensive experimental results for various applications show significant reduction in power dissipation ranging up to 44% for switching activity on the bus lines and up to 25% for coupling activity. The results also show a drastic reduction ranging up to 98% in the number of patterns that are most likely to produce crosstalk errors. expand
|
|
|
VLSI on-chip power/ground network optimization considering decap leakage currents |
| |
Jingjing Fu,
Zuying Luo,
Xianlong Hong,
Yici Cai,
Sheldon X.-D. Tan,
Zhu Pan
|
|
Pages: 735-738 |
|
doi>10.1145/1120725.1121005 |
|
Full text: PDF
|
|
In today's power/ground(P/G) network design, on-chip decoupling capacitors(decaps) are usually made of MOS transistors with source and drain connected together. The gate leakage current becomes worse as the gate oxide layer thickness continues to shrink ...
In today's power/ground(P/G) network design, on-chip decoupling capacitors(decaps) are usually made of MOS transistors with source and drain connected together. The gate leakage current becomes worse as the gate oxide layer thickness continues to shrink below 20Å. As a result, decaps will become leaky due to the gate leakage from CMOS devices. In this paper, we take a first look at the leaky decaps in P/G network optimization. We propose a leakage current model for practical decaps and also present a new two-stage leakage-current-aware approach to efficiently optimize P/G networks in a more area efficient way. expand
|
|
|
Probabilistic congestion model considering shielding for crosstalk reduction |
| |
Jinjun Xiong,
Lei He
|
|
Pages: 739-742 |
|
doi>10.1145/1120725.1121006 |
|
Full text: PDF
|
|
We extend an existing probabilistic congestion model to consider shielding for crosstalk reduction. We then develop a multilevel router to study the impact of various congestion models on routing congestion by using large industrial design examples. ...
We extend an existing probabilistic congestion model to consider shielding for crosstalk reduction. We then develop a multilevel router to study the impact of various congestion models on routing congestion by using large industrial design examples. We show that (1) when shielding is applied as a post-routing optimization for crosstalk reduction, the existing probabilistic model, when compared to a deterministic routing-order dependent congestion model, reduces routing congestion by 17.1% on average under the given routing area constraints, or reduces routing area by 9.4% on average under the given routing congestion constraints; (2) our extended probabilistic congestion model considering shielding enables shielding reservation and minimization for routing and achieves routing congestion (or area) reduction by 47.7% (or 31.0%) on average under the given routing area (or congestion) constraints, when compared to the above deterministic congestion model not able to estimate shielding and therefore not able to minimize shielding during routing. expand
|
|
|
SESSION: Others in leading edge designs |
| |
Sheldon X.-D. Tan,
Hai Zhou
|
|
|
|
|
Customized on-chip memories for embedded chip multiprocessors |
| |
O. Ozturk,
M. Kandemir,
G. Chen,
M. J. Irwin,
M. Karakoy
|
|
Pages: 743-748 |
|
doi>10.1145/1120725.1121008 |
|
Full text: PDF
|
|
Ensuring that most of data accesses are satisfied from on-chip memories is a critical problem for chip multiprocessors, as the cost of an off-chip access can be very high. Particularly, multiple cores that need to access the off-chip memory system may ...
Ensuring that most of data accesses are satisfied from on-chip memories is a critical problem for chip multiprocessors, as the cost of an off-chip access can be very high. Particularly, multiple cores that need to access the off-chip memory system may contend with each other for the same buses/pins to get there. While it is possible to structure on-chip memory space as shared memory or private memory, each of these has its own drawbacks. In an attempt to achieve lower power consumption than these conventional memory architectures, this paper proposes and evaluates an application-specific hybrid memory architecture that has both shared and private components. The approach is built upon the idea of capturing the amount of privately-accessed and shared data across processors through a polyhedral tool, and using this information to guide memory space partitioning across two dimensions, namely, across parallel processors and across shared and private memory components. We evaluate the resulting memory configurations using a set of benchmarks and compare them to pure private and pure shared architectures. When running the same set of applications with the same code optimizations, our results indicate that the proposed hybrid memory design methodology leads to much less power consumption than the conventional architectures. expand
|
|
|
Performance driven reliable link design for networks on chips |
| |
Rutuparna Ramesh Tamhankar,
Srinivasan Murali,
Giovanni De Micheli
|
|
Pages: 749-754 |
|
doi>10.1145/1120725.1121009 |
|
Full text: PDF
|
|
With decreasing feature size of transistors, the interconnect wire delay is becoming a major bottleneck in current Systems on Chips (SoCs). Another effect of shrinking feature size is that the wires are becoming unreliable as they are increasingly ...
With decreasing feature size of transistors, the interconnect wire delay is becoming a major bottleneck in current Systems on Chips (SoCs). Another effect of shrinking feature size is that the wires are becoming unreliable as they are increasingly susceptible to various noise sources such as cross-talk, coupling noise, soft errors etc. Increasing importance of wire delay and reliability has lead to a communication centric design approach, Networks on Chip (NoC), for building complex SoCs. Current NoC communication design methodologies are based on conservative design approaches and consider worst case operating conditions for link design, resulting in large latency penalty for data transmission. In order to sub-stantially decrease the link delay and thereby increase system performance an aggressive design approach is needed. In this work we present Terror, timing error tolerant communication system, for aggressively designing the links of NoCs. In our methodology, instead of avoiding timing errors by a worst-case design, we do aggressive design by tolerating timing errors. Simulation results show large latency savings (up to 35%) for the Terror based system compared to traditional design methodology. expand
|
|
|
Dynamic power management using on demand paging for networked embedded systems |
| |
Yuvraj Agarwal,
Curt Schurgers,
Rajesh Gupta
|
|
Pages: 755-759 |
|
doi>10.1145/1120725.1121010 |
|
Full text: PDF
|
|
The power consumption of the network interface plays a major role in determining the total operating lifetime of wireless networked embedded systems. In case of on-demand paging, a low power secondary radio is used to wake up the higher power radio, ...
The power consumption of the network interface plays a major role in determining the total operating lifetime of wireless networked embedded systems. In case of on-demand paging, a low power secondary radio is used to wake up the higher power radio, allowing the latter to sleep for longer periods of time. In this paper we present use of Bluetooth radios to serve as a paging channel for the 802.11b wireless LAN. We have implemented an on-demand paging scheme on an infrastructure based WLAN consisting of iPAQ PDAs equipped with Bluetooth radios and Cisco Aironet wireless networking cards. Our results show power saving ranging from 23% to 48% over the present 802.11b standard operating modes with negligible impact on performance. expand
|
|
|
An FPGA implementation of low-density parity-check code decoder with multi-rate capability |
| |
Lei Yang,
Manyuan Shen,
Hui Liu,
C.-J. Richard Shi
|
|
Pages: 760-763 |
|
doi>10.1145/1120725.1121011 |
|
Full text: PDF
|
|
With superior error correction capability, low-density parity-check (LDPC) has initiated wide scale interests in wireless telecommunication fields. In the past, various structures of single code rate LDPC decoders have been implemented for different ...
With superior error correction capability, low-density parity-check (LDPC) has initiated wide scale interests in wireless telecommunication fields. In the past, various structures of single code rate LDPC decoders have been implemented for different applications. However, in order to cover a wide range of service requirements and diverse interference conditions in wireless applications, LDPC decoders that can operate in both high and low code rates are desired. In this paper, a new multi-rate LDPC decoder architecture is presented and implemented in a Xilinx FPGA device. Through selection pins, three operating modes with the irregular 1/2 rate, regular 5/8 rate and regular 7/8 rate are supported. The measurement results show LDPC decoder can achieve BER below 10-5 at SNR of 1.4dB in the most critical case with the irregular 1/2 mode. expand
|
|
|
Single-track asynchronous pipeline controller design |
| |
Xiao Yong,
Zhou Runde
|
|
Pages: 764-768 |
|
doi>10.1145/1120725.1121012 |
|
Full text: PDF
|
|
Various applications have demonstrated that asynchronous circuits have great potential for energy-efficient and high-performance design. It is well known that asynchronous pipeline serves as a powerful method of implementing general computation. This ...
Various applications have demonstrated that asynchronous circuits have great potential for energy-efficient and high-performance design. It is well known that asynchronous pipeline serves as a powerful method of implementing general computation. This paper presents a new fast asynchronous pipeline controller with the forward and reverse latency of 2 transitions and a new robust QDI asynchronous pipeline controller using Muller C-gate. The first controller reduces 38.1% forward latency comparing to the recently proposed ultra-high-speed GasP circuit, the controller can run at 2.2 GHz using TSMC 0.25 um process. The second controller greatly simplifies the timing verifications with lower area cost than STFB circuit. expand
|
|
|
Using data replication to reduce communication energy on chip multiprocessors |
| |
M. Kandemir,
G. Chen,
F. Li,
I. Demirkiran
|
|
Pages: 769-772 |
|
doi>10.1145/1120725.1121013 |
|
Full text: PDF
|
|
Chip multiprocessors are gaining popularity as they are very suitable for data-intensive embedded and high-end processing. In particular, array-intensive embedded image and video applications can benefit a lot from these architectures due to coarse-grain ...
Chip multiprocessors are gaining popularity as they are very suitable for data-intensive embedded and high-end processing. In particular, array-intensive embedded image and video applications can benefit a lot from these architectures due to coarse-grain parallelization they offer. However, if not optimized, interprocessor communication can be a major energy consumer. Focusing on a distributed memory chip multiprocessor architecture and array-intensive embedded applications, this paper proposes a compiler-based communication minimization strategy based on data replication. The proposed scheme replicates shared data items across the memories of the processors in a controlled fashion (i.e., under a memory limit), with the goal of eliminating the otherwise necessary interprocessor communication. expand
|
|
|
SESSION: Synthesis for FPGAs |
| |
Kia Bazargan,
Evangeline F. Y. Young
|
|
|
|
|
Three-dimensional place and route for FPGAs |
| |
Cristinel Ababei,
Hushrav Mogal,
Kia Bazargan
|
|
Pages: 773-778 |
|
doi>10.1145/1120725.1121015 |
|
Full text: PDF
|
|
We present timing-driven partitioning and simulated annealing based placement algorithms together with a detailed routing tool for 3D FPGA integration. The circuit is first divided into layers with limited number of inter-layer vias, and then placed ...
We present timing-driven partitioning and simulated annealing based placement algorithms together with a detailed routing tool for 3D FPGA integration. The circuit is first divided into layers with limited number of inter-layer vias, and then placed on individual layers, while minimizing the delay of critical paths. We use our tool as a platform to explore the potential benefits in terms of delay and wire-length that 3D technologies can offer for FPGA fabrics. Experimental results show on average a total decrease of 21% in wire-length and 24% in delay, can be achieved over traditional 2D chips, when five layers are used in 3D integration. expand
|
|
|
Modern FPGA constrained placement |
| |
Wai-Kei Mak
|
|
Pages: 779-784 |
|
doi>10.1145/1120725.1121016 |
|
Full text: PDF
|
|
We consider the placement of FPGA designs with multiple I/O standards on modern FPGAs that support multiple I/O standards. We propose an efficient approach to solve the constrained I/O placement problem by 0-1 integer linear programming within a high ...
We consider the placement of FPGA designs with multiple I/O standards on modern FPGAs that support multiple I/O standards. We propose an efficient approach to solve the constrained I/O placement problem by 0-1 integer linear programming within a high performance placement flow. We derive an elegant 0-1 integer linear program formulation which is applicable not only for devices with symmetric I/O banks but also for devices with asymmetric I/O banks (i.e., different banks may have different sizes and/or support different subsets of I/O standards). Moreover, it is capable of handling user's pre-locked I/Os. We also show that additional restrictions such as conditional usage of Vref pins can be easily incorporated. Our formulation involves only a small number of 0-1 integer variables independent of the device size or the number of I/O objects, hence our approach can comfortably handle very large problem instances. Extensive experimentation showed that the 0-1 integer linear program corresponding to a feasible instance of the constrained I/O placement problem can be solved in seconds. expand
|
|
|
Clustering techniques for coarse-grained, antifuse FPGAs |
| |
Chang Woo Kang,
Massoud Pedram
|
|
Pages: 785-790 |
|
doi>10.1145/1120725.1121017 |
|
Full text: PDF
|
|
In this paper, we present area and performance-driven clustering techniques for coarse-grained, antifuse-based FPGAs. A macro logic cell in this class of FPGAs has multiple inputs and multiple outputs. Starting with this macro cell, a library of small ...
In this paper, we present area and performance-driven clustering techniques for coarse-grained, antifuse-based FPGAs. A macro logic cell in this class of FPGAs has multiple inputs and multiple outputs. Starting with this macro cell, a library of small logic cells can be generated and a target network was mapped with the library. For the minimum-area clustering, our algorithm minimizes the number of required macro logic cells to cover a network. Two linear equations were set up and we found the optimal mapping solution by using the equations. For the performance-driven clustering, the number of macro logic cells on the critical path is minimized by using the extension of Lawler's algorithm. The results show that the area-driven clustering algorithm reduced the number of macro logic cells by 12.29% and the performance-driven clustering reduced the maximum depth by 44.75% compared to a commercial tool. expand
|
|
|
A novel CLB architecture and circuit packing algorithm for logic-area reduction in SRAM-based FPGAs |
| |
Vivek Garg,
Vikram Chandrasekhar,
M. Sashikanth,
V. Kamakoti
|
|
Pages: 791-794 |
|
doi>10.1145/1120725.1121018 |
|
Full text: PDF
|
|
The main objective of the technique presented in this paper is to exploit the relations between a set of Boolean functions so as to generate one function from another. The paper defines a relation termed as split-equivalence between logical ...
The main objective of the technique presented in this paper is to exploit the relations between a set of Boolean functions so as to generate one function from another. The paper defines a relation termed as split-equivalence between logical functions. Using this relation, a single Look-Up Table (LUT) storing the truth table of a function F may be used to generate other functions that are split-equivalent to F resulting in an overall reduction in the Logic Area used to map the circuit on the FPGA. This paper proposes a new Configurable Logic Block (CLB) architecture containing a single LUT that stores the truth table of a Boolean function F and is capable of generating three split-equivalent functions of F. Given a set of Boolean functions to be mapped onto LUTs, the technique proposed identifies sets of four functions such that any three of them are split-equivalent to the fourth. These sets are mapped on to the proposed CLB architecture. The proposed CLB architecture was compared with the standard CLBs available on Xilinx Virtex architecture and it was found that the former occupies 26% lesser area than the latter with a small increase in the SRAM configuration bits required to configure a CLB. expand
|
|
|
Resource sharing in pipelined CDFG synthesis |
| |
Somsubhra Mondal,
Seda Öǧrenci Memik
|
|
Pages: 795-798 |
|
doi>10.1145/1120725.1121019 |
|
Full text: PDF
|
|
Efficient use of limited available resources on an FPGA remains a crucial problem for synthesizing pipelined designs. Resource sharing addresses this challenge. In this paper, we propose resource sharing techniques that can be incorporated into an automated ...
Efficient use of limited available resources on an FPGA remains a crucial problem for synthesizing pipelined designs. Resource sharing addresses this challenge. In this paper, we propose resource sharing techniques that can be incorporated into an automated synthesis flow to generate pipelined designs. Given a synthesized pipelined design, we create a direct relationship between available time slack on modules and the multiplexing overhead due to sharing. This flexibility is maximally exploited without violating any throughput constraints. We propose different techniques to address resource sharing problems of varying restrictions. Specifically, we propose an optimal algorithm for Constant-Slack Resource Sharing and a heuristic for the general Intra-Pipeline Stage Resource Sharing. On an average the demand on arithmetic functional units can be reduced by 39.5% for a set of benchmarks from the multimedia domain using our resource sharing technique. expand
|
|
|
SESSION: Analog circuit design |
| |
Chris Verhoeven,
Junyan Ren
|
|
|
|
|
A 2.4-GHz linear-tuning CMOS LC voltage-controlled oscillator |
| |
Hong Zhang,
Guican Chen,
Ning Li
|
|
Pages: 799-802 |
|
doi>10.1145/1120725.1121021 |
|
Full text: PDF
|
|
This paper presents a voltage-controlled oscillator (VCO) with high linearity in frequency tuning. The VCO has two control inputs. One controls a pair of p+/n-well varactors to realize continuous tuning and the other controls a pair of MOS ...
This paper presents a voltage-controlled oscillator (VCO) with high linearity in frequency tuning. The VCO has two control inputs. One controls a pair of p+/n-well varactors to realize continuous tuning and the other controls a pair of MOS varactors for band switching. This new circuit topology achieves a good tuning nonlinearity of 1.45% and a wide total tuning range from 2.33 GHz to 2.72 GHz. The simulated phase noise is only -115.7 dBc/Hz at 600kHz offset from 2.4 GHz. Based on a 0.25-μm process, the VCO dissipates only 15mW at a 2.5-V supply voltage. expand
|
|
|
Adiabatic CMOS gate and adiabatic circuit design for low-power applications |
| |
Guoqiang Hang
|
|
Pages: 803-808 |
|
doi>10.1145/1120725.1121022 |
|
Full text: PDF
|
|
The methodology for designing adiabatic circuits employing two-phase power clock, is investigated. First, algebraic expressions for and properties of power-clocked signals are discussed. Then the design of adiabatic gates based on AC power supply and ...
The methodology for designing adiabatic circuits employing two-phase power clock, is investigated. First, algebraic expressions for and properties of power-clocked signals are discussed. Then the design of adiabatic gates based on AC power supply and CMOS transmission gates is analyzed. On this basis, basic rules for the design of adiabatic circuits are proposed, and a design example of an adiabatic full adder is demonstrated. SPICE simulations using a trapezoidal power-clock demonstrate that the designed adiabatic circuits have a correct logic function and ultra low-power characteristics. expand
|
|
|
An 11-bit 160-MS/s 1.35-V 10-mW D/A converter using automated device sizing system |
| |
Osamu Matsumoto,
Hisashi Harada,
Yasuo Morimoto,
Toshio Kumamoto,
Takahiro Miki,
Masao Hotta
|
|
Pages: 809-814 |
|
doi>10.1145/1120725.1121023 |
|
Full text: PDF
|
|
This paper describes an automated device sizing system for current-steering D/A converters (DACs) and an 11-bit 160-MS/s DAC implemented using this system. Based on an analysis of harmonic distortion (or spurious) of the DAC, a circuit technique named ...
This paper describes an automated device sizing system for current-steering D/A converters (DACs) and an 11-bit 160-MS/s DAC implemented using this system. Based on an analysis of harmonic distortion (or spurious) of the DAC, a circuit technique named One-Vgs Switching has been newly developed for realizing high spurious free dynamic range (SFDR). The automated device sizing system has also been developed for quick retargeting of the current-steering DAC. The 11-bit 160-MS/s DAC has been designed using this system and fabricated in a 0.18-μm technology. It operates at 1.35-V power supply with 10-mW power consumption, 1.6-Vppd output swing, and 61-dB SFDR at fsig=10.2 MHz. Its active area is 0.22 mm2. expand
|
|
|
A class D audio power amplifier with high-efficiency and low-distortion |
| |
Chen Hai,
Wu Xiaobo
|
|
Pages: 815-818 |
|
doi>10.1145/1120725.1121024 |
|
Full text: PDF
|
|
Efficiency and fidelity is of key importance to audio power amplifiers. A new configuration of power amplifier was proposed to improve both of them. By combining a linear amplifier with a nonlinear one in parallel, it features high efficiency and low ...
Efficiency and fidelity is of key importance to audio power amplifiers. A new configuration of power amplifier was proposed to improve both of them. By combining a linear amplifier with a nonlinear one in parallel, it features high efficiency and low distortion. Simulation shows that at the output power of 8.5W its efficiency could be up to 83%, while its THD (Total Harmonic Distortion) is as low as 0.14%. expand
|
|
|
Substrate noise modeling in early floorplanning of MS-SOCs |
| |
Grzegorz Blakiewicz,
Marcin Jeske,
Malgorzata Chrzanowska-Jeske,
Jin S. Zhang
|
|
Pages: 819-823 |
|
doi>10.1145/1120725.1121025 |
|
Full text: PDF
|
|
We propose a frequency-dependent sensitivity model for analog blocks and a noise injection model for digital blocks in application to early design planning of Mixed-Signal System-on-Chips (MS-SOCs). We assume no precise layout information about IP cores ...
We propose a frequency-dependent sensitivity model for analog blocks and a noise injection model for digital blocks in application to early design planning of Mixed-Signal System-on-Chips (MS-SOCs). We assume no precise layout information about IP cores is available. We also propose an empirical formula for separation-dependent coupling between large-area noisy ports and small-area sensitive ports for lightly-doped substrates that are preferred for mixed-signal circuits. The interaction between digital and analog blocks is incorporated into our floorplanner, which reduces the overall noise and the number of analog blocks with noise limit violations. Experimental results on examples created from MCNC floorplanning benchmarks are very encouraging. expand
|
|
|
SESSION: Low power design for embedded and real-time systems |
| |
Joerg Henkel,
Jihua Chen
|
|
|
|
|
Instruction scheduling of VLIW architectures for balanced power consumption |
| |
Shu Xiao,
Edmund M-K. Lai
|
|
Pages: 824-829 |
|
doi>10.1145/1120725.1121027 |
|
Full text: PDF
|
|
An instruction word in VLIW (very long instruction word) processors consists of a variable number of individual instructions. Therefore the power consumption variation over time significantly depends on the parallel instruction schedule generated by ...
An instruction word in VLIW (very long instruction word) processors consists of a variable number of individual instructions. Therefore the power consumption variation over time significantly depends on the parallel instruction schedule generated by the compiler. Sharp power variations across time cause power supply noises, degrade chip reliability and accelerate battery exhaustion. This paper proposes a branch and bound algorithm for instruction scheduling of VLIW architectures that effectively minimizing power variation without degrading the speed. Our experimental results demonstrate the efficiency of our algorithm compared with previously presented approaches. Finally, a new rough sets based approach to the instruction-level VLIW power model for this instruction scheduling optimization problem is discussed. expand
|
|
|
Power minimization techniques on distributed real-time systems by global and local slack management |
| |
Shaoxiong Hua,
Gang Qu
|
|
Pages: 830-835 |
|
doi>10.1145/1120725.1121028 |
|
Full text: PDF
|
|
Recently, a static power management with parallelism (P-SPM) technique has been proposed to reduce the energy consumption of distributed systems to execute a set of real-time dependent tasks [7]. The authors claimed that the proposed P-SPM outperforms ...
Recently, a static power management with parallelism (P-SPM) technique has been proposed to reduce the energy consumption of distributed systems to execute a set of real-time dependent tasks [7]. The authors claimed that the proposed P-SPM outperforms other known methods in energy reduction. However, how to take advantage of the local static slack for further energy optimization remains as an open problem.In this paper, we propose the static power management with proportional distribution and parallelism scheme (PDP-SPM) that not only answers this open problem, but also exploits the parallelism. Simulations on task graphs derived for DSP applications and TGFF benchmark suite suggest that PDP-SPM achieves 64% energy saving over the system without power management, and 15% over the P-SPM scheme. expand
|
|
|
A generalized technique for energy-efficient operating voltage set-up in dynamic voltage scaled processors |
| |
Jaewon Seo,
Nikil D. Dutt
|
|
Pages: 836-841 |
|
doi>10.1145/1120725.1121029 |
|
Full text: PDF
|
|
Dynamic voltage scaling (DVS) which is an effective energy minimization technique has been well-studied in recent years. Yet the problem of selecting voltage levels for multiple voltage DVS systems remains an unresolved issue. In this paper, we present ...
Dynamic voltage scaling (DVS) which is an effective energy minimization technique has been well-studied in recent years. Yet the problem of selecting voltage levels for multiple voltage DVS systems remains an unresolved issue. In this paper, we present a novel technique for dealing with the problem of finding k operating voltages to minimize the energy consumption (voltage set-up problem). A new formulation of the voltage set-up problem is given to make our solution less dependent on the specific DVS scheme. Then it is solved optimally using dynamic programming in polynomial time. With almost the same time complexity we extend the proposed technique to explore the design space to determine the best number of voltage levels. It is confirmed from the experiments that the proposed voltage set-up solution reduces energy consumption by 19.2% on average over that of previous technique [7]. expand
|
|
|
A dynamic voltage scaling algorithm for energy reduction in hard real-time systems |
| |
Van R. Culver,
Sunil P. Khatri
|
|
Pages: 842-845 |
|
doi>10.1145/1120725.1121030 |
|
Full text: PDF
|
|
As the quantity and functional complexity of battery powered portable devices continues to rise, energy efficient design of such devices has become increasingly important. Many real-time scheduling algorithms have been developed recently to reduce energy ...
As the quantity and functional complexity of battery powered portable devices continues to rise, energy efficient design of such devices has become increasingly important. Many real-time scheduling algorithms have been developed recently to reduce energy consumption in hard real-time embedded systems that use dynamic voltage scaling (DVS) capable processors. This paper explores an algorithm that seeks to reduce energy consumption by considering tasks in tandem, with the intuition that what may be a good frequency for one task, may be much worse for another. In particular, our algorithm considers pairs of tasks, and optimizes them simultaneously so that their total energy consumption is minimized while all deadlines are met. Experimental results demonstrate that our method is able to effectively improve on the results of look-ahead EDF, one of the best energy-aware schedulers, especially for task sets with moderate utilization, and "harmonious" task periodicity. expand
|
|
|
An efficient dynamic task scheduling algorithm for battery powered DVS systems |
| |
Jianli Zhuo,
Chaitali Chakrabarti
|
|
Pages: 846-849 |
|
doi>10.1145/1120725.1121031 |
|
Full text: PDF
|
|
Battery lifetime enhancement is a critical design parameter for mobile computing devices. Maximizing battery life-time is a particularly difficult problem due to the non-linearity of the battery behavior and its dependence on the characteristics of the ...
Battery lifetime enhancement is a critical design parameter for mobile computing devices. Maximizing battery life-time is a particularly difficult problem due to the non-linearity of the battery behavior and its dependence on the characteristics of the discharge profile. In this paper we address the problem of dynamic task scheduling with voltage scaling in a battery-powered DVS system. The objective is to maximize the battery performance measured in terms of charge consumption during execution of the tasks. We present a new battery-aware dynamic task scheduling algorithm, darEDF, based on an efficient slack utilization scheme that employs dynamic speed setting of tasks in run queue. We compare darEDF with three state of the art energy-efficient algorithms, lpfpsEDF, lppsEDF, lpSEH, with respect to battery performance and energy consumption. We show that darEDF has better performance than lpSEH (which has close to optimal energy value), and has lower run-time complexity. expand
|
|
|
SESSION: Synthesis for low power |
| |
Shih-Chieh Chang,
Farzan Falla
|
|
|
|
|
Optimal module and voltage assignment for low-power |
| |
Deming Chen,
Jason Cong,
Junjuan Xu
|
|
Pages: 850-855 |
|
doi>10.1145/1120725.1121054 |
|
Full text: PDF
|
|
Reducing power consumption through high-level synthesis has attracted a growing interest from researchers due to its large potential for power reduction. In this work we study functional unit binding (or module assignment) given a scheduled data flow ...
Reducing power consumption through high-level synthesis has attracted a growing interest from researchers due to its large potential for power reduction. In this work we study functional unit binding (or module assignment) given a scheduled data flow graph under a dual-Vdd framework. We assume that each functional unit can be driven by a low Vdd or a high Vdd dynamically during run time to save dynamic power. We develop a polynomial-time optimal algorithm for assigning low Vdd to as many operations as possible under the resource and time constraint, and in the same time minimizing total switching activity through functional unit binding. Our algorithm shows consistent improvement over a design flow that separates voltage assignment from functional unit binding. We also change the initial scheduling to examine power-latency tradeoff scenarios. Experimental results show that we can achieve a 21% power reduction when latency bound is tight. When latency is relaxed by 10 to 100%, the power reduction is 31 to 73% compared to the synthesis results for the case of single high Vdd without latency relaxation. We also show comparison data of energy consumption under the same experimental setting. expand
|
|
|
Bitwidth-aware scheduling and binding in high-level synthesis |
| |
Jason Cong,
Yiping Fan,
Guoling Han,
Yizhou Lin,
Junjuan Xu,
Zhiru Zhang,
Xu Cheng
|
|
Pages: 856-861 |
|
doi>10.1145/1120725.1121055 |
|
Full text: PDF
|
|
Many high-level description languages, such as C/C++ or Java, lack the capability to specify the bitwidth information for variables and operations. Synthesis from these specifications without bitwidth analysis may introduce wasted resources. Furthermore, ...
Many high-level description languages, such as C/C++ or Java, lack the capability to specify the bitwidth information for variables and operations. Synthesis from these specifications without bitwidth analysis may introduce wasted resources. Furthermore, conventional high-level synthesis techniques usually focus on uniform-width resources, thus they cannot obtain the full resource savings even with bitwidth information. This work develops a bitwidth-aware synthesis flow, including bitwidth analysis, scheduling and binding, and register allocation and binding, to exploit the multi-bitwidth nature of operations and variables for area-efficient designs. We also develop lower bound estimation to evaluate the efficiency of our proposed solutions for register allocation and binding. The flow is implemented in the MCAS synthesis system [11]. Experimental results show that our proposed bitwidth-aware synthesis flow reduces area by 36% and wire-length by 52% on average compared to the uniform-width MCAS flow, while achieving the same performance. expand
|
|
|
Functionality directed clustering for low power MTCMOS design |
| |
Tsuang-Wei Chang,
Ting-Ting Hwang,
Sheng-Yu Hsu
|
|
Pages: 862-867 |
|
doi>10.1145/1120725.1121056 |
|
Full text: PDF
|
|
Multi-Threshold CMOS (MTCMOS) is a circuit style that can effectively reduce leakage power consumption. Sleep transistor sizing is the key issue when MTCMOS circuit is designed. If the sleep transistor size is too large, the circuit performance can be ...
Multi-Threshold CMOS (MTCMOS) is a circuit style that can effectively reduce leakage power consumption. Sleep transistor sizing is the key issue when MTCMOS circuit is designed. If the sleep transistor size is too large, the circuit performance can be maintained but the dynamic power consumption of the sleep transistor will increase. On the other hand, if the sleep transistor size is too small, there will be significant performance degradation because of the increased resistance to ground. Previous approach [1, 2] designed the sleep transistor size based on mutual exclusive discharge patterns. However, these approaches considered only topology of a circuit. We observed that two possible simultaneous switching gates may not discharge at the same time in terms of functionality. Thus, we propose an algorithm to determine how to cluster cells to share sleep transistors taking both topology and functionality into consideration. The results show that the proposed method can achieve on the average 18% reduction ratio in terms of the number of sleep transistors as compared to the method without considering functionality. expand
|
|
|
Wake-up protocols for controlling current surges in MTCMOS-based technology |
| |
Azadeh Davoodi,
Ankur Srivastava
|
|
Pages: 868-871 |
|
doi>10.1145/1120725.1121057 |
|
Full text: PDF
|
|
This paper proposes strategies to control the wake-up noise for circuits implemented in MTCMOS technology. In MTCMOS circuits, during the switchings between the active and standby modes, sudden surges in current happens due to floating voltages at the ...
This paper proposes strategies to control the wake-up noise for circuits implemented in MTCMOS technology. In MTCMOS circuits, during the switchings between the active and standby modes, sudden surges in current happens due to floating voltages at the nodes. These surges might violate the reliability of the circuit. In this paper we address the above problem by developing wake-up strategies to control these current surges as the circuit is getting turned on. Through gradually turning on a circuit a smaller current will be drawn from the power-grid network. A novel partitioning technique is proposed for MTCMOS circuits under a given constraint of maximum drawn-current from the power-grid network. Two approaches are proposed in this paper; the optimal ILP-based formulation and a polynomial-time heuristic. Experimental results show that up to 90.7% improvement in peak drawn-current is obtained with a maximum of 4 clock cycles time to turn on the circuit. Also result show the effectiveness of the heuristic in terms of the quality of solution and a run-time of up to 6600 times faster than the ILP approach for larger circuits. expand
|
|
|
On multiple-voltage high-level synthesis using algorithmic transformations |
| |
Hsueh-Chih Yang,
Lan-Rong Dung
|
|
Pages: 872-876 |
|
doi>10.1145/1120725.1121058 |
|
Full text: PDF
|
|
This paper presents a multiple-voltage high-level synthesis methodology for low power DSP applications using algorithmic transformation techniques. Our approach is motivated by maximization of task mobilities in that the increase of mobilities may raise ...
This paper presents a multiple-voltage high-level synthesis methodology for low power DSP applications using algorithmic transformation techniques. Our approach is motivated by maximization of task mobilities in that the increase of mobilities may raise the possibility of assigning tasks to low-voltage components. The mobility means the ability to schedule the starting time of a task. It is defined as the distance between its as-late-as-possible (ALAP) schedule time and its as-soon-as-possible (ASAP) schedule time. To earn task mobilities, we use loop shrinking, retiming and unfolding techniques. The loop shrinking can first reduce the iteration period bound (IPB) and, then, the others are employed for shortening the minimum achieved sample period (MASP) as much as possible. The minimization of MASP results in high task mobilities. Thereafter, we can assign tasks with high mobilities to low-voltage components and minimize energy dissipation under resource and latency constraints. With considering the overhead of level conversion, our approach can achieve significant power reduction. For instance, as the experimental results, we can save the power consumption up to 54.77% for the case of the third-order IIR filter. expand
|
|
|
SESSION: New circuit and methodology |
| |
Jinmei Lai,
Zheng Shi
|
|
|
|
|
An advanced bit-line clamping scheme in magnetic RAM for wide sensing margin |
| |
Jong-Chul Lim,
Hye-Seung Yu,
Jae-Suk Choi,
Soo-Won Kim
|
|
Pages: 877-882 |
|
doi>10.1145/1120725.1121060 |
|
Full text: PDF
|
|
This paper proposes the bit-line clamping scheme for a stable signal margin in Magnetoresistance RAM. MRAM distinguishes data by the difference of resistance in MTJ. However, there are so many error sources in MTJ that it limits a yield factor. In this ...
This paper proposes the bit-line clamping scheme for a stable signal margin in Magnetoresistance RAM. MRAM distinguishes data by the difference of resistance in MTJ. However, there are so many error sources in MTJ that it limits a yield factor. In this paper, we focus on the resistance variation due to bit-line voltage. For maximum signal difference, we try to reduce bit-line voltage as low as possible.Proposed scheme employs CBLSA, equalizer transistor and ITIMTJ array structure. This method has very excellent bit-line clamping characteristic and overall memory can be designed a simple architecture using current mode sensing. As a result, proposed memory structure can clamp a bit-line voltage under 0.15V and it uses very small power and area. This lower bit-line voltage promises more stable data accessing in MRAM. The circuit is designed in a 0.35um-CMOS technology. expand
|
|
|
Constructing zero-deficiency parallel prefix adder of minimum depth |
| |
Haikun Zhu,
Chung-Kuan Cheng,
Ronald Graham
|
|
Pages: 883-888 |
|
doi>10.1145/1120725.1121061 |
|
Full text: PDF
|
|
Parallel prefix adder is a general technique for speeding up binary addition. In unit delay model, we denote the size and depth of an n-bit prefix adder C(n) as SC(n) and dC(n) respectively. ...
Parallel prefix adder is a general technique for speeding up binary addition. In unit delay model, we denote the size and depth of an n-bit prefix adder C(n) as SC(n) and dC(n) respectively. Snir proved that sC(n) + dC(n) ≥ 2n - 2 holds for arbitrary prefix adders. Hence, a prefix adder is said to be of zero-deficiency if sC(n) + dC(n) = 2n - 2. In this paper, we first propose a new architecture of zero-deficiency prefix adder dubbed Z(d), which provably has the minimal depth among all kinds of zero-deficiency prefix adders. We then design a 64-bit prefix adder Z64, which is derived from Z(d)|d=8, and compare it against several classical prefix adders of the same bit width in terms of area and delay using logical effort method. The result shows that the proposed Z(d) adder is also promising in practical VLSI design. expand
|
|
|
An accurate 1.08-GHz CMOS LC voltage-controlled oscillator |
| |
Zhangwen Tang,
Jie He,
Hongyan Jian,
Hao Min
|
|
Pages: 889-892 |
|
doi>10.1145/1120725.1121062 |
|
Full text: PDF
|
|
An accurate 1.08-GHz CMOS LC voltage-controlled oscillator is implemented in a 0.35μm standard 2P4M CMOS process. In this paper we present a new convenient method of calculation of oscillating period. With this period calculation technique, the frequency ...
An accurate 1.08-GHz CMOS LC voltage-controlled oscillator is implemented in a 0.35μm standard 2P4M CMOS process. In this paper we present a new convenient method of calculation of oscillating period. With this period calculation technique, the frequency tuning curves agree perfectly with the experiment. At a 3.3-V supply, the LC-VCO measures a phase noise of -82.2 dBc/Hz at a 10kHz frequency offset while dissipating 3.1mA current. Chip size is 0.86mm x 0.82mm. expand
|
|
|
Area-IO DRAM/logic integration with system-in-a-package (SiP) |
| |
Anru Wang,
Wayne Dai
|
|
Pages: 893-896 |
|
doi>10.1145/1120725.1121063 |
|
Full text: PDF
|
|
This paper presents a cost-effective area-IO DRAM (aDRAM)/Logic integration implemented with CLC (Chip-Laminate-Chip)-based System-in-a-Package (SiP) technology. By inserting 512 area-IOs into the area-IO DRAM, the bandwidth of the area-IO DRAM can achieve ...
This paper presents a cost-effective area-IO DRAM (aDRAM)/Logic integration implemented with CLC (Chip-Laminate-Chip)-based System-in-a-Package (SiP) technology. By inserting 512 area-IOs into the area-IO DRAM, the bandwidth of the area-IO DRAM can achieve 10GB/s when working under 166MHz. An interface module with configurable IO width was also developed to make this implementation platform able to be adapted by various applications. A performance analysis, including bandwidth and power is also presented in this paper. It is demonstrated that area-IO DRAM/Logic integration with SiP technology provides significant cost-effective implementation methodology compared with embedded DRAM and off-chip DRAM. expand
|
|
|
Design of an efficient memory subsystem for network processor |
| |
Shuguang Gong,
Huawei Li,
Yufeng Xu,
Tong Liu,
Xiaowei Li
|
|
Pages: 897-900 |
|
doi>10.1145/1120725.1121064 |
|
Full text: PDF
|
|
The rapid growth of backbone network traffic increases the gaps among the available network bandwidth, the CPU computation power and the memory bandwidth. The memory bandwidth has become the main performance bottleneck of network processor. In this paper, ...
The rapid growth of backbone network traffic increases the gaps among the available network bandwidth, the CPU computation power and the memory bandwidth. The memory bandwidth has become the main performance bottleneck of network processor. In this paper, an efficient memory subsystem design is proposed which combines dynamic memory allocation and a novel page-based memory access algorithm. The dynamic memory allocation achieves fast random packet access and flexible queue management. Utilizing the paged-based memory access algorithm, an efficient design of memory controller is proposed and high throughput can be implemented in the network processor. expand
|
|
|
Design of clocked circuits using UML |
| |
Zhenxin Sun,
Weng-Fai Wong,
Yongxin Zhu,
Santhosh Kumar Pilakkat
|
|
Pages: 901-904 |
|
doi>10.1145/1120725.1121065 |
|
Full text: PDF
|
|
Clocking is an essential component of any embedded system design. However, traditional design techniques are either short of clocking support or too complex for users. The Unified Modeling Language (UML) has been proposed as design tool in real time ...
Clocking is an essential component of any embedded system design. However, traditional design techniques are either short of clocking support or too complex for users. The Unified Modeling Language (UML) has been proposed as design tool in real time system design, but the clocking semantics has not been properly dealt with. In this paper, we will present our experience of using UML to design a clocked system. In particular, UML is used to model the digital down converter, an essential component of software radios. Our tool chain automatically generates the simulation as well as synthesizes the final implementation. expand
|
|
|
SESSION: FPGA circuits and architectures |
| |
Wai-Kei Mak,
Feng Zhou
|
|
|
|
|
A function generator-based reconfigurable system |
| |
Vivek Garg,
Vikram Chandrasekhar,
M. Sashikanth,
V. Kamakoti
|
|
Pages: 905-909 |
|
doi>10.1145/1120725.1121067 |
|
Full text: PDF
|
|
This paper proposes a new reconfigurable system which has a function generator-based CLB architecture. This is different from the standard look-up table (LUT) based CLB architectures available in commercial FPGAs. The new function generation architecture ...
This paper proposes a new reconfigurable system which has a function generator-based CLB architecture. This is different from the standard look-up table (LUT) based CLB architectures available in commercial FPGAs. The new function generation architecture is based on the fact that a small set of k-input Boolean functions can generate all the 22k, k-input Boolean functions using a simple mapping technique. The area required by the new function generation architecture is 58.6% lesser than the area required by a standard 16 x 1 LUT used in commercial FPGAs. In addition, the proposed architecture consumes 40.8% lesser power than the standard 16 x 1 LUT. The routing architecture for the proposed reconfigurable system is the same as those present in current-day FPGAs. Hence, the algorithms presently used for technology mapping, packing, placement and routing on FPGAs can be used for the proposed reconfigurable system without much modification. The new architecture requires a 10% increase in the SRAM configuration memory. This is an insignificant penalty in comparison to the reduction in the area of the FPGA and power consumption, achieved by the proposed CLB architecture. expand
|
|
|
Crossbar based design schemes for switch boxes and programmable interconnection networks |
| |
Hongbing Fan,
Yu-Liang Wu
|
|
Pages: 910-915 |
|
doi>10.1145/1120725.1121068 |
|
Full text: PDF
|
|
Crossbars have been considered one of the most standard switching modules in conventional communication networks due to its simplicity in routing algorithm and fabrication regularity. While in programmable on-chip interconnection networks such as the ...
Crossbars have been considered one of the most standard switching modules in conventional communication networks due to its simplicity in routing algorithm and fabrication regularity. While in programmable on-chip interconnection networks such as the routing networks in field programmable gate arrays (FPGAs), switch boxes are often used for a better tradeoff between routability and area efficiency. Much work has been done on the topology design of switch boxes, e.g. universal switch boxes and hyper-universal switch boxes. However, the layout design of switch boxes tends to be difficult when the topology of switch boxes is less regular. In this paper we revisit the theoretical design aspects of the classic crossbar design schemes and further investigate a new design style, a so called meta-crossbar, which is obtained from a crossbar by adding the least number of switches and direct contacts to achieve the desired optimal routability. We show that a switch box can always be implemented by a meta-crossbar. This means that the layout design of switch boxes can be done almost like crossbars. As a result, we present a hyper-universal meta-crossbar design, and a three level meta-crossbar based interconnection network design, which is capable of routing all group communication requirements. expand
|
|
|
A domain specific reconfigurable Viterbi fabric for system-on-chip applications |
| |
Cheng Zhan,
Tughrul Arslan,
Sami Khawam,
Iain Lindsay
|
|
Pages: 916-919 |
|
doi>10.1145/1120725.1121069 |
|
Full text: PDF
|
|
A novel embedded dynamically reconfigurable fabric for implementing the Viterbi algorithm in a System-on-Chip device is presented in this paper. The proposed reconfigurable fabric can support Viterbi implementations for different standards, such as GSM, ...
A novel embedded dynamically reconfigurable fabric for implementing the Viterbi algorithm in a System-on-Chip device is presented in this paper. The proposed reconfigurable fabric can support Viterbi implementations for different standards, such as GSM, IS-95, CDMA and Wireless LAN. Our results illustrate that the proposed architecture has superior power consumption and throughput characteristics and it is demonstrated a 80% reduction in power consumption over generic field programmable gate array (FPGA) and 40 times improvement in throughput over digital signal processor (DSP), respectively. Thus, the reconfigurable system-on-chip platform based on this kind of domain specific reconfigurable fabrics is an efficient solution for the high-performance portable communication systems. expand
|
|
|
Design of a high performance FFT processor based on FPGA |
| |
Chu Chao,
Zhang Qin,
Xie Yingke,
Han Chengde
|
|
Pages: 920-923 |
|
doi>10.1145/1120725.1121070 |
|
Full text: PDF
|
|
The design method of a real-time FFT processor is presented. By optimizing algorithm of memory mapping and generation of twiddle factors, a radix-4 butterfly can be calculated in one clock cycle. An approach to adaptive overflow control is also introduced ...
The design method of a real-time FFT processor is presented. By optimizing algorithm of memory mapping and generation of twiddle factors, a radix-4 butterfly can be calculated in one clock cycle. An approach to adaptive overflow control is also introduced to avoid overflow without interrupting the computing pipeline. The design is implemented on a FPGA chip and achieves the operating frequency at 127 MHz. It can complete a complex 1024-point FFT within 10.1 μs. expand
|
|
|
Increasing FPGA resilience against soft errors using task duplication |
| |
G. Chen,
F. Li,
M. Kandemir,
I. Demirkiran
|
|
Pages: 924-927 |
|
doi>10.1145/1120725.1121071 |
|
Full text: PDF
|
|
Reconfigurable computing systems are becoming increasingly widespread as they bring the flexibility of programmable systems and approach the performance of ASICs. While the prior research on FPGAs mainly studied issues such as performance, power, and ...
Reconfigurable computing systems are becoming increasingly widespread as they bring the flexibility of programmable systems and approach the performance of ASICs. While the prior research on FPGAs mainly studied issues such as performance, power, and area optimization, reliability related issues have not taken much attention. However, with increasing soft error rates, providing resilience to soft errors in FPGA based embedded platforms is becoming an increasingly important issue. This paper proposes an OS-directed task duplication scheme for increasing reliability by providing resilience against soft errors. The idea is to exploit the unused portions of the FPGA space to schedule duplicates of active tasks. The outputs of the primary and duplicate tasks are compared to check for the existence of soft errors. expand
|
|
|
Automatic extraction of function bodies from software binaries |
| |
Gaurav Mittal,
David Zaretsky,
Gokhan Memik,
Prith Banerjee
|
|
Pages: 928-931 |
|
doi>10.1145/1120725.1121072 |
|
Full text: PDF
|
|
This paper describes a method for automatically extracting function bodies from linked software binaries. It utilizes procedure-calling conventions along with limited control and data flow information. It has been tested with the TI C6000 DSP processor ...
This paper describes a method for automatically extracting function bodies from linked software binaries. It utilizes procedure-calling conventions along with limited control and data flow information. It has been tested with the TI C6000 DSP processor platform. Results are reported on eight benchmarks for which our algorithm successfully identifies all functions. It identifies 198% more functions than by the use procedure calling conventions alone. expand
|
|
|
SESSION: (Special session) EDA market in China |
|
|
|
|
Panel III: EDA market in China |
| |
David Chen,
Nancy Wu,
Wayne Dai,
Jun Tan,
Weiping Liu,
Hao Min,
Jian-yue Pan
|
|
Pages: 1-1 |
|
doi>10.1145/1120725.1121074 |
|
Full text: PDF
|
|
|
|
|
SESSION: Poster session I |
|
|
|
|
Modeling SystemC design in UML and automatic code generation |
| |
Chen Xi,
Lu JianHua,
Zhou ZuCheng,
Shang YaoHui
|
|
Pages: 932-935 |
|
doi>10.1145/1120725.1120760 |
|
Full text: PDF
|
|
The combination of Unified Modeling Language (UML) and SystemC has led to an object-oriented high-level design automation methodology. In this paper, a novel bi-directional UML-SystemC translation tool UMLSC is proposed. Specifically, a set of ...
The combination of Unified Modeling Language (UML) and SystemC has led to an object-oriented high-level design automation methodology. In this paper, a novel bi-directional UML-SystemC translation tool UMLSC is proposed. Specifically, a set of principles for modeling SystemC design in UML and an algorithm for UML-SystemC bi-directional translation are addressed. The principles and the algorithm are integrated into UMLSC, which provides a smooth link between visual specification, implementation and verification. An implementation example is given to verify the effectiveness of the proposed principles and the algorithm. expand
|
|
|
Enabling RTOS simulation modeling in a system level design language |
| |
M. AbdElSalam Hassan,
Keishi Sakanushi,
Yoshinori Takeuchi,
Masaharu Imai
|
|
Pages: 936-939 |
|
doi>10.1145/1120725.1120761 |
|
Full text: PDF
|
|
In this paper, we propose a new process definition (T-THREAD) and an extension to the existing SystemC simulation engine (SIM_API library) to capture the real time aspects of RTOS simulation models in an SLDL like SystemC. We describe the execution semantics ...
In this paper, we propose a new process definition (T-THREAD) and an extension to the existing SystemC simulation engine (SIM_API library) to capture the real time aspects of RTOS simulation models in an SLDL like SystemC. We describe the execution semantics of this process and show how it works in a complete embedded system simulation model. expand
|
|
|
A system-level framework for evaluating area/performance/power trade-offs of VLIW-based embedded systems |
| |
Giuseppe Ascia,
Vincenzo Catania,
Maurizio Palesi,
Davide Patti
|
|
Pages: 940-943 |
|
doi>10.1145/1120725.1120762 |
|
Full text: PDF
|
|
Architectures based on Very Long Instruction Word (VLIW) have found fertile ground in multimedia electronic appliances thanks to their ability to exploit high degrees of Instruction Level Parallelism (ILP) with a reasonable trade-off in complexity and ...
Architectures based on Very Long Instruction Word (VLIW) have found fertile ground in multimedia electronic appliances thanks to their ability to exploit high degrees of Instruction Level Parallelism (ILP) with a reasonable trade-off in complexity and silicon costs. In this case Application Specific Instruction-set Processor (ASIP) specialization may require not only manipulation of the instruction-set but also tuning of the architectural parameters of the processor (e.g. the number and type of functional units, register files, etc.) and the memory subsystem (cache size, associativity, etc.). Setting the parameters so as to optimize certain metrics requires the use of efficient Design Space Exploration (DSE) strategies and also simulation tools (retargetable compilers and simulators) and accurate estimation models operating at a high level of abstraction. In this paper we present a framework for evaluation, in terms of performance, cost and power consumption, of a system based on a parameterized VLIW microprocessor together with the memory hierarchy subsystem following execution of a specific application. The framework, which can be freely downloaded from the Internet, implements a number of multi-objective DSE strategies to obtain Pareto-optimal configurations for the system. expand
|
|
|
Multi-metric and multi-entity characterization of applications for early system design exploration |
| |
Lukai Cai,
Andreas Gerstlauer,
Daniel Gajski
|
|
Pages: 944-947 |
|
doi>10.1145/1120725.1120763 |
|
Full text: PDF
|
|
At system level, intensively analyzing the system application will produce a variety of useful characteristics and provide designers valuable exploration indications. In this paper, we present such an analysis approach based on the instrumentation-based ...
At system level, intensively analyzing the system application will produce a variety of useful characteristics and provide designers valuable exploration indications. In this paper, we present such an analysis approach based on the instrumentation-based profiling. The proposed approach analyzes complex system application and generates multi-metric and multi-entity characteristics. Experimental results show the applicability of the approach for efficient early design space exploration. expand
|
|
|
An integrated performance and power model for superscalar processor designs |
| |
Yongxin Zhu,
Weng-Fai Wong,
Ştefan Andrei
|
|
Pages: 948-951 |
|
doi>10.1145/1120725.1120764 |
|
Full text: PDF
|
|
On current superscalar processors, performance and power issues cannot be decoupled for designers. Extensive simulations are usually required to meet both power and performance constraints. This paper describes an integrated performance and power analytical ...
On current superscalar processors, performance and power issues cannot be decoupled for designers. Extensive simulations are usually required to meet both power and performance constraints. This paper describes an integrated performance and power analytical model. The model's performance and power results are in good agreement with detailed simulations, previous models and physically measured results. For designers, the model enables quick and flexible explorations into a subset of even entire huge parameter space of more than 15 workload and architectural parameters plus leakage power, feature sizes, clock and voltage. expand
|
|
|
Hierarchical task scheduler for interleaving subtasks on heterogeneous multiprocessor platforms |
| |
Zhe Ma,
Francky Catthoor,
Johan Vounckx
|
|
Pages: 952-955 |
|
doi>10.1145/1120725.1120765 |
|
Full text: PDF
|
|
Nowadays, the System-on-a-chip (SoC) has integrated more processors onto a single chip. Applications are also consisting of multiple (sub)tasks that are presented as different source code which can be partly executed concurrently. However, the subtask-level ...
Nowadays, the System-on-a-chip (SoC) has integrated more processors onto a single chip. Applications are also consisting of multiple (sub)tasks that are presented as different source code which can be partly executed concurrently. However, the subtask-level parallelism inside a single task is often too limited to fully utilize all the parallel processors and results in many slacks on processors. To better use the processors, subtasks of multiple tasks will have to be executed in an interleaving fashion. This paper proposes design-time algorithms to interleave subtasks based on the separated schedules of tasks. This interleaver can be considered as part of a hierarchical scheduler to steer the code generation of very complex applications with many tasks. The scheduling experiments show that the execution time can be shortened by 20%-30% when interleaving two tasks against the sequential execution without subtask interleaving. Moreover, the differences between the solutions given by our scheduling algorithm and the optimal solutions are less than 6% for up to 20 subtasks. expand
|
|
|
A flexible framework for communication evaluation in SoC design |
| |
Praveen Kalla,
Xiaobo Sharon Hu,
Jörg Henkel
|
|
Pages: 956-959 |
|
doi>10.1145/1120725.1120766 |
|
Full text: PDF
|
|
We present SoCExplore, a framework for fast communication-centric design space exploration of complex SoCs with network-based interconnects. Speed-up in exploration is achieved through abstraction of computation as a high-level trace, and accuracy is ...
We present SoCExplore, a framework for fast communication-centric design space exploration of complex SoCs with network-based interconnects. Speed-up in exploration is achieved through abstraction of computation as a high-level trace, and accuracy is maintained through cycle-accurate interconnect simulation. The flexibility offered allows for fast partition/mapping and interconnect design space exploration. Error analysis of such frameworks is non-trivial and is presented for the first time. As a case study, a speed-up of 94% over architectural simulation is reported for the MPEG application. expand
|
|
|
Feasibility analysis of messages for on-chip networks using wormhole routing |
| |
Zhonghai Lu,
Axel Jantsch,
Ingo Sander
|
|
Pages: 960-964 |
|
doi>10.1145/1120725.1120767 |
|
Full text: PDF
|
|
The feasibility of a message in a network concerns if its timing property can be satisfied without jeopardizing any messages already in the network to meet their timing properties. We present a novel feasibility analysis for real-time (RT) and nonreal-time ...
The feasibility of a message in a network concerns if its timing property can be satisfied without jeopardizing any messages already in the network to meet their timing properties. We present a novel feasibility analysis for real-time (RT) and nonreal-time (NT) messages in wormhole-routed networks on chip. For RT messages, we formulate a contention tree that captures contentions in the network. For coexisting RT and NT messages, we propose a simple bandwidth partitioning method that allows us to analyze their feasibility independently. expand
|
|
|
A clustering technique to optimize hardware/software synchronization |
| |
Junyu Peng,
Samar Abdi,
Daniel Gajski
|
|
Pages: 965-968 |
|
doi>10.1145/1120725.1120768 |
|
Full text: PDF
|
|
In this paper we present a scheme for reducing the amount of synchronization overhead needed between components, after HW/SW partitioning, to preserve the original control flow of the specification. Since traffic between components is expensive, our ...
In this paper we present a scheme for reducing the amount of synchronization overhead needed between components, after HW/SW partitioning, to preserve the original control flow of the specification. Since traffic between components is expensive, our scheme can significantly enhance the performance of the system implementation. Our optimization technique dynamically groups the tasks in the specification such that synchronization for different tasks can be shared. The grouping depends on the partitioning decision, and hence, is performed during the generation of the partitioned model. We apply our grouping algorithm for various partitions on system level models of industry standard designs. The experimental results show significant reduction in synchronization overhead compared to the unoptimized model. expand
|
|
|
Using abstract CPU subsystem simulation model for high level HW/SW architecture exploration |
| |
Aimen Bouchhima,
Iuliana Bacivarov,
Wassim Youssef,
Marius Bonaciu,
Ahmed A. Jerraya
|
|
Pages: 969-972 |
|
doi>10.1145/1120725.1120769 |
|
Full text: PDF
|
|
Current and future SoC will contain an increasing number of heterogeneous multiprocessor subsystems combined with a complex communication architecture to meet flexibility, performance and cost constraints. The early validation of such complex MP-SoC ...
Current and future SoC will contain an increasing number of heterogeneous multiprocessor subsystems combined with a complex communication architecture to meet flexibility, performance and cost constraints. The early validation of such complex MP-SoC architectures is a key enabler to manage this complexity and thus to enhance design productivity.In this paper, we describe an abstract, high level CPU subsystem model that captures the specificities of such MP-SoC architectures, along with a timed co-simulation environment to perform early exploration of the entire HW/SW design. The model is based on the Hardware Abstraction Layer (HAL) concept allowing the validation of complex applications written on top of real-life operating systems. Experimentation with a MPEG4 application proves the interest of the proposed methodology. expand
|
|
|
On combining iteration space tiling with data space tiling for scratch-pad memory systems |
| |
Chunhui Zhang,
Fadi Kurdahi
|
|
Pages: 973-976 |
|
doi>10.1145/1120725.1120770 |
|
Full text: PDF
|
|
Most previous studies on tiling concentrate on iteration space only for cache-based memory systems. However, more and more real-time embedded systems are adopting Scratch-Pad Memories (SPMs) which emphasize on the management of data flow through data-oriented ...
Most previous studies on tiling concentrate on iteration space only for cache-based memory systems. However, more and more real-time embedded systems are adopting Scratch-Pad Memories (SPMs) which emphasize on the management of data flow through data-oriented tiling. In this paper, we analyze the relationships between iteration space I and data space D, proposing a preliminary classification based on subscript functions. An important real-life application, matrix multiply, is selected to illustrate how we combine the mismatched iteration space tiling with data space tiling for optimal solutions. expand
|
|
|
REMIC: design of a reactive embedded microprocessor core |
| |
Zoran Salcic,
Dong Hui,
Partha Roop,
Morteza Biglari-Abhari
|
|
Pages: 977-981 |
|
doi>10.1145/1120725.1120771 |
|
Full text: PDF
|
|
Reactivity on external events is an important feature of almost all embedded systems. In this paper we present the design of a new, reactive embedded microprocessor called REMIC, that supports reactivity in a new way following the paradigm of synchronous ...
Reactivity on external events is an important feature of almost all embedded systems. In this paper we present the design of a new, reactive embedded microprocessor called REMIC, that supports reactivity in a new way following the paradigm of synchronous system level language Esterel. The rationale for REMIC design, its novel features with the design details and some performance figures are presented to demonstrate its suitability for embedded systems. Besides single processor systems, REMIC can be easily combined into multiple processor architectures that support real concurrency. expand
|
|
|
Online hardware/software partitioning in networked embedded systems |
| |
Thilo Streichert,
Christian Haubelt,
Jürgen Teich
|
|
Pages: 982-985 |
|
doi>10.1145/1120725.1120772 |
|
Full text: PDF
|
|
Today's embedded systems are typically distributed and more often confronted with time-varying demands. Existing methodologies that optimize the partitioning of computational tasks to hardware (HW) and software (SW) at compile-time become obsolete ...
Today's embedded systems are typically distributed and more often confronted with time-varying demands. Existing methodologies that optimize the partitioning of computational tasks to hardware (HW) and software (SW) at compile-time become obsolete or inefficient in this context as the optimal use of existing resources cannot be foreseen. Here, we investigate a discrete iterative algorithm that balances the load of a HW/SW partition online: Once there are changing computational demands, the system will dynamically assign tasks to reconfigurable HW or SW resources and migrates tasks to other nodes if necessary. For this purpose an Evolutionary Algorithm combined with a discrete version of a diffusion algorithm is presented. Concerning the diffusion algorithm, we will show theoretically and by experiment that our version is run-time optimal in a linear number of steps. expand
|
|
|
Comparing high-level modeling approaches for embedded system design |
| |
Lisane Brisolara,
Leandro Becker,
Luigi Carro,
Flávio Wagner,
Carlos E. Pereira,
Ricardo Reis
|
|
Pages: 986-989 |
|
doi>10.1145/1120725.1120773 |
|
Full text: PDF
|
|
This paper present a comparison between three different high-level modeling approaches for embedded systems design, focusing on systems that require dataflow models. The proposed evaluation investigates the facilities provided by these approaches for ...
This paper present a comparison between three different high-level modeling approaches for embedded systems design, focusing on systems that require dataflow models. The proposed evaluation investigates the facilities provided by these approaches for expressing systems requirements, functional specification, and timing constraints. Properties like model readability, testability, and implementability are also considered. Moreover, the support to different Models of Computation is also evaluated. A Crane Control System is used as case study to apply the proposed comparison criteria. expand
|
|
|
Deriving a new efficient algorithm for min-period retiming |
| |
Hai Zhou
|
|
Pages: 990-993 |
|
doi>10.1145/1120725.1120774 |
|
Full text: PDF
|
|
A new efficient algorithm is derived for the minimal period retiming problem by formal methods. Contrary to all previous algorithms, which used binary search to check feasibilities on a range of candidate periods, the derived algorithm checks the optimality ...
A new efficient algorithm is derived for the minimal period retiming problem by formal methods. Contrary to all previous algorithms, which used binary search to check feasibilities on a range of candidate periods, the derived algorithm checks the optimality of a current period directly. It is much simpler and more efficient than previous algorithms. Experimental results showed that it is even faster than ASTRA, an efficient heuristic algorithm. Since the derived algorithm is incremental by nature, it also opens the opportunity to be combined with other optimization techniques. expand
|
|
|
K-disjointness paradigm with application to symmetry detection for incompletely specified functions |
| |
Kuo-Hua Wang,
Jia-Hung Chen
|
|
Pages: 994-997 |
|
doi>10.1145/1120725.1120775 |
|
Full text: PDF
|
|
In this paper, we propose a K-disjointness paradigm that can effectively search all pairs of minterms with Hamming distance K between two Boolean functions. By this paradigm, we correlate it with symmetry detection problem and propose an efficient symmetry ...
In this paper, we propose a K-disjointness paradigm that can effectively search all pairs of minterms with Hamming distance K between two Boolean functions. By this paradigm, we correlate it with symmetry detection problem and propose an efficient symmetry detection algorithm for Boolean functions. Our algorithm can not only handle completely specified functions but also incompletely specified functions. Experimental results on a set of MCNC and ISCAS benchmarking circuits show that our algorithm is indeed very effective and efficient for detecting symmetries of large Boolean functions. expand
|
|
|
Logic optimization using rule-based randomized search |
| |
Petra Färm,
Elena Dubrova,
Andreas Kuehlmann
|
|
Pages: 998-1001 |
|
doi>10.1145/1120725.1120776 |
|
Full text: PDF
|
|
In this paper we describe a new logic synthesis approach based on rule-based randomized search using simulated annealing. Our work is motivated by two observations: (1) Traditional logic synthesis applies literal count as the primary quality metric during ...
In this paper we describe a new logic synthesis approach based on rule-based randomized search using simulated annealing. Our work is motivated by two observations: (1) Traditional logic synthesis applies literal count as the primary quality metric during the technology independent optimization phase. This simplistic metric often leads to poor circuit structures as it cannot foresee the impact of early choices on the final area, delay, power consumption, etc. (2) Although powerful, global Boolean optimization is not robust and corresponding algorithms cannot be used in practice without artificially restricting the application window. Other techniques, such as algebraic methods scale well but provide weaker optimization power. To address both problems, we use randomized search that is based on a simple circuit graph representation and a complete set of local transformations that include algebraic and Boolean optimization steps. The objective of the search process can be tuned to complex cost functions, combining area, timing, routability, and power. Our experimental results on benchmark functions demonstrate the significant potential of the presented approach. expand
|
|
|
Fast synthesis of exact minimal reversible circuits using group theory |
| |
Guowu Yang,
Xiaoyu Song,
William N. N. Hung,
Marek A. Perkowski
|
|
Pages: 1002-1005 |
|
doi>10.1145/1120725.1120777 |
|
Full text: PDF
|
|
We present fast algorithms to synthesize exact minimal reversible circuits for various types of gates and costs. By reducing reversible logic synthesis problems to group theory problems, we use the powerful algebraic software GAP to solve such problems. ...
We present fast algorithms to synthesize exact minimal reversible circuits for various types of gates and costs. By reducing reversible logic synthesis problems to group theory problems, we use the powerful algebraic software GAP to solve such problems. Our algorithms are not only able to minimize for arbitrary cost functions of gates, but also orders of magnitude faster than the existing approaches to reversible logic synthesis. In addition, we show that the Peres gate is a better choice than the standard Toffoli gate in libraries of universal reversible gates. expand
|
|
|
Design and design automation of rectification logic for engineering change |
| |
Cheng-Hung Lin,
Yung-Chang Huang,
Shih-Chieh Chang,
Wen-Ben Jone
|
|
Pages: 1006-1009 |
|
doi>10.1145/1120725.1120778 |
|
Full text: PDF
|
|
In a later stage of a VLSI design, it is quite often to modify a design implementation to accommodate the new specification, design errors, or to meet design constraints. In addition to meet the design schedule for the new implementation, the reduction ...
In a later stage of a VLSI design, it is quite often to modify a design implementation to accommodate the new specification, design errors, or to meet design constraints. In addition to meet the design schedule for the new implementation, the reduction of the mask set have become very critical. In this paper, we propose a new method to add a programmable rectification module to reduce the mask cost and to improve the turn around time. When a modification is needed, one can program the rectification module to achieve the new implementation. The rectification module can be designed by one mask programmable gate array, or an embedded FPGA. To reduce the size needed for the rectification module, we also propose algorithms, which can intelligently select some internal signals of the old implementation to become pseudo primary inputs and primary outputs. Our experimental results are very encouraging. expand
|
|
|
Power minimization for dynamic PLAs |
| |
Tzyy-Kuen Tien,
Chih-Shen Tsai,
Shih-Chieh Chang,
Chingwei Yeh
|
|
Pages: 1010-1013 |
|
doi>10.1145/1120725.1120779 |
|
Full text: PDF
|
|
In this paper, we propose a new dynamic PLA structure which incorporates super product lines. A super product line adds the NAND functionality on top of the NOR structure, thus lowering the switching activities in the product lines as well as power consumption. ...
In this paper, we propose a new dynamic PLA structure which incorporates super product lines. A super product line adds the NAND functionality on top of the NOR structure, thus lowering the switching activities in the product lines as well as power consumption. Since there are many candidates for super product lines, we have developed a CAD algorithm based on the maximum weighted matching to find optimal solution. The post simulation results show significant reduction in power consumption. On the average, the power consumption can be saved 58.9% and the delay overhead is merely 1.6% for 18 circuits. expand
|
|
|
Integrated algorithmic logical and physical design of integer multiplier |
| |
Shuo Zhou,
Bo Yao,
Jian-Hua Liu,
Chung-Kuan Cheng
|
|
Pages: 1014-1017 |
|
doi>10.1145/1120725.1120780 |
|
Full text: PDF
|
|
This paper presents an integrated methodology for high-performance integer multiplier design, which combines algorithmic partial product generation, logic synthesis, and physical layout into a unified process. The interconnect delay, which dominates ...
This paper presents an integrated methodology for high-performance integer multiplier design, which combines algorithmic partial product generation, logic synthesis, and physical layout into a unified process. The interconnect delay, which dominates the performance of a multiplier, is thoroughly considered in this integration. The special structures in the multiplier are utilized to reduce the high complexity of the holistic approach. Compared with multipliers generated by a state-of-the-art tool, the timing improvements of our results are 11% for a 16-bit multiplier, and 7.5% for a 32-bit multiplier. expand
|
|
|
Arrival time aware scheduling to minimize clock cycle length |
| |
R. Ruiz-Sautua,
M. C. Molina,
J. M. Mendías,
R. Hermida
|
|
Pages: 1018-1021 |
|
doi>10.1145/1120725.1120781 |
|
Full text: PDF
|
|
Conventional scheduling algorithms usually adjust the clock cycle duration to the execution time of the longest operations. This results in large slack times wasted in those cycles with faster operations. To reduce the wasted times multi-cycle and chaining ...
Conventional scheduling algorithms usually adjust the clock cycle duration to the execution time of the longest operations. This results in large slack times wasted in those cycles with faster operations. To reduce the wasted times multi-cycle and chaining techniques have been employed. The scheduling algorithm presented in this paper goes one step further. It breaks up some of the specification operations and schedule several data-dependent operation fragments in the same cycle. In consequence, some of the specification operations are executed during several cycles (non-necessarily consecutive ones), and in every execution cycle some result bits are calculated. Thus the execution of one operation may start even if its predecessors have not finished yet. In the experimental results carried out, the proposed algorithm improves circuit performance above 70% on average, with slight increments in the datapath area. expand
|
|
|
Efficient synthesis of speed-independent combinational logic circuits |
| |
W. B. Toms,
D. A. Edwards
|
|
Pages: 1022-1026 |
|
doi>10.1145/1120725.1120782 |
|
Full text: PDF
|
|
Speed-Independent synthesis of combinational logic datapath circuits using tools such as Petrify is often inefficient or infeasible because such circuits typically contain many concurrent inputs and independent outputs. This paper presents a practical ...
Speed-Independent synthesis of combinational logic datapath circuits using tools such as Petrify is often inefficient or infeasible because such circuits typically contain many concurrent inputs and independent outputs. This paper presents a practical method for generating arbitrary combinational logic circuits, using a sub-class of speed-independent circuits known as Strongly-Indicating circuits, without the need to verify the speed-independence of the implementation through construction of a state-graph or other method. expand
|
|
|
A practical cut-based physical retiming algorithm for field programmable gate arrays |
| |
Peter Suaris,
Dongsheng Wang,
Nan-Chi Chou
|
|
Pages: 1027-1030 |
|
doi>10.1145/1120725.1120783 |
|
Full text: PDF
|
|
This paper presents a heuristic cut-based retiming algorithm for FPGA designs. It handles complex retiming constraints including timing, architectural and structural constraints; improves retimeability by incorporating logic resynthesis; and efficiently ...
This paper presents a heuristic cut-based retiming algorithm for FPGA designs. It handles complex retiming constraints including timing, architectural and structural constraints; improves retimeability by incorporating logic resynthesis; and efficiently integrates with incremental placement. Thus, the algorithm improves timing compliance by allowing groups of registers to be rapidly retimed across blocks of combinational logic in the physical domain without violating any complex constraints. Experiments have shown that this algorithm can improve the performance of FPGA designs by 16% on average, while achieving a 61.7% speedup in terms of runtime compared with classic retiming algorithms. expand
|
|
|
BDD-based two variable sharing extraction |
| |
Dennis Wu,
Jianwen Zhu
|
|
Pages: 1031-1034 |
|
doi>10.1145/1120725.1120784 |
|
Full text: PDF
|
|
It has been shown that Binary Decision Diagram (BDD) based logic synthesis enjoys faster runtime than the classic logic synthesis systems based on Sum of Product (SOP) form. However, its synthesis quality has not been on par with the classic method due ...
It has been shown that Binary Decision Diagram (BDD) based logic synthesis enjoys faster runtime than the classic logic synthesis systems based on Sum of Product (SOP) form. However, its synthesis quality has not been on par with the classic method due to the lack of an effective sharing extraction strategy. In this paper, we present the first sharing extraction algorithm that directly exploits the structural properties of BDD. While our sharing extraction algorithm is limited to two-variable, disjunctive factors, and therefore may miss sharing opportunities, we show that it can be made exact, incremental and polynomial. expand
|
|
|
SESSION: Poster session II |
|
|
|
|
Supporting sequential assumptions in hybrid verification |
| |
Ed Cerny,
Ashvin Dsouza,
Kevin Harer,
Pei-Hsin Ho,
Tony Ma
|
|
Pages: 1035-1038 |
|
doi>10.1145/1120725.1120818 |
|
Full text: PDF
|
|
We present a method for using a set of temporal properties (SVA, PSL, OVA, RTL monitors) as environment models for industrial-strength hybrid verification that combines formal methods with constrained random simulation. We demonstrate the effectiveness ...
We present a method for using a set of temporal properties (SVA, PSL, OVA, RTL monitors) as environment models for industrial-strength hybrid verification that combines formal methods with constrained random simulation. We demonstrate the effectiveness of the method on real-world designs. expand
|
|
|
Automatic functional test program generation for microprocessor verification |
| |
Tun Li,
Dan Zhu,
Lei Liang,
Yang Guo,
SiKun Li
|
|
Pages: 1039-1042 |
|
doi>10.1145/1120725.1120819 |
|
Full text: PDF
|
|
A novel specification driven and constraints solving based method to automatically generate test programs from simple to complex ones for advanced microprocessors is presented in this paper. Our microprocessor architectural automatic test program generator ...
A novel specification driven and constraints solving based method to automatically generate test programs from simple to complex ones for advanced microprocessors is presented in this paper. Our microprocessor architectural automatic test program generator (MA2TG) can produce not only random test programs but also a sequence of instructions for a specific constraint by specifying a user constraints file. The proposed methodology makes three important contributions. First, it simplifies the microprocessor architecture modeling and eases adoption of architecture modification via architecture description language (ADL) specification. Second, it generates test programs for specific constraints utilizing the power of state-to-art constraints solving techniques. Finally, the number of test program for microprocessor verification and the verification time are dramatically reduced. We applied this method on DLX processor to illustrate the usefulness of our approach. expand
|
|
|
Forward symbolic model checking for real time systems |
| |
Georgios Logothetis
|
|
Pages: 1043-1046 |
|
doi>10.1145/1120725.1120820 |
|
Full text: PDF
|
|
Synchronous languages are widely used in industrial applications for the design and implementation of real-time embedded and reactive systems and are also well-suited for real-time verification purposes, since they have clean formal semantics. In this ...
Synchronous languages are widely used in industrial applications for the design and implementation of real-time embedded and reactive systems and are also well-suited for real-time verification purposes, since they have clean formal semantics. In this paper we focuse on the real-time temporal logic JCTL, which can directly support the real-time formal verification of synchronous programs for the design of systems in earlier (high-level) as well as in later (low-level) design stages, creating a bridging between industrial real-time descriptions and formal real-time verification. We extend the model-checking capabilities of JCTL, by introducing new forward symbolic model-checking techniques, allowing JCTL to benefit from both, forward-, as well as traditional backward state traversal methods and of course, their combination. expand
|
|
|
Validating the result of a Quantified Boolean Formula (QBF) solver: theory and practice |
| |
Yinlei Yu,
Sharad Malik
|
|
Pages: 1047-1051 |
|
doi>10.1145/1120725.1120821 |
|
Full text: PDF
|
|
Despite the increasing use of QBF solvers, current QBF solvers do not provide for any mechanism to verify their results. This paper demonstrates a methodology for independently validating the results of a DLL based QBF solver using the traces generated ...
Despite the increasing use of QBF solvers, current QBF solvers do not provide for any mechanism to verify their results. This paper demonstrates a methodology for independently validating the results of a DLL based QBF solver using the traces generated during the solving process. It also presents a mechanism to extract small unsatisfiable subformulas, called cores, from unsatisfiable QBF instances. expand
|
|
|
Priority directed test generation for functional verification using neural networks |
| |
Hao Shen,
Yuzhuo Fu
|
|
Pages: 1052-1055 |
|
doi>10.1145/1120725.1120822 |
|
Full text: PDF
|
|
Functional verification is the bottleneck in delivering today's highly integrated electronic systems and chips. We should notice the simulation times and computation resource challenge in the automatic pseudo-random test generation and a novel solution ...
Functional verification is the bottleneck in delivering today's highly integrated electronic systems and chips. We should notice the simulation times and computation resource challenge in the automatic pseudo-random test generation and a novel solution named Priority Directed test Generation (PDG) is proposed in this paper. With PDG, a test vector which hasn't been simulated is granted a priority attribute. The priority indicates the possibility of detecting new bugs by simulating this vector. We show how to apply Artificial Neural Networks (ANNs) learning algorithm to the PDG problem. Several experiments are given to exhibit how to achieve better result in this PDG method. expand
|
|
|
Comparison of schemes for encoding unobservability in translation to SAT |
| |
Miroslav N. Velev
|
|
Pages: 1056-1059 |
|
doi>10.1145/1120725.1120823 |
|
Full text: PDF
|
|
Compared are seven schemes for encoding unobservability of logic blocks in Boolean-to-CNF translation. Four of the schemes are based on merging of logic blocks with adjacent gates toward the primary output. Two are based on using CNF unobservability ...
Compared are seven schemes for encoding unobservability of logic blocks in Boolean-to-CNF translation. Four of the schemes are based on merging of logic blocks with adjacent gates toward the primary output. Two are based on using CNF unobservability variables to encode the unobservability of logic blocks. Also explored is a hybrid scheme. Encoding the unobservability of logic blocks accelerated the SAT-solving of Boolean formulas from formal verification of complex micro-processors, while allowing us to use a conventional CNF-based SAT-solver. On unsatisfiable CNF formulas, best was the strategy of merging logic blocks with adjacent gates on the only path from the block output to the primary output, with a resulting speedup of up to 16x for CNF formulas with hundreds of thousands of variables, millions of clauses, and tens of millions of literals. Furthermore, the speedup is relative to an already very efficient Boolean-to-CNF translation. On satisfiable CNF formulas, best was the strategy of merging logic blocks with leaf gates and with adjacent gates on the only path to the primary output, as well as exploiting the polarity of gates and logic blocks to reduce the number of their clauses. The presented optimizations are general and applicable to other classes of Boolean formulas. expand
|
|
|
Implication of assertion graphs in GSTE |
| |
Guowu Yang,
Jin Yang,
William N. N. Hung,
Xiaoyu Song
|
|
Pages: 1060-1063 |
|
doi>10.1145/1120725.1120824 |
|
Full text: PDF
|
|
We address the problem of implication of assertion graphs that occur in generalized symbolic trajectory evaluation (GSTE). GSTE has demonstrated its powerful capacity in formal verification of digital systems. Assertion graphs are used for property and ...
We address the problem of implication of assertion graphs that occur in generalized symbolic trajectory evaluation (GSTE). GSTE has demonstrated its powerful capacity in formal verification of digital systems. Assertion graphs are used for property and model specifications. We present a novel implication technique for assertion graphs. It relies on direct Boolean reasoning on each edge (and vertex) of an assertion graph, thus avoiding the reachability computation in GSTE. We have successfully applied. both model-based and language-based implications on real industrial circuits. Experimental results demonstrate the promising performance of our approach. expand
|
|
|
XTW, a parallel and distributed logic simulator |
| |
Qing XU,
Carl Tropper
|
|
Pages: 1064-1069 |
|
doi>10.1145/1120725.1120825 |
|
Full text: PDF
|
|
In this paper, a new event scheduling mechanism XEQ and a new rollback procedure rb-messages are proposed for use in optimistic logic simulation. We incorporate both of these techniques in a simulator XTW. XTW groups LPs into clusters, and makes use ...
In this paper, a new event scheduling mechanism XEQ and a new rollback procedure rb-messages are proposed for use in optimistic logic simulation. We incorporate both of these techniques in a simulator XTW. XTW groups LPs into clusters, and makes use of a multi-level queue, XEQ, to schedule events in the cluster. XEQ has an O(1) event scheduling time complexity. Our new rollback mechanism replaces the use of anti-messages by an rb-message, and eliminates the need for an output queue at each LP. Experimental comparisons to Time Warp reveal a superior performance on the part of XTW, while experimental results over large circuits (5-million-gate to 25-million-gate) shows XTW scales well with both the size of circuits and the number of processors. expand
|
|
|
Comprehensive frequency dependent interconnect extraction and evaluation methodology |
| |
Rong Jiang,
Charlie Chung-Ping Chen
|
|
Pages: 1070-1073 |
|
doi>10.1145/1120725.1120826 |
|
Full text: PDF
|
|
This paper presents a wide frequency range interconnect extraction and analysis methodology. First, an improved reluctance-based extraction algorithm is proposed to generate compact interconnect models at some sample frequencies. Then, DLSCF (Discrete ...
This paper presents a wide frequency range interconnect extraction and analysis methodology. First, an improved reluctance-based extraction algorithm is proposed to generate compact interconnect models at some sample frequencies. Then, DLSCF (Discrete Least Square Curve Fitting) techniques are employed to produce approximation polynomials to calculate parasitics at other frequencies. Finally, after transferring those approximation polynomials into power series of s and substituting them into the MNA (Modified Nodal Analysis) formula, we develop and apply the WIFRIM (Wide Frequency Range Interconnect Moment Matching) algorithm to calculate moments of arbitrary orders. Since WIFRIM only needs to decompose a sparse conductance matrix once, it results in significant speedup while providing accuracy within 1% error. expand
|
|
|
On-chip thermal gradient analysis and temperature flattening for SoC design |
| |
Takashi Sato,
Junji Ichimiya,
Nobuto Ono,
Kotaro Hachiya,
Masanori Hashimoto
|
|
Pages: 1074-1077 |
|
doi>10.1145/1120725.1120827 |
|
Full text: PDF
|
|
This paper quantitatively analyzes thermal gradient of SoC and proposes a thermal flattening procedure. First, the impact of dominant parameters, such as area occupancy of memory/logic, power density, and floorplan on thermal gradient and clock skew ...
This paper quantitatively analyzes thermal gradient of SoC and proposes a thermal flattening procedure. First, the impact of dominant parameters, such as area occupancy of memory/logic, power density, and floorplan on thermal gradient and clock skew are studied. Important results obtained here are 1) the maximum temperature difference increases with higher memory area occupancy and 2) the difference is very floorplan sensitive. Then, we propose a procedure to amend thermal gradient. A slight floorplan modification using the proposed procedure improves on-chip thermal gradient significantly. expand
|
|
|
Return path selection for loop RL extraction |
| |
Akira Tsuchiya,
Masanori Hashimoto,
Hidetoshi Onodera
|
|
Pages: 1078-1081 |
|
doi>10.1145/1120725.1120828 |
|
Full text: PDF
|
|
This paper propose a systematic method to select power/ground wires that should be considered in interconnect RL extraction. The return current distribution affects loop characteristic of interconnects. To extract exact RL value, all of return paths ...
This paper propose a systematic method to select power/ground wires that should be considered in interconnect RL extraction. The return current distribution affects loop characteristic of interconnects. To extract exact RL value, all of return paths have to be considered. However it is impossible because there are huge number of P/G wires in LSIs. As more wires are considered, the extraction accuracy improves but the extraction cost increases undesirably. The proposed method focuses the energy dissipated at P/G wires and utilizes it for screening return paths. Experimental results reveal that our method enables accurate and computationally efficient RL extraction with considering return current distribution. expand
|
|
|
Delay extraction based closed-form SPICE compatible passive macromodels for distributed transmission line interconnects |
| |
Natalie Nakhla,
Ram Achar,
Michel Nakhla,
Anestis Dounavis
|
|
Pages: 1082-1085 |
|
doi>10.1145/1120725.1120829 |
|
Full text: PDF
|
|
Time-domain macromodeling of high speed interconnects characterized by distributed transmission lines has generated immense interest during the recent years. It has been demonstrated that, preserving passivity of the macromodel is essential to guarantee ...
Time-domain macromodeling of high speed interconnects characterized by distributed transmission lines has generated immense interest during the recent years. It has been demonstrated that, preserving passivity of the macromodel is essential to guarantee a stable global transient simulation. In this paper, a SPICE compatible closed-form passive macromodel for distributed transmission lines is presented. The proposed method enables representation of the distributed stamp in terms of simple delay and resistive elements. The new method while guaranteeing the passivity of the macromodel, provides significant speedup, and enables easy implementation. Necessary formulation and validation examples are given. expand
|
|
|
Vector extraction for average total power estimation |
| |
Yongjun Xu,
Jinghua Chen,
Zuying Luo,
Xiaowei Li
|
|
Pages: 1086-1089 |
|
doi>10.1145/1120725.1120830 |
|
Full text: PDF
|
|
Power consumption has become a primary constraint of integrated circuit design. Many models have been proposed to evaluate dynamic and leakage power in every design level. However, how to accurately predict TPower, the total power with dynamic and leakage ...
Power consumption has become a primary constraint of integrated circuit design. Many models have been proposed to evaluate dynamic and leakage power in every design level. However, how to accurately predict TPower, the total power with dynamic and leakage power included, for large-scale designs within reasonable time remains unsolved. In this paper, a new topic of vector extraction for power estimation is brought forward based on the distribution analysis of power consumption of different types. After extracted, a large number of vectors are compacted into much fewer without significant influence on the specified power property, which makes the application of accurate and fast simulator possible. For the purpose of validation, we use the method on average TPower vector extraction and obtain good experimental results. expand
|
|
|
Relaxed hierarchical power/ground grid analysis |
| |
Yici Cai,
Zhu Pan,
Shelton X-D Tan,
Xianlong Hong,
Wenting Hou,
Lifeng Wu
|
|
Pages: 1090-1093 |
|
doi>10.1145/1120725.1120831 |
|
Full text: PDF
|
|
This paper proposes a novel hierarchical approach to the efficient analysis of large VLSI power/ground grids. Different from the existing hierarchical approach where sub-circuit equivalent models are sparsified with computation-intensive integer programming ...
This paper proposes a novel hierarchical approach to the efficient analysis of large VLSI power/ground grids. Different from the existing hierarchical approach where sub-circuit equivalent models are sparsified with computation-intensive integer programming and the resulting modeling may lead to larger errors if the top circuit matrix has large condition number, the new approach employs an iterative (relaxation) procedure to explicitly compensate the errors and avoid introducing dense matrix caused by the circuit reduction. We also propose an efficient scheme for partitioning high performance center-bumped P/G grids. Experimental results demonstrate that the new algorithm is more accurate than the existing hierarchical method while delivering more speedup over the flat simulators. expand
|
|
|
Sleep transistor sizing using timing criticality and temporal currents |
| |
Anand Ramalingam,
Bin Zhang,
Anirudh Devgan,
David Z. Pan
|
|
Pages: 1094-1097 |
|
doi>10.1145/1120725.1120832 |
|
Full text: PDF
|
|
Power gating is a circuit technique that enables high performance and low power operation. One of the challenges in power gating is sizing the sleep transistor which is used to gate the power supply. This paper presents a new methodology based on timing ...
Power gating is a circuit technique that enables high performance and low power operation. One of the challenges in power gating is sizing the sleep transistor which is used to gate the power supply. This paper presents a new methodology based on timing criticality and temporal currents to size the sleep transistor. The timing criticality information and temporal current estimation are obtained using static timing analyzer. The results obtained indicate that our proposed technique results in area reduction of sleep transistors by 80% and 49% compared to module based design and cluster based design respectively. expand
|
|
|
Timing analysis considering temporal supply voltage fluctuation |
| |
Masanori Hashimoto,
Junji Yamaguchi,
Takashi Sato,
Hidetoshi Onodera
|
|
Pages: 1098-1101 |
|
doi>10.1145/1120725.1120833 |
|
Full text: PDF
|
|
This paper proposes an approach to cope with temporal power/ground voltage fluctuation for static timing analysis. The proposed approach replaces temporal noise with an equivalent power/ground voltage. This replacement reduces complexity that comes from ...
This paper proposes an approach to cope with temporal power/ground voltage fluctuation for static timing analysis. The proposed approach replaces temporal noise with an equivalent power/ground voltage. This replacement reduces complexity that comes from the variety in noise waveform shape, and improves compatibility of power/ground noise aware timing analysis with conventional timing analysis framework. Experimental results show that the proposed approach can compute gate propagation delay considering temporal noise within 10% error in maximum and 0.5% in average. expand
|
|
|
Fast, accurate MOS table model for circuit simulation using an unstructured grid and preserving monotonicity |
| |
G Peter Fang,
David C Yeh,
David Zweidinger,
Lawrence A Arledge,
Vinod Gupta
|
|
Pages: 1102-1106 |
|
doi>10.1145/1120725.1120834 |
|
Full text: PDF
|
|
In this work, we developed a highly memory-efficient, accurate table model that is 10X+ faster than its analytical counterparts: BSIN3/4 models. Speed derives from linear interpolation; accuracy and memory efficiency result from the unstructured grid ...
In this work, we developed a highly memory-efficient, accurate table model that is 10X+ faster than its analytical counterparts: BSIN3/4 models. Speed derives from linear interpolation; accuracy and memory efficiency result from the unstructured grid founded on a BSP tree for discretizing the device function space. We also describe a methodology invoked during table generation to overcome the non-monotonic device behavior that results from interpolating the unstructured grid; the method preserves both continuity and monotonicity of the device quantities. These table models are now implemented in our production circuit simulator, TISpice. Overall speedups of 1.8X to 4.8X were observed on suites of industry circuits. expand
|
|
|
Congestion prediction in floorplanning |
| |
Chiu-wing Sham,
Evangeline F. Y. Young
|
|
Pages: 1107-1110 |
|
doi>10.1145/1120725.1120835 |
|
Full text: PDF
|
|
Routability optimization has become the major concern in floorplanning. In traditional floorplanners, area minimization is an important issue. Due to the recent advances in VLSI technology, interconnect has become a dominant factor to the overall performance ...
Routability optimization has become the major concern in floorplanning. In traditional floorplanners, area minimization is an important issue. Due to the recent advances in VLSI technology, interconnect has become a dominant factor to the overall performance of a circuit. Routability prediction is thus very important in the floorplanning stage. In this paper, we propose a new congestion model to predict the congestion after detailed routing which is not confined to the assumption of shortest Manhattan distance routes. We have compared our new models and some existing models with the actual congestion measures obtained by global routing some placement results (using the Capo placer [3]) with a publicly available maze router [2]. Results show that our models can make significant improvement in estimation accuracy over the other models. expand
|
|
|
CMP aware shuttle mask floorplanning |
| |
Gang Xu,
Ruiqi Tian,
David Z. Pan,
Martin D. F. Wong
|
|
Pages: 1111-1114 |
|
doi>10.1145/1120725.1120836 |
|
Full text: PDF
|
|
By putting different chips on the same mask, shuttle mask (or multiple project wafer) provides an economical solution for low volume designs and design prototypes to share the rising mask cost. A challenging floorplanning problem is to optimally pack ...
By putting different chips on the same mask, shuttle mask (or multiple project wafer) provides an economical solution for low volume designs and design prototypes to share the rising mask cost. A challenging floorplanning problem is to optimally pack these chips according to objectives and constraints related to cost and manufacturability. In this paper, we study the problem of CMP aware shuttle mask floorplanning, which is formulated as a rectangle packing problem with objectives of area and post-CMP topography variation minimization. We propose a 3-step procedure to solve the problem. First, we use the low-pass filter oxide CMP model to guide the simulated annealing search to minimize the topography variation. The result is then further improved by sliding each chip in its enclosing rectangle. Finally, we calculate the optimal amount of dummy feature needed with a linear programming method. Our experiment show excellent results on real industry data. expand
|
|
|
An improved P-admissible floorplan representation based on Corner Block List |
| |
Renshen Wang,
Sheqin Dong,
Xianlong Hong
|
|
Pages: 1115-1118 |
|
doi>10.1145/1120725.1120837 |
|
Full text: PDF
|
|
The Corner Block List representation (CBL) introduced in 2000 is an efficient and effective model for floorplanning and placement while still having some limitations such as redundancy and incompleteness. In this paper, we present an auxiliary 3-Route ...
The Corner Block List representation (CBL) introduced in 2000 is an efficient and effective model for floorplanning and placement while still having some limitations such as redundancy and incompleteness. In this paper, we present an auxiliary 3-Route Model to eliminate the redundancy and insert empty rooms to resolve the incompleteness. Finally we attain a P-admissible representation ECBL(2) which has higher performances than the original CBL and the count of its solution space is O((2n)!26n/n!n4). expand
|
|
|
Fast floorplanning by look-ahead enabled recursive bipartitioning |
| |
Jason Cong,
Michail Romesis,
Joseph R. Shinnerl
|
|
Pages: 1119-1122 |
|
doi>10.1145/1120725.1120838 |
|
Full text: PDF
|
|
A new paradigm is introduced for floorplanning any combination of fixed-shape and variable-shape blocks under tight fixed-outline area constraints and a wirelength objective. Dramatic improvement over traditional floorplanning methods is achieved by ...
A new paradigm is introduced for floorplanning any combination of fixed-shape and variable-shape blocks under tight fixed-outline area constraints and a wirelength objective. Dramatic improvement over traditional floorplanning methods is achieved by explicit construction of strictly legal layouts for every partition block at every level of a cutsize-driven, top-down hierarchy. By scalably incorporating legalization into the hierarchical flow, post-hoc legalization is successfully eliminated. For large floorplanning benchmarks, an implementation, called PATOMA, generates solutions with half the wirelength of state-of-the-art floorplanners in orders of magnitude less run time. expand
|
|
|
LFF algorithm for heterogeneous FPGA floorplanning |
| |
Jun Yuan,
Sheqin Dong,
Xianlong Hong,
Yuliang Wu
|
|
Pages: 1123-1126 |
|
doi>10.1145/1120725.1120839 |
|
Full text: PDF
|
|
With increasing of FPGA densities and greater demand for performance, a hierarchical approach is often used in FPGA design. Floorplanning is a key ingredient of the hierarchical approaches. However, heterogeneous resources across FPGA fabric have made ...
With increasing of FPGA densities and greater demand for performance, a hierarchical approach is often used in FPGA design. Floorplanning is a key ingredient of the hierarchical approaches. However, heterogeneous resources across FPGA fabric have made FPGA floorplanning quite different from traditional floorplanning for ASICs. Enlightened by human's accumulated experience in "packing" problem, we propose a "less flexibility first" (LFF) algorithm. Experiment results based on Xilinx's XC3S5000 show that our algorithm can work better for heterogeneous FPGA floorplanning problem. expand
|
|
|
Placement for configurable dataflow architecture |
| |
Mongkol Ekpanyapong,
Michael Healy,
Sung Kyu Lim
|
|
Pages: 1127-1130 |
|
doi>10.1145/1120725.1120840 |
|
Full text: PDF
|
|
As wire delay increasingly becomes a significant performance bottleneck in monolithic architectures, there is a strong motivation to move to Dataflow Architectures. In this paper, we propose a set of placement algorithms for generic dataflow architectures. ...
As wire delay increasingly becomes a significant performance bottleneck in monolithic architectures, there is a strong motivation to move to Dataflow Architectures. In this paper, we propose a set of placement algorithms for generic dataflow architectures. Our timing-driven and profile-driven placement algorithms respectively are targeting streaming and non-streaming applications. Compared to the conventional wirelength-driven algorithm, our timing-driven placer reduces the longest path delay by 23% and maximum slack by 26% at the cost of 10% increase in wirelength for streaming applications. In addition, our profile-driven placer reduces the total execution time of non-streaming applications by 17%. Lastly, our simultaneous timing/profile-driven placer reduces the total execution time of non-streaming applications by 13% on average. expand
|
|
|
Wire congestion and thermal aware 3D global placement |
| |
Karthik Balakrishnan,
Vidit Nanda,
Siddharth Easwar,
Sung Kyu Lim
|
|
Pages: 1131-1134 |
|
doi>10.1145/1120725.1120841 |
|
Full text: PDF
|
|
The recent popularity of 3D IC technology stems from its enhanced performance capabilities and reduced wirelength. However, wire congestion and thermal issues are exacerbated due to the compact nature of these layered technologies. In this paper, we ...
The recent popularity of 3D IC technology stems from its enhanced performance capabilities and reduced wirelength. However, wire congestion and thermal issues are exacerbated due to the compact nature of these layered technologies. In this paper, we develop techniques to reduce the maximum temperature and wire congestion of 3D circuits without compromising total wirelength and via count. Our approach consists of two phases. First, we use a multi-level min-cut placement with a modified gain function for local wire congestion and dynamic power consumption reduction. Second, we perform simulated annealing together with full-length thermal analysis and global routing for global wire congestion and maximum temperature reduction. Our experimental results show smooth tradeoff among congestion, temperature, wirelength, and via. expand
|
|
|
Placement with symmetry constraints for analog layout design using TCG-S |
| |
Jai-Ming Lin,
Guang-Ming Wu,
Yao-Wen Chang,
Jen-Hui Chuang
|
|
Pages: 1135-1137 |
|
doi>10.1145/1120725.1120842 |
|
Full text: PDF
|
|
In order to handle device matching for analog circuits, some pairs of modules need to be placed symmetrically with respect to a common axis. In this paper, we deal with the module placement with symmetry constraints for analog design using the Transitive ...
In order to handle device matching for analog circuits, some pairs of modules need to be placed symmetrically with respect to a common axis. In this paper, we deal with the module placement with symmetry constraints for analog design using the Transitive Closure Graph-Sequence (TCG-S) representation. Since the geometric relationships of modules are transparent to TCG-S and its induced operations, TCG-S has better flexibility than previous works in dealing with symmetry constraints. We first propose the necessary and sufficient conditions of TCG-S for symmetry modules. Then, we propose a polynomial-time packing algorithm for a TCG-S with symmetry constraints. Experimental results show that the TCG-S based algorithm results in the best area utilization. expand
|
|
|
SESSION: Poster session III |
|
|
|
|
An LP-based methodology for improved timing-driven placement |
| |
Qingzhou (Ben) Wang,
John Lillis,
Shubhankar Sanyal
|
|
Pages: 1139-1143 |
|
doi>10.1145/1120725.1120925 |
|
Full text: PDF
|
|
A method for timing driven placement is presented. The core of the approach is optimal timing-driven relaxed placement based on a linear programming (LP) formulation. The formulation captures all topological paths in a linear sized LP and thus, ...
A method for timing driven placement is presented. The core of the approach is optimal timing-driven relaxed placement based on a linear programming (LP) formulation. The formulation captures all topological paths in a linear sized LP and thus, heuristic net weights or net budgets are not necessary. Additionally, explicit enumeration of a large number of paths is avoided. The flow begins with a given placement and iteratively extracts timing-critical sub-circuits, optimally places the sub-circuit by LP and applies a timing-driven legalizer. The approach is applied to the FPGA domain and yields an average of 19.6% reduction in clock period of routed MCNC designs versus [6] (with reductions up to 39.5%). expand
|
|
|
Placement stability metrics |
| |
Chuck J. Alpert,
Gi-Joon Nam,
Paul Villarribua,
Mehmet C. YILDIZ
|
|
Pages: 1144-1147 |
|
doi>10.1145/1120725.1120926 |
|
Full text: PDF
|
|
To achieve timing closure, one often has to run through several iterations of physical synthesis flows, for which placement is a critical step. During these iterations, one hopes to consistently move towards design convergence. A placement algorithm ...
To achieve timing closure, one often has to run through several iterations of physical synthesis flows, for which placement is a critical step. During these iterations, one hopes to consistently move towards design convergence. A placement algorithm that is "stable" will consistently drive towards similar solutions, even with changes in the input netlist and placement parameters. Indeed, the stability of the algorithm is arguably as important a characteristic as the wirelength it achieves. However, currently there is no way to actually quantify the stability of a placement algorithm. This work seeks to address the issue by proposing metrics that measure the stability of a placement algorithm. Our experimental results examine the stability of three different placement algorithms with our proposed metrics and convincingly illustrate that some algorithms are quantifiably more stable than others. We believe that this opens the door to applying different standards for evaluating placement algorithms in terms of their effectiveness for achieving timing closure. expand
|
|
|
Redundant-via enhanced maze routing for yield improvement |
| |
Gang Xu,
Li-Da Huang,
David Z. Pan,
Martin D. F. Wong
|
|
Pages: 1148-1151 |
|
doi>10.1145/1120725.1120927 |
|
Full text: PDF
|
|
Redundant via insertion is a good solution to reduce the yield loss by via failure. However, the existing methods are all post-layout optimizations that insert redundant via after detailed routing. In this paper, we propose the first routing algorithm ...
Redundant via insertion is a good solution to reduce the yield loss by via failure. However, the existing methods are all post-layout optimizations that insert redundant via after detailed routing. In this paper, we propose the first routing algorithm that considers feasibility of redundant via insertion in the detailed routing stage. Our routing problem is formulated as maze routing with redundant via constraints. The problem is transformed to a multiple constraint shortest path problem, and solved by Lagrangian relaxation technique. Experimental results show that our algorithm can find routing layout with much higher rate of redundant via than conventional maze routing. expand
|
|
|
Interconnect estimation without packing via ACG floorplans |
| |
Jia Wang,
Hai Zhou
|
|
Pages: 1152-1155 |
|
doi>10.1145/1120725.1120928 |
|
Full text: PDF
|
|
ACG (Adjacent Constraint Graph) is a general floorplan representation. The refinement of constraint graphs gives not only an efficient representation but also a representation sharing the advantage of adjacency graphs. As most edges in an ACG are between ...
ACG (Adjacent Constraint Graph) is a general floorplan representation. The refinement of constraint graphs gives not only an efficient representation but also a representation sharing the advantage of adjacency graphs. As most edges in an ACG are between modules that are close to each other, the physical distance of two modules can be measured without packing by the shortest path between them on the ACG. Experimental results verified this relationship and possible approaches for interconnect planning are discussed. expand
|
|
|
Timing driven track routing considering coupling capacitance |
| |
Di Wu,
Jiang Hu,
Min Zhao,
Rabi Mahapatra
|
|
Pages: 1156-1159 |
|
doi>10.1145/1120725.1120929 |
|
Full text: PDF
|
|
As VLSI technology enters the ultra-deep submicron era, wire coupling capacitance starts to dominate self capacitance and can no longer be neglected in timing driven routing. In this paper, a coupling aware timing driven track routing heuristic is proposed. ...
As VLSI technology enters the ultra-deep submicron era, wire coupling capacitance starts to dominate self capacitance and can no longer be neglected in timing driven routing. In this paper, a coupling aware timing driven track routing heuristic is proposed. Given a global routing solution and timing constraint for each net, major trunks of wire segments are assigned to routing tracks such that the minimum timing slack among all nets is maximized. Delay penalties from both coupling capacitance and wire detour are considered in a unified graph model. The core problem is formulated and solved as a Sequential Ordering Problem (SOP). Routing blockages are handled in a post processing procedure. The experimental results on benchmark circuits show that the effect of coupling capacitance on timing is significant and the proposed heuristic results in greater improvement on coupling aware timing compared with other approaches. expand
|
|
|
Multilevel full-chip gridless routing considering optical proximity correction |
| |
Tai-Chen Chen,
Yao-Wen Chang
|
|
Pages: 1160-1163 |
|
doi>10.1145/1120725.1120930 |
|
Full text: PDF
|
|
To handle modern routing with nanometer effects, we need to consider designs of variable wire widths and spacings, for which gridless routers are desirable due to their great flexibility. The gridless routing is much more difficult than the grid-based ...
To handle modern routing with nanometer effects, we need to consider designs of variable wire widths and spacings, for which gridless routers are desirable due to their great flexibility. The gridless routing is much more difficult than the grid-based one because the solution space of gridless routing is significantly larger than that of grid-based one. In this paper, we present the first multilevel, full-chip gridless detailed router. The router integrates global routing, detailed routing, and congestion estimation together at each level of the multilevel routing. It can handle non-uniform wire widths and consider routability and optical proximity correction (OPC). Experimental results show that our approach obtains significantly better routing solutions than previous works. For example, for a set of 11 commonly used benchmark circuits, our approach achieves 100% routing completion for all circuits while the famous state-of-the-art three-level routing and multilevel routing (multilevel global routing + flat detailed routing) cannot complete routing for any of the circuits. Besides, experimental results show that our multilevel gridless router can handle non-uniform wire widths efficiently and effectively (still maintain 100% routing completion for all circuits). In particular, our OPC-aware multilevel gridless router archives an average reduction of 11.3% pattern features and still maintains 100% routability for the 11 benchmark circuits. expand
|
|
|
Improving the scalability of SAMBA bus architecture |
| |
Ruibing Lu,
Aiqun Cao,
Cheng-Kok Koh
|
|
Pages: 1164-1167 |
|
doi>10.1145/1120725.1120931 |
|
Full text: PDF
|
|
SAMBA bus [1] is a high performance bus architecture that can deliver multiple transactions in one bus cycle under single-winner bus arbitration. The bus architecture displays several advantages such as, high bandwidth, low latency, and low performance ...
SAMBA bus [1] is a high performance bus architecture that can deliver multiple transactions in one bus cycle under single-winner bus arbitration. The bus architecture displays several advantages such as, high bandwidth, low latency, and low performance penalty from arbitration delay, all of which make it more scalable than traditional buses. However, its scalability may be limited by the bus access logic delay. As a module is connected to the bus through its interface unit, which is connected in series on the bus, the bus logic delay increases linearly as the bus size increases. In this paper, we propose to increase the scalability of SAMBA buses through two methods: control signal lookahead and module clustering. The control signal lookahead technique can determine the bus access control signal in advance, thereby reducing the effective delay of each interface unit. Module clustering, on the other hand, can reduce the number of interface units attached to a bus. Experimental results show that combining these two methods can effectively reduce the bus logic delay, and thus increase the scalability of SAMBA buses. expand
|
|
|
Process-variation robust and low-power zero-skew buffered clock-tree synthesis using projected scan-line sampling |
| |
Jeng-Liang Tsai,
Charlie Chung-Ping Chen
|
|
Pages: 1168-1171 |
|
doi>10.1145/1120725.1120932 |
|
Full text: PDF
|
|
Zero-skew clock-tree with minimum clock-delay is preferable due to its low unintentional and process-variation induced skews. We propose a zero-skew buffered clock-tree synthesis flow and a novel algorithm that enables clock-tree optimization throughout ...
Zero-skew clock-tree with minimum clock-delay is preferable due to its low unintentional and process-variation induced skews. We propose a zero-skew buffered clock-tree synthesis flow and a novel algorithm that enables clock-tree optimization throughout the full zero-skew design-space by considering simultaneous buffer-insertion, buffer-sizing, and wire-sizing. For an industrial clock-tree with 3101 sink nodes, our algorithm achieves up to 45X clock-delay improvement and up to 23% power reduction compared with its initial routing. expand
|
|
|
Register-transfer level functional scan for hierarchical designs |
| |
Ho Fai Ko,
Qiang Xu,
Nicola Nicolici
|
|
Pages: 1172-1175 |
|
doi>10.1145/1120725.1120933 |
|
Full text: PDF
|
|
This paper discusses the potential benefits of inserting scan chains (SCs) in hierarchical designs at the register-transfer level (RTL) of design abstraction. Using new algorithms for functional scan chain design, it is shown how tight timing constraints ...
This paper discusses the potential benefits of inserting scan chains (SCs) in hierarchical designs at the register-transfer level (RTL) of design abstraction. Using new algorithms for functional scan chain design, it is shown how tight timing constraints for design-for-test (DFT) planning at RTL can improve the performance of a circuit, when compared to its gate level counterpart, without any loss in testability. expand
|
|
|
Using fault model relaxation to diagnose real scan chain defects |
| |
Yu Huang,
Wu-Tung Cheng,
Greg Crowell
|
|
Pages: 1176-1179 |
|
doi>10.1145/1120725.1120934 |
|
Full text: PDF
|
|
Software-based scan chain fault diagnosis is typically composed of two steps. First, scan chain flush patterns are used to identify faulty chains and fault models. This is followed by chain diagnosis using scan patterns in the second step. In this paper, ...
Software-based scan chain fault diagnosis is typically composed of two steps. First, scan chain flush patterns are used to identify faulty chains and fault models. This is followed by chain diagnosis using scan patterns in the second step. In this paper, we target chain diagnosis on one special category of chain faults: intermittent scan chain faults. It is showed that these faults may not be modeled correctly in the first step. Hence, a novel diagnosis methodology based on scan chain fault model relaxation is proposed. expand
|
|
|
A retention-aware test power model for embedded SRAM |
| |
Baosheng Wang,
Josh Yang,
Yuejian Wu,
André Ivanov
|
|
Pages: 1180-1183 |
|
doi>10.1145/1120725.1120935 |
|
Full text: PDF
|
|
This paper addresses the test power model problem for embedded SRAMs (e-SRAMs). Previous researches treat e-SRAMs the same as other SoC core and use a "single-rectangle" power model to describe their test power consumption. This leads to significant ...
This paper addresses the test power model problem for embedded SRAMs (e-SRAMs). Previous researches treat e-SRAMs the same as other SoC core and use a "single-rectangle" power model to describe their test power consumption. This leads to significant waste of test time since e-SRAM test usually includes a long period of "zero" power consumption for the detection of Data Retention Faults. This paper takes advantage of this "zero" power period and proposes a "retention-aware" test power model for e-SRAMs. The proposed model is evaluated and its impact on test time reduction is reported for various scenarious in terms of retention test duration, memory capacities, test algorithm complexities, etc. A formula is derived to predict the maximum test time reduction when the "zero" power period is fully utilized in a SoC environment. expand
|
|
|
On-chip accumulated jitter measurement for phase-locked loops |
| |
Chih-Feng Li,
Shao-Sheng Yang,
Tsin-Yuan Chang
|
|
Pages: 1184-1187 |
|
doi>10.1145/1120725.1120936 |
|
Full text: PDF
|
|
A time-to-digital Converter (TDC) circuit is presented to measure the worst-case accumulated jitters over N periods of clock produced by the PLL. Including the most positive jitter and the most negative jitter, the worst case jitters can be calculated ...
A time-to-digital Converter (TDC) circuit is presented to measure the worst-case accumulated jitters over N periods of clock produced by the PLL. Including the most positive jitter and the most negative jitter, the worst case jitters can be calculated through the proposed approach. In a case-study, by applying the proposed. TDC circuit with 4-bit flash ADC and the accumulated period N=8, the frequency range of the measured signal, resolution and linearity error are 0.7-1.4GHz, 44ps and 1.25%, respectively. Using a 0.25um 1P6M CMOS process, the HSPICE simulation result shows that the maximum measurement error is 1 LSB after calibration. expand
|
|
|
SoC test scheduling using the B-tree based floorplanning technique |
| |
Jen-Yi Wuu,
Tung-Chieh Chen,
Yao-Wen Chang
|
|
Pages: 1188-1191 |
|
doi>10.1145/1120725.1120937 |
|
Full text: PDF
|
|
We present in this paper a new algorithm to co-optimize the problems of test scheduling and core wrapper design under power constraints for core-based SoC (System on Chip) designs. The problem of test scheduling is first transformed into a floorplanning ...
We present in this paper a new algorithm to co-optimize the problems of test scheduling and core wrapper design under power constraints for core-based SoC (System on Chip) designs. The problem of test scheduling is first transformed into a floorplanning problem with a given maximum height (test access mechanism width) constraint. Then, we apply the B*-tree based floorplanning technique to solve the SoC test scheduling problem. Experimental results based on the ITC'02 benchmarks show that our method is very effective and efficient---our method obtains the best results ever reported for SoC test scheduling with power constraint in every efficient running time. Compared with recent works, our method achieves average improvements of 4.7% to 20.1%. expand
|
|
|
Fault tolerant quantum cellular array (QCA) design using Triple Modular Redundancy with shifted operands |
| |
Tongquan Wei,
Kaijie Wu,
Ramesh Karri,
Alex Orailoglu
|
|
Pages: 1192-1195 |
|
doi>10.1145/1120725.1120938 |
|
Full text: PDF
|
|
Due to their extremely small feature sizes and ultra low power consumption, Quantum-dot Cellular Automata (QCA) technology is projected to be a promising nanotechnology. However, in nanotechnologies, manufacture time defect levels and operational time ...
Due to their extremely small feature sizes and ultra low power consumption, Quantum-dot Cellular Automata (QCA) technology is projected to be a promising nanotechnology. However, in nanotechnologies, manufacture time defect levels and operational time fault rates are expected to be quite high. Straightforward Triple Modular Redundancy (TMR) based fault tolerance is inappropriate for QCA nanotechnology since wire delays dominate the logic delays and faults in wires dominate the faults in a QCA based design. Furthermore, long wires are necessary in TMR based designs. In this paper we show that fault-tolerance can be obtained by using TMR with Shifted Operands (TMRSO). TMRSO uses shorter wires of QCA cells and exploits the self-latching property of clocked QCA arrays to provide the same level of fault tolerance capability as straightforward TMR while being significantly faster and smaller. This technique can be applied to a variety of operations; we have validated TMRSO on adders. Implementation results obtained using QCADesigner [6] show that an 8-bit adder using TMRSO has more than 50% area reduction and more than 100% throughput improvement when compared to a TMR implementation. expand
|
|
|
Efficiently generating test vectors with state pruning |
| |
Ying Chen,
Dennis Abts,
David J. Lilja
|
|
Pages: 1196-1199 |
|
doi>10.1145/1120725.1120939 |
|
Full text: PDF
|
|
This paper extends the depth first search (DFS) used in the previously proposed witness string method for generating efficient test vectors. A state pruning method is added that exploits different search heuristics in simultaneous searches. Using an ...
This paper extends the depth first search (DFS) used in the previously proposed witness string method for generating efficient test vectors. A state pruning method is added that exploits different search heuristics in simultaneous searches. Using an IBM Power4 multiprocessor system with the Berkeley Active Message library, we show that this new method of state pruning is efficient and produces quantitatively better witness strings compared to both pure and guided DFS. expand
|
|
|
Cluster-based detection of SEU-caused errors in LUTs of SRAM-based FPGAs |
| |
E. Syam Sundar Reddy,
Vikram Chandrasekhar,
M. Sashikanth,
V. Kamakoti,
N. Vijaykrishnan
|
|
Pages: 1200-1203 |
|
doi>10.1145/1120725.1120940 |
|
Full text: PDF
|
|
This paper proposes a cluster-based parity-checking technique that can detect 100% of all Single Event Upset (SEU) faults in the LUTs of SRAM-based FPGAs. The paper describes two different Configurable Logic Block (CLB) architectures that could be used ...
This paper proposes a cluster-based parity-checking technique that can detect 100% of all Single Event Upset (SEU) faults in the LUTs of SRAM-based FPGAs. The paper describes two different Configurable Logic Block (CLB) architectures that could be used to implement the proposed SEU detection technique. Of the two, the first architecture can perform at-speed testing of the LUTs without interrupting the normal functioning of the FPGA. The second one works by switching the CLBs from normal-mode to testing-mode and vice-versa. The LUTs are tested in the testing-mode. The switching frequency can be externally programmed and hence varied depending on the rate of SEU occurrences. Both the proposed architectures were compared with the Xilinx Virtex and Virtex Pro architecture. The proposed architectures require only 2 (when compared with Virtex) and 4 (when compared with Virtex Pro) additional SRAM configuration bits per LUT. This is extremely low when compared to the 16 additional SRAM configuration bits required by CLB architectures used to implement standard DWC techniques for detecting SEUs in LUTs. The area requirements of both the proposed architectures are also significantly less than the area requirements of DWC techniques. The proposed detection technique requires only 3 clock cycles of the Xilinx Virtex internal clock to detect the effect of an SEU in any LUT of the FPGA. expand
|
|
|
Comprehensive analysis and optimization of CMOS LNA noise performance |
| |
Dong Feng,
Bingxue Shi
|
|
Pages: 1204-1207 |
|
doi>10.1145/1120725.1120941 |
|
Full text: PDF
|
|
Comprehensive analysis of CMOS low noise amplifier (LNA) noise performance is presented in this paper, including channel noise and induced gate noise in MOS devices. The impacts of distributed gate resistance and intrinsic channel resistance on noise ...
Comprehensive analysis of CMOS low noise amplifier (LNA) noise performance is presented in this paper, including channel noise and induced gate noise in MOS devices. The impacts of distributed gate resistance and intrinsic channel resistance on noise performance are also considered and formulized. A new analytical formula for noise figure is proposed. Two kinds of noise optimization approaches are performed. This work will benefit the design of high performance CMOS LNA. expand
|
|
|
An analog front-end IP for 13.56MHz RFID interrogators |
| |
Jung-Hyun Cho,
Suk-Byung Chai,
Chung-Gi Song,
Kyung-Won Min,
Shiho Kim
|
|
Pages: 1208-1211 |
|
doi>10.1145/1120725.1120942 |
|
Full text: PDF
|
|
An analog front-end circuit for 13.56MHz RFID interrogators compatible with ISO14443, ISO15693 and ISO18000-3 Mode 1 RFID interrogators was designed and fabricated by using 0.35μm double poly CMOS process. The fabricated chip was operated at 3.3 volt ...
An analog front-end circuit for 13.56MHz RFID interrogators compatible with ISO14443, ISO15693 and ISO18000-3 Mode 1 RFID interrogators was designed and fabricated by using 0.35μm double poly CMOS process. The fabricated chip was operated at 3.3 volt single supply. The results of this work can be provided as reusable IPs in a form of hard or firm IPs for designing single chip 13.56MHz RFID interrogators. expand
|
|
|
A two-stage genetic algorithm method for optimization the ΣΔ modulators |
| |
A. Zahabi,
O. Shoaei,
Y. Koolivand,
P. Jabehdar-maralani
|
|
Pages: 1212-1215 |
|
doi>10.1145/1120725.1120943 |
|
Full text: PDF
|
|
A two-stage optimization approach for the design of ΣΔ Modulators using Genetic Algorithm has been proposed. The conversion speed and consumed CPU time of the design process have been reduced significantly by utilizing the combination of ...
A two-stage optimization approach for the design of ΣΔ Modulators using Genetic Algorithm has been proposed. The conversion speed and consumed CPU time of the design process have been reduced significantly by utilizing the combination of an equation-based and a high-level simulation-based genetic algorithm. The proper circuit specifications of the modulator are obtained by using a new idea called gene-dependent fitness function which takes some circuit-level non-idealities into account in the evaluation of the cost function. This significantly reduces the time-consuming circuit simulations and transient analysis. expand
|
|
|
A novel differential VCO circuit design for USB Hub |
| |
Gong Qian,
Yuan Guo-shun
|
|
Pages: 1216-1219 |
|
doi>10.1145/1120725.1120944 |
|
Full text: PDF
|
|
The paper describes a novel differential Voltage Controlled Oscillator circuit, which is used in the phase locked loop of a USB hub chip. The output clock signals can be altered from 36MHz to 96MHz by changing the value of control signals. The Voltage ...
The paper describes a novel differential Voltage Controlled Oscillator circuit, which is used in the phase locked loop of a USB hub chip. The output clock signals can be altered from 36MHz to 96MHz by changing the value of control signals. The Voltage Controlled Oscillator architecture, module circuits of Voltage Controlled Oscillator design, simulation results, and chip layout are included. Experimental results using CSMC 0.6um process technology show that the anticipatory performance can be obtained and the Voltage Controlled Oscillator circuit consumes only 0.9 mW from a 5V power supply. expand
|
|
|
Static power minimization in current-mode circuits |
| |
M. S. Bhat,
H. S. Jamadagni
|
|
Pages: 1220-1223 |
|
doi>10.1145/1120725.1120945 |
|
Full text: PDF
|
|
We propose a method involving selective signal gating to minimize power dissipation in current-mode CMOS analog and multiple-valued logic (MVL) circuits employing a stack of current comparators. First, we present an approximation model for current in ...
We propose a method involving selective signal gating to minimize power dissipation in current-mode CMOS analog and multiple-valued logic (MVL) circuits employing a stack of current comparators. First, we present an approximation model for current in a current comparator circuit. Power reduction is achieved by turning off the redundant comparator circuits using a switch-architecture. Simulations are carried-out for current-mode flash ADC designs and literal generating circuits for MVL to validate the method. expand
|
|
|
A novel transmitter for 1000Base-T physical transceiver |
|
Pages: 1224-1227 |
|
doi>10.1145/1120725.1120946 |
|
Full text: PDF
|
|
This paper describes a transmitter used in 1000Base-T PHY chip. A Digital-to-Analog Converter with 5bit resolution, 8bit accuracy, 125MHz sample rate and 4ns transition timing has been implemented to satisfy all the specifications of the Gigabit Ethernet ...
This paper describes a transmitter used in 1000Base-T PHY chip. A Digital-to-Analog Converter with 5bit resolution, 8bit accuracy, 125MHz sample rate and 4ns transition timing has been implemented to satisfy all the specifications of the Gigabit Ethernet transmitter defined in IEEE 802.3 standard. The entire design occupies 0.4 x 0.6mm2 in a 0.18-μm CMOS technology. And the design ensures no other power dissipation inside the transmitter except the digital decoder block. Most power will distribute to the peripheral interface circuit with the twisted-pair used in the transmitter front-end. expand
|
|
|
A novel data processing circuit in high-speed serial communication |
| |
Yongjian Tang,
Lenian He,
Xiaolang Yan
|
|
Pages: 1228-1231 |
|
doi>10.1145/1120725.1120947 |
|
Full text: PDF
|
|
A novel data processing circuit in high-speed serial communication has been demonstrated in this work. The circuit, including a serializer and a frequency divider, was developed to convert the transmission signals into the desired format. The chip design ...
A novel data processing circuit in high-speed serial communication has been demonstrated in this work. The circuit, including a serializer and a frequency divider, was developed to convert the transmission signals into the desired format. The chip design is based on TSMC 0.25μm mixed signal model, and semi-custom design methodology is used. Pre- and post-layout simulation results indicated that the speed of circuit has reached 480MHz. Moreover, the data are processed properly in agreement with USB2.0 specification. expand
|
|
|
A monolithic CMOS L band DAB receiver |
| |
Ziqiang Wang,
Baoyong Chi,
Min Lin,
Shuguang Han,
Lu Liu,
Jinke Yao,
Zhihua Wang
|
|
Pages: 1232-1235 |
|
doi>10.1145/1120725.1120948 |
|
Full text: PDF
|
|
This paper presents a fully integrated CMOS low-IF receiver working at L-band for DAB application. An image-rejection low noise amplifier (LNA) supplies over 30dB image rejection in the whole band. A calibration circuit improves matching of quadrature ...
This paper presents a fully integrated CMOS low-IF receiver working at L-band for DAB application. An image-rejection low noise amplifier (LNA) supplies over 30dB image rejection in the whole band. A calibration circuit improves matching of quadrature LO signals. Together with the quadrature weaver architecture, the receiver rejects the image signal more than 65dB. The receiver's noise figure is 4dB and OIP3 is 22dBm. The receiver is implemented in 0.25um CMOS process. The core die area is 9mm2. expand
|
|
|
A bipolar IF amplifier/RSSI for ASK receiver |
| |
Yonggang Tao,
Yongsheng Xu,
Wei Jin,
Hui Yu,
Zongsheng Lai
|
|
Pages: 1236-1239 |
|
doi>10.1145/1120725.1120949 |
|
Full text: PDF
|
|
A bipolar logarithmic intermediate-frequency (IF) amplifier with received signal strength indicator (RSSI) circuit for ASK Receiver is presented. The amplifier realizes a piecewise approximation to an exact logarithmic response. In the demodulating Log ...
A bipolar logarithmic intermediate-frequency (IF) amplifier with received signal strength indicator (RSSI) circuit for ASK Receiver is presented. The amplifier realizes a piecewise approximation to an exact logarithmic response. In the demodulating Log Amplifiers, a special architecture of RSSI is proposed. There are five stages in the Log amplifiers. Each consists of a limiter amplifier and a gm cell. A 90dB input dynamic range within ±1dB linearity is achieved. expand
|
|
|
SESSION: Poster session IV |
|
|
|
|
Evaluation of dual VDD fabrics for low power FPGAs |
| |
Rajarshi Mukherjee,
Seda Ogrenci Memik
|
|
Pages: 1240-1243 |
|
doi>10.1145/1120725.1121033 |
|
Full text: PDF
|
|
Power efficiency is becoming an increasingly important design aspect for FPGAs. Recently it has been shown that well-known power minimization techniques in the ASICs such as creating supply voltage (Vdd) scalable islands of different granularity ...
Power efficiency is becoming an increasingly important design aspect for FPGAs. Recently it has been shown that well-known power minimization techniques in the ASICs such as creating supply voltage (Vdd) scalable islands of different granularity can be applied to FPGAs. However, the discrete routing architecture of FPGAs amplifies any constraint imposed on the placement stage. In this work, we evaluate the overheads of voltage scaling schemes in relation to FPGA architectures and design flows in terms of critical path delay, channel-width and area/delay product. We present a detailed evaluation of the impact of alternative realizations of voltage scaling schemes onto the physical design flow of FPGAs and show that as high as 47% dynamic power gain is possible with 17% area/delay product penalty and 30% power gain is possible with as low as 6% area/delay product penalty for different voltage island configurations. expand
|
|
|
Design of an application-specific PLD architecture |
| |
Jae-Jin Lee,
Gi-Yong Song
|
|
Pages: 1244-1247 |
|
doi>10.1145/1120725.1121034 |
|
Full text: PDF
|
|
This paper presents a new application-specific PLD architecture which adopts a bit-level super-systolic array for application-specific arithmetic operation such as MAC. The proposed design offers a significant alternative view on programmable logic device. ...
This paper presents a new application-specific PLD architecture which adopts a bit-level super-systolic array for application-specific arithmetic operation such as MAC. The proposed design offers a significant alternative view on programmable logic device. The bit-level super-systolic array whose cells contain another systolic array is ideal for newly proposed PLD architecture in terms of area efficiency and clock speed as it limits the routing requirement in a PLD to local interconnections between Logic Units and to global interconnections between Logic Modules. The maximum clock cycle is limited only by one AND gate and one full adder. expand
|
|
|
Event-oriented computing with reconfigurable platform |
| |
Mitsuru Tomono,
Masaki Nakanishi,
Katsumasa Watanabe,
Shigeru Yamashita
|
|
Pages: 1248-1251 |
|
doi>10.1145/1120725.1121035 |
|
Full text: PDF
|
|
Recently, reconfigurable computing has come under the spotlight as the new computing paradigm. A machine employing this paradigm combines the flexibility of a general purpose processor with the performance of a dedicated system. In this paper, we propose ...
Recently, reconfigurable computing has come under the spotlight as the new computing paradigm. A machine employing this paradigm combines the flexibility of a general purpose processor with the performance of a dedicated system. In this paper, we propose Event-Oriented Computing, a new application area for reconfigurable computing. We also show the architecture model suited to Event-Oriented Computing. Using Artificial Life as an example, we report the evaluation of our architecture model. expand
|
|
|
Reconfigurable adaptive FEC system with interleaving |
| |
Kazunori Shimizu,
Nozomu Togawa,
Takeshi Ikenaga,
Satoshi Goto
|
|
Pages: 1252-1255 |
|
doi>10.1145/1120725.1121036 |
|
Full text: PDF
|
|
This paper proposes a reconfigurable adaptive FEC system with interleaving. For adaptive FEC schemes, we can implement an optimal RS decoder composed of minimum hardware units for any given error correction capability t. If the hardware units ...
This paper proposes a reconfigurable adaptive FEC system with interleaving. For adaptive FEC schemes, we can implement an optimal RS decoder composed of minimum hardware units for any given error correction capability t. If the hardware units of the RS decoder can be reduced for any given t, we can embed as large deinterleaver as possible into the RS decoder for each t. Reconfiguring the RS decoder embedded with the expanded deinterleaver dynamically for each t allows us to decode larger interleaved codes which are more robust FEC codes to burst errors. Our reconfigurable adaptive FEC system with interleaving achieves better packet error rate and higher throughput than fixed hardware systems. expand
|
|
|
An AMBA AHB-based reconfigurable SOC architecture using multiplicity of dedicated flyby DMA blocks |
| |
Adeoye Olugbon,
Sami Khawam,
Tughrul Arslan,
Ioannis Nousias,
Iain Lindsay
|
|
Pages: 1256-1259 |
|
doi>10.1145/1120725.1121037 |
|
Full text: PDF
|
|
We propose a System-on-Chip (SoC) architecture for reconfigurable applications based on the AMBA High-Speed Bus (AHB). The architecture features multiple low-area flyby DMA blocks for transferring configuration data. Furthermore, the architecture eliminates ...
We propose a System-on-Chip (SoC) architecture for reconfigurable applications based on the AMBA High-Speed Bus (AHB). The architecture features multiple low-area flyby DMA blocks for transferring configuration data. Furthermore, the architecture eliminates the use of energy-consuming instructions used in comparable commercial reconfigurable SoCs. The flyby DMA blocks achieve a reduction of up to 98% in the number of gates found in general-purpose DMA controllers. The DMA blocks also achieve the flyby throughput which halves the number of clock cycles used in conventional DMA for data transfer. We also demonstrate the presence of parallel processing which contributes to improved system performance of the proposed architecture over commercial comparatives. expand
|
|
|
Using GALS architecture to reduce the impact of long wire delay on FPGA performance |
| |
Xin Jia,
Ranga Vemuri
|
|
Pages: 1260-1263 |
|
doi>10.1145/1120725.1121038 |
|
Full text: PDF
|
|
Interconnect delay is becoming a major roadblock to FPGA performance with technology scaling and growing chip sizes. Globally Asynchronous Locally Synchronous (GALS) design is considered a potential solution to this issue. An important design decision ...
Interconnect delay is becoming a major roadblock to FPGA performance with technology scaling and growing chip sizes. Globally Asynchronous Locally Synchronous (GALS) design is considered a potential solution to this issue. An important design decision in building a GALS FPGA architecture is to determine the appropriate GALS island size. A large GALS island will reduce the asynchronous communication overhead but the interconnect delay inside an island is increased. On the other hand, asynchronous communication overhead could be a major concern for a small GALS island size. In this paper, we propose a design flow to investigate this tradeoff. The input circuit is first divided into partitions according to the specified GALS island size and each partition is then implemented with commercially available CAD tools. The overall system performance is estimated by a performance evaluator. Experimental results validate our design flow and show a performance improvement of around 20% by adopting a GALS architecture. expand
|
|
|
A novel configurable motion estimation architecture for high-efficiency MPEG-4/H.264 encoding |
| |
Tiejun Li,
Sikun Li,
Chengdong Shen
|
|
Pages: 1264-1267 |
|
doi>10.1145/1120725.1121039 |
|
Full text: PDF
|
|
This paper proposes a flexible, efficient and configurable motion estimation architecture. The core of this architecture is a motion estimation engine NPSPE (Nine Points Search Pattern Engine), which can support the latest efficient block-based motion ...
This paper proposes a flexible, efficient and configurable motion estimation architecture. The core of this architecture is a motion estimation engine NPSPE (Nine Points Search Pattern Engine), which can support the latest efficient block-based motion estimation algorithms used by MPEG-4/H.264 encoding, such as PMVFAST and EPZS. This architecture has been designed and synthesized in SMIC 0.18um technology. The result shows it consumes only 17.5K gates, but its computing efficiency is about 15 times higher than the well-known low power FS engine including 16 PEs while its PSNR is similar to FS. expand
|
|
|
A fast digit-serial systolic multiplier for finite field GF(2m) |
| |
Chang Hoon Kim,
Soonhak Kwon,
Chun Pyo Hong
|
|
Pages: 1268-1271 |
|
doi>10.1145/1120725.1121040 |
|
Full text: PDF
|
|
This paper presents a new digit-serial systolic multiplier over GF(2m) for cryptographic applications. When input data come in continuously, the proposed array produces multiplication results at a rate of one every [m/D] ...
This paper presents a new digit-serial systolic multiplier over GF(2m) for cryptographic applications. When input data come in continuously, the proposed array produces multiplication results at a rate of one every [m/D] + 2 clock cycles, where D is the selected digit size. Since the inner structure of the proposed array is tree-type, critical path increases logarithmically proportional to D. Therefore, the computation delay of the proposed architecture is significantly less than previously proposed digit-serial systolic multipliers whose critical path increases proportional to D. Furthermore, since the new architecture has the features of regularity, modularity, and unidirectional data flow, it is well suited to VLSI implementations. expand
|
|
|
Adaptive fuzzy control scheduling of window-constrained real-time systems |
| |
Zhu Xiangbin,
Tu ShiLiang
|
|
Pages: 1272-1275 |
|
doi>10.1145/1120725.1121041 |
|
Full text: PDF
|
|
DWCS (Dynamic window-constrained scheduling) algorithm has good performance when the DWCS scheduler is not overloaded. But when the scheduler is overloaded, many violations will be produced and they are not uniformly distributed. In this paper, we present ...
DWCS (Dynamic window-constrained scheduling) algorithm has good performance when the DWCS scheduler is not overloaded. But when the scheduler is overloaded, many violations will be produced and they are not uniformly distributed. In this paper, we present an adaptive fuzzy control scheduling based on DWCS algorithm. The improved algorithm can have a rectangular distribution of violation ratios among all real-time tasks when the system is overloaded. To evaluate the effectiveness of the improved algorithm, we have done extensive simulation studies. The simulation results show that the new algorithm is superior to the old algorithm. expand
|
|
|
A high performance QAM receiver for digital cable TV with integrated A/D and FEC decoder |
| |
Bo Shen,
Junhua Tian,
Zheng Li,
Jianing Su,
Qianling Zhang
|
|
Pages: 1276-1279 |
|
doi>10.1145/1120725.1121042 |
|
Full text: PDF
|
|
A DVB-C/ITU J.83-A compliant QAM (Quadrature Amplitude Modulation) demodulator suitable for digital cable TV is proposed, which can support 4~256QAM with variable bit rate up to 80Mbps. It integrates a 10-bit 40MSPS ADC, (204,188) Reed-Solomon decoder ...
A DVB-C/ITU J.83-A compliant QAM (Quadrature Amplitude Modulation) demodulator suitable for digital cable TV is proposed, which can support 4~256QAM with variable bit rate up to 80Mbps. It integrates a 10-bit 40MSPS ADC, (204,188) Reed-Solomon decoder as well as a convolutional interleaver. The chip is implemented in SMIC 0.25um CMOS technology with die size of 3.5x3.5 mm2. It features wide carrier offset acquisition range, robust demodulation algorithm and small circuit area. expand
|
|
|
Partitioned bus coding for energy reduction |
| |
Lin Xie,
Peiliang Qiu,
Qinru Qiu
|
|
Pages: 1280-1283 |
|
doi>10.1145/1120725.1121043 |
|
Full text: PDF
|
|
For VLSI design in deep submicron technology, the bus energy reduction has become more and more important. This paper studies the bus partition scheme for the Transition Pattern Coding (TPC). The genetic algorithm based approach is used. A closed-form ...
For VLSI design in deep submicron technology, the bus energy reduction has become more and more important. This paper studies the bus partition scheme for the Transition Pattern Coding (TPC). The genetic algorithm based approach is used. A closed-form expression is derived to calculate the energy dissipation for the partitioned bus with TPC coding. A general bus model with coupling capacitance is considered during the energy estimation and optimization. The resulted partitioned bus coding reduces the encoding and decoding complexity of the original TPC. The experimental results show that the TPC with careful bus partitioned saves up to 16.9% the energy of the TPC with random bus partition. expand
|
|
|
An improved bit-plane and pass dual parallel architecture for coefficient bit modeling in JPEG2000 |
| |
Yanju Han,
Chao Xu,
Yizhen Zhang
|
|
Pages: 1284-1287 |
|
doi>10.1145/1120725.1121044 |
|
Full text: PDF
|
|
Embedded block coding with optimized truncation (EBCOT) is a critical part in JPEG2000 systems. There are bit-plane and pass dual parallel methods that can speed up the encoding, but the acceleration is always companied with the complication of the circuit ...
Embedded block coding with optimized truncation (EBCOT) is a critical part in JPEG2000 systems. There are bit-plane and pass dual parallel methods that can speed up the encoding, but the acceleration is always companied with the complication of the circuit structure and the increase of the circuit resources. In this paper, we present an improved bit-plane and pass dual parallel architecture (IBPDP), which not only achieves a high encoding speed but also reduces the logic circuit requirement and the coding delay. Experimental results show that about 45% of the logic circuit is reduced and that the average fall of the delay per code-block is 10% compared with BPDP. expand
|
|
|
A generalized quadrature bandpass sampling in radio receivers |
| |
Yi-Ran Sun,
Svante Signell
|
|
Pages: 1288-1291 |
|
doi>10.1145/1120725.1121045 |
|
Full text: PDF
|
|
Bandpass Sampling (BPS) realizes frequency down-conversion by undersampling. Noise aliasing as the direct consequence of the lower sampling rate causes a performance degradation. In this paper, a Generalized Quadrature BPS (GQBPS) combined with a filter ...
Bandpass Sampling (BPS) realizes frequency down-conversion by undersampling. Noise aliasing as the direct consequence of the lower sampling rate causes a performance degradation. In this paper, a Generalized Quadrature BPS (GQBPS) combined with a filter which performs both reconstruction and bandpass filtering is studied in the frequency domain with respect to both signal reconstruction and noise aliasing reduction. The theoretical analyses show that GQBPS might be a potential way to reduce noise aliasing at the cost of a more complicated reconstruction algorithm. expand
|
|
|
Reducing leakage power in instruction cache using WDC for embedded processors |
| |
Xin Lu,
Yuzhuo Fu
|
|
Pages: 1292-1295 |
|
doi>10.1145/1120725.1121046 |
|
Full text: PDF
|
|
Power consumption is an important design issue of current embedded systems and SoC. It has been shown that instruction cache accounts for a significant portion of the power dissipation of the whole processor chip. WDC (Way-Decay Cache) proposed in this ...
Power consumption is an important design issue of current embedded systems and SoC. It has been shown that instruction cache accounts for a significant portion of the power dissipation of the whole processor chip. WDC (Way-Decay Cache) proposed in this paper is a novel cache architecture with resizable associativity and low leakage power. Experiment results show that for the SPECint95 benchmarks, WDC reduces energy consumption without significantly hindering performance. expand
|
|
|
System-level architectural exploration using allocation-on-demand technique |
| |
Qiang Wu,
Jinian Bian,
Hongxi Xue
|
|
Pages: 1296-1298 |
|
doi>10.1145/1120725.1121047 |
|
Full text: PDF
|
|
Architectural exploration is very important in embedded system design and SoC design. In this paper, a new heuristic algorithm using the allocation-on-demand technique is proposed to solve this problem. Unlike previous research efforts, this algorithm ...
Architectural exploration is very important in embedded system design and SoC design. In this paper, a new heuristic algorithm using the allocation-on-demand technique is proposed to solve this problem. Unlike previous research efforts, this algorithm allocates new resources only when it fails to schedule tasks under the performance constraints. So the resource costs of the system increase monotonously in running, which is apt to determine the feasibility of current solution earlier. Experimental results show that this approach is helpful for an efficient architectural exploration process. expand
|
|
|
A fractional delay-locked loop for on chip clock generation applications |
| |
P. Torkzadeh,
A. Tajalli,
M. Atarodi
|
|
Pages: 1300-1309 |
|
doi>10.1145/1120725.1121048 |
|
Full text: PDF
|
|
A fractional multiplying delay-locked loop (FMDLL) for high speed on-chip clock generation applications is presented. The proposed DLL architecture overcomes some drawbacks of phase-locked loops (PLLs) such as jitter accumulation and stability while ...
A fractional multiplying delay-locked loop (FMDLL) for high speed on-chip clock generation applications is presented. The proposed DLL architecture overcomes some drawbacks of phase-locked loops (PLLs) such as jitter accumulation and stability while maintaining the advantageous of a PLL as a multi-rate fractional frequency multiplier.The output frequency range can be tuned from 1GHz to 2.5GHz with selectable multiplication ratios of M + 0.05 x K where 1 ≤ K ≤ 19. To generate some finer ratios, K could be changed between two consecutive integer numbers. In this situation, a digital delta-sigma modulator could be used to suppress the spurs existing in the output spectrum. expand
|
|
|
A novel O(n) parallel banker's algorithm for System-on-a-Chip |
| |
Jaehwan John Lee,
Vincent John Mooney, III
|
|
Pages: 1304-1308 |
|
doi>10.1145/1120725.1121049 |
|
Full text: PDF
|
|
This paper proposes a novel O(n) Parallel Banker's Algorithm (PBA) with a best-case run-time of O(1), reduced from an O(mn2) run-time complexity of the original Banker's Algorithm. We implemented the approach in hardware, which we call ...
This paper proposes a novel O(n) Parallel Banker's Algorithm (PBA) with a best-case run-time of O(1), reduced from an O(mn2) run-time complexity of the original Banker's Algorithm. We implemented the approach in hardware, which we call PBA Unit (PBAU), using Verilog HDL and verified the run-time complexity. PBAU is an Intellectual Property (IP) block that provides a mechanism of very fast, automatic deadlock avoidance for a MultiProcessor System-on-a-Chip (MPSoC, which we predict will be the mainstream of future high performance computing environments). Moreover, our PBA supports multiple-instance multiple resource systems. We demonstrate that PBAU not only avoids deadlock in a few clock cycles (1600X faster than the Banker's Algorithm in software) but also achieves in a particular example a 19% speedup of application execution time over avoiding deadlock in software. Lastly, the MPSoC area overhead due to PBAU is small, under 0.05% in our candidate MPSoC example. expand
|
|
|
Hardware/software co-design using hierarchical platform-based design method |
| |
Zhihui Xiong,
Sikun Li,
Jihua Chen
|
|
Pages: 1309-1312 |
|
doi>10.1145/1120725.1121050 |
|
Full text: PDF
|
|
A Hierarchical Platform-Based Design (Hi-PBD) method is put forward for SoC system design. This method divides SoC system design flow into three levels (i.e. system model level, virtual components level and real components level) to achieve separation ...
A Hierarchical Platform-Based Design (Hi-PBD) method is put forward for SoC system design. This method divides SoC system design flow into three levels (i.e. system model level, virtual components level and real components level) to achieve separation of function from structure and separation of computation from communication. HI-PBD defines two mapping processes (i.e. design planning and virtual-real synthesis) to go through all the three design levels. Hi-PBD supports reuse of both the three level design templates and the two mapping results, which increased reusing efficiency greatly. Besides, Hi-PBD boosts up design flexibility by means of supporting revision at all the three level and ensures the final design target satisfies performance requirements through a novel performance constraints transmission strategy. Experiments indicate Hi-PBD method improves SoC high level design efficiency by 30%-40%, and this method achieves platform template reuse ratio by 75%-90%. expand
|
|
|
Architecture and performance comparison of a statistic-based lottery arbiter for shared bus on chip |
| |
Yan Zhang
|
|
Pages: 1313-1316 |
|
doi>10.1145/1120725.1121051 |
|
Full text: PDF
|
|
This paper presents a statistic-based priority strategy for dynamic priority arbiters and its application was investigated for the lottery arbiter. Two set MxM registers are proposed to record the arbitration history. The period of recording arbitration ...
This paper presents a statistic-based priority strategy for dynamic priority arbiters and its application was investigated for the lottery arbiter. Two set MxM registers are proposed to record the arbitration history. The period of recording arbitration history is programmable. A randomized verification environment is used to do performance comparison for statistic-based and non-statistic-based arbiters, the results show that the performance is improved when different master's request pattern is changed dynamically due to different programs running at system on chip and especially when the grants of different master's requests are correlated. expand
|
|
|
Using loop invariants to fight soft errors in data caches |
| |
Sri Hari Krishna N,
Seung Woo Son,
Mahmut Kandemir,
Feihui Li
|
|
Pages: 1317-1320 |
|
doi>10.1145/1120725.1121052 |
|
Full text: PDF
|
|
Ever scaling process technology makes embedded systems more vulnerable to soft errors than in the past. One of the generic methods used to fight soft errors is based on duplicating instructions either in the spatial or temporal domain and then comparing ...
Ever scaling process technology makes embedded systems more vulnerable to soft errors than in the past. One of the generic methods used to fight soft errors is based on duplicating instructions either in the spatial or temporal domain and then comparing the results to see whether they are different. This full duplication based scheme, though effective, is very expensive in terms of performance, power, and memory space. In this paper, we propose an alternate scheme based on loop invariants and present experimental results which show that our approach catches 62% of the errors caught by full duplication, when averaged over all benchmarks tested. In addition, it reduces the execution cycles and memory demand of the full duplication strategy by 80% and 4%, respectively. expand
|