 Yaohua Wang

January 2016 ACM Transactions on Architecture and Code Optimization (TACO): Volume 12 Issue 4, January 2016
The efficacy of single instruction, multiple data (SIMD) architectures is limited when handling divergent control flows. This circumstance results in SIMD fragments using only a subset of the available lanes. We propose an iteration interleaving--based SIMD lane partition (IISLP) architecture that interleaves the execution of consecutive iterations and dynamically partitions ...
Keywords: SIMD, iteration interleaving, vector iteration, SIMD lane partition, instruction shuffle

June 2012 HPCC '12: Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems
To further improve the performance of SIMD (Single Instruction Multiple Data) architectures, which are widely used in the wireless communication domain. The main components of Long Term Evolution (LTE) protocol are analyzed. Performance investigation is taken on a cycle-accurate simulator, featuring the main characteristics of existing SIMD architectures. Based on ...
Keywords: SIMD, LTE, MRF, Shuffle

May 2012 IPDPSW '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
Hybrid architectures combined of VLIW, SIMD and multi-core schemes are increasingly prevailing in media processors, due to the abundant parallelism existed in media applications. However, parameters for current combinations such as the VLIW length, SIMD width and core count are set mainly according to simple profiling or the designer's experience ...
Keywords: VLIW, SIMD, Multi-core, Analytical Model

July 2011 NAS '11: Proceedings of the 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage
The shuffle operation is one of the bottlenecks invector DSPs. The partitioning problem of the shuffle matrix will have a great effect on the design of the shuffle unit, when dealing with the small grain data shuffle using a smaller-sized crossbar. The traditional matrix block partitioning solution will bring much ...

July 2011 ISVLSI '11: Proceedings of the 2011 IEEE Computer Society Annual Symposium on VLSI
Stream processor is efficient for media applications as it exploits the features of media processing, such as data parallelism, producer-consumer locality and so on. However, the loosely coupled structure between host and stream processor makes the communication between scalar and SIMD part costly and scheduling across kernels less flexible. Besides, ...
Keywords: Stream Processor, Stream Length Effect, Enhanced Scalar Processor, Kernel Overlapping

September 2010 HPCC '10: Proceedings of the 2010 IEEE 12th International Conference on High Performance Computing and Communications
The emergence of large-scale chip multicore processors makes the on-chip parallel H.264/AVC encoder with high parallelism feasible. To reduce the data reload frequency, a hierarchical chip multi-core DSP platform with overall 64 DSP cores is designed to accommodate the computation/data-intensive H.264/AVC encoder. To increase parallelism, macro block level parallelism is ...

