- 1 Advanced Data Mining Techniques and Architectures
- 2 Towards Efficient Dynamic Run Time Reconfigurable Systems
- 3 A Hardware Implementation of a Mixture of Experts
- 4 Packet Classification Algorithms and Architectures
- 5 Dec 2009, An Architecture Exploration Framework For The Implementation Of Embedded DSP Applications
- 6 Aug 2008, Efficient Implementations of WiMAX OFDM Functions on Reconfigurable Platforms
- 7 May 2006, Hardware Implementations of Artificial Neural Networks
- 8 December 2005, A Hardware/Software Co-design of Fiducia/Mattheus Partitioning Algorithm
- 9 September 2005, A Handle-C Implementation of Artificial Neural Networks on Reconfigurable Platform
- 10 Sep. 2004, Sequential/Parallel Heuristic Algorithms For VLSI Standard Cell Placement
Advanced Data Mining Techniques and Architectures
By Dunia Jamma, PhD, 2013- Present
Abstract: Data mining is described as a process of discovering and extracting interesting knowledge from large amounts of data stored in multiple data sources such as file systems, databases, data warehouses. This knowledge contributes a lot of benefits to business strategies, scientific, medical research, governments and individual. This thesis proposes several novel algorithms and architectures to enhance the process of data mining.
Towards Efficient Dynamic Run Time Reconfigurable Systems
By Ahmed Alwattar, PhD, 2009- Present
Abstract: Reconfigurable Computing is a disruptive innovation technology intended to fill the gap between high performance ASICs and flexible software general purpose processors. Runtime partial reconfiguration is a recent method to update selectively the circuitry of an FPGA while still being active. This allows changing a group of logic very quickly when the applications needs it. However this comes at a price since the complexity of design increases with this flow. In this report we propose to implement an OS to manage resources required in addition to improving the current flow. The potential amount of speedup reconfigurable computing can reach depends on the application and amount of parallelism intrinsic to the target application In this thesis we propose to use image processing applications as a case study to demonstrate how dynamic partial reconfiguration can be applied to achieve speedup, low power consumption and reduce cost.
A Hardware Implementation of a Mixture of Experts
By Antony Savich, PhD, May 2008 – Present
Abstract: Recent applications of Artificial Neural Networks (ANN) have increased the size, performance and difficulty requirements of single networks used to solve complex problems of today. Previous research focused on optimizing single ANN performance and hardware requirements for accelerated training cycles and use of cheaper implementation platforms (both on generic processors and more recently on reconfigurable hardware). After training these single networks were inflexible to change, and in general it was demonstrated difficult to use the complete parallelism of single ANN algorithms and fit a large flexible fully-parallel network design into the confines of a single chip solution. This research focuses on realizing solutions for complex problems using, instead of a single large ANN, a subdivision of the solution into a group of smaller networks in a Mixture of Experts configuration, with the flexibility of modularly adding new functionality after initial training process. The goal of this research is to demonstrate the ability of such a solution to have a performance advantage over current implementations with single network topologies by using smaller faster networks, and flexibility advantage over monolithically trained networks in adapting to new conditions by expanding the problem/solution set. This will be accomplished through the implementation of a fully functional single FPGA Mixture of Experts system and its subsequent performance analysis on various sample problems.
Packet Classification Algorithms and Architectures
By Omar Ahmed, PhD, 2008- Present
Abstract: Packet classification is the process of categorizing packets into classes in any network device. In other words, it tends to perform matching of an incoming packet to rules in the classifier, and accordingly identifying the type of action to be performed on the packet. Current software-based packet classification algorithms exhibit relatively poor performance, prompting many researchers to concentrate on novel frameworks and architectures that employ both hardware and software components. In this these , we propose two novel algorithm namely: – Packet Classification with Incremental Update (PCIU). – Group Based Search packet classification (GBSA). The PCIU algorithm is a novel and efficient packet classification algorithm with a unique incremental update capability that demonstrated powerful results and was shown to be scalable for many different tasks and clients. While a pure software implementation can generate powerful results on a server machine, an embedded solution may be more desirable for some applications and clients. Embedded, specialized hardware accelerator based solutions are typically much more efficient in speed, cost, and size than solutions that are implemented on general-purpose processor systems. The algorithm, furthermore, was improved and made more accessible for a variety of applications through implementation in hardware. Four such implementations are detailed and discussed in this thesis. The results indicate that a hardware/software co-design (using Xilinx EDK/SDK and Spartan 3E) approach results in a slower, but easier to optimize and improve within time constraints, PCIU solution. A hardware accelerator based on an ESL approach using Handel-C on the other hand resulted in a 22x speed up over a pure software implementation running on a state of the art Xeon processor. An ASIP implementation achieves on average 21x speed-up in terms of classification. In this these, we propose another novel algorithm GBSA for packet classification that is scalable, fast and efficient. On average the algorithm consumes 0.4 MB of memory for a 10k rule set. The classification time per packet in worst case is 2 $\mu s$, and the pre-processing speed is 3M Rule/sec based on a CPU operating at 3.4 GHz. The proposed algorithm was evaluated and compared to state-of-the-art techniques such as RFC, HiCut, Tuple, and PCIU using several standard benchmarks. Results obtained indicate that GBSA outperforms these algorithms in terms of speed, memory usage and pre-processing time. The algorithm, furthermore, was improved and made more accessible for a variety of applications through implementation in hardware. Three such implementations are detailed and discussed in this paper. The first approach was implemented using an Application Specific Instruction Processor (ASIP), while the others are pure RTL implementations using two different ESL flows (Impulse-C and Handel-C). The GBSA ASIP implementation achieved, on average, 18x speedup over a pure software implementation running on a Xeon processor. Conversely, the hardware accelerators (based on the ESL approaches) resulted in a 9x speedup.
Dec 2009, An Architecture Exploration Framework For The Implementation Of Embedded DSP Applications
By Ahmed Elhossini, December 2009
Abstract: Embedded systems are widely used today in different Digital Signal Processing (DSP) applications that usually require high computation power and tight constraints. Using SoC technology increases the challenges facing a designer when choosing an appropriate design. A tool that helps explore different architectures is required to design such an efficient system. The tool should be able to explore different architectures and evaluate them according to the given constraints. The design space to be explored depends on the application domain and the target platform. This thesis proposes an efficient Particle Swarm Optimization (PSO) technique that can handle multi-objective optimization problems. It is based on the strength-Pareto approach originally used in Evolutionary Algorithms (EA). The proposed modified particle-swarm algorithm is used to build three hybrid EA-PSO algorithms to solve different multi-objective optimization problems. This algorithm and its hybrid forms are tested using seven benchmarks from the literature and the results are compared to the strength Pareto evolutionary algorithm (SPEA2) and a competitive multi-objective PSO. Combining PSO and evolutionary algorithms leads to superior hybrid algorithms that outperform SPEA2, the competitive multi-objective PSO (MO-PSO) and the proposed strength Pareto PSO based on different metrics. Accordingly, an optimization engine is built using these meta-heuristics that can be used to solve multi-objective optimization problems in general and design exploration in particular. The direction of the search process depends on the evaluation of each solution generated. In this thesis an approach for performance evaluation of embedded systems is presented. Several cycle-accurate simulations are performed for commercial embedded processors used in our study. The simulation results are used to build Artificial Neural Network (ANN) models with accuracy up to 90% compared to cycle-accurate simulations with a very significant time saving. These models are combined with an analytical model and static scheduler to increase the accuracy of the estimation process. The optimization engine is integrated with the performance evaluation module to build an architecture exploration framework for embedded DSP applications. The functionality of the framework is verified by using benchmarks from industry. The results show accuracy over 90% comparing the proposed solutions to cycle-accurate simulation.
Aug 2008, Efficient Implementations of WiMAX OFDM Functions on Reconfigurable Platforms
By Ahmed Sagir, August 2008
Abstract: This thesis investigates three approaches to implement the OFDM functions of the fixed-WiMAX standard on reconfigurable platforms. The custom RTL approach showed the ability of a medium size FPGA to accommodate the design with only 50% occupation rate. The AccelDSP approach showed an area overhead of 10%. However, the throughput obtained was almost 1/4 of that obtained in the custom RTL approach. The Tensilica Xtensa processor approach presented remarkable figures, in terms of power, area and design time. Comparing the three approaches indicated that the custom RTL approach has the lead in terms of performance. However, both the AccelDSP and the Tensilica approaches accelerated the design time by a factor of two and provided early architectural exploration capabilities. The obtained power results showed that the Tensilica approach required approximately a total power consumption of about 12-15 times less than those results obtained by the other two approaches.
May 2006, Hardware Implementations of Artificial Neural Networks
By Antony Savich, May 2006
Abstract: Artificial Neural Networks, and Multi-Layer Perceptron with Back Propagation (MLP-BP) algorithm in particular, have historically suffered from slow training. Unfortunately, many applications require real-time training. This thesis studies aspects of MLP-BP implementation in FPGA hardware (Field Programmable Gate Arrays) for accelerating network training. This task is accomplished through analysis of numeric representation and its effect on network convergence, hardware performance and resource consumption. The effects of pipelining on the Back Propagation algorithm are analyzed, and a novel hardware architecture is presented. This new architecture allows extended flexibility in terms of selected numeric representation, degree of system-level parallelism, and network virtualization. A high degree of resource consumption efficiency is accomplished through a careful architectural design, which allows placement of large network topologies within a single FPGA. Examination of performance for this pipelined architecture demonstrates at least three orders of magnitude improvement over software implementation techniques.
December 2005, A Hardware/Software Co-design of Fiducia/Mattheus Partitioning Algorithm
By Fujina Li, December 2005
Abstracts: The rapidly increasing size and complexity of digital circuits place a stressing demand for faster and more efficient techniques for VLSI physical design automation. Circuit partitioning is the first stage in the VLSI physical design automation, significantly affecting the results of other stages. F-M algorithm has proved to be an efficient approximate-solution algorithm for circuit partitioning, but digital circuits are becoming larger and larger. There is a need to speed up the F-M algorithm. Although Application Specific Integrated Circuits (ASICs) can achieve speedup over general-purpose processors, its major disadvantage is inflexibility. Reconfigurable computing is a relatively new area of computing. Reconfigurable Computing Systems (RCSs) are filling the gap between performance (ASICs) and flexibility (general-purpose processors). Today the capacity of FPGAs has increased to the point where both an embedded processor and dedicated hardware can be built on a single FPGA chip. This provides an excellent platform for hardware/software co-design approach to design a system using both software and speedup hardware. To accelerate the F-M algorithm, an embedded computing system consisting a MicroBlaze processor and speedup hardware on an FPGA chip is proposed, where the computationally intensive modules are implemented in reconfigurable hardware while the MicroBlaze performs the operations that cannot be executed efficiently in hardware. The Xilinx MicroBlaze processor is the industry’s fastest soft processor built on FPGAs.
September 2005, A Handle-C Implementation of Artificial Neural Networks on Reconfigurable Platform
By Vijay Pandya, September 2005
Abstract: Today Artificial Neural Networks (ANNs) are widely used in various applications. The Back-Propagation algorithm for building an ANN has been increasingly popular since its advent in the late 80’s. The regular structure and broad field of application of the BP algorithm have drawn researchers’ attention at attempting a time-efficient implementation. General Purpose Processors(GPP) and the ASICs have traditionally been the common computing platforms to build an ANN based on the BP algorithm. However such computing machines suffer from the constant need of establishing a trade-off between flexibility and performance. In a last decade or so there has been a significant progress in the development of special kind of hardware, a reconfigurable platform based on Field Programmable Array (FPGA). FPGAs are shown to exhibit excellent flexibility in terms of reprogramming the same hardware and at the same time achieving good performance by allowing parallel computation. In this thesis various implementations of ANNs on hardware are investigated. The research described in this thesis proposes three parallel architectures and one fully parallel architecture to realize the BP algorithm on FPGA. The proposed designs are coded in Handel-C and verified for its functionality by synthesizing on Virtex2000e FPGA chip. The validation of the designs are carried out on two toy and one real world benchmarks. The partially parallel architectures and the fully parallel architecture are found to be 2.25 and 4 times faster than the software implementation respectively. Also, all the parallel architectures are found to exhibit high Weight Update per Second(WUPS).
Sep. 2004, Sequential/Parallel Heuristic Algorithms For VLSI Standard Cell Placement
By Guangfa Lu, September 2004
Abstract: With advanced sub-micron technologies, the exponentially increasing number of transistors on a VLSI chip causes placement within physical design automation to become more and more important and consequently extremely complicated and time consuming. This research addresses the placement for VLSI standard cell designs. A number of heuristic optimization techniques for placement are studied and implemented, in particular, local search, Tabu Search, Simulated Annealing and Genetic Algorithm. The Tabu Search reduces wire length on average by 52.4% while Simulated Annealing yields a 61% improvement on average. Furthermore, two parallel island-based GA models are implemented on a loosely-coupled parallel computing architecture to pursue better performance. With the synchronous model, an average speedup of 6.2 was achieved by seven processors. On the other hand, the asynchronous model achieved a 7.6 speedup. The former obtained the speedups while maintaining equal or better quality of solutions than a serial GA. In addition to the above developed algorithms, preprocessing and postprocessing procedures were analyzed and developed to further enhance solution quality.