# Performance Improvement by Customization

• Alex Doboli
• Edward H. Currie
Chapter

## Abstract

The chapter focuses on design methods for optimizing system performance by customizing the architecture to the application’s requirements. The basic concepts are illustrated by examples that employ PSoC’s digital programmable and customized blocks.

## Keywords

Execution Time Clock Cycle Pulse Width Modulator Clock Signal Assembly Code
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

The chapter focuses on design methods for optimizing system performance by customizing the architecture to the application’s requirements. The basic concepts are illustrated by examples that employ PSoC’s digital programmable and customized blocks.

Performance-driven customization involves developing optimized hardware circuits and software routines for the performance-critical modules of an application. A module is performance-critical if it has a significant effect on the global performance of the implementation. Customization explores performance–cost tradeoffs for the implementations of the critical modules.

This chapter presents a design methodology for reducing the execution time of algorithms executed on an architecture with one general-purpose processor and coprocessors that can be shared by several modules of an application. This architecture is popular for many embedded applications. The methodology consists of the following steps: specification, profiling, identification of the performance-critical blocks, functional partitioning, hardware–software partitioning, hardware resource allocation, mapping of the performance-critical modules to hardware, and scheduling of the modules that share the same hardware resource.

PSoC’s support for customization and several illustrative design examples for performance-driven customization are also discussed. PSoC includes programmable digital blocks and blocks with dedicated functionality. Blocks can be programmable to function as timers, counters, dead-band circuits, and CRCs.

This chapter has the following structure:
• Section 1 introduces the concept of performance improvement by application-specific architecture customization.

• Section 2 presents the design methodology for customization.

• Section 3 discusses PSoC’s programmable digital blocks.

• Section 4 summarizes PSoC’s blocks with dedicated functionality.

• Section 5 presents concluding remarks.

## 4.1 Introduction to Application-Specific Customization

The performance of an embedded system, for example, execution speed and power consumption, can be substantially improved by customizing the system’s architecture to fit the application [7, 9, 10, 19]. Customization involves optimizing the modules for the performance-critical parts of the application (e.g., blocks and subroutines) including the use of customized hardware circuits and software routines.

As a rule of thumb, a block or subroutine is defined as “critical” with respect to a system’s overall performance attribute P, if P changes significantly with the modification of an attribute S of the block. Or, the sensitivity of P with respect to S is very large.

For example, if the attribute P is the total execution time of a system and S is the execution time of a block, the block is critical if it involves a large amount of data processing, and/or data communications with other subroutines. (More precise descriptions of performance criticality are given in Subsection 4.2.1 of the chapter.) The execution speed of an embedded system can often be increased by employing customized digital circuits for the critical parts of the application. Such circuits can be made much faster than corresponding software subroutines, because they do not have to follow the instruction cycle of a general-purpose processor (more precisely, the customized circuits do not implement the fetch-decode-execute-store instruction cycle), or equivalently, require fewer clock cycles for execution. In some applications, it is also possible to exploit the parallelism of the subroutine and to execute multiple operations simultaneously. In this case, the tradeoff is between the design cost and execution speed. The latter can often be increased substantially by relying upon dedicated hardware circuits for the performance-critical parts.

In broad terms, customizing an embedded system implementation involves the following tasks: (i) finding the performance-critical parts in the system (i.e., profiling the system), (ii) selecting the critical parts that should be implemented in hardware, (iii) determining the nature and structure of the hardware circuits that can be used to improve system performance, and (iv) implementing the overall system. The design tasks that correspond to the four generic activities depend on the specific nature of the embedded system architecture, as follows:
• Finding the performance-critical parts of a system requires that the designer estimate the performance characteristics of each part (e.g., blocks and subroutines) of the system, and the contribution of that part to the overall system performance. Improving the critical parts performance can often significantly enhance overall performance.

For example, if the execution speed of the system is the primary metric of interest, the designer must determine, for each of the blocks and subroutines, the number of clock cycles needed to execute and communicate the data. In addition, it may be of interest to determine the amount of memory needed to store the data variables and to execute the code (e.g., the stack memory). In the literature [7, 9, 11] this step is referred to as profiling.

• Analyzing the suitability for hardware implementation and deciding on the nature of the hardware to be employed: The main goal is to partition the performance-critical parts into the parts that are to be implemented in hardware, and those that are to be implemented in software. The latter are executed by the general-purpose processor. The designer must estimate the resulting system performance and cost tradeoffs for each performance-critical part that is to be implemented in hardware. This step also determines the main characteristics of the hardware used, for example the structure of the circuits, the amount of resource required, and the mapping of the performance-critical parts to the physical circuits.

PSoC provides two kinds of digital blocks that allow implementation of application-specific hardware: (1) Programmable digital blocks that are programmed to implement the basic functions, such as timers, counters, deadband generators, and cyclic redundancy clocks [1]. (These blocks are discussed in Section 4.3.) (2) Specialized digital blocks provide dedicated functionality, such as multiply accumulate (MAC), decimator, and watchdog timers [1].

• Implementing the overall system: This step includes the tasks of designing the hardware circuits, developing the software routines (including the firmware routines for controlling and accessing the hardware), and integrating the entire system.

To improve design reusability, the customized hardware modules are based on a three-layer approach:
• The circuit layer includes the hardware used for the critical block, or subroutine.

• The low-level firmware layer incorporates the software routines, including interrupt service routines and drivers, that configure and operate the physical blocks. This layer utilizes the physical addresses of the blocks’ control, status, and data registers, to implement functionality. Even if two hardware blocks have the same functionality, their firmware routines might actually differ somewhat, because the two blocks may have different physical implementations.

• The high-level firmware (API) layer involves the interfacing routines that are used in the control and processing routines of an embedded system to operate the hardware blocks. The hardware is described at an abstract level using symbolic identifiers and abstract data, and without involving any physical details about the hardware. These routines rely on the low-level firmware layer for actually interacting with the hardware.

Figure 4.1 illustrates the three-layer implementation of modules. This figure is based on the assumption that different implementation styles are available for a given functional block. Alternative hardware circuits, shown as blocks (i.e., Hardware circuit i ) can be used for implementing the block, with specific firmware and API routines existing for each circuit.

High-level firmware layer (API) methods provide an abstract interface to the blocks, hiding all of their implementation details. These methods are called without the designer knowing all the implementation details of the method. Only the precise functionality of the method, and the amount of resource required for its implementation must be known (e.g., hardware and memory) and the performance it delivers in terms of execution time and power consumption. The designer can utilize this information for the partitioning, hardware allocation, and mapping steps to decide which specific circuit must be used in the system implementation to achieve the required performance. (These tasks are discussed in more detail in Subsection 4.2.2.)

The actual design tasks depend on the architecture employed by the embedded application:
• Architectures with a single general-purpose processor and associated coprocessors: The hardware used for the selected performance-critical blocks and the subroutines function as a coprocessor, and the functionality handled in the software domain is executed by the general-purpose processor, which also controls the coprocessor. Each of the selected critical parts is implemented as a dedicated circuit, and therefore, there is no sharing of hardware.

This situation requires a design procedure that partitions the blocks and subroutines into those that are implemented in hardware circuits, and those that are to be implemented in software. This task is called hardware–software partitioning [7, 9, 10]. If the same block, or subroutine, can be implemented using alternative hardware circuits (e.g., different adder circuits, multiplier blocks, etc.) then a second task called hardware resource allocation must be performed to identify the nature of the hardware circuits that offer the best performance–cost tradeoff.

• Architectures that have one general-purpose processor and shared coprocessors: Although this system has only one general-purpose processor, the same hardware circuits can be shared to implement multiple, performance-critical subroutines, or blocks. In this case, the processor controls the shared coprocessors. Compared to the previously discussed architectures that did not allow sharing of the coprocessors, the cost of the system implementation can be significantly lower due to the hardware sharing, and the performance may be only marginally degraded as a result of the additional overhead introduced by sharing.

In addition to the hardware–software partitioning, and hardware resource allocation tasks, the mapping (binding) of the critical blocks and subroutines to the hardware, and the execution scheduling for the critical blocks that share hardware is also required. The corresponding design flow is summarized in Figure 4.2, and detailed in Subsection 4.2.2.
• Architectures that have multiple general-purpose processors plus shared coprocessors: This is the most general case. There may be no processor that maintains overall control. The importance of such architectures for computationally intensive applications is continuing to increase, for example in multimedia, image processing, and telecommunications.

In this case, the design process involves not only the tasks of hardware–software partitioning, hardware resource allocation and mapping/scheduling of the customized hardware that are specific to the previous architectures, but also the tasks of allocating the general-purpose processors, mapping the software subroutines to processors, allocating the interconnect structure between the processors, mapping the data communications to the interconnect structures, and scheduling the software routines and data communications.

## 4.2 Design Methodology for Architecture Customization

In this section, the discussion focuses on the design methodology for embedded applications that utilize architectures with a single general-purpose processor and shared coprocessors for the performance-critical blocks and subroutines.

Figure 4.2 illustrates the design methodology for improving the execution time of embedded systems by customizing the system’s architecture. This methodology is based on the following tasks: (1) specification and profiling, (2) hardware–software partitioning, (3) hardware resource allocation, (4) operation binding, and (5) operation scheduling.

### 4.2.1 System Specification and Profiling

The goal of the profiling step is to characterize the execution times, in terms of clock cycles and memory size requirements, of the blocks and subroutines for both data and program in the application specification, that is algorithm. Then, as shown in Figure 4.2, this information can be used to guide the design methodology. This section uses an example to illustrate the profiling step.

Figure 4.3 presents a typical example of an iterative algorithm used in digital signal processing (DSP) [4, 8, 15] applications that share the following defining characteristics:
1. 1.

The processing is data dominated , and involves a large number of multiplication/addition computations and data transfers to/from the data arrays.

2. 2.

The number of iterations for each loop is known and constant.

3. 3.

In addition, the iterations of inner (“for”) loop instructions are either uncorrelated or loosely correlated, because there are no interdependencies between iterations, and therefore the inner loop can be executed in parallel, provided that the architecture has the necessary hardware capabilities.

These three features of DSP algorithms encourage two kinds of performance optimizations of the system:
1. 1.

Optimization of the operators, individual instructions, and instruction sequences inside the loop bodies.

2. 2.

Optimization of the processing, and data accesses, across different loop iterations.

Before discussing the parts that form the specification in Figure 4.3, the organization of the array data in the memory has to be detailed. This is important because accessing the data stored in the two-dimensional arrays involves significant overhead resulting from computation of the positions of the elements. Figure 4.4 illustrates the organization of the data arrays. For simplicity, it is assumed that the data and memory word size are the same, for example if PSoC is the target architecture, then the array elements are each one byte. The elements in the two-dimensional arrays a, F, and G in the specification are stored in consecutive memory locations, one row after another, as shown. The element a[m][n] is located at the following entry i of the linear sequence:
$${ i} = { m \times SZ + n}$$
(4.1)
where the constant SZ is the length of a row. For example, the memory entry for the element a[1][2] is at index $$i = 1 \times SZ + 2 = SZ + 2$$, as shown in Figure 4.4.

Both types of optimizations use the data collected in the profiling step.

Profiling involves the following steps:
• Producing the assembly code level description: The application’s assembly code is obtained by compiling (i.e., translating) the C program into assembly code. The designer can then determine the execution time and the program/data memory requirements by summing the number of execution clock cycles and memory bytes, respectively, required to execute each assembly code instruction.

• Functional partitioning : This is a hardware-related reorganization of the specification. The assembly code instructions are grouped into blocks, such that each block corresponds to a single instruction in the initial C program, and each block can be a hardware circuit with precise functionality. This is important in determining the impact of a dedicated hardware circuit with respect to performance improvement.

• Finding the performance-criticality of the blocks: This requires the designer to calculate the performance attributes for each block in the specification, and for the entire system. Based on this analysis, the designer can determine the performance-criticality of the blocks.

For example, consider the following profiling example.

A. Developing the assembly code description . Figure 4.5 shows the assembly code structure of the specification in Figure 4.3. This representation is important for correctly estimating the number of clock cycles needed for each part of the C specification (i.e., the instructions or operations). This is quite easy to compute because the number of clock cycles required to execute each assembly instruction is known for a general-purpose processor. Also, the amount of RAM required for storing data, and flash/EPROM for storing code can be estimated.
The following is the correspondence between the high-level instructions in Figure 4.3 and the assembly code in Figure 4.5. (Note that although the assembly code in what follows is based on the PSoC microcontroller instruction set, the discussion is valid for any general-purpose processor.)
• The instruction for, with the index i, is represented by Blocks 1 and 2: Block 1 initializes the index i where [i] is the memory cell for variable i, and Block 2 checks the upper bound of the loop instruction. Block 18 increments the index after each iteration.

• Similarly, Blocks 3 and 4 correspond to the for loop with the index j where [j] is the memory cell for variable j. Block 17 increments the index j.

• The initialization of the element aux corresponds to Block 5 and [aux] is the memory cell for variable aux.

• Blocks 6 and 7 are used for the for instruction with the index m where [m] is the memory cell for variable m, and Blocks 8 and 9 are used for the for loop with the index n where [n] is the memory cell for variable n. Block 14 increments index n, and Block 15 increments the index m

• Block 10 accesses the array element a[i][j] from memory. The first five instructions use Equation (4.1) to compute the index of the array element in memory. The element is then retrieved using indexed addressing, where the constant _a indicates the start address of the array a. The data are stored in the memory location temp2.

As is the case for the PSoC microcontroller’s instruction set, it is assumed that the general-purpose processor does not have an instruction for multiplication, but instead uses the multiplication subroutine, _mult. The multiplicands are passed as parameters to the multiplication subroutine from the A and X registers, and the result is returned to these registers.

• Block 11 represents the assembly code for accessing the array element $$\textrm{F}[i+m][j+n]$$. Instructions 1-8 compute the index of the element in memory. Then, the value is accessed using indexed addressing with the constant _F pointing to the start address of the array F.

• Block 12 multiplies a[i][j] with $$F[i+m][j+n]$$, and the result is added to the value of the element aux by Block 13.

• Block 16 assigns the value to the element G[i][j]. The first four instructions compute the index of the element, and the rest of the instructions assign the value of the variable aux to the element G[i][j].

B. Functional partitioning for hardware-oriented, specification reorganization. The initial block structure of the specification is restructured, so that all the instructions in a block can be implemented using only one hardware module with a precisely defined functionality. This is important in determining the performance improvement that results from developing, and then utilizing, a customized hardware circuit for a set of assembly code instructions. For example, the first five instructions of Block 10, in Figure 4.5, compute the index of an array element in memory, and instructions 6 and 7 access memory to retrieve and store the array element, in a processor register. From an execution point of view, the first five instructions use the hardware circuits for computation (i.e., for multiplication and addition), and instructions 6 and 7 employ the memory circuits, and the address, data, and control signals of the memory modules. Hence, improving the execution speed of Block 10 must take into consideration the hardware required for computing the address, and the hardware for data access from memory. To determine the performance improvement offered by each customized hardware module, it is important to split Block 10 into two subblocks, such that each sub-block can be implemented by hardware with dedicated functionality.

The block structure in Figure 4.5 is reorganized as shown in Figure 4.6 to reflect the types of circuits used in processing and communication. Block 10 was split into two sub-Blocks 10-1 and 10-2. The first nine instructions of Block 11 compute the index of an array element, and were organized as the sub-Block 11-1, while the remaining instruction, defined as sub-Block 11-2 accessed memory. Similarly, Block 16 was split into the sub-Block 16-1 that computes an array element index, and Block 16-2 for reading an array element from memory.

C. Finding the performance-criticality of the blocks. Table 4.1 shows the profiling data for the refined block structure shown in Figure 4.6. The first column shows the block number, the second the number of required clock cycles, the third the number of bytes in nonvolatile memory (e.g., flash or EPROM) required to store the block’s code, and the fourth the number of internal registers storing data. For Blocks 10-1, 11-1, 12, and 16-1, the profiling data include the number of clock cycles required to execute the multiplication subroutine mult, and the number of EPROM bytes required to store the code of the subroutine. The execution time for Blocks 10-2, 11-2, and 16-2 includes the time needed to transfer data from the memory. These values are expressed as symbols because their actual values depend on the implementation style.
Table 4.1:

Profiling data for the block structure in Figure 4.6.

Block

# Clock Cycles

# EPROM Bytes

# Register

1

8

3

-

2

13

5

-

3

8

3

-

4

13

5

-

5

8

3

-

6

8

3

-

7

13

5

-

8

8

3

-

9

13

5

-

10-1

30 + mult

9 + mult

2

10-2

5 + memory access

4

2

11-1

52 + mult

17

2

11-2

memory access

2

2

12

17 + mult

4

2

13

7

2

1

14

12

4

-

15

12

4

-

16-1

35 + mult

11

2

16-2

memory access

2

2

17

12

4

-

18

12

4

-

The execution-time criticality of the block is computed starting from the total execution time of the application:
$$T_{overall} = SZ^2 {(2L+1)}^2 (T_9 + T_{10-1}$$
(4.2)
$$\begin{array}{lll}+ T_{10-2} + T_{11-1} + T_{11-2} + T_{12} + T_{13} + T_{14}) \\ & &+ { SZ^2 {(2L+1)} { (T_7 + T_8 + T_{15})}} \\ & & + { SZ^2 {(T_4 + T_5 + T_6 + T_{16-1} + T_{16-2} + T_{17})}} \\ & & + { SZ {(T_2 + T_3 + T_{18})} + {T_{1}} }\end{array}$$
(4.3)
or, considering the profiled clock cycles in Table 4.1:
$$\begin{array}{lll}{T_{overall}} = { SZ^2 {(2L+1)}^2 {(136 + 3 mult + 2 {memory access})} + {SZ}^2 {(2L+1)} {(33)}} \\ & & + {SZ^2 {(76 + mult + memory access)} + {SZ} {(33)} + {(8)}}\end{array}$$
(4.4)

The blocks’ performance-criticality is characterized by starting from Equation (4.4). This equation shows that the highest criticality is that of Blocks 9, 10-1, 10-2, 11-1, 11-2, 12, 13, and 14, for which the execution complexity is proportional to the value $$SZ^2(2L+1)^2$$. For example, if $$\textrm{SZ} = \textrm{4}$$ and $$\textrm{L} = \textrm{2}$$, then the execution time for the seven blocks is more than 90% of the total execution time of the system. This percentage increases as the values of the two constants increase.

### 4.2.2 System Partitioning and Implementation

The goal of the design process is to maximize the performance of the system implementation, for example minimize the system execution time by utilizing customized hardware circuits for the performance-critical blocks. Two kinds of customization optimizations can be achieved in the process: (i) optimization of the operators, individual instructions, and instruction sequences of the critical blocks, and (ii) optimization of the data processing and communication pertaining to different critical blocks.

The targeted architecture has one processor, and one coprocessor, that can be shared among multiple performance-critical blocks and subroutines as shown in Figure 4.7. The general-purpose processor executes the blocks, and subroutines, in the software domain. In addition, the coprocessor consists of hardware circuits that implement critical blocks that are mapped to the hardware domain. The processor and coprocessor exchange data via the shared memory. These data can be mapped to multiple memory circuits, so that multiple data accesses can occur in parallel. The interconnect structure provides the needed connectivity between the processing elements and memory.

The first design optimization refers to the execution-time, optimized implementation of the operators, individual instructions, and instruction sequences pertaining to the performance-critical regions of an application. The example in Figure 4.3 is used as an illustration of this optimization. The refined block structure is shown in Figure 4.6, and the profiling data are given in Table 4.1. Blocks 9, 10-1, 10-2, 11-1, 11-2, 12, 13, and 14 are the performance-critical blocks of the application. As shown in Table 4.1, of the eight blocks, the most critical from an execution time point-of-view, are Blocks 10-1, 10-2, 11-1, 11-2, and 12. These blocks are further considered as candidates for implementation as customized hardware circuits.

These five blocks provide three kinds of functionality: Blocks 10-1 and 11-1 compute the index of an array element, Blocks 10-2 and 11-2 access the data memory, and Block 12 multiplies two values. The nature of the operations involved for the three block types, and the kind of corresponding hardware circuits, are different. For example, the blocks for computing the array element indexes involve data transfers between the processor registers, addition operations, and multiplications. The blocks for accessing the data memory involve registers and interconnect to memory. Block 12 requires data registers for the operands, and a multiplication module.

Figure 4.8 shows possible hardware structures for the implementation of the performance-critical blocks:
• Figure 4.8(a) shows the structure of a possible hardware module for Block 12. The structure uses three registers to store the two operands and the result, and a multiplier circuit. In addition to the time $$T_{mult}$$, as measured in clock cycles, required to perform the multiplication operation, the circuit also introduces overhead for register loading time, and retrieving the result from the register. The total execution time is given by:
$${ T^{HW}_{12}} = { T_{mult} + 2 \times T_{reg load} + T_{reg read}}$$
(4.5)
• Figure 4.8(b) shows a possible hardware structure for accessing the memory that stores the array elements. The access times for one memory word are cumulatively represented as the time $$T_{mem read}$$ for reading data from memory to a processor register, and the time $$T_{mem write}$$ for writing data from a processor register to memory. Hence,
$${ T_{10-2}} = { T_{11-2}} = {T_{ mem read}}$$
(4.6)
• Figure 4.8(c) illustrates a possible hardware structure for computing the index of the memory element a[i][j], if the array a is a bidimensional array of size $$SZ \times SZ$$. This structure is an alternative to the more “intuitive” implementation that uses a multiplier module to compute the starting address of row i, within the array.

Each element of the lookup table stores a pointer to the first element of a row: the first element points to the start of the first row, the second element to the start of the second row, and so on. The contents of the table are known once the array data are allocated in memory. The memory address is found by summing the starting memory address of the array a, the displacement of the row i, and the displacement, within the row i, of the column j. The resulting memory address is placed on the memory address bus. The advantage of the solution is that no multiplications are needed at the penalty of needing extra memory for storing the lookup table.

The total execution time $$T_{ad comp}$$ includes the time to load the registers holding the indexes i and j, the time to read from the lookup table, and the time to perform the two additions:
$${T^{HW}_{10-1}} = { 2 \times T_{reg load} + T_{ {lookup}} + 2 \times T_{adder}} \\$$
(4.7)
$${ T^{HW}_{11-1}} = { 4 \times T_{reg load} + T_{ {lookup}} + 4 \times T_{adder}}$$
(4.8)

The information about the block-criticality and the execution time of the corresponding hardware circuits is used to guide the design decisions with respect to (a) hardware–software partitioning, (b) hardware resource allocation, (c) operation binding, and (d) scheduling.

A. Hardware–software partitioning . Hardware–software partitioning is the task of separating the specification blocks, and subroutines, into those that are implemented in the hardware domain (using customized circuits), and those that are executed by the general–purpose processor, so that the performance improvement is maximized, and the customization cost is less than a predefined limit. Alternatively, the scope of hardware–software partitioning can also be defined as cost minimization subject to satisfying the performance constraints.

Each of the performance-critical, customized hardware blocks and subroutines is analyzed to determine the system performance improvement. For example, if Block 11-1, is based on the hardware circuit shown in Figure 4.8(c), then the change in execution time for the system becomes:
$${ \Delta T} = { SZ^2 (2L + 1)^2 (T^{SW}_{11-1} - T^{HW}_{11-1})}$$
(4.9)
The resulting execution speed-up, defined as the ratio of the software execution time to the hardware–software execution time, for the system can be approximated by:
$${ Speedup} \quad \approx \quad { \frac {T^{SW}_9 + T^{SW}_{10-1} + T^{SW}_{10-2} + T^{SW}_{11-1} + T^{SW}_{11-2} + T^{SW}_{12} + T^{SW}_{13} + T^{SW}_{14}}{T^{SW}_9 + T^{SW}_{10-1} + T^{SW}_{10-2} + T^{HW}_{11-1} + T^{SW}_{11-2} + T^{SW}_{12} + T^{SW}_{13} + T^{SW}_{14}}}$$
(4.10)
Similarly, if Blocks 10-1, 10-2, 11-1, 11-2, and 12 are all implemented in hardware, the change in execution time is given by:
$$\begin{array}{lll}{ \Delta T} = SZ^2 (2L + 1)^2 [(T^{SW}_{10-1} - T^{HW}_{10-1}) + (T^{SW}_{10-2} - T^{HW}_{10-2}) \\ & & + (T^{SW}_{11-1} - T^{HW}_{11-1}) + { (T^{SW}_{11-2} - T^{HW}_{11-2}) + (T^{SW}_{12} - T^{HW}_{12})]}\end{array}$$
(4.11)
and the resulting speed-up is:
$${ Speed-up} \quad \approx \quad { \frac{T^{SW}_9 + T^{SW}_{10-1} + T^{SW}_{10-2} + T^{SW}_{11-1} + T^{SW}_{11-2} + T^{SW}_{12} + T^{SW}_{13} + T^{SW}_{14}}{T^{SW}_9 + T^{HW}_{10-1} + T^{HW}_{10-2} + T^{HW}_{11-1} + T^{HW}_{11-2} + T^{HW}_{12} + T^{SW}_{13} + T^{SW}_{14}}}$$
(4.12)

If the cost limit is exceeded, then the design process must consider alternative subsets of the set of performance-critical blocks, and find the subset that gives the best performance improvement, at a cost below the limit. For example, in the case of the five performance-critical blocks, the alternative subsets would be {Block 10-1, Block 11-1, Block 12}, {Block 10-2, Block 11-2, Block 12}, {Block 11-1, Block 11-2, Block 12}, and so on.

Note that in the worst case, the number of subsets that must be analyzed to find the best performance/cost tradeoff can be very large. If the set of performance-critical blocks has the cardinality N, then 2 N subsets must be analyzed. Note that this number grows exponentially with respect to N. To address this issue, blocks are analyzed in decreasing order of the ratio:
$${ E}_{i} = { \frac{\Delta T_{i}}{\Delta Cost_{i}}}$$
(4.13)
where ΔTT i is the difference in the execution time in clock cycles, between the software and hardware implementations of Block i, and Δ Cost i is the cost increase due to the customized circuit. The factor E i describes the performance improvement, per cost unit, that is provided by Block i. Therefore, provided that the cost limit is not exceeded, the design method maps the performance-critical blocks to the hardware domain in decreasing order of E i . For the five performance critical blocks, this method maps hardware Block 11-1, followed by the Blocks 10-1, 12, 10-2, and 11-2 to the hardware domain.

B. Hardware resource allocation . This step identifies the nature, and number, of hardware circuits that are incorporated into the application-specific coprocessor. This is justified by the fact that there are often many alternative hardware implementations possible for the same block functionality. These alternative implementations not only provide different execution times, and consume specific amounts of hardware, but also create distinct opportunities for sharing the hardware circuits by different functional blocks. This is an important way to reduce the total cost.

For example, consider Blocks 10-1 and 11-1 that compute the indexes of elements in two different arrays, as shown in Figures 4.5 and 4.6 In principle, these blocks perform the same computations. However, Block 11-1 also calculates the element indexes by summing the values of four variables. Figure 4.8(c) presents the customized circuit for Block 10-1, and Figure 4.9(a) introduces the circuit for Block 11-1. The only difference between the circuits is the two additional adders in the second circuit. Equations (4.7) and (4.8) express the respective execution times.

The two blocks can be modified in several ways to lower the total implementation cost. Cost can be reduced by sharing a single hardware circuit for the implementation of the two blocks, instead of having separate circuits for each of the blocks. Figure 4.9(b) shows the resulting circuit. Two multiplexer circuits are added to the initial Block 11-1 circuit, so that the correct indexes of the array elements can be selected. The third multiplexer selects the starting address of the array, represented by the constant _a for Block 10-1, and the constant _F for Block 11-1. The selection signals for the multiplexer circuits can be implemented by setting a dedicated bit, if Block 10-1 is executed, and resetting the bit, if Block 11-1 is performed. The implementation cost is reduced by the elimination of the customized circuit for Block 10-1. Although, the cost reduction is slightly diminished by the three additional multiplexers.

The tradeoff, in this case, is the increase of the execution times of the two blocks. Assuming that all of the control signals are generated in sequence, the new (worst case) execution times are now:
$${ T^{'HW}_{10-1}} = { 2 \times T_{reg load} + T_{lookup} + 2 \times T_{adder} + 3 \times T_{MUX}}$$
(4.14)
$${ T^{'HW}_{11-1}} = { 4 \times T_{reg load} + T_{lookup} + 4 \times T_{adder} + 3 \times T_{MUX}}$$
(4.15)

(assuming that all control signals are generated in sequence). In the best case, the two execution times increase only by the propagation delay of one multiplexer circuit, if the multiplexer circuits are activated in parallel.

The cost of the hardware circuits in Figure 4.9(b) can be further lowered by sharing the circuits for computing the two indexes of the array elements. The resulting circuit has a data path and controller circuits, which are shown in Figures 4.10 and 4.11. The data path circuit in Figure 4.10 uses the same adder circuit to first sum the values x and i, and then to sum values y and j.

Figure 4.11(a) is the controller for Block 10-1, and Figure 4.11(b) is the controller for Block 11-1. For Block 10-1, in the state s1, the circuit loads values into the data path registers. In state s2, the circuit accesses the lookup table. The selected value in the lookup table is stored in the A register. Finally, in state s3, the following two sums are computed: register A + DEMUX output, and _a + index. The actions corresponding to the controller states of Block 11-1 are indicated in Figure 4.11. The multiplexer and demultiplexer circuits of the data path are controlled by signals that depend on s2, s3, t2, and t3. These four signals are 1, if the controller circuit is in the corresponding state, and 0, if it is not. The functions F are the logic functions of the combinational circuits that produce the select signals for the multiplexer and demultiplexer circuits.
The execution times of Blocks 10-1 and 11-1 are now equal, and described by the following:
$$T_{10-1} = {T_{11-1} = 4 \times T_{FSM}}$$
(4.16)
$$\begin{array}{lll}T_{FSM} = \max \{T_{reg load},T_{adder} + T_{DEMUX} + T_{lookup} \\ & & + T_{reg load},T_{DEMUX} + 3\times T_{adder} + T_{reg load} \} \label{(4.17)}\end{array}$$

Equation (4.17) defines the clock period for the controller circuit. Note that this circuit has lower implementation cost, in terms of the data path components, than the previous two circuits, but the execution times of Blocks 10-1 and 11-1 are now larger.

Figures 4.12 and 4.13 show a hardware solution with an even smaller number of data path components than the previous designs. A single adder circuit is introduced to perform all addition operations of Blocks 10-1 and 11-1. For the data path in Figure 4.12, the actions executed in each state of the controller are also shown. This circuit has the simplest data path of the hardware designs presented.
In this case, the execution times of Blocks 10-1 and 11-1 are the longest and given by:
$$T_{10-1} = {T_{11-1} = 6 \times T_{FSM}}$$
(4.18)
$$\begin{array}{lll}T_{FSM} = \max \{T_{reg load},T_{adder} + T_{DEMUX} + T_{lookup} \\ & & + T_{reg load},T_{DEMUX} + T_{adder} + T_{reg load} \}\end{array}$$
(4.19)
$$\approx T_{adder} + T_{DEMUX} + T_{lookup} + T_{reg load}$$
(4.20)
In addition to the alternatives for sharing hardware modules among the block implementations, another possibility for improving the system’s execution time is to consider using faster adder and multiplier circuits. For example, ripple-carry adder circuits have a lower cost than look ahead adder circuits, but their execution time (time delay) is also significantly longer. Discussing efficient circuit structures for different arithmetic and logic operators is beyond the scope of this text. More details can be found in the related literature [4, 8, 15].

C–D. Operation binding/scheduling . In addition to improving the system performance by employing customized hardware for the individual blocks and subroutines, the designer can also consider exploiting the parallelism between operations in the same block, or in different blocks. This further reduces the system execution time, because multiple operations can now be executed in parallel.

Without addressing the operation parallelism of the application, its execution time, shown in Figure 4.3, cannot be reduced below the limit given by:
$${ T_{limit}} = { SZ^2 {(2L+1)}^2 {(T_9 + T_{10-1} + T_{10-2} + T_{11-1} + T_{11-2} + T_{12} + T_{13} + T_{14})}}$$
(4.21)

However, if the values of the constants SZ and L are large, this limit may still be very large.

The execution time of the application can be reduced by observing that there are no dependencies between the iterations of the innermost loop, other than all iterations adding a value to the variable aux. This dependency can be removed by introducing the array aux[2L+1], such that the element n corresponds to the iteration n. The iteration uses this element for storing the product of the two matrix elements. The value auxv is computed by summing all the elements of the matrix aux[2L+1]. Figure 4.14 shows the modified algorithm.

If the hardware circuits offer support for parallel execution, modifying these algorithms allows the innermost iterations to be executed in parallel. Figure 4.15 presents the block structure of the application, so that the operation parallelism is exposed. The figure assumes that $$\textrm{L} = \textrm{2}$$, hence the inner most loop of the algorithm includes five iterations. For simplicity reasons, the initial performance-critical blocks of the application were merged into the following blocks: Block CA computes the address of an array element, Block MA accesses the data memory, and Block * multiplies two data values.

In this case, implementing optimized hardware circuits involves two additional steps: operation binding and operation scheduling:
• Operation binding is the process of mapping the specification’s functional blocks to the hardware resources of the architecture. This step is justified by the fact that the application-specific coprocessor can be based on multiple instances of the same hardware circuits, and therefore there is a variety of ways in which the functional blocks can be mapped to the circuits.
• Operation scheduling is the activity of time sequencing the execution of blocks and subroutines mapped to the same hardware circuits.

The goal of operation binding and scheduling is to maximize the performance improvement, by minimizing the execution time of the system.

For the parallelized inner loop of the block structure in Figure 4.15, Figure 4.16 illustrates two instances of operation binding and scheduling. These instances differ in terms of the number of hardware circuits that were allocated in the coprocessor.

In Figure 4.16(a), the architecture included a distinct hardware circuit for each of the CA blocks (computing the index), MA (accessing the memory), and * (for multiplication). Because of resource limitations, only the blocks pertaining to two of the five iterations can be executed simultaneously. The time scheduling of the execution is also shown. For example, in step 0, only Block CA is executed, but in step 1, Blocks MA and CA of two different iterations occur in parallel, because they use different hardware circuits. Hence, the execution of the two iterations ends in only six steps, as compared to ten steps, if the two iterations are executed in sequence. Four iterations are performed in ten steps using parallel execution, as compared to the twenty steps required for serial execution. The total execution time of the five iterations is 15 cycles for the parallel execution, and 25 cycles if the iterations are performed sequentially. Hence, compared to sequential execution, parallel execution provides a relative speed-up of about 25%. Relative speed-up is the ratio:
$${ Speedup} = { \frac{T_{serial} - T_{parallel}}{T_{serial}}} \times 100\,\, [\%]$$
(4.22)

The execution-time reduction increases slightly as a function of L.

The execution time of the loop iterations can be further reduced by allocating more hardware circuits to increase the number of parallel operation executions. Figure 4.16(b) illustrates this case. The coprocessor includes two additional circuits for computing the memory address, and one for multiplication. In addition, the architecture includes two memory circuits that can be accessed in parallel. The operations of the four iterations are executed in eight steps, as shown, and five iterations take only nine steps. This corresponds to a speed-up of 55%, as compared to the serial execution of the iterations, and a speed-up of 40%, as compared to the coprocessor used for the case in Figure 4.16(a).

Note that introducing one additional multiplier circuit to the coprocessor in Figure 4.16(a) does not improve the performance of the design. Also, for the coprocessor in Figure 4.16(b), the improvement would be only two execution steps for the execution of the four iterations, a speed-up of $$\approx 25\%$$, and none for the execution of the five iterations. Therefore, adding more hardware resources does not necessarily improve execution time.

## 4.3 Programmable Digital Blocks

In this section, PSoC hardware is customized for an application by utilizing the programmable digital blocks and PSoC’s dedicated circuits.

PSoC has 16 programmable digital blocks, as shown in Figure 4.17. The blocks are organized as a two-dimensional array, as discussed in . Figure 4.18 describes the structure of a digital PSoC block. Each programmable block has three inputs, clock, data and auxiliary data (Aux Data), and four outputs, primary output (PO), auxiliary output (AO), block interrupt (BI), and broadcast output (BO).

The operation of a block is determined by programming seven registers: CR0 and FN for storing the configuration and function state of a block, IN and OU for selecting the input and output signals, and DR0, DR1, and DR2 for storing the data. In addition, the INT register defines the blocks interrupt mask. (cf. for details on the INT register). Each digital PSoC block operates independently, and has a unique interrupt vector that points to a specific interrupt service routine. The physical addresses of the seven registers are shown in Tables 4.24.7 for the 16 blocks.
Table 4.2:

Physical addresses of the DR0, DR1, and DR2 registers for rows 0 and 1 [1].

DBB00

DBB01

DCB02

DCB03

DBB10

DBB11

DCB12

DCB13

DR0

0,20H

0,24H

0,28H

0,2CH

0,30H

0,34H

0,38H

0,3CH

DR1

0,21H

0,25H

0,29H

0,2DH

0,31H

0,35H

0,39H

0,3DH

DR2

0,22H

0,26H

0,2AH

0,2EH

0,32H

0,33H

0,3AH

0,3EH

The PSoC blocks are programmed with respect to [1]: (1) the input and clock signals, (2) the output signals, (3) the basic functionality, and (4) the data widths of the blocks:
• The data signal is selected as one of 16 different input signals. Similarly, the clock signal is chosen from one of 16 alternative clock signals, and then resynchronized with either SYSCLK, or SYSCLKX2. An auxiliary data selection circuit chooses one of four auxiliary data inputs, which is used primarily for the SPI slave function (cf. for details on the SPI blocks).

• Each digital PSoC block has a primary and a secondary data output. Both outputs can be routed to one of four different destinations. PSoC blocks have two additional outputs: the block interrupt outputs for generating interrupt signals, and the broadcast output.

• PSoC blocks can be configured to be a timer, counter, deadband circuit, or a cyclic redundancy clock (CRC) circuit. PSoC communication blocks (Blocks DCB in Figure 4.17) can be either an SPI master or slave, or a UART. (These circuits are presented in .)

• The data width of a circuit can be expanded by chaining several digital PSoC blocks together. This chaining is achieved by using the primary output signals (also refer to and ).

Table 4.3:

Physical addresses of the DR0, DR1, and DR2 registers for rows 2 and 3 [1].

DBB20

DBB21

DCB22

DCB23

DBB30

DBB31

DCB32

DCB33

DR0

0,40H

0,44H

0,48H

0,4CH

0,50H

0,54H

0,58H

0,5CH

DR1

0,41H

0,45H

0,49H

0,4DH

0,51H

0,55H

0,59H

0,5DH

DR2

0,42H

0,46H

0,4AH

0,4EH

0,52H

0,53H

0,5AH

0,5EH

Table 4.4:

Physical addresses of the CR0 and FN registers for rows 0 and 1 [1].

DBB00

DBB01

DCB02

DCB03

DBB10

DBB11

DCB12

DCB13

CR0

0,23H

0,27H

0,2BH

0,2FH

0,33H

0,37H

0,3BH

0,3FH

FN

1,20H

1,24H

1,28H

1,2CH

1,30H

1,34H

1,38H

1,3CH

Table 4.5:

Physical addresses of the CR0 and FN registers for rows 2 and 3 [1].

DBB20

DBB21

DCB22

DCB23

DBB30

DBB31

DCB32

DCB33

CR0

0,43H

0,47H

0,4BH

0,4FH

0,53H

0,57H

0,5BH

0,5FH

FN

1,40H

1,44H

1,48H

1,4CH

1,50H

1,54H

1,58H

1,5CH

Table 4.6:

Physical addresses of the IN and OU registers for rows 0 and 1 [1].

DBB00

DBB01

DCB02

DCB03

DBB10

DBB11

DCB12

DCB13

IN

1,21H

1,25H

1,29H

1,2DH

1,31H

1,35H

1,39H

1,3DH

OU

1,22H

1,26H

1,2AH

1,2EH

1,32H

1,36H

1,3AH

1,3EH

As shown in Figures 4.184.20, a PSoC digital block clock signal is selected as one of 16 possible clock signals using the Clock input (bits 3-0) of the IN register. The selected clock signals are then synchronized with one of the available system clocks, either SYSCLK or SYSCLK2 . The identity of the synchronization signal is determined by AUXCLK (bits 7-6) of the OU register. Similarly, the DATA signal to the block is selected as one of the 16 possible input signals based on the bits Data input (bits 7-4) of the IN register.

OUTEN (bit 2) of the OU register enables, or disables, the primary output of the PSoC digital blocks. AUXEN (bit 5) of the OU register enables, or disables the auxiliary output of the block. Output select (bits 1-0) selects the output data row for the primary output, and AUX IO select (bits 4-3) defines the output data row for the auxiliary output, as shown in Figure 4.20.

### 4.3.1 Timer Block

The main purpose of the timer block is to generate low-level timing and interrupt signals based on a selectable clock signal that is provided as an input [1]. This block produces programmable frequencies and pulse widths that can be used for timing.

Specifically, the functionality of the timer block can be grouped into three categories: (A) terminal count (count down) functions, (B) compare functions , and (C) capture functions . In addition, the block includes routines to start, and stop, the timer. The start timer routine enables the timer circuit, and starts the main timer, compare, and capture functions. The timer stop routine disables the timer circuit. As a result, all outputs, including the interrupt output, are gated low. The internal state registers are reset, but the data registers used in counting, for example the timer value, period, and compare value, are not affected.

The implementation for each of the timer block functions relies on both hardware and software. The hardware circuits implement the functionality that actually generates the low-level signals. The software consists of firmware routines that set, and access, the data and control registers of the hardware.
Table 4.7:

Physical addresses of the IN and OU registers for rows 2 and 3 [1].

DBB20

DBB21

DCB22

DCB23

DBB30

DBB31

DCB32

DCB33

IN

1,41H

1,45H

1,49H

1,4DH

1,51H

1,55H

1,59H

1,5DH

OU

1,42H

1,46H

1,4AH

1,4EH

1,52H

1,56H

1,5AH

1,5EH

1. A.

Terminal count

Terminal count (TC) generates a programmable timing signal frequency based on the input clock signal. The related functionality includes (1) an operator that sets the period of the timer circuit, (2) the main timer function that decrements the content of the timer circuit at each clock cycle, and (3) a function that generates an interrupt signal upon reaching the value 00H. Figure 4.21 shows the inputs, outputs, and dataflow of the timer block.

The following are the three TC-related functions:
1. 1.

Write period function : This function sets the timer circuit’s period value. This value is stored in a dedicated register, DR1 (Data Register 1) . The contents of the register do not change during the countdown process, and thus the value can be used for repeatedly initializing the timer block with the same period value. In addition to the DR1 register, the period value is also loaded into the DR0 register (Data Register 0 ), which is used for the count down function. Tables 4.2 and 4.3 provide the physical addresses of the DR0 and DR1 registers for the 16 programmable digital blocks of a PSoC chip.

2. 2.

Main timer function : This function produces a timing signal frequency equal to the clock frequency divided by the period value loaded into the DR1 register. The pulse width is either one input clock period, or one half of the input clock period.

This functionality is implemented by the timer circuit decrementing the value of its DR0 register on each rising edge of the clock signal. Upon reaching the value 00H on the next positive edge of the clock, the Terminal Count (TC) output of the block is raised, the counter is reloaded with the period value stored in its DR1 register and the count down process resumes.

3. 3.

Terminal count interrupt function: If the block interrupt is enabled, then the positive edge of the TC output generates an interrupt signal. This happens when the DR0 register reaches the value 00H.

Hardware circuit. The hardware circuit implements the required data transfer involving the DR0 and DR1 registers required by the three functions as shown in Figure 4.21, and generates the output, TC, and the interrupt signal. The circuit operation is programmed by the control registers, CR0 and FN. Tables 4.4 and 4.5 show the physical addresses of the CR0 and FN registers of the 16 programmable digital blocks.

Figure 4.22 illustrates the PSoC implementation of the main timer function [1]. The timer remains in the initial state until it is enabled by setting enable (bit 0) of the control register, CR0, to 1.

If the timer circuit is enabled, when the rising edge of the next clock cycle occurs, the timer period stored in the DR1 register is copied into the DR0 register and is used for the countdown operation. Then, on the rising edge of the clock, the DR0 register is decremented, if its value is not 00H. Once the value in the DR0 register becomes 00H, the output, TC, of the block is set to 1. If interrupts are enabled, an interrupt signal is also produced. Bit 3 of the FN register enables/disables the interrupts on terminal count. Bit 2 of the CR0 register determines the pulse width of the TC output, and the interrupt signal to be one full, or one half, clock cycle. Then, the timer period is reloaded into the DR0 register from the DR1 register. The countdown process resumes on the rising edge of the next clock cycle. Figure 4.23 shows the timing diagram for the PSoC timer [1].
Software routines. Firmware routines set the control registers that start and stop the timer circuit, set the timer period, and enable/disable the interrupts. Figure 4.24 shows three short firmware routines related to the terminal count functions. The Timer_Start routine enables the timer circuit by setting bit 0 of the timer blocks control register. Figure 4.22 shows how this bit is used by the rest of the timer functionality. After setting the enabling bit, the timer circuit starts counting down on the next clock cycle. The Timer_Stop routine disables the timer circuit by resetting bit 0 of CR0. The Timer_WritePeriod routine writes a 16-bit value into the DR1 register that stores the period value. The LSB of the new value is in the A register, and the MSB of the new value is stored in the X register.

2. B.
Compare functionality The compare functionality can generate an output signal of programmable frequency and pulse width, and an interrupt signal. The compare functions (1) writing the pulse width (compare) value into the DR2 register (i.e. the Data Register 2) (2) reading the compare value from the DR2 register, (3) producing the output signal, and (4) enabling/disabling the interrupts:
1. 1.

Write compare value : This function loads the DR2 register with the 16-bit value used in the timers compare function. The output signals pulse width is equal to (DR1 – DR2), and is programmed by selecting the appropriate register values.

2. 2.

Read compare value : This function returns the 16-bit value stored in the DR2 register that is used in the compare function.

3. 3.

Compare function : This block compares the values of the decrementing DR0 register, and the DR2 register that stores the reference value. If the tested condition is true, then the auxiliary output of the timing circuit is set. The comparison operator can be programmed for either $$DR0 \leq DR2$$, or $$DR0 < DR2$$. If the value of the DR1 register is larger than the value of the DR2 register, the compare output is low for the first (DR1 – DR2) clock cycles. Then, for the next DR2 clock cycles the output is high. Therefore, both the output period (DR1) and the pulse width (DR2) are programmable.

If a capture function call occurs, then it overrides the DR2 register value used for comparison. (For more details, cf. the section on the capture functionality.)

4. 4.

Enable/disable compare interrupts : These functions enable interrupt signals, if the tested comparison condition is true. If enabled, the interrupt signal is generated on the rising edge of the compare output. Other functions disable the interrupts.

Tables 4.2 and 4.3 show the physical addresses of the DR2 registers of the 16 programmable digital blocks. The compare functionality is implemented using both hardware and firmware as in the case of the terminal count functionality.

Hardware circuit. The hardware circuit generates CMP, the low-level compare output, and the interrupt signal, which are the basis for the related data transfer shown in Figure 4.21.

Figure 4.25 illustrates the PSoC implementation of the compare functionality [1]. The circuit stays in the initial state provided that the timer circuit is not enabled. After enabling the timer by setting bit 0 in the CR0 register to 1, the DR0 and DR2 registers are compared at each clock cycle. Bit 4 of the FN register selects the comparison relationship to be either $$DR0 \leq DR2$$ or $$DR0 < DR2$$.

If the values of the DR0 and DR2 registers are equal, then the auxiliary output CMP of the circuit is set, and an interrupt signal is produced, if interrupts are enabled by setting bit 3 of the FN register. The length of the auxiliary output signal and interrupt signal are either one, or one half, clock cycles, because they are selected by bit 2 of the CR0 register.

Software routines. The firmware routines provide the capability to write/read the compare value stored in the DR2 register, to enable/disable the compare related interrupts, and to start/stop the circuit. Note that the start/stop routines are the same as those for the terminal count functionality shown in Figure 4.24.

Figure 4.24 shows the related firmware routines. The Timer_Write CompareValue routine loads the 16-bit value used for the comparison function into the DR2 register. The LSB is in the A register, and the MSB is in the X register. The Timer_ReadCompareValue routine returns the content of the DR2 register. The LSB is returned in the A register and the MSB in the X register.

3. C.

Capture Functionality

The capture function returns the value of the DR0 register. There are two kinds of capture functions: hardware capture and software capture. Hardware capture occurs on the rising edge of the data input. Software capture occurs if the CPU core attempts to read the contents of the decrementing DR0 register. In both cases, the contents of the decrementing DR0 register are copied to the DR2 register, and the value can be accessed from there. The resulting data transfer is shown in the Figure 4.21.

Software routines. Figure 4.26 shows the firmware routine that implements the capture function. This routine, called Timer_ReadTimerSaveCV, returns the value of the DR0 register without affecting the other registers.

It first allocates three stack locations to store the counter value, that is, bytes DR0_MSB and DR0_LSB, and the flags register. The X register points to the bottommost of the three stack locations allocated. Then, the values of the configuration register, CR0, and compare register, bytes DR2_MSB and DR2_LSB, are stored on the stack, above the allocated locations. Next, two bytes of the counter register (bytes DR0_MSB and DR0_LSB) are saved in the stack entries $$X + 1$$ and $$X + 2$$, respectively. Flags are stored in the stack entry pointed to by the X register. After disabling the interrupts, the DR2 and CR0 registers are restored to their saved values. Finally, the A and X registers return the value of the counter register: byte DR0_MSB is returned in the A register and byte DR0_LSB is returned in the X register.

### 4.3.2 Counter Block

The functionality of the counter block is very similar to that of the timer block. As in the case of the timer block, a counter has one clock input, one data input, one primary output, and one auxiliary output. The main functions are also similar. However, the counter circuits do not implement the capture functionality of the timer block.

The counter block uses three data registers: the DR0 register for counting, the DR1 register for storing the period value, and the DR2 register for storing the compare value. The period value is loaded into the DR1 register only if the counter circuit is disabled. The value is also automatically copied into the DR0 register.

Figure 4.27 shows the PSoC implementation for the countdown function of the block [1]. The main difference, with respect to the terminal count functionality of the timer block in Figure 4.22, is that the content of the DR0 register is decremented only if the Data input is high. If the Data input is low, then decrementing stops, the two output signals are set to low, and the values of the three registers of the counter circuit DR0, DR1, and DR2, are saved.

Upon decrementing the value of the DR0 register to 00H, the counter circuit sets the auxiliary output to high, and issues an interrupt signal if interrupts are enabled. Similar to the timer block, interrupts are enabled by setting bit 3 of the FN register to 1. The auxiliary output and interrupt signals are always applied for one clock cycle.

Deadband circuits generate two nonoverlapping clocks, called PHI1 and PHI2, that are separated by a value called the deadband . Deadband circuits are implemented using PSoC digital blocks, for which the configuration and function registers are as shown in Figure 4.28. The deadband generator has three inputs: viz., the clock signal, the reference signal (the input DATA of the digital block), and the KILL signal (the input Aux DATA of the block). The primary circuit output generates clock PHI1, and the secondary output clock PHI2. The input reference can be produced by a PWM or by toggling a bit using the Bit–Bang interface [1]. The bit is the primary output of the previous digital block, or the bit-bang clock register.
Figure 4.29 shows the operation of the enabled deadband circuit. After the reference signal is asserted, the next rising edge of the clock cycle causes the count register of the PSoC block to be loaded with the period value, P. Then the count value is decremented on each clock cycle, and after reaching value zero, clock PHI1 becomes high. Clock PHI1 is lowered to zero on the falling edge of the reference input. Similarly, this behavior is repeated for clock PHI2 generated at the auxiliary output. On the rising edge of the reference signal, value P is loaded into the count register, and then decremented each clock cycle. Upon reaching value zero, clock PHI2 is set high until the time of the falling edge of the reference signal.
Deadband and ΔT are defined as:
$${ Deadband} = { P + 1} \\ { \Delta{T}} = { T_{ low} + P + 1}$$
(4.23)

These values are determined by the characteristics of the reference signal and period value P.

If the circuit is disabled then all outputs are low, the DR0, DR1, and DR2 registers save their state, and the internal state of the circuit is set to a reset state. The asynchronous signal, KILL, applied to the data input of the block disables both outputs immediately.

## 4.4 Customized PSoC Digital Blocks

PSoC has three types of hardware blocks with customized functionality: (1) pulse width modulation, (2) multiply-accumulate, and (3) decimator circuits.

### 4.4.1 Pulse Width Modulator Blocks

Pulse width modulators (PWM ) generate digital signals of programmable period, W and pulse width, P as defined in Figure 4.30(a). The duty cycle of the PWM is given by:
$${ Duty cycle} = { \frac{Pulse width}{Period} =\frac{W}{P+1}}$$
(4.25)
The following routines are the firmware-level functions of the PWM block:
• PWM_Start : This function enables the PWM.

• PWM_Stop : This function disables the PWM. As a result, the circuit outputs are lowered to 0.
• PWM_WritePulseWidth : This function sets the pulse width of the output signal. If the loaded value is W then the pulse width is $$W \times T_{clock}$$. $$T_{clock}$$ is the period of the clock signal to the PWM.

• PWM_WritePeriod : This function sets the period of the output signal. If the loaded value is P, then the period of the signal is actually $$(P + 1) \times T_{clock}$$, where $$T_{clock}$$ is the period of the PWM clock signal.

• PWM_bReadCounter  : This function returns the current value of the counter register.

• PWM_bReadPulseWidth : This function returns the value of the pulse width.

• PWM_EnableInterrupts: This function enables interrupts when the terminal count reaches zero.

• PWM_DisableInterrupts: This function disables interrupts when the terminal count reaches zero.

The PWM is based on digital PSoC blocks programmed to operate as counters [1]. One digital PSoC block is used for an 8-bit PWM, and two chained, digital, PSoC blocks are required for a 16-bit PWM. Figure 4.30(b) shows the configuration of the FN and CR0 control registers:
• Bits 2-0 in the FN register configure the digital PSoC block as a counter.

• Bit 3 determines the type of interrupt, that is, interrupt on a terminal count, if the bit is set to 0, and interrupt on a true compare value, if the bit is set to 1.

• Bit 4 sets the compare function, which is either ≤, if the bit is set to 0, or <, if the bit is set to 1.

• Bit 5 set to 1 configures the digital PSoC block as standalone, because it is an 8-bit PWM.

• Bit 6 enables the primary output of the PSoC block to drive the broadcast net, if the bit value is 1, or disables the driving if the bit value is.

• Bit 7 is set to 0, if the data input is noninverted, and if set to 1, the data input is inverted.

• Bit 0 in the CR0 register is set to 1, if the PWM is enabled, and to 0, if the PWM is disabled.

In addition to the FN and CR0 registers, the DR0, DR1, IN, OU, and DR2 registers are also used in the implementation of the PWM. The DR0 register is the down counter. The DR1 register stores the period value, and the DR2 register stores the pulse width of the output signal.

Similar to the counter circuit, bits 7-4 in the IN register select one of the 16 possible data inputs, and bits 3-0 select one of the 16 possible clock signals. The OU register follows the general bit structure of the other programmable blocks.

Figure 4.31 shows the implementation of the firmware functions for a digital PSoC block. This figure shows the physical addresses of the eight PSoC block registers for configuration, function, data, input and output selection, and interrupt mask. Routine PWM_Start starts the operation of the PWM by enabling bit 0 of the configuration CR0 register. Similarly, routine PWM_Stop stops the PWM by disabling bit 0 of CR0 register. The PWM_WritePulseWidth routine loads the pulse width, expressed as multiples of the selected clock cycles, into the DR2 register, which is used as a compare register for the PWM. The routines parameter is passed through the A register. The PWM_WritePeriod routine sets the period of the PWM by loading the DR1 register with the period value. The parameter of the routine is passed through the A register.
Figure 4.32 shows the firmware routines for reading the pulse width and the current value of the counter register stored in the DR0 register. Figure 4.33 shows the firmware routines for enabling and disabling interrupts. Figure 4.34 presents the API method for setting the duty cycle of the PWM,

### Example (Software implementation of a PWM).

Figure 4.35 shows the main elements of the software implementation of a PWM. The circuit is used to turn an LED on/off based on the values in the PRT2DR register. Initially, the LED is turned on by loading a 1 in the PRT2DR register. Then, the value of the PWM pulse width, i.e., 0xFF in this example, is loaded into memory at address 4.

The LED remains on for a time interval equal to the execution time of the delay_loop1 loop. The pulse width of the PWM emulation is given by:
$${ Pulse width} = { [(7 + 5) \times (R_5 + 1) + (7 + 5 + 8)] \times (R_4 + 1)\times clock cycles}$$
(4.26)
The same equation should be used to evaluate the length of time that the LED is off. R 4 and R 5 represent the values loaded at memory addresses 4 and 5, respectively. The execution time is 7 clock cycles for a decrement instruction, 5 cycles for a jump instruction, and 8 clock cycles for a move instruction.

If memory entries at addresses 4 and 5 are initialized to zero, the minimum pulse width is 32 clock cycles. Similarly, the minimum PWM period is 64 clock cycles. This means that the code cannot be used, if shorter pulse widths, or faster PWM signals, are needed. Also, the values of pulse width and period are controlled by two variables. The variable stored at location 5 provides a finer tuning, whereas the variable at location 4 gives a coarser adjustment. Fine tuning is in steps of 12 clock cycles, and coarse tuning is in steps of 32 cycles. The code has to be modified, if finer adjustment steps are needed for PWM pulse width, or period.

### 4.4.2 Multiply ACcumulate

Multiply ACcumulate (MAC) circuits can provide two distinct functions: fast multiplication of two operands, and multiplication followed by summing, that is accumulating, the resulting products.

Figure 4.36 shows the structure of a MAC that is connected to the system bus. Its operation does not require an additional clock or enable signal. For fast multiplication, the 8-bit multiplicands are loaded into the MUL_X and MUL_Y registers, and the 16-bit product is available in the register pair, MUL_DH and the MUL_DL. The MUL_DH register stores the more significant byte, and MUL_DL register the less significant byte. The two multiplicands are signed integers, and the product is also a signed integer.

For the multiply accumulate function, the two multiplicands are loaded into registers MAC_X and MAC_Y. The product is summed with the previous results, and stored as a 32-bit, signed number in the ACC_DR3, ACC_DR2, ACC_DR1, and ACC_DR0 registers. These registers are reset by writing into the MAC_CL1 or MAC_CL0 register. The type of MAC function, that is either multiplication, or multiply and accumulate, is selected by loading the multiplicands into the respective register pair.

### Example (Scalar product of two vectors).

This example illustrates the computational speed benefits offered by the use of MAC, as compared to software routines, by calculating the scalar product of two vectors, a and b. The two vectors can have dimensions of 16, 64, or 256. This example compares three implementations of the scalar product algorithm using C language, assembly code, and assembly code using the MAC for multiplication and summing the partial results.

1. A.

C program for the scalar product of two vectors

Figure 4.37 shows the C code for the scalar product algorithm. The code produced by the C compiler resides in flash memory. This includes the program instructions, and the constant values, required for initializing vectors a and b.

Figure 4.38 highlights the main steps of the code. The data stack space is first allocated for all local variables. Each stack entry is one byte. Variable i is stored at the bottom of the stack. Two stack entries, i.e., two bytes, are allocated for the variables, as variable i of type int. Vector a is stored next, and occupies sixteen stack locations entries (i.e., one entry for each vector element). Vector b is stored on top of vector a, and also has sixteen allocated entries. Finally, variable sp is allocated on top of the stack. Figure 4.38(a) shows the stack allocation of the local data variables.

Next, the code for the compiled program initializes the stack entries for vectors a and b with the constant values stored in flash. Figure 4.38(b) shows the code fragment for initializing vector a. Similar code initializes vector b. The first instruction, at address 0318, sets the current page pointer CUR_PP, stored in register 208, to point to SRAMs page zero. Then, the virtual register, __r1, also located in SRAMs page 0, but at address 03, is loaded with value 80, virtual register __r0 situated in SRAMs page 0, at address 04, is loaded with value 1, and virtual register __r3 located in SRAM page 0, at address 01, points to the first, that is, the bottommost stack entry assigned to vector a. The instruction at address 0326 sets the MVW_PP register, stored in a register at address 213, to point to SRAM page 7. The MVW_PP register is used to efficiently move data from the accumulator into SRAM by using indirect addressing. Then, the contents of the X register, pointing to the bottommost stack entry storing local variables, is saved on the stack. The virtual register __rX is initialized with the value 0.

Data transfer from flash memory. At address 032D, the A register is loaded with the contents of the virtual register __r0, and the X register with the contents of the virtual register __r1. The A register is stored on the stack, because the following instruction ROMX alters its contents. Registers A and X are used to determine the address in flash, from which instruction ROMX, at address 0332, reads data. The A register contains the most significant byte, and the X register, the least significant byte. Address 0x0150 is for first read operation. The read data are returned in the A and X registers. The move immediate instruction, MVI, at address 0336, stores the data in the A register into a page pointed to by the MVW_PP register (SRAM page 7) and location register $$X + 2$$. Then, the contents of the A register are popped from the stack and the index of the next vector elements is incremented. Virtual register __rX is also incremented, and execution jumps back to address 0x033, if there are elements in vector A that still have to be read. Finally, the POP X instruction, at address 0343, restores the contents of the X register to point to the bottommost stack element allocated for the local variables.

Scalar product of two vectors. Figure 4.39 shows the compiler-produced assembly code required to compute the scalar product of two vectors, in this example. The code first initializes the stack entries holding the loop index i, to zero. These are the entries pointed to by the X register, and the entry on top of it. Next, the virtual registers are set so that they point to the SRAM page holding vectors a and b. Virtual register __r0 points to page 7, where vector b is stored. Then the virtual register __r1 is loaded and points to the start of the vector b (register $$\textrm{X} + \textrm{18}$$). Then, virtual register __r1 is updated to point to the current element b[i]. This element occupies the i – 1 entry on top of the stack entry holding b[0]. The current SRAM page number for MVI instruction is computed, and stored in the MVR_PP register located in register 212. For this example, the MVR_PP register points to SRAM page 7. The value of element b[i] is loaded into the A register, and then stored in the virtual register __r0.

A similar set of steps is executed to access vector element a[i]. Virtual register __r2 points to the SRAM page number, and virtual register __r3 to the element a[i]. After storing the current SRAM page in the MVR_PP register (register 212), element a[i] is moved to the A register, and the virtual register __r2. At this point, both operands are available. Element a[i] is stored in register __r2 and element b[i] in register __r0. Next, the value of the X register is stored on the stack, because the multiplication routine __mul8 affects the register value. Then, value b[i] is stored in the A register, and the a[i] value in the X register. The two multiplicands for the multiplication routine __mul8 are passed by using the values in the A and X registers.

This partial product is stored in the A register and this partial product is added to the location at address X register + 34, storing the scalar product. The X register is restored to the saved value. Index i is incremented and the code iterates until all of the pairs of vector elements a[i] and b[i] are multiplied together. The final scalar product is stored at entry $$X + 34$$ in the data stack.

2. B.

Assembly code for scalar product without using MAC.

Figure 4.40 shows the Assembly code for the scalar product of two vectors without using the hardware MAC. The code is much shorter and faster than the code produced by the C compiler. The values for vectors a and b are stored directly into SRAM starting from address 5 to and ending at 20, for vector a, and address 21 to 36 for vector b. Index i is stored at location 3, and the scalar product is computed using the location at address 37. The X register is set to point to the first element in vector a, and the A register is used to compute the product $$a[i] \times b[i]$$. If the value of element a[i] is zero, the multiplication algorithm ends, and execution jumps to the section labeled accu. The product value stored in the A register is added to the memory entry storing the scalar product (address 37). Then, the X register is incremented to point to the next element in vector a and index i is incremented. If there are still elements to be multiplied, then the algorithm jumps to address _mul8, and continues the computation.

The code for multiplying two unsigned numbers starts at address _mul8, and the algorithm is shown in of . Note that in this algorithm, the values of vectors a and b are destroyed during multiplication. The end result of the scalar product is stored at address 37.

3. C.

Assembly code for scalar product using MAC

Figure 4.41 shows the assembly code, using MAC for the scalar product algorithm. First, the MAC is reset, including the ACC_DR3, ACC_DR2, ACC_DR1, and AC_DR0 register. This is accomplished by writing to the MAC_CL0 register located at address EEH. The partial products $$\textrm{a}[\textrm{i}] \times \textrm{b}[\textrm{i}]$$ are computed by writing operand a[i] into the MAC_X register at address ECH, and operand b[i] into the MAC_Y register located at address EDH. Instruction MOV reg[EDH],0 is introduced after each partial product to allow the MAC to finish its computation. After all partial products have been computed, the value of the scalar product is stored in the four registers: ACC_DR3 located at address EEH, ACC_DR2 at located address EFH, ACC_DR1 located at address ECH, and AC_DR0 located at address EDH.
Table 4.8:

Execution time in clock cycles for different implementation styles

Vector size

C Code without MAC

C Code with MAC

Assembly Code without MAC

Assembly Code with MAC

16

8958

6043

2861

390

64

45177

23659

11932

1580

256

52268

6188

4. D.

Execution times

Table 4.8 summarizes the execution times for each of the implementations discussed. In addition to the listed execution times, 1494 clock cycles are needed for initializing the embedded system. Vector sizes of 256 could not be handled by the version of the C compiler used for this example. Note that using MAC for vectors of length 16, the code produced by the C compiler required 2915 fewer clock cycles than when MAC was used. This represents a 32% saving in the execution time of the algorithm. For vectors of length 64, the execution time, when using MAC, is 21,618 clock cycles less than without MAC, which is a 73% improvement in execution time. For vectors of length 16, the execution time for the algorithm written in assembly code is 6097 clock cycles shorter, which is a 68% saving in time, as compared to the C code without MAC. The time saving increases to 33,245 clock cycles for vectors of length 64, which represents a 73% improvement. Finally, for vectors of length 16, the algorithm written in assembly code using MAC requires 8560 fewer clock cycles than the C algorithm without MAC. Similarly, for vectors of length 64, it needs 43,597 fewer clock cycles. These savings in time represent 95% and 96%, respectively, of the execution time of the C program.

### 4.4.3 Decimator Blocks

Decimator blocks are used to remove the frequency components above a given frequency value f B , and then to reduce the data rate of the signal by the factor M, called the down sampling factor. Figure 4.42 presents the dataflow of the decimation operation. discusses incremental, analog-to-digital converters, and discusses oversampled ADCs, which are two popular applications of decimator blocks.

The transfer function for decimator circuits is defined as
$${ H(z)} = { \bigg(\frac{1}{M}\bigg)^2 (1 - z^{-M})^2 \bigg(\frac{1}{1-z^{-1}}\bigg)^2}$$
(4.27)

PSoC provides two types of decimator blocks: type 1, and type 2. Type 1 circuits implement only the integration part of the transfer function H(z), whereas the differentiation part is computed in software. Type 2 provides a full hardware implementation of the transfer function.

The resolution of the analog-to-digital converters that use PSoCs decimator circuits can be estimated using the following two relationships: for single integration [1]:
$${ Resolution (\# bits)} = { 1.5 \times (\log_2 M - 1)}$$
(4.28)
and for double integration [1]:
$${ Resolution (\# bits)} = { 2 \times (\log_2 M - 1)}$$
(4.29)
Figure 4.43 shows the ADC resolution, in terms of the number of bits, as a function of the decimation factor M. The two plots correspond to single, and double integration, during decimation. For single integration, the decimation factor $$M = 32$$ provides 6 bit resolution, $$M = 64$$ provides 6.5 bit resolution, and $$M = 256$$ gives 10.5 bit resolution. For double integration, the decimation factor $$M = 32$$ provides 8 bit resolution, the factor $$M = 64$$ provides 10 bit resolution, and the factor $$M = 256$$ gives 14 bit resolution.

Type 1 Decimators

This decimator circuit can be programmed to perform a single or a double integration operation [1]. The structure of the circuit is shown in Figure 4.44. The input data are applied to the one-bit input, DATA. If the input bit is 0, then the value -1 is added to the accumulated value, otherwise, the value 1 is added. The accumulated value is 16-bits long and is stored in the A0 register for the first integration operation, i.e., the sum of $$DATA + A0$$ register. The 16-bit A1 register stores the accumulated value for the second integration operation (the sum A0 register + A1 register). The output register contains the sum of the two integration operations.

Type 2 Decimators

Type 2 decimator circuits offer a full hardware implementation of the integration and differentiation steps in a decimator. The circuit structure is shown in Figure 4.45.

The type 1, and type 2, decimators are programmed by using the DEC_CR0, DEC_CR1, and DEC_CR2 control registers. Note that PSoCs decimator circuits are designed for analog-to-digital conversion (ADC) only, and therefore the control registers bit structures are based on this fact.

Register DEC_CR0: This register has the physical address 0,E6H.
The bit structure of the register is as follows [1]
• IGEN (bits 7-4) selects the analog comparator column that is gated for the incremental ADC operation. If the value is 0001, then the analog column 0 is selected, if 0010, then the analog column 1 is used, if 0100, then the analog column 2 is chosen, and if 1000, then the column 3 is selected.

• ICLKS0 (bit 3) is used in conjunction with ICLKS (bits 5-3) of the DEC_CR1 register to determine the digital block that generates the gating signal for incremental ADC.

The following digital blocks are selected for different values of the bits ICLKS: (1) Block 02 for 0000, (2) Block 12 for 0001, (3) Block 01 for 0010, (4) Block 11 for 0011, (5) Block 00 for 0100, (6) Block 10 for 0101, (7) Block 03 for 0110, (8) Block 13 for 0111, (9) Block 22 for 1000, (10) Block 32 for 1001, (11) Block 21 for 1010, (12) Block 31 for 1011, (13) Block 20 for 1100, (14) Block 30 for 1101, (15) Block 23 for 1110, and (16) Block 33 for 1111.

• DCOL (bits 2-1) selects the input to the decimator circuit from one of PSoC’s four analog comparator columns. Column 0 is selected by the value 00, column 1 by 01, column 2 by 10, and column 3 by 11.

• DCLKS0 (bit 0) is used in conjunction with the DCLKS (bits 2-0) of the DEC_CR1 register to select the digital block that generates the DCLK signal for the decimator registers, that is, the DEC, DIFF 0, and DIFF 1 registers in Figure 4.45.

The following digital blocks are selected for different values of the bits DCLKS: (1) Block 02 for 0000, (2) Block 12 for 0001, (3) Block 01 for 0010, (4) Block 11 for 0011, (5) Block 00 for 0100, (6) Block 10 for 0101, (7) Block 03 for 0110, (8) Block 13 for 0111, (9) Block 22 for 1000, (10) Block 32 for 1001, (11) Block 21 for 1010, (12) Block 31 for 1011, (13) Block 20 for 1100, (14) Block 30 for 1101, (15) Block 23 for 1110, and (16) Block 33 for 1111.

Register DEC_CR1: The physical address of this register is 0, E7H.

The bit structure of the register is as follows [1]:
• ECNT (bit 7) set to the value 0 disables the decimator as a counter for incremental ADC, and instead configures the circuit to operate together with a Δ΃ ADC. If the bit is 1 then the decimator is enabled for an incremental ADC. The bit is available only for decimators of type 1.

• IDEC (bit 6) selects the digital block latch control between noninverted, if the bit is set to 0, and inverted, if the bit is set to 1.

• ICLKS (bits 5-3) are used in conjunction with the ICLKS0 (bit 3) of the DEC_CR0 register to determine the digital block that generates the gating signal for the incremental ADC operation.

• DCLKS (bits 2-0) are used in conjunction with DCLKS0 (bit 0) of the DEC_CR0 register to select the digital block that generates the signal DCLK for the decimator registers.

Register DEC_CR2: The physical address of this register is 1, E7H.

The bit structure of this register is as follows [1]:
• Mode (bits 7-6) programs the operation of the type 2 decimator circuits: The decimator functions as a type 1 decimator, if the bits are 00, and the circuit is used for incremental ADC, if the bits are 01, and functions as a type 2 decimator, if the bits are 10.

• Data out shift (bits 5-4) programs the shifting of the output bits: No shifting of bits occurs for the value 00, shifting of all bits to the right by one position for the value 01, shifting of all bits to the right by two positions for the value 10, and shifting of all bits to the right by four positions for the value 11.

• Data format (bit 3) defines the type of input data: if the bit is 0, then an input bit 0 is interpreted by the decimator as the value -1, and the input bit 1 as the value 1. If the control bit is programmed to 1 then the input bit 0 has the value 0, and for the input bit 1, the value is 1.

• Decimation rate (bits 2-0) programs the decimation rate of the circuit as: (1) the circuit is off for the value 000, (2) the rate is 32 for the value 001, (3) the rate is 50 for the value 010, (4) the rate is 64 for the value 011, (5) the rate is 125 for the value 100, (6) the rate is 128 for the value 101, (7) the rate is 250 for the value 110, and (8) the rate is 256 for the value 111.

The 16-bit output register of the decimator circuit is implemented using two 8-bit registers [1]:
• Register DEC_DH: The physical address of this register is 0, E4H. The register contains the more significant byte of the output register. If the register is written to, contents of the accumulator registers is cleared.

• Register DEC_DL: The physical address of this register is 0, E5H. This register stores the least significant byte of the output register. As in the previous case, writing to this register clears the accumulator registers.

## 4.5 Conclusions

This chapter has focused on design methods for optimizing system performance by customization of the architecture for the application. The concepts are illustrated by referring to PSoC’s programmable and customized digital blocks.

Performance-driven customization involves developing optimized modules for the performance-critical modules of an application, including customized hardware circuits and software routines. A module is performance-critical, if it has a significant effect on the global performance of the implementation. For example, with respect to the total system execution time, a module is performance-critical if it involves a large amount of processing and/or data communications with other subroutines. For developing optimized modules, architecture customization explores different performance–cost tradeoffs for the critical modules.

Any design methodology for architecture customization includes four general steps: (i) finding the performance-critical parts of an application, (ii) selecting the critical parts that are subjected to implementation in hardware, (iii) determining the nature and structure of the optimized hardware circuits, and (iv) implementing the overall system design. The design methodologies differ depending on the type of the targeted architecture: architectures with one general-purpose processor and dedicated coprocessors, with one general-purpose processor and shared coprocessors, and with multiple general-purpose processors plus shared coprocessor.

A design methodology for reducing the execution time of algorithms executed on an architecture with one general-purpose processor and coprocessors that are shared by multiple modules of the implementation has also been presented. This architecture is popular for many embedded applications. The analyzed application is inspired by digital signal processing, and was selected because of the following characteristics: it performs a large number of computations, the number of loop iterations is constant and known, and loop iterations can be executed in parallel, in as much as there are few data dependencies between the iterations.

The discussed design methodology consists of the following: specification, profiling, identification of the performance-critical blocks, functional partitioning, hardware–software partitioning, hardware resource allocation, mapping of the performance-critical modules to hardware, and scheduling of the modules that share the same hardware resource.

The second part of the chapter treated the support offered by the PSoC architecture for customization, and explained several design examples that illustrate the design methodology for performance-driven customization. The support provided by the PSoC architecture includes programmable digital blocks and blocks with dedicated functionality.

Programmable digital blocks can implement the following functions: timer, counter, deadband, and CRC. The operation of a block is defined by programming seven registers: registers IN and OU define the programmable input, output, and clock, registers FN and CR0 set the functionality, and registers DR2, DR1, and DR0 store the data involved in the circuit operation. In addition, register INT controls the interrupt generation by a programmable block. This chapter defined the functionality of the blocks, and described the firmware and API routines for operating the programmable blocks. (Refer to [1] for details on CRC blocks.)

The dedicated PSoC functionality and associated programming for the pulse width modulator (PWM), multiply accumulate (MAC), and decimator were also treated in detail in this chapter. (Refer to [1] for details on sleep and watchdog timers.)

A comprehensive discussion was presented of two examples of performance-driven architecture customizations: the software implementation of PWM, and the speed-optimized implementation of an algorithm for computing the scalar product of two vectors. Two additional design examples are presented in .

## 4.6 Recommended Exercises

1. 1.

Using the PSoC programmable digital blocks, develop a technique for measuring the signal frequency. The goal of this exercise is to maximize the precision of the measurement.1

2. 2.
Develop an execution-time efficient implementation of the transfer function:
$${H(z) = 2 \times \frac{0.0488 + 0.0976 z^{-1} + 0.0488 z^{-2}}{1 - 0.9428 z^{-1} + 0.3333 z^{-2}}}$$

Estimate the maximum frequency of the signals that can be filtered with your implementation. Provide solutions for increasing the signal frequency that can be processed.

3. 3.
Develop a cost efficient implementation of the transfer function:
$${H(z) = 2 \times \frac{0.1464 + 0.1464 z^{-2}}{1 - 0.5858 z^{-1} + 0.4142 z^{-2}}}$$

Estimate the maximum frequency of the signals that can be filtered with your implementation. Provide solutions for increasing the signal frequency that can be processed. What is the corresponding cost increase?

4. 4.
Provide an execution time efficient implementation of the transfer function
$$H(z) = 2 \times \frac{0.1464 - 0.2929 z^{-1} + 0.1464 z^{-2}}{1 + 0.1716 z^{-2}}$$

Estimate the maximum frequency of the signals that can be filtered with your implementation. Provide solutions for increasing the signal frequency that can be processed.

5. 5.

Design a cost-efficient implementation for a filter bank that includes the three transfer functions in exercises 2, 3 and 4. Minimize the execution time of the implementation. The solution should exploit all opportunities for hardware sharing among the implementations for the three transfer circuits.

6. 6.

Develop a time-efficient design for a bitstream sequence recognizer. The input data are received serially at an input port. The application must recognize if a predefined sequence of length 8 appeared in the input. Estimate the execution time of your design, and predict the maximum input bit rate that can be handled by your implementation without bit loss.

Extend the design for sequences of lengths 16 bits and then 32 bits.

7. 7.

Propose a design for a FIFO (First-In-First-Out) buffer. The buffer should handle data of length B bytes, where the length is a parameter for the design. Estimate the execution time of your implementation. Suggest solutions for speedingup the design.

8. 8.

Develop (a) a cost-efficient and (b) a time-efficient algorithm for computing the square root of an unsigned integer number. Compare the two solutions.

9. 9.

Program the CRC algorithm in C language, and then in assembly code. Estimate the execution time, and code size, for the two algorithms. Identify the C language constructs that lead to significant execution overhead. Propose solutions that minimize the execution time of the C program. Compare the execution times of the two software procedures with the execution time of the CRC hardware module that is based on PSoC, programmable, digital blocks.

10. 10.

Propose a general approach for scheduling the operations pertaining to parallel loop iterations. Assume that there are no data dependencies between operations in different iterations. Consider an example of 5 parallel iterations and explain the scheduling result of your algorithm.

11. 11.

Propose a general method for binding operations to hardware circuits and variables to registers. The goal is to find the minimum amount of hardware that guarantees satisfying a predefined timing constraint on a ADFG (Acyclic Dataflow Graph). Illustrate your algorithm on an ADFG with more than 10 nodes that represent operations on at least 5 variables.

12. 12.

Is it possible to develop a hardware-only, PSoC-based implementation for the differentiation step of the decimator module, assuming that only Type 1 decimator blocks are available? If so, describe your design and if not, explain why not.

13. 13.

The design example in Section 4.2 minimized the execution time for a given cost of the hardware circuits. How does the design change, if the goal is to find the minimum cost implementation that meets a specified execution time requirement?

14. 14.

Analyze the advantages and limitations of the cost function defined in expression (4.12). Propose more complex cost functions that address the limitations.

15. 15.

Compute the expressions of the logic functions F in Figures 4.10 and 4.12.

16. 16.

Develop firmware routines for managing and operating the deadband block. Use the firmware routines in a C program to generate nonoverlapping signals with different frequencies, and deadband values.

17. 17.

Build a 32-bit PWM module and develop the corresponding firmware routines for starting and stopping the module, setting the pulse width and period, enabling and disabling interrupts, and reading the pulse width and counter.

18. 18.

Identify possibilities of optimizing the execution time of the code in Figure 4.39(b) and Figure 4.40. Find the reasons for the difference in execution time between the C code and the assembly code, without MAC.

19. 19.

Develop a set of firmware routines for operating the type 1 decimator blocks.

20. 20.

For the international data encryption algorithm (IDEA), propose an implementation that is optimized for execution time. The algorithm is described in . Compare your solution with the solution discussed in .

## Footnotes

1. 1.

(See Application Note AN2283, “Measuring Frequency,” Cypress Semiconductor Corporation.

## References

1. [1]
PSoC Mixed Signal Array, Technical Reference Manual, Document No. PSoC TRM 1.21, Cypress Semiconductor Corporation, 2005.Google Scholar
2. [2]
Pseudo Random Sequence Generator, CY8C29/27/24/22/21xxx Data Sheet, Cypress Semiconductors, September 21 2005.Google Scholar
3. [3]
8-bit and 16-bit Pulse Width Modulators, CY8C29/27/24/22/21xxx, CY7C6xxxx, and CYWUSB Data Sheet, Cypress Semiconductors, October 3 2005.Google Scholar
4. [4]
J. Ackenhusen, Real-Time Signal Processing: Design and Implementation of Signal Processing Systems, Upper Saddle River, NJ: Prentice Hall, 1999.Google Scholar
5. [5]
M. Chiodo et al., A case study in computer-added codesign of embedded controllers, Journal of Design Automation of Embedded Systems, 1, (1), pp. 51-67, 1996.
6. [6]
D. Comer, Computer Networks and Internets with Internet Applications, third edition, Upper Saddle River, NJ: Prentice Hall, 2001.Google Scholar
7. [7]
G. De Micheli, R. Ernst, W. Wolf, Readings in Hardware/Software Co-Design, San Francisco: Morgan Kaufmann, 2002.Google Scholar
8. [8]
P. Diniz, E. da Silva, S. Netto, Digital Signal Processing, Cambridge: Cambridge University Press, 2002.Google Scholar
9. [9]
R. Ernst, J. Henkel, and T. Benner, Hardware-software cosynthesis of microcontrollers, IEEE Design & Test, 10, (4), December 1993, pp. 64-75.
10. [10]
R. Gupta, G. De Micheli, A Cosynthesis Approach to Embedded System Design Automation, Journal of Design Automation of Embedded Systems, 1, (1), pp. 69-120, January 1996.
11. [11]
A. A. Jerraya, J. Mermet, System-Level Synthesis, A. A. Jerraya, J. Mermet (editors), Boston: Kluwer Academic Publishers, 1999.
12. [12]
O. Ozbek, Estimating PSoC Power Consumption, Application Note AN2216, September 21 2004.Google Scholar
13. [13]
M. Raaja, Binary to BCD Conversion, Application Note AN2112, Cypress Semiconductors, February 12 2003.Google Scholar
14. [14]
V. Sokil, Hardware Bitstream Sequence Recognizer, Application Note AN2326, Cypress Semiconductors, December 8 2005.Google Scholar
15. [15]
D. Stranneby, Digital Signal Processing: DSP and Applications, Boston: Newnes, 2001.Google Scholar
16. [16]
S. Sukittanon, S. Dame, 3-Channel Filterbank in PSoC, Application Note AN2315, Cypress Semiconductors, September 26 2005.Google Scholar
17. [17]
J. Valvano, Embedded Microcomputer Systems. Real Time Interfacing, third edition, London: Thomson 2007.Google Scholar
18. [18]
D. Van Ess, Measuring Frequency, Application Note AN2283, Cypress Semiconductors, May 26 2005.Google Scholar
19. [19]
W. Wolf et al. Hardware-Software Codesign: Principles and Practices, Boston: Kluwer Academic Publishers, 1999.Google Scholar
20. [20]
PSoC Express, Version 2.0, Cypress Semiconductor, 2006.Google Scholar