A Partitioned CAM Architecture with FPGA Acceleration for Binary Descriptor Matching

An efficient architecture for image descriptor matching that uses a partitioned content-addressable memory (CAM)-based approach is proposed. CAM is frequently used in high-speed content-matching applications. However, due to its lack of functionality to support approximate matching, conventional CAM is not directly useful for image descriptor matching. Our modifications improve the CAM architecture to support approximate content matching for selecting image matches with local binary descriptors. Matches are based on Hamming distances computed for all possible pairs of binary descriptors extracted from two images. We demonstrate an FPGA-based implementation of our CAM-based descriptor-matching unit to illustrate the high matching speed of our design. The time complexity of our modified CAM method for binary descriptor matching is O(n). Our method performs binary descriptor matching at a rate of one descriptor per clock cycle at a frequency of 102 MHz. The resource utilization and timing metrics of several experiments are reported to demonstrate the efficacy and scalability of our design.


INTRODUCTION
Binary descriptor matching is a process of finding the most similar binary vectors that describe a characteristic of two or more different sets.Fast matching of binary descriptors can improve the speed of many diverse applications such as image matching [22], image registration [8], aerial image processing [26], and DNA classification [7].Due to the high number of key points in images, finding the corresponding matches of binary descriptors from one image to another can be computationally expensive.In this work, we propose an architecture based on content-addressable memory (CAM) for fast matching of binary descriptors.Conventional CAM architecture cannot be utilized for binary descriptor matching due to the lack of capability to tolerate approximation.Our CAM-based approach supports approximate matching and provides the advantage of finding the matching descriptor to a set of binary descriptors much faster than other methods.This is beneficial where a large number of Hamming distance calculations are required, including real-time image matching [23], or applications that require processing of large databases such as memoryaugmented neural networks [1] and DNA sequencing [32].
The remainder of this section is organized as follows.We introduce CAM in more detail in Section 1.1.Since our focus in this work is to accelerate the matching step for image-matching applications, we overview fundamentals of image matching in Section 1.2.In Section 1.3, we review the challenges of using conventional CAM for binary descriptor matching.Finally, we introduce the hardware platform for implementing our novel CAM architecture in Section 1.4.

Content-addressable Memory
CAM [25], also known as associative memory, is often used for high-speed search applications.Using CAM as a memory structure in a search application increases the speed as it can identify the location of a query data without any iteration.
In traditional random access memory structures, an address (location) is given as an input to the memory, and the corresponding data is read as the output of the memory.Unlike random access memory structures, CAM identifies the location of a query data.
Figure 1(a) shows a general structure of a RAM (in this article, by RAM, we mean a conventional random access memory structure), and Figure 1(b) is an example of loading the corresponding content of an address using a RAM. Figure 1(c) is a CAM structure implemented using a RAM and a customized encoder.Using CAM, we can find the location(s) of a specific value (query data).The content stored in CAM is an array of bits where each bit corresponds to a location of the data.The most significant bit of this array corresponds to the last memory location, and the least significant bit corresponds to the first memory location.An encoder can be used to encode the hot encoded locations to binary value(s) corresponding to the query data location.The specific design and output of the encoder depend on the application and the encoder can be customized (for example, as a priority encoder or a multi-output encoder) based on the application's requirement.A numerical example of the CAM functionality is shown in Figure 1(d), in which for a query data equals 0101, the encoded array is 00000010, which corresponds to memory location 1 (binary value: 001).If the query data is 1100, then the corresponding output is 01100000, which corresponds to two locations: 5 and 6.For the query data equal to 0001, 0011, 0100, 0110, 1000, 1001, 1011, 1101, or 1110 the corresponding content is 00000000, which means that there is no value equal to those (no hit).

Image Matching
Image matching is one of the fundamental operations of computer vision for many higher level applications such as structure from motion [38] and visual simultaneous localization and mapping [4].Image matching comprises several steps, including key-point detection, patch description, descriptor matching, and outlier elimination.In key-point detection and patch description steps, various algorithms such as SIFT [21], SURF [3], ORB [29], BRISK [20], and D2-Net [6] are used for finding key points in images and extracting features (descriptors) from image patches around the key points [22].For each key point a descriptor is generated.In the descriptor-matching step, a distance metric is calculated among all combinations of descriptors from the reference image and target image.Euclidean distance is an example distance metric used for descriptors with floatingpoint values.Hamming distance is used for binary descriptor matching where each element of the descriptor is a binary value.Algorithms such as BRIEF, BRISK, and ORB generate binary descriptors.For distance calculation of each pair of binary descriptors, a bit-wise XOR is applied to specify 10:3 the different elements within the two descriptors.Then, the distance is calculated by counting the number of differing elements.If the distance between two descriptors is less than a pre-defined threshold, then that pair of key points (each key point corresponds to a descriptor) is considered a proposed match, either exact or approximate.

Using CAM for Binary Descriptor Matching
In image matching, the reference image and target image might be captured from different viewpoints and illuminations, which might cause variations in binary descriptors of patches around key points corresponding to the same local points.Therefore, in the binary descriptor-matching step of the image-matching algorithm, a descriptor in the reference image, which is an approximate value of the query descriptor (descriptor corresponding to the key point of the target image) is acceptable as a match.Although CAM can locate a query data with high speed, it is not possible to use CAM for search applications such as binary descriptor matching, in which approximation is also acceptable.Riazi et al. [28] illustrate that CAM cannot be used directly for Hamming distance calculation.The distance between two binary descriptors is based on the difference between their bits, while in CAM, a match is found only if the query descriptor is an exact match to at least one of the contents of memory.
In this work, we introduce a CAM solution for binary descriptor matching.Instead of computing the Hamming distance between all combinations of binary descriptors of the reference and target images, our CAM-based binary descriptor matching proposes the nearest approximate descriptor to the query descriptor.We only compute Hamming distance between the selected descriptor by the modified partitioned CAM and the query descriptor.If the Hamming distance is less than a pre-defined threshold, then the corresponding key points are considered as a proposed match.

Hardware Implementation of Binary Descriptor Matching
Many computer vision algorithms, including binary descriptor-matching algorithms, require a large number of computations; however, they can be implemented in parallel for speed enhancement.Field programmable gate arrays (FPGA) are popular platforms for the implementation of computer vision algorithms due to their capability for parallel computations, reconfiguration, and low power consumption.In this work, we implement and demonstrate our design based on content-addressable memory using an FPGA platform.
The rest of this article is organized as follows.In Section 2, we review previous work on CAM structures and implementation and other hardware-based binary descriptor-matching algorithms.Then, we present our modified CAM design for binary descriptor matching in Section 3. We report timing analysis and resource utilization of our implementation of our model in Section 4. We conclude in Section 5.

RELATED WORK
We first review recent work on CAM in Section 2.1.Then, we discuss the hardware implementations of image-matching algorithms in Section 2.2.

Review of CAM Structures and Implementation
Irfan et al. [12] have surveyed various implementations of CAM on FPGAs.Although CAM has been used for a variety of high-speed applications, and specifically including FPGA-based implementations in the recent literature [13][14][15]33], there has been little FPGA-based CAM architectures described for solving image-matching problems.
Lee et al. [19] propose a nanoelectromechanical-switch-based ternary contentaddressable memory (NEMTCAM) for implementing a nearest-neighbor classifier.Their proposed NEMTCAM can calculate the Hamming distance by the discharge conductance distribution.Kazemi et al. [16] propose a novel distance function that can be evaluated using multi-bit content-addressable memories (MCAMs) based on ferroelectric FETs (FeFETs) to perform a single-step in-memory nearest-neighbor search with Hamming distance.Garzon et al. [7] propose Hamming distance tolerant content-addressable memory (HD-CAM) for matching applications.They design the HD-CAM in 65 nm CMOS technology.
Ternary content-addressable memory (TCAM) is an extension of CAM that provides the ability to consider some of the selected input bits as don't-care values in addition to 0 and 1 s.TCAM is used when a portion of the input data is sufficient to find the address and an exact match is not required [12].In this method, the don't-care bit positions in the query data must be specified.TCAM [12] is also not appropriate for binary descriptor matching, as the location of don't-care bits must be pre-defined, which is incongruous with the nature of descriptor vectors in image matching.The nature of the descriptor vector matching problem is that bit differences can happen on any of the bits for each image patch in each image, and thus cannot be empirically tuned.Therefore, Ternary CAM is not suitable for descriptor matching.As an example, using our novel approach proposed in this work, it makes no difference if a small number of bit differences between two descriptor vectors are near the beginning of the vectors or near the end.In either case, the result is identical, which is the required functionality of a descriptor-matching system.
A disadvantage of using CAM is its high memory requirement and hardware resource utilization as shown by Irfan et al. [12].Ullah et al. [34] propose UE-TCAM, which partitions the CAM based on the query data to reduce memory usage.The basic implementation of CAM requires 2 N b memory locations, where N b is the number of bits of data stored in data memory.This configuration, which is shown in Figure 2(a), results in a large memory footprint for increasingly larger N b .In UE-TCAM, the query input of the CAM is partitioned into a smaller number of bits.Each part of the query input is used to access a separate CAM.As a result, the output of each CAM module shows all the locations in the data memory that have the same bits in the same positions of the query input of that CAM module.To find the final location for the query input of CAM, the bit-wise logical AND of all the data read from CAM is calculated, and the final non-zero bit provides the location of the query input in the data memory (if all the bits are zero, there is no hit for that query input).
Figure 2(b) is based on the partitioning method in UE-TCAM [34], which presents an implementation of CAM with a query data with N b bits, which are partitioned into k strings of m bits.The content stored in each of the k partitioned CAM units has N bits.In Figure 2, b N b −1 to b 0 represent the query data, a N −1 to a 0 represent the content stored in CAM, and A N −1 to A 0 represent the bit-wise AND result of all a n values in each column.The final AND result is a binary vector, which is the encoded location of the query data.The position of each 1 value in the binary vector corresponds to a memory location.

Hardware Implementation of Image-matching Algorithms
There have been many FPGA-based implementations of the detector (key-point detection) and descriptor (patch description) steps of image-matching algorithms described in the literature [9,30,31].Since the focus of this work is on the binary descriptor-matching step, which follows the patch description step in an image-matching algorithm, we only review the work that has addressed binary descriptor matching on FPGAs in more detail.
Le et al. [17,18] propose a CAM-based FPGA design for pattern matching.The objective of Reference [17] is to match a query pattern in an image with a reference image stored in a database with applications such as face detection.The focus of Reference [17,18] is not on the descriptormatching step of the image-matching problem.
Rao et al. [27] proposed an FPGA implementation of an ORB-based image-matching algorithm for full HD videos.After computing the binary descriptors, their descriptor-matching module calculates the Hamming distance among the descriptors stored in block RAMs (descriptors of the reference image) and the descriptors in the database (descriptors of the target image).Their imagematching algorithm requires 13.37 ms for each 1,920 × 1,080 frame with 500 features (key points).
Huang et al. [11] use the BRIEF algorithm for the patch description step of image matching.They also use Hamming distance for the binary descriptor-matching step.The maximum number of BRIEF descriptors in their design is 100.The BRIEF descriptors are stored in two FIFO memories, one for each image.The Hamming distances of each descriptor from the target image, together with all descriptors from the reference image, are then computed in parallel.They use an adder tree for computing the summation in Hamming distance and a comparator tree for finding the minimum distance for all computed Hamming distances.They achieve 310 fps in the overall system for a 512 × 512 image frame with 100 descriptors.
Ni et al. [24] propose an FPGA-based binocular image-matching algorithm.They use a SURF detector and a BRIEF descriptor with 128 bits.They also use parallel matching cores based on Hamming distance for finding the correspondences between left (reference) and right (target) images in a stereo vision system.They achieve 162 fps for 640 × 480 images.
Peng et al. [26] utilize the image-matching algorithm for high-resolution aerial images.They use a 512-bit BRISK descriptor together with 128 Hamming distance calculator modules in parallel.Each module is composed of a 512 bit XOR operator and a bit accumulator to compute the Hamming distance of two binary descriptors.In each clock cycle, they compute the Hamming distance of one binary descriptor with 128 binary descriptors that are pre-stored in SRAM (Static RAM) from the description step.Finally, they find the matching key points associated with the two best matches for each key point using a two-level comparator design.They use high-resolution (5,616 × 3,744) images with 4,588 pairs of key points that require 548 ms for processing the image-matching algorithm.
Hu et al. [10] propose a binary matching system with a focus on the symmetry of image patches for achieving a high frame rate and ultra-low delay.Their descriptor-matching module has three steps.The first step computes the XOR of descriptors in a template and the currently processed descriptor.In the second step, they use population counting for each 32-bit subset of 128 bits of XOR results in parallel.In the final step, which is called the addition step, the four numbers for each Hamming distance are added together.
We compare our method with work that implements image-matching algorithms in hardware using conventional binary descriptor matching [10,11,24,26,27].The methods proposed in these works compute a distance metric among all binary descriptors of the two images (reference image and target image).In their methodology, an iteration through all binary descriptors of the reference image, together with each binary descriptor of the target image is required to calculate the Hamming distance of all possible pairs.In our work as proposed here, we modify the partitioned CAM architecture as in Reference [34] to increase the speed of binary descriptor matching.In our method, for each query binary descriptor of a target image, the nearest binary descriptor of a reference image is proposed using the enhanced, partitioned CAM, and only the Hamming distance for the query and the proposed binary descriptor is calculated.Therefore, for a single query binary descriptor, no iteration through all binary descriptors of the reference image is required.

METHODOLOGY: BINARY DESCRIPTOR MATCHING WITH OUR MODIFIED
PARTITIONED CAM In this section, first, we present our design to use CAM for binary descriptor matching and illustrate with an example in Section 3.1.Then, we discuss the effect of selecting the number of bits for CAM partitioning on binary descriptor matching in Section 3.2.Finally, we discuss the process timing of our method in Section 3.3.
A Partitioned CAM Architecture with FPGA Acceleration 10:7

Modified Partitioned CAM Architecture for Binary Descriptor Matching
In this section, we present our modifications on the partitioned CAM [34] so that it can be used for binary descriptor matching.We modify the partitioned CAM design (partitioned CAM is shown in Figure 2(b)) to load the location of the closest data (approximate data) instead of just loading the exact data.Although we use the idea of partitioning CAM modules for more efficient implementation, the novelty and contribution of our work is introducing a method for using CAM for descriptor matching.CAM partitioning is a method for more efficient implementation.However, none of the variants of CAM implementation can be used directly for descriptor matching, since they do not have tolerance for accepting a small number of bit differences between the two descriptor vectors.The additional logic elements and the data-processing methodology, which enables the use of a high-performance CAM for descriptor matching, is a primary contribution of our work.
The pseudocode for our approach is shown in Algorithm 1.The input is query data, which has N b bits, and it is represented by a bit string (b (N b )−1 to b 0 ) and is divided into k strings of m bits (k is the number of CAM units).Each m-bit string is the input of a CAM unit and the corresponding content loaded from each CAM unit is an N -bit binary string.The loaded contents from CAM units are represented as a matrix A k×N , where each row corresponds to the output of one CAM unit and each column corresponds to a location of the data.In the subsequent step, the summation of each column of A is calculated and stored in the Sum_values matrix.By doing so, the summation result of each column (Sum_values (i)) shows the number of CAM unit outputs with value 1 as the output bit for the corresponding index (index i).After calculating the summation, the index of Sum_values, which contains the maximum value among the elements of Sum_values string, represents the location of the closest content to the query data (b N b −1 to b 0 ).If there are two maximum values, then the left-most position is selected.To select the contents within a pre-defined distance, we compare the maximum value with a pre-defined threshold (Sum_val_threshold), which is determined experimentally in Section 4.3.We then select the content as the output (the closest content to the query data) if the maximum value is greater than or equal to the Sum_val_threshold.The output of this algorithm is the location of the approximate value of the query data.Figure 3 demonstrates an example of using partitioned CAM units to locate the exact match and our modification approach to partitioned CAM to locate an approximate match in a RAM.In this example, the query data is 101011, comprising 6 bits (N b = 6), and three CAM units (k = 3) are used to find a match for the query data stored in RAM (as in Figure 3(c)) as shown in Figure 3(a).The result of processing the loaded values of CAM units for the partitioned CAM and our modified partitioned CAM is shown in Figure 3(b).If the loaded values were processed in a conventional partitioned CAM module, then the result of the logical bit-wise AND of the bits has no non-zero bit as shown in Figure 3(b).So, the partitioned CAM cannot locate any content in the RAM (no hit).If the same loaded values of CAM units are processed using our approach for modified partitioned CAM, instead of bit-wise AND, then the number of ones in each column of CAM unit outputs is counted using bit-wise addition.This addition result is compared with a pre-defined threshold and if it is greater than or equal to the threshold, the index is selected as the location of the approximate match.As shown in Figure 3(c), the content of address 4 in RAM has only one bit difference with the query data.As the identical value 101011 is not stored in the RAM, the conventional partitioned CAM cannot find a match for the input query, while our approach finds the closest value (approximate value) and proposes location 4 in the RAM as the output.
The block diagram of our modified partitioned CAM for binary descriptor matching is shown in Figure 4.The inputs of this design are a key point and its corresponding descriptor of a target image.The goal is to match a key point of the target image with one of the key points of a reference image stored in a RAM (Reference Image Key-points RAM in Figure 4).The outputs of the design are a key point of reference image and a key point of target image that are selected as a match.The matching of key points is done based on the distance of their corresponding descriptors.The corresponding descriptor of the input key point is thus the query descriptor of the modified partitioned CAM.
In this design, there are two RAMs to store descriptors and key points of the reference image.The index of each key point stored in the Reference Image Key-points RAM is the same as the index of its corresponding descriptor in the Reference Image Binary Descriptors RAM.Therefore, one index (named Location of approximate match) is used to load both the key point and its corresponding descriptor.For each query descriptor (descriptor of target image), a descriptor among descriptors of the reference image (stored in the Reference Image Binary Descriptors RAM) is selected using our modified partitioned CAM.The Hamming distance of this selected descriptor and the query descriptor (target image descriptor) is calculated.If the distance is less than a pre-defined threshold (Hamminд_threshold), which is determined experimentally based on Precision and Recall values as shown in Section 4.3, then the corresponding key points are proposed as a match.The Precision metric and the Recall metric, which are formulated and shown later in Equations ( 2) and (3), determine how many of the proposed matches are correct and how many of the correct matches from the detected key points are selected, respectively.
The data flow of our method for modified partitioned CAM implementation is shown in Figure 5(a).For CAM to support approximate matching (locate the closest value), we modify the partitioned CAM structure (shown in Figure 2(b)) by replacing the bit-wise AND with an adder tree to calculate the Sum result (Sum_values) and a comparator tree to select the maximum value (Max_val).By using an adder tree for each column, the summation of all bits is computed (counting the number of 1 s).Then, the maximum of the summation values is selected using a comparator tree.Next, an N -bit binary vector showing the Sum_values (i) with the Max_value is generated.Element i of the binary vector is 1 if Sum_values (i) is equal to the maximum value (Max_value); otherwise, element i of the binary vector is 0 .Finally, a priority encoder converts the binary vector to the output locations of the query data.as the select signal of a multiplexer, passes the greater value to the next level.The comparator tree structure selects the maximum value among all summation results.
The modified partitioned CAM architecture can locate the content closest to the query data without any iteration.

Selection of the Number of Bits for Partitioning the Query Descriptor
In this design, the parameters that can be tuned are N b and k, based on the trade-off made for binary descriptors matching performance (in terms of Precision and Recall) and resource utilization.As the number of bits for the binary descriptor (N b ) is a parameter that is usually tuned based on the performance of the previous step of the image-matching algorithm (patch description), we can assume N b is a fixed number.In Figure 5(a), k is the total number of CAM units, and m = N b k is the number of bits from the descriptor dedicated to the address of each CAM unit.For a fixed number of bits for a descriptor, changing the parameter k also changes m and affects the accuracy and resource utilization of the architecture.If m, which relates to the size of the CAM units, increases, then the matching system detects exact matches on longer bit strings of the descriptor vectors.This may result in an increase in matching error by missing the correct matches that have only a few different bits.Figure 6 shows an example of two 8-bit binary descriptors (N b = 8).We assume the binary descriptor 1 is the query binary descriptor (target image binary descriptor), and binary descriptor 2 is the reference image binary descriptor that is stored in location i in the Reference Image Binary Descriptor RAM.The Hamming distance of these two descriptors is equal to 1 as the least significant bit of these binary descriptors is the only different bit in these binary descriptors.If we assume the pair of descriptor 1 and descriptor 2 is a match, then for the design to select descriptor 1 as a match of descriptor 2, the ith bit of loaded outputs of the CAM units should be 1 for a pre-defined number of CAM units (refer to Figure 5(a) in which (S i ) should be greater than or equal to the pre-defined threshold (Sum_val_threshold)).If the ith bit of output for CAM unit i is equal to 1, then we consider it as a hit for that CAM unit.Otherwise, we consider it a miss.As shown in 10:11 Fig. 6.Example of a various number of CAM units and their effect on the number of hits for matching the two 8-bit binary descriptors with only one bit difference.Partitioning with a higher number of CAM units (for example, k = 4 in panel (a)) results in a higher number of hits, while the lower number of CAM units (for example, k = 1 in panel (c)) leads to no hit.
Figure 6(a), for k = 4, there are three hits and one miss.In Figure 6(b) for k = 2, there is one hit and one miss.In Figure 6(c), there is 0 hit and one miss.In our design, after this stage, if the number of hits is greater than or equal to the pre-defined threshold (sum_val_threshold), we consider the pair of binary descriptors to be a match.Therefore, the probability of selecting descriptor 1 and descriptor 2 as a match is increased when the number of CAM units (k) is increased.
Consequently, increasing the number of CAM units k results in increasing the performance of image matching in terms of Precision while the resources required for the adder tree and comparator tree also increase.
Another important parameter in choosing the number of CAM units for partitioning the query binary descriptor is the total number of bits required for the CAM implementation.Each CAM unit requires 2 m × N bits where m is the number of bits in each partition and N is the number of key points.The total number of memory bits for the CAM units is shown in Equation (1): where N b is the total number of descriptor bits (N b = k × m as in Figure 5(a)).Therefore, if N b = 256, for m = 2, m = 4, and m = 8, the total number of required memory bits are 2 9 × N , 2 10 × N , and 2 13 × N , respectively.In this work, we chose m = 4, which results in 64 CAM units after considering a trade-off between accuracy and resource utilization.The approximation mentioned throughout this article is not a source for error.In fact, by introducing this approximation (computing the summation of the bits from the CAM outputs and partitioning the bits of the descriptor for storing in the CAM modules), we create the possibility of using CAM for image matching, which results in increased speed for the matching step.Our proposed method finds the best match from the first image to the second image.Descriptor matching by nature should accommodate small differences in the bit values.The only potential error that may happen is if we increase the size of the CAM modules (m) so that the matching system looks for an exact match of the descriptor from the first image to the second image.This may result in not finding an exact match and missing the correct matches that have a small number of unmatched bits.

Process Timing of the Proposed Modified Partitioned CAM
The modified partitioned CAM binary descriptor matching has three steps.In this section, we present illustrative timing diagrams to show the conceptual description of the state machine operation.The diagrams in this section visually illustrate how our proposed hardware architecture works in each step.

3.3.1
Step 1. Figure 7 shows the timing diagram of the first step.In the first step, the corresponding locations to the descriptors of the reference image are written in CAM units.Storing each descriptor location in CAM requires three clock cycles in our design: the first clock cycle is for loading the existing value from each CAM module.The input of each CAM module is the number of selected bits (4 bits in our design) from the descriptor.In the second clock cycle, the corresponding bit of CAM that represents the descriptor location in the data memory is set to 1.At last, the new value is stored in the same cell of CAM in the third clock cycle.The first step is processed immediately after the key-point detection and patch description steps.

3.3.2
Step 2. The second step of our CAM-based binary descriptor matching is finding the most similar descriptor from Reference Image Binary Descriptors RAM to the query descriptor (target image descriptor).First, we partition the query descriptor into m-bit values.Each m-bit array is used to access the data stored in one of the k CAM units.Then, the outputs of the CAM units are loaded in one clock cycle, and the summation of each column is calculated using an adder tree.The timing diagram of the second step is shown in Figure 8.In Figure 8, each t n corresponds to one clock cycle, and at each clock cycle, four key points are being processed in the pipeline structure.

3.3.3
Step 3. The third step is writing zeros in all memory cells to prepare the CAM for the next image.This step happens once for each image and is done in parallel for all CAM units.The required time for this step is equal to the number of cells in each CAM unit (in our design, the number of cells in each CAM is equal to 16 as the input of CAM has 4 bits).In comparison with the number of clock cycles required for processing each image, the number of clock cycles for this step, 16 clock cycles, is negligible.

EXPERIMENTAL RESULTS
In this section, we present the result of the simulation and implementation of our binary descriptor matching using CAM implemented on a Kintex UltraScale (KCU105) FPGA board [36], which contains a Xilinx XCKU040-2FFVA1156E FPGA.

Time Complexity
The descriptors from the reference image are generated in sequence and stored in the CAM.In a conventional descriptor-matching system, the descriptors from the reference image and the target image are stored in the memory.The reference image descriptors are loaded one by one to be compared with each descriptor from the target image.In our CAM-based circuit, although the descriptors from the reference image are stored in the CAM sequentially, the step that compares them one by one with the descriptor from the target image happens without any iterations due to the modified CAM architecture.This is possible due to the nature of the CAM, which provides the address of the query data in the memory and our proposed approach, which enables the CAM to accommodate bit differences as required in the descriptor-matching problem.As a result, only a single pass based on the number of descriptors of the target image is required.
The commonly used method for binary descriptor matching is the computation of the Hamming distance for the two descriptors from the reference image and the target image.Let the number of key points in the reference and target images be N 1 and N 2 , respectively.The basic conventional commonly used matching based on the Hamming distance computation method requires N 1 × N 2 Hamming distance operations.Therefore, the time complexity for binary descriptor matching is O (n 2 ) where n represents the number of arguments (in our case, the number of key points) passed to the algorithm.Many hardware implementations of Hamming distance calculation such as Reference [24] have proposed parallel Hamming distance modules, leading to a time complexity of O (n 2 /m), where m is the number of parallel Hamming distance calculation modules.Since m is a bounded value and is smaller than the number of key points in practical applications, the time complexity of these methods is still O (n 2 ).
In conclusion, our method features high parallelism due to the CAM architecture.Each binary descriptor of the target image is compared with all binary descriptors from the reference image simultaneously (due to the nature of CAM functionality) without any one-to-one comparison.Therefore, the required time for matching N 2 binary descriptors of the target image is proportional to N 2 .Our method thus has a time complexity of O (n).If the number of key points increases, then our CAM-based matching becomes faster than the basic Hamming distance calculation method.

Speed Comparison
Table 1 compares our binary descriptor-matching method with other published work in terms of speed.Since the contributions of References [10,11,24,26,27] as shown in Table 1 have been primarily focused on the detector and descriptor stages of image matching, they have implemented the detector, binary descriptor, and binary descriptor-matching steps of the image-matching algorithm.Our focus and contributions are related to the matching step and to demonstrate a fair comparison, we compare with the works that use Hamming distance for the binary descriptor matching.
The focus of our work is on the binary descriptor-matching step.Our proposed method can be used with any detector and binary descriptor algorithm and the matching stage is independent of the image size.However, the number of key points and the number of bits directly affect on performance and resource utilization.As a result, we compute and report the required time of applying our binary descriptor-matching method to 128-bit (BRIEF), 256-bit (ORB), and 512-bit (BRISK) descriptors so that our results can be compared with other work that employ these three binary descriptors.As shown in Table 1, our method is faster than other work that uses the same number of key points and descriptor bits.This is because our method does not require computation of the Hamming distance for all combinations of descriptors of the reference image and target image.Although the image size has a direct effect on the detector and descriptor speed and the frame rate metric, the processing speed of the matching step is only affected by the number of key points and the binary descriptor size, and is independent of the image size.Therefore, since our focus and contributions in this work are on the matching step, and are not related to a specific detector and descriptor, we do not report the image size of our work in Table 1.
Rao et al. [27] report 3 ms as the latency of their binary descriptor-matching step and 13.37 ms as the latency of the overall image-matching algorithm.They compute the Hamming distance for 500 key points per image.Their method requires 250,000 Hamming distance calculations for computing the distance between every two descriptors from reference and target images.In our design, 500 key points lead to 500 operations and can be done in 500 clock cycles.This requires 0.005 ms at a frequency of 100 MHz, which is 600 times faster in comparison with the binary descriptor-matching step proposed in Reference [27].

Accuracy of Image Matching
For verification of our hardware implementation, we obtain ORB descriptors from the images of the HPatches dataset [2], which is one of the most well-known and commonly-used datasets for descriptor matching in the literature.Note that the purpose of accuracy verification in this section is to demonstrate that our method has similar accuracy performance with conventional matching methods used in the literature.Our contribution is a method using a CAM architecture for descriptor matching, which results in higher speed than seen in other work, while having similar Precision and Recall.The experiments presented in this section demonstrate how our technique compares with other methods reported in the literature.
Our CAM-based binary descriptor-matching method uses two parameters, which were introduced in Section 3.1, which can be tuned for a potential application.The first parameter is the summation threshold.After selecting the maximum of summation values (Sum_values), we compare the maximum value (Max_val) with a pre-defined threshold (Sum_val_threshold).The second parameter is the Hamming distance threshold (Hamminд_threshold) for the final test of the selected pair.

10:15
The Precision value shows the ratio of correct matches from the proposed matches according to Equation (3): In Figure 9, we investigate the effect of the summation threshold before the Hamming distance unit (which is equivalent to having the Hamminд_threshold = 1).As shown in Figure 9, increasing the summation threshold (Sum_val_threshold), selects fewer key points as potential matches in our design.As a result, Precision increases, since only matches with higher similarities are accepted.However, Recall decreases if many of the correct matches are missed.
The reason for the saturation of Precision and Recall at summation thresholds (Sum_val_threshold) lower than 25 is that we select the maximum value (Max_val), as shown in Algorithm 1.The Sum_val_threshold is a parameter used for controlling the Precision and Recall of the matching results based on the requirements of an application.This value can be reasonably chosen based on the number of bits and the number of CAM modules.Although Figure 9 is provided for the purpose of explanation of the effect of changing this parameter on Precision and Recall values, the same methodology can be used on an application-specific dataset (on a training set for example).
Figure 10 shows the effect of tuning the Hamming distance threshold (Hamminд_threshold) on Precision and Recall.The thresholds can be reasonably chosen based on the number of descriptor bits, number of CAM units, and the requirements of a specific application.In this experiment, the summation threshold (Sum_val_threshold) is set to 25 empirically, based on Figure 9. Lower Hamming distance threshold (Hamminд_threshold) values lead to higher Precision as it results in matches with lower distance values.Increasing the Hamming distance threshold (Hamminд_threshold) results in a higher number of correct matches, which increases the Recall.Descriptor matching is concerned with finding the nearest descriptor vector from a set of descriptors from a reference image to a query descriptor from the target image.Therefore, the conventional descriptor-matching method that is generally used in the literature can be formulated as a K-nearest-neighbor (KNN) problem.In the basic form, K is equal to 1 and the closest descriptor with a minimum distance (one nearest neighbor) is selected.An improved version, which is commonly used in the literature [5,35,37], uses K = 2 and selects the two nearest neighbor and performs a ratio test based on the calculated distances.As a result, we compare our proposed method to the conventional KNN method to demonstrate comparable accuracy of our novel CAMbased proposed matching method.
Table 2 shows the comparison between our CAM-based matching method with KNN that uses Hamming distance as the distance metric.We used the KNN method (in this experiment K is equal to 2 so that the ratio test is applicable) to select the two best matches (lowest Hamming distances) for each key point from the reference image and applied the ratio test proposed by Lowe [21] (one of the most well-known works in descriptor matching in the literature) to decide if the pair should be proposed as a match.The Precision and Recall shown in this table correspond to the Hamming distance threshold (Hamminд_threshold) of 0.225 and the summation threshold (Sum_val_threshold) of 25.As shown in Table 2, our method exhibits Precision and Recall similar to that of the KNN method, indicating its performance is comparable to other commonly used matching methods.
The values of Precision and Recall are dependent on several factors such as the complexity of the dataset and the number of points in the experiment.But under identical conditions, as in Table 2, we show that our method achieves similar Precision and Recall for the same set of points with the KNN algorithm.It is important to note that any detector or descriptor algorithms can be used as the steps prior to descriptor matching in the image-matching pipeline.But since the emphasis of our work is on matching, our experiments are focused on the matching step.Related work in the literature have used KNN, which is a commonly used descriptor-matching technique, and as a result, have provided minimal detail on their matching step.Our method performs the same number of comparisons as the conventional method, which leads to similar results (using the same descriptor vectors) as the conventional method in terms of Precision and Recall.

Hardware Implementation Metrics
All results provided in this section are based on the metrics after place and routing on the FPGA device.Table 3 illustrates the resource usage of our binary descriptor-matching method.The reported results in Table 3 assume 100 key points per image.In the LUT-CAM implementation, the CAM units are implemented using LUT RAM, and in BRAM-CAM implementation, the CAM units are implemented on BRAMs.We also report the resource usage of the binary descriptor-matching step proposed by Ni et al. [24] in Table 3 as this is the only work, to the best of our knowledge, that reports resource utilization of the binary descriptor-matching step.Since most of the previous work such as References [11,26,27] does not provide information about resource requirements of the matching stage, we cannot properly compare parameters such as power and resource utilization.Our method for 128-bit BRAM-based CAM requires about twice the number of LUTs and  BRAMs as does Reference [24].However, because of our CAM-based binary descriptor matching, we achieve higher speed as a trade-off with resource utilization.Table 4 presents the comparison of resource utilization for 100 and 500 number of key points.As shown in Table 4, increasing the number of key points (N in Figure 5) leads to a higher number of bits for each CAM unit.Therefore, the number of adder units in Figure 5(a) and the number of comparator units, and the critical delay path of the comparator tree also increase.
Although the number of binary descriptor bits and the number of key points extracted from each image effect both resource utilization and the Precision and Recall metrics of our proposed architecture, these numbers do not impact our contribution to speed improvement described in this work.Choosing a higher number of bits for the descriptors normally increases performance in terms of precision and selecting a higher number of key points usually improves recall (depending on the data set).Each of these choices will increase the resource utilization as well.However, these design decisions are primarily related to the specific dataset and intended image-matching application.
Table 5 shows the power and maximum operating frequency of various configurations of our CAM-based binary descriptor-matching implementation.In Table 5, the numbers of bits (128, 256, 512) is proportional to the number of CAM units k (32,64,128) as the size of the CAM units is fixed (m = 4).The maximum frequency is computed based on the critical timing path of the circuit.The maximum frequency for 100 key points per image for varying number of bits does not lead to noticeable changes.This shows that the critical timing path of the design does not change significantly with changing the number of bits.However, using 500 key points per image decreases the maximum operating frequency to 102.75 and 91.69 MHz for BRAM-CAM and LUT-CAM, respectively.The change in the maximum operating frequency demonstrates that the critical timing path of the design is related to the comparator tree, which is a combinational circuit with a size 10:19 proportional to the number of key points.In addition, the power consumption of the implementation with 500 key points is much more than the implementation that uses 100 key points, due to the usage of more BRAM as shown in Table 5.The choice of using LUTs or BRAM for the implementation of CAM depends on the available resources, maximum operating frequency, and power consumption of the FPGA.
Our proposed CAM matching method can be applied to any image size and using any binary descriptor.As a case study, we provide the implementation metrics of a key-point description and matching, for two full-HD 1,920 × 1,080 images on the KCU105 FPGA board [36].The resource utilization and speed metrics for 500 key points per image and a 256-bit binary descriptor extracted from image patches of 65 × 65 are provided in Table 6.
The key-point locations are stored in RAM for each image.For each image, a 1,920 × 65 line buffer is implemented so that the binary descriptor has parallel access to the pixel values.The descriptors from the first image are stored in CAM and the descriptors from the second image are stored in a RAM.After computing all descriptors, the descriptors from the RAM (second image) are loaded sequentially, and the address of the best matching descriptor (first image) is retrieved from CAM.The address is used to load the matching key-point coordinates from the key-point memory.This implementation results in a maximum frequency of 77 MHz and a frame rate of 40 fps, which includes image loading and descriptor computation time as well as the time for matching.

CONCLUSION
In this work, we introduce a method for using CAM to increase the speed of binary descriptor matching.Our method finds the location of the closest descriptor stored in a data memory to a query descriptor and proposes these descriptors' corresponding key points as a match.We also present an FPGA-based hardware architecture for binary descriptor matching based on partitioned CAM.The architecture can be used with any local binary descriptor to accelerate the

Fig. 1 .
Fig. 1.(a) General structure of Conventional RAM.(b) Example reading content of a RAM using an input address (an example of input address with its corresponding content is shown in green).(c) CAM structure implemented using a RAM and a customized encoder that can be designed based on the requirements of a specific application.(d) Example of finding the location of a query data in CAM (two examples of query data with their corresponding outputs are shown in green and blue).

Fig. 2 .
Fig. 2. (a) Basic implementation of CAM.(b) Partitioned implementation of CAM.Total number of bits for the query data (N b ) is partitioned into m-bit strings, resulting in k CAM units.Content stored in each CAM unit has N bits, where N is the number of locations.

10 : 8 PALGORITHM 1 :Fig. 3 . 9 Fig. 4 .
Fig. 3. Example of finding a match in a RAM using partitioned CAM units for conventional partitioned CAM and our approach for modified partitioned CAM.Panel (a) shows the query data and three CAM units, the selected content based on the input index is highlighted for each CAM unit.Panel (b) shows the outputs of the conventional partitioned CAM and our modified partitioned CAM.Panel (c) shows the content of the corresponding RAM.Selected content in the RAM unit is highlighted.

Figure 5 (
b) demonstrates the structure of the comparator tree in Figure 5(a).The comparator tree consists of loд 2 (N ) layers where N is the number of key points for each image.Each layer in Figure 5(b) contains a number of Comparator units (comparator and multiplexer) as shown in Figure 5(c).The Comparator unit compares two values from the previous layer, and by applying the Greater than (Gt) signal of the comparator

Fig. 5 .
Fig. 5. (a) Architecture of our matching module using partitioned CAM.The gray boxes are registers.The summation result is computed using N adder trees.Parameter m shows the number of input bits for each CAM unit, and k indicates the total number of CAM units.If the descriptor has 256 bits and m = 4, then the number of CAM units is k = 64.(b) Comparator tree for selecting the maximum value using Comparator units shown in panel (c).N is the number of key points in the reference image.(c) Architecture of each Comparator unit, which outputs the maximum of its two inputs (Value 1 and Value 2).

Fig. 7 .
Fig. 7. Timing diagram showing the filling of the CAM.After detecting a key point, three clock cycles are required to fill the CAM with the location of the descriptor corresponding to the detected key point.

10 : 13 Fig. 8 .
Fig. 8. Timing diagram showing the reading from CAM for three consecutive key points.At each clock cycle, the pipeline registers are filled with the key-point data.

Figures 9 and 10 illustrate
Figures 9 and 10 illustrate Precision and Recall based on tuning these two parameters on the HPatches dataset.Precision and Recall evaluation metrics, which are common metrics for binary descriptor matching, are used in this work to validate the accuracy performance of our method.The Recall in this experiment is computed as the number of correct proposed matches divided by the number of matches in all known key points for each pair of images according to Equation (2): Recall = # of correct proposed matches # of matches of known key points .(2)

Fig. 9 .
Fig. 9.The effect of changing summation threshold (Sum_val_threshold) on Precision and Recall of matching.

Fig. 10 .
Fig. 10.Effect of changing Hamming distance threshold on Precision and Recall.

Table 1 .
Comparison of Binary Descriptor-matching Speed of Our Design with Other Work The latency was not reported and is computed based on the reported frame rate.* * Not applicable (the matching stage is not dependent on image size and the latency is affected only by the number of bits and the number of key points).

Table 2 .
Comparison of Precision and Recall Metrics

Table 3 .
Resource Utilization for Various Numbers of Bits

Table 4 .
Resource Utilization with Various Number of Key Points *Each BRAM unit contains 36 kb.*Each LUT RAM unit contains 64 bits.*The percentage of resource usage is shown in parentheses.

Table 5 .
Power Usage and Maximum Operating Frequency for Various Numbers of Bits and Key Points

Table 6 .
Resource Utilization Metrics for 500 Key Points and 1,920 × 1,080 Images on KCU105