A Survey of Computer Vision Technologies In Urban and Controlled-environment Agriculture

In the evolution of agriculture to its next stage, Agriculture 5.0, artiﬁcial intelligence will play a central role. Controlled-environment agriculture, or CEA, is a special form of urban and suburban agricultural practice that oﬀers numerous economic, environmental, and social beneﬁts, including shorter transportation routes to population centers, reduced environmental impact, and increased productivity. Due to its ability to control environmental factors, CEA couples well with computer vision (CV) in the adoption of real-time monitoring of the plant conditions and autonomous cultivation and harvesting. The objective of this paper is to familiarize CV researchers with agricultural applications and agricultural practitioners with the solutions oﬀered by CV. We identify ﬁve major CV applications in CEA, analyze their requirements and motivation, and survey the state of the art as reﬂected in 68 technical papers using deep learning methods. In addition, we discuss ﬁve key subareas of computer vision and how they related to these CEA problems, as well as nine vision-based CEA datasets. We hope the survey will help researchers quickly gain a bird-eye view of the striving research area and will spark inspiration for new research and development.


INTRODUCTION
Artificial intelligence (AI), especially computer vision (CV), is finding an ever broadening range of applications in modern agriculture.The next stage of agricultural technological development, Agriculture 5.0 [15,100,236,361], will constitute AI-driven autonomous decision making as a central component.The term Agriculture 5.0 stems from a chronology [361] that begins with Agriculture 1.0, which heavily depends on human labor and animal power, and Agriculture 2.0, enabled by synthetic fertilizers, pesticide, and combustion-powered machinery, and develops to Agriculture 3.0 and 4.0, characterized by GPS-enabled precision control, and Internet-of-Thing (IoT) driven data collection [257].Built upon the rich agricultural data collected, Agriculture 5.0 holds the promise to further increase productivity, satiate the food demand of a growing global population, and mitigate the negative environmental impact of existing agricultural practices.
As an integral component of Agriculture 5.0, controlled-environment agriculture (CEA), a farming practice carried out within urban, indoor, resource-controlled, and sensor-driven factories, is particularly suitable for the application of AI and CV.This is because CEA provides ample infrastructure support for data collection and autonomous execution of algorithmic decisions.In terms of productivity, CEA could produce higher yield per unit area of land [8,9] and boost the nutritional content of agricultural products [162,313].In terms of environmental impact, CEA farms can insulate environmental influences, relieve the need for fertilizer and pesticides, and efficiently utilize recycled resources like water, thereby may be much more environmentally friendly and self-sustainable than traditional farming.
In the light of current global challenges, such as disruptions to global supply chains and the threat of climate change, CEA appears especially appealing as a food source for urban population centers.Under pressures of deglobalization brought by geopolitical tensions [371] and global pandemics [237,276], CEA provides the possibility to build farms close to large cities, which shortens the transportation distance and maintains secure food supplies even when long-distance routes are disrupted.The city-state Singapore, for example, has promised to source 30% of its food domestically by 2030 [1,315], which is only possible through suburban farms such as CEAs.Furthermore, CEA, as a form of precision agriculture, is by itself a viable solution to the reduction of the emission of greenhouse gasses [9,37,249].CEA can also shield plants from adverse climate conditions exacerbated by climate change as its environments are fully controlled [112] and is able to effectively reuse the arable land eroded due to climate change [373].
We argue that AI and CV are critical to the economic viability and long-term sustainability of CEAs as these technologies could save expenses associated with production and improve productivity.Suburban CEAs have high land costs.An analysis in Victoria, Australia [38] shows that, due to the higher land cost resulting from proximity to cities, with an estimated 50-fold productivity improvement per land area, it still takes 6 to 7 years for a CEA to reach the break-even point.Thus, further productivity improvement from AI would act as strong drivers for CEA adoption.Moreover, vertical or stacked setup of vertical farms impose additional difficulty for farmers to perform daily surveillance and operations.Automated solutions empowered by computer vision could effectively solve this problem.
Finally, AI and CV technologies have the potential to fully characterize the complex, individually different, time-varying, and dynamic conditions of living organisms [39], which will enable precise and individualized management and further elevate yield.Thus, AI and CV technologies appear to be a natural fit to CEAs.
Most of the recent development of AI can be attributed to the newly discovered capability to train deep neural networks [175] that can (1) automatically learn multi-level representations of input data that are transferable to diverse downstream tasks [65,137], (2) easily scale up to match the growing size of data [291], and (3) conveniently utilize massively parallel hardware architectures like GPUs [114,337].As function approximators, deep learning proves to be surprisingly effective in generalizing to previously unseen data [363].Deep learning has achieved tremendous success in computer vision [302], natural language processing [47,83,118], multimedia [23,88], robotics [300], game playing [278], and many other areas.
The AI revolution in agriculture is already underway.State-of-the-art neural network technologies, such as ResNet [134] and MobileNet [139] for image recognition, and Faster R-CNN [244], Mask R-CNN [133], and YOLO [239] for object detection, have been applied to the management of crops [197], livestock [142,308], and plants in indoor and vertical farms [245,366].AI has been used to provide decision support in a myriad of tasks from DNA analysis [197] and growth monitoring [245,366] to disease detection [262] and profit prediction [28].
While several surveys have explored the use of computer vision (CV) techniques in agriculture, none of them specifically focus on CEA applications.Some surveys summarize studies based on aspects of practical applications in agriculture.[74,89,123,149,286] survey pest and disease detection studies.[40,111,312] discuss fruit and vegetable quality grading and disease detection.[307] summarizes studies in six sub-fields, including crop growth monitoring, pest and disease detection, automatic harvesting/fruit detection, fruit quality testing, automated management of modern farms and the monitoring of farmland information with Unmanned Aerial Vehicle (UAV).Other survey organize existing works from a technical perspective, namely algorithms used [241] or formats of data [56].[154], as an exception, introduces the development history of CV and AI in smart agriculture, without investigating any individual studies.
Our work aims to address this gap and provide insights tailored to CEA-specific contexts.
As the volume of research in smart agriculture grows rapidly, we hope the current review article can bridge researchers from AI and agriculture and create a mild learning curve when they wish to familiarize themselves in the other area.We believe computer vision has the closest connections with, and is the most immediately applicable in, urban agriculture and CEAs.Hence, in this paper, we focus on reviewing deep-learning based computer vision technologies in urban farming and CEAs.We focus on deep learning because it is the predominant approach in AI and CV research.The contributions of this paper are two-fold, with the former targeted at AI researchers and the latter targeted at agriculture researchers: • We identify five major CV applications in CEA and analyze their requirements and motivation.Further, we survey the state of the art as reflected in 68 technical papers and 14 vision-based CEA datasets.
• We discuss five key subareas of computer vision and how they relate to CEA.In addition, we identify four potential future directions for research in CV for CEA.
In figure 1 we provide an graphical preview of our content.It illustrates the end-to-end agriculture process of CEAs, from seed planting to harvest and sales, with five major deep learning based CV applications-Growth Monitoring, Fruit and Flower Detection, Fruit Counting, Maturity Level Classification and Pest and Disease Detection-mapped to the corresponding applicable plant growth stages.We do not survey the autonomous seed planting and harvesting step as they are more relevant to robot functioning and robotic control, i.e grasping, carrying and placing of objects rather than computer vision (we do include the localization of fruit in the fruit and flower detection section that facilitate harvesting robot to locate the targeted object and perform action).However, we provide here some literature related to agriculture robot and end-effector design for reference [36,57,92,235,362] We structure the survey following the process in the figure: First, to provide a bird-eye view of CV capabilities available to researchers in smart agriculture, we summarize several major CV problems and influential technical solutions in §2.Next, we review 68 papers with respect to the application of computer vision in the CEA system in §3.The discussion is organized into five subsections: Growth Monitoring, Fruit and Flower Detection, Fruit Counting, Maturity Level Classification, and Pest and Disease Detection.In the discussion, we focus on fruits and vegetables that are suitable for CEA, including tomato [10,13,127,360], mango [7], guava [277,342], strawberry [107,355], capsicum [177], banana [5], lettuce [368], cucumber [10,128,203], citrus [4] and blueberry [2].Next, we provide a summary of fourteen publicly available datasets of plants and fruits in §4 to facilitate future studies in Controlled-environment agriculture.Finally, we highlight a few research directions that could generate high-impact research in the near future in §5.One thing to note here is that, except for the Leaf Instance Segmentation task under the Growth Monitoring section, all the tasks are performed with model trained from different datasets and evaluated on different metrics.showcase the variety in datasets and evaluation metrics.This variation results in incomparable performance between studies.Such a phenomenon further indicates the necessity of our survey, which summarizes the current progress in literature and encourages the development of general benchmarks to promote consistency and comparability in future research.

Image Recognition
The classic problem of image recognition is to classify an image containing a single object to the corresponding object class.The success of deep convolutional networks in this area dates (at least) back to LeNet [176] of 1998, which recognizes hand-written digits.The fundamental building block of such networks is the convolution operation.Using the principles of local connections and weight sharing, convolutional networks benefit from an inductive bias of translational invariance.That is, a convolutional network applies (approximately) the same operation to all pixel locations of the image.
The victory of AlexNet [166] in the 2012 ImageNet Large Scale Visual Recognition Challenge [253] is often considered as a landmark event that introduced deep neural networks into the AI mainstream.Subsequently, many variants of convolutional networks [151,173,279,297] have been proposed.Due to space limits, here we provide a brief review of a few influential works, which is by no means exhaustive.ResNet [135] introduces residual connections that allow the training of networks of more than 100 layers.ResNeXT [345] and MobileNet [140] employ grouped convolution that reduces interaction between channels and improves the efficiency of the network parameters.ShuffleNet [374] utilizes the shuffling of channels, which complements group convolution.EfficientNet [301] shows simultaneous scaling of the network width, height, and image resolution is key to efficient use of parameters.
Recently, the transformer model has proven to be a highly competitive architecture for image recognition and other computer vision tasks [90].These models cut the input image into a sequence of small image patches and often apply strong regularization such as RandAugment [75].Variants such as CaiT [311], CeiT [359], Swin Transformer [198], and others [72,78,346,380] achieve outstanding performance on ImageNet.
Despite the maturity of the technology for image classification, the assumption that an image contains only one object may not be easily satisfied in real-world scenarios.Thus, it is often necessary to adopt a problem formulation as object detection or semantic / instance segmentation.

Object Detection
The object detection task is to identify and locate all objects in the image.It can be understood as the task resulted from relaxing the assumption that the input image contains a single object.This is one natural problem formulation for real-world images and has seen wide adoption in agricultural applications.
In broad strokes, contemporary object detection methods can be categorized into anchor-box-based and point-based / proposal-free approaches.In anchor-box methods [110,243], the process starts with a number of predefined anchor boxes that are periodically tiled to cover the entire input image.For each anchor box, the network makes two types of predictions.First, it determines if the anchor box contains one of the predefined object classes.Second, if the box contains an object, the network attempts to move and reshape the box to become closer to the ground-truth location of the object.One-stage anchor-box detectors [77,101,190,196,240,376] make these predictions all at once.In comparison, two-stage detectors [110,132,189,243], in the first stage discard anchor boxes that do not contain any object and classify the remaining boxes into finer object categories in the second stage.The location adjustment, known as bounding box regression, can happen in both stages.It is also possible to employ more than two stages [48].When the objects have diverse shapes and scales, these methods must create a large number of proposal boxes and evaluate them all, which can lead to high computational cost.
While point-based object detectors [91,161,174,309,381] still need to identify rectangular boxes around the objects, they make predictions at the level of grid locations on the feature maps.The networks predict if a grid location is a corner or the center of an object bounding box.After that, the algorithm assembles the corners and centers into bounding boxes.The point-based approaches can reduce the total number of decisions to be made.A careful comparison and analysis of anchor-box methods and point-based methods can be found in [369].

Semantic, Instance, and Panoptic Segmentation
Segmentation is a pixel-level classification task, aiming to classify every pixel in the image into a type of object or an object instance.The variations of the task differ by their definitions of the classes.In semantic segmentation [73,94,113,170,199], each type of object, such as cat, cow, grass, or sky, is its own class, but different instances of the same object type (e.g., two cats) share the same class.In instance segmentation [76,129,131,230], different instances of the same object type become unique classes, so that two cats are no longer the same class.However, object types such as sky or grass, which are not easily divided into instances, are ignored.In the recently proposed panoptic segmentation [69,109,158,181,192,372], objects are first separated into things and stuff.Things are countable and each instance of things is its own class, whereas stuff is uncountable, impossible to separate into instances, appearing as texture or amorphous regions [12], and remains as one class.We note that the distinction between things and stuff is not rigid and can change depending on the application.For example, grass is typically considered as stuff, but in the leaf instance segmentation task, each leaf of a plant becomes an instance and is a separate class.
The primary requirement of pixel-level classification is to learn pixel-level representations that consider sufficient context and within reasonable computational budget.A typical solution is to introduce a series of downsampling followed by a series of upsampling operations.Since classic works such as the Fully Convolutional Network (FCN) [199] and U-Net [252], this has been the mainstream strategy for various segmentation strategies.
Due to its use in leaf segmentation, a problem in plant phenotyping, instance segmentation may be the most relevant segmentation formulation for urban farming.Despite the apparent similarity to semantic segmentation, instance segmentation poses challenges due to the variable number of instance classes and possible permutation of class indices [80].This could be handled by combining proposal-based object detection and segmentation [61,68,129,183,231].
Mask-RCNN [132] exemplifies this approach.Leveraging its object detection capability, the network associates each object with a bounding box.After that, the network predicts a binary mask for the object within the bounding box.
However, such methods may not perform well when there is substantial occlusion among objects or when objects are of irregular shapes [80].
Departing from the detect-then-segment paradigm, recurrent methods [242,251,258] that outputs one segmentation mask at one time may be considered as implicitly modeling occlusion.Pixel embedding methods [62,80,216,225,335,340,353] learn vector representations for every pixel and cluster the vectors.These methods are especially suitable for segmenting plant leaves and we will discuss them in greater detail in §3.1.Taking a page from the proposal-free object detector YOLO [239], SOLO [325] and SOLOv2 [326] divide the image into grids.The grid that the center an object falls into is responsible for predicting the segmentation mask of the object.

Uncertainty Quantification
Real-world applications often require qualification of the amount of uncertainty in the predictions made by machine learning, especially when the predictions carry serious implications.For example, if the system incorrectly determines that fruits are not mature enough, it may delay harvesting and cause overripe fruits with diminished values.Thus, users of the ML system are justified to ask how certain we are about the decision.In addition, when facing real-world input, it is desirable for the network to answer "I don't know" when facing an input that it does not recognize [186].
Well-calibrated uncertainty measurements may enable such a capability.However, research shows that deep neural networks exhibit severe vulnerability to overconfidence, or underestimation of the uncertainty in its own decisions [117,206].That is, the accuracy of the network decision is frequently lower than the probability that the network assigns to the decision.As a result, proper calibration of the networks should be a concern for systems built for real-world applications.
Calibration of deep neural networks may be performed post-doc (after training) using temperature scaling and histogram binning [87,117,321].Also, regularization during training such as label smoothing [298] and mixup [138] have been shown to improve calibration [214,228,306].Researchers propose new loss functions to replace existing ones that are susceptible to overconfidence [213,352].Moreover, ensemble methods such as Vertical Voting [344], Batch Ensemble [332], and Multi-input Multi-output [130] can derive uncertainty estimates.

Interpretability
Modern AI systems are known for its inability to provide faithful and human-understandable explanations for its own decisions.The unique characteristics of deep learning, such as network over-parameterization, large amount of training data, and stochastic optimization, while being beneficial to the predictive accuracy (e.g., [27,182,282,289]), all create obstacles toward understand how and why a neural network reaches its decisions.The lack of human-understandable explanations leads to difficulties in the verification and trust of network decisions [52,375].
We categorize model interpretation techniques into a few major classes, including visualization, feature attribution, instance attribution, inherently explainable models, and approximation by simple models.Visualization techniques present holistically what the model has learned from the training data by visualizing the model weights for direct visual inspection [34,93,97,205,212,299].In comparison, feature attribution and instance attribution are often considered as local explanations as they aim to explain model predictions on individual samples.Feature attribution methods [22,58,60,210,234,263,273,281,295,349] generate a saliency map of an image or video frame, which highlights the pixels that contribute the most to its prediction.Instance attribution methods [32,46,67,159,233,273,350] attribute a network decision to training instances that, through the training process, exert positive or negative influence on the particular decision.Moreover, inherently explainable models [33,59,178,267,354] incorporate explainable components into the network architecture, which reduces the need to apply post-hoc interpretation techniques.In contrast, researchers also try to post-hoc approximate complex neural networks with simple models such as rule-based models [84,102,115,156,226,327] or linear models [14,105,106,163,247] that are easily understandable.
The most significant benefit of interpretation in the context of CEA lies in its ability to aid with the auditing and debugging of AI systems and datasets.With feature attribution, users can make sure the system captures the robust features, or semantically meaningful features, that generalize to real-world data.As in the well-known case of husky vs.
wolf image classification, due to a spurious correlation, the neural network learns to classify all images with white backgrounds as wolf and those with green backgrounds as husky [209].Such shortcut learning can be identified by feature attribution and subsequently corrected.Moreover, instance attribution allows researchers to pinpoint outliers or incorrectly labeled training data that may lead to misclassification [67].

CONTROLLED-ENVIRONMENT AGRICULTURE
Controlled-environment agriculture (CEA) is the farming practice carried out within urban, indoor, resource-controlled factories, often accompanied by stacked growth levels (i.e., vertical farming), renewable energy and recycling of water and waste.CEA has recently been adopted in nations around the world [38,82] such as Singapore [164], North America [6], Japan [9,272], and UK [8].
CEA has economic and environmental benefits.Compared to traditional farming, CEA farms produce higher yield per unit area of land [8,9].Controlled environments shield the plants from seasonality and extreme weather, so that plants can grow all year round given suitable lighting, temperature and irrigation [38].The growing conditions can as well be further optimized to boost growth and nutritional content [162,313].Rapid turnover increases farmers' flexibility in plant choice to catch the trend of consumption [35].Moreover, farms investment on pesticides, herbicides, and transportation can be cut down due to reduced contamination from the outside environment and proximity to urban consumers.
CEA farms, when designed properly, can become much more environmentally friendly and self-sustainable than traditional farming.With optimized growing conditions and limited external interference, the need for fertilizer and pesticides decreases, so that we can reduce the amount of chemicals that go into the environment as well as the resulting pollution.Furthermore, CEA farms can save water and energy through the use of renewable energy and aggressive water recycling.For instance, CEA farms from Spread, a Japanese company, recycle 98% of used water and reduce the energy cost per head of lettuce by 30% with LED lightning [9].Finally, CEA farm can be situated in urban or suburban areas, thereby reducing transportation and storage cost.A simulation for different farm designs in Lisbon shows vertical tomato farms with appropriate designs emit less greenhouse gas than conventional farms, mainly due to reduced water use and transportation distance [37].
A significant drawback of CEA, however, lies in its high cost, which may be partially addressed by computer vision technologies.According to [38], the higher land cost in Victoria, Australia means that the yield of vertical farms has to be at least 50 times more than traditional farming to break even.Computer vision holds the promise of boosting the level of automation and increasing yield, thereby making CEA farms economically viable.As would be discussed in the following sections, CV techniques can reduce a major amount of variable costs such as wastage cost induced by incorrect or delayed decisions on harvesting, and provide long-term benefit.
Carrying the potential to reduce a significant amount of cost, setting up computer vision systems in the field costs significantly less than expected when compared to the expenses of constructing a CEA building.Building a CEA structure involves high upfront costs, including construction, insulation, lighting, and HVAC systems.According to [272], a 1,300 square meter CEA building with a production area of 4,536 square meters would require a capital investment of $7.4 million and incur annual operational costs of approximately $3.4 million.
On the other hand, setting up hardware systems for CV models is relatively inexpensive.The necessary components include servers (CPU, GPU, memory, storage), sensors, cameras, networking, as well as cooling system.For example, a server with specifications like a 32-Core 2.80 GHz Intel Xeon Platinum 8462Y+, 128G memory, 4 NVIDIA RTX A6000 "Ada" GPUs, and 2TB storage costs around $60,000.Using this server for training purposes, assuming a standard VGG-16 architecture, training on 5000 images of size 224x224 pixels, with a batch size of 64 and 50 training epochs, and utilizing 4 NVIDIA A6000 GPUs, the estimated training time is less than an hour.Such a server is sufficient for daily training and inference of commonly used CV models.For a camera system, if we consider 10 surveillance cameras such as the Hikvision DS-2CD2142FWD-I, the total cost would be around $1400.Additionally, a high-speed network infrastructure is required to transfer data between the computer hardware, storage, and camera systems.Typically it necessitates 4 to 7 routers to cover an area of 1300 square meters, costing approximately $2000.Finally, a liquid cooling system could cost between $1,000 and $2,000.In summary, a hardware system with a total cost of around $70,000 is sufficient for the daily operation, training, and inference of CV systems.
CEA can take diverse form factors [35]  Nevertheless, with the autonomous setup of CEAs, which allow easy new data collection, training a new CV model or fine-tuning a previous model to adapt to the above mentioned changeable environment would be a cinch.Besides, there are also few-shot learning [294,328], weakly-supervised learning [16,222,382] and unsupervised learning techniques [49,261], which require minimal or zero annotations, that can facilitate the adjustment of the models.
Besides environmental change, there also exist other factors that need to be take into account when applying CV techniques in CEA.Two typical problems to consider would be 1) How to cope with sub-optimal data with label noise and how to address unbalanced class distribution.2) How to interpret the prediction from models or measure the uncertainty of prediction so that users can use the models with confidence.Quantitative measure of the confidence or uncertainty would allow farmers to understand the decision generation process and make decisions with more confidence.Table 1 map these factors to consider into CV problems, and list corresponding solutions and the respective sections that discuss the solutions.
In the following, we investigate the application of autonomous computer vision techniques on Growth Monitoring, Fruit and Flower Detection, Fruit Counting, Maturity Level Classification and Pest and Disease Detection to increase  [225] 84.5 1.5 Crop Leaf and Plant Instance Segmentation [333] 91.1 1.8 W-Net (GT-FG) [340] 91.9 -SPOCO (GT-FG) [336] 93.2 1.7 production efficiency.In addition to existing applications, we also include techniques that can be easily applied to vertical farms even though they have not yet been applied to them.

Growth Monitoring
Growth monitoring, a critical component of plant phenotyping, aims to understanding the life cycle of plants and estimating yield [148] by monitoring various growth indicators such as the plant size, number of leaves, leaf sizes, land area covered by the plant, and so on.Plant growth monitoring facilitates in quantifying the effects of biological / environmental factors on growth and thus is crucial for finding the optimal growing condition and developing high-yield crops [215,303].
As early as 1903, Wilhelm Pfeffer has recognized the potential of image analysis in monitoring plant growth [229,287].
Traditional machine vision techniques such as gray-level pixel thresholding [224], Bayesian statistics [45] and shallow learning techniques [150,357], have been applied to segment the objects of interest, such as leaves and stems, from the background to analyze plant growth.Compared to traditional methods, deep-learning techniques provide automatic representation learning and are less sensitive to image quality variations.For this reason, deep learning techniques for growth monitoring have recently gained popularity.
Among various growth indicators, leaf size and number of leaves per plant are the most commonly used [121,148,172,260].Therefore, in the section below, we first discuss leaf instance segmentation, which can support both indicators at the same time, followed by a discussion of techniques for only leaf counting or for other growth indicators.
3.1.1Leaf Instance Segmentation.Due to the popularity of the CVPPP dataset [207], the segmentation of leaf instance has attracted special attention from the computer vision community and warrants its own section.leaf instance segmentation methods include recurrent network methods [242,251] and pixel embedding methods [62,80,225,333,340].
Parallel proposal methods are popular for general-purpose segmentation (see §segmentation), but are ill-suited for leaf segmentation.As most leaves have irregular shapes, the rectangle proposal boxes used in these methods do not fit the leaves well, resulting in many poorly positioned boxes.In addition, the density of leaves causes many proposal boxes to overlap and compounds the fitting problem.As a result, it is difficult to pick out the best proposal box from the large number of parallel proposals.Therefore, we focus on recurrent network based methods and pixel embedding based methods in this section.Quality metrics for leaf segmentation include Symmetric Best Dice (SBD) and Absolute Difference in Count (|DiC|).SBD calculates the average overlap between the predicted mask and the ground truth for all leaves.DiC calculates the average number of miscalculated leaves over the entire test set.
Recurrent network based methods output a mask for a single leaf sequentially.Their decisions are usually informed by the already segmented parts of the image, which are summarized by the recurrent network.[242] applies LSTM and DeconvNet to segment one leaf at a time.The network first locates a bounding box for the next leaf, and performs segmentation within that box.After that, leaves segmented in all previous iterations are aggregated by the recurrent network and passed to the next iteration as contextual information.[251] employs convolution-based LSTMs (ConvL-STM) with FCN feature maps as input.At each time step, the network outputs a single-leaf mask and a confidence score.During inference, the segmentation stops when the confidence score drops below 0.5.[259] proposes another similar method that combines feature maps with different abstraction levels for prediction.
Pixel embedding methods learn vector representations for the pixels so that pixels in irregularly shaped leaves can become regularly shaped clusters in the representation space.With that, we can directly cluster the pixels.[333] performs simultaneous instance segmentation of leaves and plants.The authors propose an encoder-decoder framework, based on ERFNet [250], with two decoders.One decoder predicts the centroids of plants and leaves.The other decoder predicts the offset of each leaf pixels to the leaf centroid.The pixel location plus the offset vector hence should be very close to the leaf centroid.The dispersion among all pixels of the same leaf can be modeled as a Gaussian distribution, whose covariance matrix is also predicted by the second decoder and whose mean is from the first decoder.The training maximizes the Gaussian likelihood for all pixels of the same leaf.The same process is applied to pixels of the same plant.
[62, 225,340] are three similar pixel embedding methods.They encourage pixels from the same leaf to have similar embeddings and pixels from different neighboring leaves to have different embeddings to enable clustering in the embedding space.Their network consists of two modules, the distance regression module and pixel embedding module.
[ 225,340] arrange the two modules in sequence, while [62] places them in parallel.The distance regression module predicts the distance between the pixel and the closest object boundary.The pixel embedding module generates an embedding vector for each pixel, so that pixels from the same leaves have similar embeddings and pixels from different neighboring leaves have different embeddings.During inference, pixels are clustered around leaf centers, which are identified as local maxima in the distance map from the distance regression module.
Lastly, [80,336] take a large-margin approach.They ensure that embeddings of pixels from the same leaf are within a circular margin of the leaf center, and the embedding of leaf centers are far away from each other.This removes the need to determine the leaf centroids during inference because the embeddings are already well separated.[336] built upon the method in [80] to perform pixel embedding and clustering of leaves under weak supervision, with annotation on only a subset of instances in the images.In addition, a differentiable instance-level loss for a single leaf is formed to overcome the non-differentiability of assigning pixels to instances by comparing a Gaussian shape soft mask with the corresponding ground truth mask.Finally, consistency regularization, which encourages accordance of two embedding frameworks, is applied to improve embedding for unlabeled pixels.
Comparing different approaches, proposal-free pixel embedding techniques seem to be the best choice for the leaf segmentation problem.As can be seen from Table 2, pixel embedding methods obtain both the highest SBD and lowest |DiC|.One thing to note here, however, is that superior result of W-Net [340] and SPOCO [336] could be attributed to the inclusion of ground-truth foreground masks during inference.Even though the recurrent approach does not generate a large number of proposal boxes at once, it still uses rectangular proposals, which means that it still suffers from the fitting problem to irregular leaf shapes.Moreover, the recurrent methods are usually slower than pixel embeddings, due to the temporal dependence between the leaves.

Category Technique Evaluation Metric Performance Dataset
Fruit Object Detection [360] Precision (IoU > 0.5) 94% 1730 images of cherry tomatoes [141] Accuracy (IoU unspecified) 95.50% 800 images of tomatoes [255] F1 scores (IoU unspecified) 83.80% 122 images of 7 fruits [365] True positive rate and False positive rate (IoU unspecified) 98%, 17% 2116 self-acquired images of fruits and 511 images of fruits from ImageNet [355] Precision and Recall (IoU > 0.9) 94.4%, 93.5% 2000 images of strawberries [270] F1 scores (IoU unspecified) Besides leaf size and leaf count, leaf fresh weight, leaf dry weight, and plant coverage (the area of land covered by the plant) are also used as metrics of growth.[368] applies CNN to regress leaf fresh weight, leaf dry weight, and leaf area of lettuce on RGB images.[246] makes use of Mask R-CNN, a parallel proposal method, for lettuce instance segmentation.
The authors derive plant attributes such as contour, side view area, height, and width from the segmentation masks and bounding boxes, using preset formulas.They also estimate growth rate from the changes in area of the plant at each time step; they estimate fresh weight by linearly regressing from the attributes.[201] leverages COCO dataset pretrained Mask R-CNN with ResNet-50 as backbone to segment lettuce leaves.The daily change of mean leaf area is used for growth rate calculation.

Fruit and Flower Detection
Algorithms for fruit and flower detection find the location and spatial distribution of fruits and fruit flowers.This task supports various downstream applications such as fruit count estimation, size estimation, weight estimation, robotic pruning, robotic harvesting, and disease detection [31,108,202,351].In addition, fruit or flower detection may help devise plantation management strategies [108,127] because fruit or flower statistics such as positions, facing directions (the directions the flowers face), and spatial scatter can reveal the status of the plant and the suitability of environmental conditions.For example, the knowledge of flower distribution may allow pruning strategies that focus on regions of excessive density and achieve even distribution of fruits which optimize the delivery of nutrient to the fruits.
Traditional approaches for fruit detection rely on manual feature engineering and feature fusion.As fruits tend to have unique colors and shapes, one natural thought is to apply thresholding on color [223,331] and shape information [195,221].Additionally, [55,187,211] employ a combination of color, shape, and texture features.However, manual feature extraction suffers from brittleness when the image distribution changes with different camera resolutions, camera angles, illumination, and species [30].
Deep learning methods for fruit detection include object detection and segmentation.[360] applies SSD for cherry tomato detection.[141] leverages Faster R-CNN to detect tomatoes.Inside the generated bounding boxes, color thresholding and fuzzy-rule-based morphological processing methods are applied to remove image background and obtain the contours of individual tomatoes.[255] leverages Faster R-CNN with VGG-16 as the backbone for sweet pepper detection.RGB and near-infrared (NIR) images are used together for detection.Two fusion approaches, early and late fusion, are proposed.Early fusion alters the first pretrained layer to allow 4 input channels (RGB and NIR), whereas late fusion aggregates the two modalities by training independent proposal models for each modality and then combining the proposed boxes by averaging the predicted class probabilities.[365] trains three multi-task cascaded convolutional networks (MTCNN) [364] for detecting apples, strawberries and oranges.MTCNN contains a proposal network, a bounding box refinement network, and an output network in a feature pyramid architecture with gradually increased input sizes for each network.The model is trained on synthetic images, which are random combinations of cropped negative patches and fruits patches, in addition to real-world images.[355] proposed R-YOLO with MobileNet-V1 as the backbone to detect ripe strawberries.Different from regular horizontal bounding boxes in object detection, the model generates rotated bounding boxes by adding a rotation-angle parameter to the anchors.
Delicate fruits, such as strawberries and tomatoes, are particularly vulnerable to damage during harvesting.Therefore, much research has been devoted to segmenting such fruits from backgrounds in order to determine the precise picking point.Precise fruit masks are expected to enable robotic fruit picking while avoiding damages on the neighboring fruits.
[188] performs semantic segmentation for guava fruits and determines their poses using FCN with RGB-D images as input.The FCN outputs a binary mask for fruits and another binary mask for branches.With the fruit binary mask, the authors employ Euclidean clustering [254] to cluster single guava fruit.From the clustering result and the branch binary mask, fruit centroids and the closest branch are located.Finally, the system predicts the vertical axis of the fruit as the direction perpendicular to the closest branch to facilitate robotic harvesting.Similarly, [13] leverages Mask R-CNN with ResNet as backbone for semantic segmentation of tomatoes.In addition, the authors filter the false positive detection of tomatoes from the non-targeted rows by setting a depth threshold.[107] utilizes Mask R-CNN with a ResNet101 backbone to perform instance segmentation of ripe strawberries, raw strawberries, straps and tables.Depth images are aligned with the segmentation mask to project the shape of strawberries into 3D space to facilitate automatic harvesting.[356] also applies Mask R-CNN with a ResNet101 + FPN backbone to perform instance segmentation and ripeness classification on strawberries.[143] leverages a similar network for instance segmentation of tomatoes.With the segmentation mask, the systems determine the cut points of the fruits.
Besides accuracy, the processing speed of neural networks is also important for their deployment on mobile devices or agricultural robots.[270] performs network pruning on YOLOv3-tiny to form a lightweight mango detection network.
A YOLOv3-tiny pretrained on the COCO dataset has learned to extract fruit-relevant features because the COCO dataset contains apple and orange images, but it also has learned irrelevant features.The authors thus use a generalized Average Precision (IoU > 0.5), RMSE 71.6%, 1.484 724 images of blueberries attribution method [274] to determine the contribution of each layer to fruit features extraction and remove convolution kernels responsible for detecting non-fruit classes.They find that the lower level features are shared across all classes detection and pruning in the higher layers does not harm fruit detection performance.After pruning, the network achieves significantly lowers float-point operations (FLOPs) at the same level of accuracy.
Object detection is also applied for flower detection.[202] proposes a modified YOLOv4-Tiny with cascade fusion (CFNet) to detect citrus buds, citrus flowers, and gray mold, which is a disease commonly found on citrus plants.The authors propose additionally a block module with channel shuffle and depth separable convolution for YOLOv4-Tiny.
[292] shrinks the anchor boxes of Faster-RCNN to fit small fruits and applies soft non-maximum suppression to retain boxes that may contain occluded objects.As flowers usually have similar morphological characteristics, flowers from other non-targeted species could possibly be used as training data in a transfer learning scenario.In [293], the authors fine-tune a DeepLab-ResNet model [63] for fruit flower detection.The model is trained on apple flower dataset but achieves high F1 scores on pear and peach flower images (0.777 and 0.854 respectively).

Fruit Counting
Pre-harvest estimation of yields plays an important role in the planning of harvesting resources and marketing strategies [136,347].As fruits are usually sold to consumers as a pack of uniformly sized fruits or individual fruits, the fruit count also provides an effective yield metric [160], besides the distribution of fruit sizes.Traditional yield estimation is obtained through manual counting of samples from a few randomly selected areas [136].Nonetheless, when the production is large-scale, to counteract the effect of plant variability, accurate estimation would require a large quantity of samples from different areas of the field, resulting in high cost.Thus, researchers resort to CV-based counting methods.
A direct counting method is to regress on the image and output the fruit count.In [238], the authors apply a modified version of Inception-ResNet for direct tomato counting.The authors train the model on simulated images and test on real images, which suggest, once again, the viability of using simulated images to circumvent the cost for formulating a large dataset.
Besides direct regression, object detection [160,329], semantic segmentation [157], and instance segmentation [219] have also been used for fruit counting.These methods provide an intermediate level of results from which the count can be easily gathered.[160] proposes MangoYOLO based on YOLOv2-tiny and YOLOv3 for mango detection and counting.
The authors increase the resolution of the feature map to facilitate detection of small fruits.[124] proposes pre-trained Faster R-CNN network, building upon DeepFruits [255], to estimate the quantity of sweet pepper.The authors design a tracking sub-system for sweet pepper counting.The sub-system identifies new fruits by measuring the IoU between and Precision, Recall, F1 score and Average Precision (IoU > 0.9) -120 images RGB-D images of strawberries [356] Precision, Recall (IoU > 0.9) 95.78%, 95.41% 1900 images of strawberries [143] Class frequency weighted precision and recall (IoU Unspecified) 96.1%, 96.0% 900 images of strawberries comparing the boundary of detected and new fruits.[157] performs semantic segmentation for mango counting using a modification of FCN.The coordinates of blob-like regions in the semantic segmentation mask is used to generate bounding boxes corresponding to mango fruits.Finally, [219] applies Mask R-CNN to for instance segmentation of blueberries.The model also classifies the maturity of individual blueberries and counts the number of berries according to the masks.
Occlusion poses a difficult challenge for counting.Due to this issue, automatic count from detection or segmentation results is almost always lower than the actual number of fruits.To solve this, [160] calculates and applies the ratio between the actual hand harvest count and the automatic fruit count; it also uses both front and back views of mango trees to mitigate occlusion from one angle.Taking this idea one step further, [329] uses dual-view videos to detect and track mangoes when the camera moves.Utilizing different views of the same tree in a video, [329] recognizes around 20% more fruits.However, the detected count is still significantly lower than the actual number, underscoring the research challenge of exhaustive and accurate counting.

Maturity Level Classification
Maturity level classification aims to determine the ripeness of fruits or vegetables to aid in proper harvesting and food quality assurance.Premature harvesting results in plants that are unpalatable or incapable of ripening, while delayed harvesting can result in overripe plants or food decay [143].
The optimal maturity level differs for different targeted products and destinations.Fruits and vegetables can be consumed at different growing stages.For example, lettuce can be consumed either as baby lettuce or fully grown lettuce.
The same situation happens with baby corn and normal corn.Products are to be transported to different destinations, so we must consider the length of transportation and ripening speed when deciding the correct maturity level at harvest [367].
Manually distinguishing the subtle differences in maturity levels is time-consuming, prone to inconsistency, and costly.The labor cost of harvesting accounts for a large percentage of operation cost in farms, with 42% of variable production expenses in U.S. fruit and vegetable farms being spent on labor for harvesting [144].Automatic maturity level classification with computer vision, in contrast, can assist automatic harvesting [20,107,367] and reduce cost.
Similar to fruit detection, we can apply thresholding methods on color to detect ripeness.For example, [25] applies color thresholding on HSI and YIQ color spaces.[305] applies linear color models.[179] utilizes the combination of color and texture features.[96,168,264,265,338] apply shallow learning methods based on a multitude of features.
More recently, researchers evaluate the performance of deep learning based computer vision methods on maturity level classification and attain satisfactory results.For example, [366] applies CNN to classify tomato maturity into five levels.However, to further facilitate automatic harvesting, object detection and instance segmentation are more commonly used for getting the exact shape, location and maturity level of fruits, and position of peduncles for robotic end-effectors to cut on.
With object detection, [355] applies the R-YOLO network described in the fruit detection section ( §3.2) to detect ripe strawberries.[124], as mentioned in the fruit counting section §3.Using the segmentation methods discussed in §3.2, [13] classifies semantic segmentation masks of tomatoes into raw and ripe tomatoes.[107,356] performs instance segmentation and classifies instance masks into ripe and raw strawberries.[143] performs instance segmentation on tomatoes first.After transforming the mask region into HSV color space, the authors employ a fuzzy system to classify tomatoes into four classes: immature (completely green), breaker (green to tannish), preharvest (light red), and harvest (fully colored).

Pest and Disease Detection
Plants are susceptible to environmental disorders caused by temperature, humidity, nutritional excess/deficiency, light changes and biotic disorders due to fungi, bacteria, virus or other pests [103,280].Infectious diseases or pest pandemic induce inferior plant quality or plant death, resulting in at least 10% of global food production losses [290].
Although controlled vertical farming restricts the entry of pests and diseases, it cannot eliminate them.Pests and diseases can enter the farm from accidental contamination from employees, seeds, irrigation water and nutrient solution, poorly maintained environment or phytosanitation protocols, unsealed entrance and ventilation systems [248].For this reason, pest and disease detection is still worth studying in the context of CEA.
Manual diagnosis of plant is complex due to the large quantity of vertically arranged plants in the field and numerous possible symptoms of diseases on different species.In addition, plants show different patterns along infection cycles and their symptoms can vary in different part of the plant [43].Consequently, autonomous computer vision systems that recognize diseases according to the species and plant organs are gaining traction.From a technological perspective, we sort existing techniques into three parts, single-and multi-label classification, handling unbalanced class distributions, as well as label noise and uncertainty estimates.
3.5.1 Single-and Multi-label Classification.Studies perform single-label, or one-label-per-image, classification of diseases of either one single species [24,262,280,370] or multiple species [95].[370] creates a lightweight version of AlexNet, replacing the fully connected network with a global pooling layer, to classify six types of cucumber diseases.
Having a single label per image can be inaccurate.In the real world, one plant or one leaf can carry multiple diseases or contain multiple diseased regions.By detecting multiple targeted areas or disease classes, the multi-label setting can lead to improved efficiency and accuracy.
To deal with the possibility of having multiple diseases or multiple areas of diseases on one plant simultaneously, two types of methods are proposed.[204] first segments out different infection areas on cucumber leaves using color thresholding following [203], then applies DCNN on segmented areas to classify four types of cucumber diseases.
Nevertheless, the color thresholding technique may not generalize to other plant species and environment.Another type of method leverages object detection or segmentation for locating and classifying infection areas.[262] locates multiple diseased regions of banana plants simultaneously using object detection but assigns only one disease label to each image.[103] compared Faster R-CNN, R-FCN and SSD for detecting nine classes of diseases and pests that affect tomato plants.Multiple diseases and pests in one plant are detected simultaneously.[358] applies improved DeepLab v3+ for segmentation of multiple black rot spots on grape leaves.The efficient channel attention mechanism [324] is added to the backbone of DeepLab v3+ for capturing local cross-channel interaction.Feature pyramid network and Atrous Spatial Pyramid Pooling [64] are utilized for fusing feature maps from the backbone network at different scales to improve segmentation.

Handling Unbalanced Class Distributions.
A common obstacle encountered in disease detection is unbalanced disease class distributions.There are typically much fewer diseased plants than healthy plants; the unequal frequencies introduce difficulties in finding images of rare diseases; the data unbalance leads to difficulty for model training.To remedy such problem, researchers propose weakly supervised learning [44], generative adversarial network (GAN) [116], and few-shot learning [185,220].
Specifically, [44] applies multiple instance learning (MIL), a type of weakly supervised learning method, for multiclass classification of six mite species of citrus.In MIL, the learner receives a set of labeled bags, containing multiple image instances.We know that at least one instance is associated with the class label, but do not know the exact instance.The MIL algorithm tries to identify the common characteristic shared by images in the positively labeled bags.
In this work, a CNN is first trained with labeled bags.Next, by calculating saliency maps of images in bags, the model identifies salient patches that have a high probability of containing mites.These patches inherit labels from their bags and are used to refine the CNN trained above.
[116] leverages generative adversarial network (GAN) to generate realistic image patches of tip-burn lettuce and trains U-net for tip-burn segmentation.For the generation stage, lettuce canopy image patches are inputted into Wasserstein GANs [26] to generate stressed (tip-burned) patches so that there are an equal number of stressed and healthy patches.Then, in the segmentation stage, the authors generate a binary label map for the images using a classifier and an edge map.The binary label map labels each mini-patches (super-pixels) as stressed or healthy.The authors then feed the label map, alongside the original images, as input to U-net for mask segmentation.
In few-shot meta-learning, we are given a meta-train set and a meta-test set, with the two sets containing mutually exclusive image classes (i.e.classes in the training set do not appear in the testing set).Meta-train or meta-test sets contain a number of episodes, each of which consists of some training (supporting) images and some test (query) images.The rationale of meta-learning is to equip the model with the ability to quickly learn to classify the test images from a small number of training images within each episode.The model acquires this meta-learning capability on the meta-train set and is evaluated on the meta-test set.
As an example , [220] performs pests and diseases classification with few-shot meta-learning.The model framework consists of an embedding module and a distance module.The embedding module first projects supporting images into an embedding space using ResNet-18, then feeds embedding vectors into a transformer to incorporate information of other support samples in the same episode.After that, the distance module calculates the Mahalanobis distance [104] of the query and support samples to classify the query.Similarly, [185] uses a shallow CNN for embedding and the Euclidean distance for calculating the similarity between the embeddings of the query and support samples.
3.5.3Label Noise and Uncertainty Estimates.[271] is another example of meta-learning, but it is used to improve the network's robustness against label noise.The model consists of two phrases.The first phrase is the conventional training of a CNN for classification.In the second phrase, the authors generate ten synthetic mini batches of images, containing real images with the labels taken from similar images.As a result, these mini-batches could contain noisy labels.After one step update on the synthetic instances, the network is trained to output similar predictions with the CNN from the first phrase.The result is a model that is not easily affected by noisy training data.
Finally, having a confidence score associated with the model prediction allows farmers to make decisions selectively under different confidence levels and boost the acceptance of deep learning models in agriculture.As an example, [99] performs classification of tomato diseases and pair the prediction with a confidence score following [79].The confidence score, calculated using Bayes' rule, is defined as the probability of the true class label conditioned on the class probability predicted by the CNN.In addition, the authors build an ontology of disease classification.For example, the parent node "stressed plant" has as children "bacteria infection" and "virus infection", which in turn has "mosaic virus" as a child.If the confidence score of a specific terminal disease label is below a certain threshold, the model switches to its more general parent label in the tree for higher confidence.By the axiom of probability, the predicted probability of the parent label is the summation of all the predicted probability of its direct descendants.For a general discussion of machine learning techniques that create well-calibrated uncertainty estimates, we refer readers to §2.4.

DATASETS
High-quality datasets with human annotations are one of the most important factors in the success of a machine learning project [208,217,334].In this section, we review established datasets that enable training of CV models.We exclude datasets for plants that we have not found literature regarding their suitability in CEA, such as apples [41,126], broccoli [169], and dates [21].We have manually checked every dataset listed and assure that they are available for downloading at the time of writing.By summarizing the dataset related to CEA, we aim to facilitate interested researchers on their future studies.In the meantime, we would like to encourage scholars to publish more datasets dedicated to CEA.
As listed in Table 7 and where computer vision technologies could provide short-to mid-term benefits to urban and suburban CEA.We identify three such areas, including realistic datasets that are unbalanced and noisy, uncertainty quantification, and multi-task learning / system integration.

Handling Realistic Data
The ability to handle realistic data is a critical competence that has not received sufficient research attention (with a few notable exceptions [44,116,185,220,271]).Unlike well-curated datasets that have accurate and abundant labels and relatively balanced label distributions, real-world data exhibit skewed label distribution as well as substantial noise in the labels.For effective real-world application, it is important that the CV algorithms can maintain good predictive performance under these conditions.In addition, the algorithmic tolerance of data imperfection can lower annotation cost and enable wider applications of CV.There has been substantial research on these topics in the computer vision community, such as long-tail recognition [81,193,269,322,378,379], few-shot and zero-shot learning [180,[283][284][285]343], as well as noise-resistant classification [17,70,153,330,377] and metric learning [147,191,319].
We believe that research on smart agriculture could benefit from the existing body of literature.

Quantifying Uncertainty and Interpretability
Real-world applications call for reliable estimation of the quality of automated decisions.An incorrect prediction made by an AI system may have profound implications.For example, if the system incorrectly determines that fruits are not mature enough, it may delay harvesting and cause overripe fruits with diminished values.However, it is impossible to eliminate incorrect or uncertain predictions, as they originate from factors difficult to control and precisely measure, including model assumptions, test data shift, incomplete training data and so on [11,146].Thus, we argue that uncertainty quantification is another crucial factor for real-world deployment.Such quantification would allow Besides uncertainty quantification, pair the model with explanation on its decisions could enhance user confidence and assist auditing and debugging of the AI system.Specifically, instance attribution methods, as discussed in §2.5, enable detection of the biased or low quality data points with extreme influence on prediction [67].For example, if the model is trained with an image of dry leaves with dust that resemble a certain disease of the plant, in the inference process, the model might misclassify diseased leaves as normal dry leaves or vice versa and induce plant death or unnecessary treatments.With instance attribution interpretation, researchers can identify misleading data points and perform adversarial training to improve model accuracy.

Multi-task Learning and System Integration
Real-world deployment usually requires the coordination of multiple CV capabilities provided by different networks.
When the system is designed well, these networks could facilitate each other and achieve synergistic effects.For example, instance segmentation can be used for fruit and flower localization ( §3.2), growth monitoring ( §3.1), and fruit maturity level detection ( §3.4).However, academic research tends to study these problems in isolation, thereby unable to reap benefits of multi-task learning.
Multi-task learning [29,51,194] focuses on leveraging mutually beneficial supervisory signals from multiple correlated tasks.Recently, CV researchers have built large-scale networks [66,71,120,152,155,200,323,383] that perform a wide range of tasks and achieve state-of-the-art results on most tasks.This demonstrates the benefits of multi-task learning and could inspire similar work dedicated to smart farming in CEAs.
Another motivation for considering multi-task learning and system integration is that errors can propagate in a pipeline architecture.For example, a network could first incorrectly detect a leaf occluding a mature fruit as the fruit and then classify it as an immature fruit.As a result, simply concatenating multiple techniques will result in inferior overall performance than what practitioners may expect.Thus, we encourage system designers to consider end-to-end training, or other innovative techniques [119,341,348] for aligning and interfacing different components within a system.
Finally, multi-task learning handles multiple tasks simultaneously, which saves computation power, enhances data efficiency, and alleviates the necessity to maintain and iterate multiple models.Such benefits are crucial for popularizing CEAs, as they facilitate the efficient use of energy, computation power, and human resources.Consequently, both the initial setup and ongoing maintenance investments for CEA farms can be reduced, expediting the emergence of economically viable CEAs.Furthermore, mindful selection and combination of targeted tasks have the potential to further improve overall efficiency [288].

Effective Use of Multimodality
Fusion of multi-modal data enhances inference ability of models by incorporating complementary view of data [171].
In the context of CEA, thermal or depth images capture the depth or temperature differences between foreground and background and enable filtering of non-target objects (e.g., fruits or leaves).Abnormal temperature changes during growth cycle can also indicate disease infection before visual symptoms appear [53,54].Furthermore, as different materials absorb, reflect, and transmit light in different ways and at different wavelengths, multi-spectral imaging (MSI) and hyper-spectral imaging (HSI), which capturing images at multiple wavelengths of light, can be used to perform more specific internal inspection of leaves, fruits and plants as compared to thermal and depth images.Finally, LiDAR and RGB-D systems allow the generation of high density 3D point clouds of plants, fruits [107,188] or environment [318], which facilitate 3D volume measurement or cut-point detection during harvesting.
Existing works have demonstrated the efficacy of multi-spectral imaging (MSI) and hyper-spectral imaging (HSI) [13,42,320].MSI have been utilized for yield prediction [310] and early disease detection [227,316].However, current literature explored majorly the power of MSI with shallow machine learning.We found only one work that leverages deep learning on MSI input [310], which applies a pruned VGG-16 for wheat yield estimation. HSI provides finer-grained resolution and divides the range of wavelength into many more spectral bands than MSI, typically ranging from tens to hundreds of bands, though at a higher cost.Hyper-spectral images have been used as the sole modality in early disease detection with both shallow machine learning methods [18,19,296] and deep learning methods [98,122,218,320].Due to relevancy and space limit, we will only talk about the deep learning methods here.Specifically, with a GAN-based data augmentation method, [320] performs early detection of tomato spotted wilt virus before visible symptoms appear using hyper-spectral images.[218] performs early detection of grapevine vein-clearing virus and shows the discriminative power of HSI in combination with CNN and shallow machine learning algorithms.[98] attains early barley disease detection through generating future prediction of hyper-spectral barley leaf images using GAN.Moreover, HSI has also been utilized for yield prediction through fruit counting.[122] leverages CNN and HSI to segment semantic mango masks and count the number of fruits.
However, systematic exploration of fusion techniques for multimodal inputs remains relatively rare in CEA applications.Many existing approaches adopt pipeline-based multimodal integration techniques that do not exhaust the potential of deep learning due to the lack of end-to-end training.For example, in [13], the authors set a depth threshold to filter false positive tomato detection from the background.[42] first performs broccoli segmentation on the RGB image.Within the segmentation mask, the authors find the mode of the depth value distribution, which is used to calculate the diameter of the broccoli head.[188] conducts semantic segmentation for guava fruits using RGB images and reconstructs their 3D positions from the depth input.[107] utilizes Mask R-CNN to perform instance segmentation of strawberries and align depth image with the segmentation mask to obtain 3D shape of strawberries.These methods use the two modalities separately and do not apply end-to-end training of the pipeline.As exceptions, [255] proposes late fusion of RGB and near-infrared images in sweet pepper detection.[317] incorporates depth information by replacing the blue channel with depth channel and applies masked R-CNN to locate tomatoes.
In computer vision research, numerous techniques for fusing and joint utilization of multimodal information have been proposed over the years, which we believe could contribute to CV applications in CEA.Due to space limits, we list only a few examples here.[268] proposes two different ways to combine multiple modalities in object detection, Concatenation and Element-wise Cross Product.The former combines feature maps from different modalities along the channel dimension and let the network discover the best way to combine them from data.The latter technique, Element-wise Cross Product, applies element-wise multiplication to every possible pair of feature maps from the two modalities.[50] experiments with a variety of fusion techniques for RGB and optical flow and discovers a highperforming late-fusion strategy in action recognition.In self-supervised learning, [125] identifies similar data points using one modality and treats them as positive pairs in another modality.This technique provides another paradigm to leverage the complementary nature of multimodality.

CONCLUSIONS
Smart agriculture, and particularly computer vision for controlled-environment agriculture (CV4CEA), are rapidly emerging as an interdisciplinary area of research that could potentially lead to enormous economic, environmental and social benefits.In this survey, we first provide brief overviews of existing CV technologies that range from image recognition to structured understanding such as segmentation; from uncertain quantification to interpretable machine learning.Next, we systematically review existing applications of CV4CEA, including growth monitoring, fruit and flower detection, fruit counting, maturity level classification, and pest / disease detection.Finally, we highlight a few research directions that could generate high-impact research in the near future.
Like any interdisciplinary area, research progress in CV4CEA requires expertise in both computer vision and agriculture.However, it could take a substantial amount of time for any researcher to acquire in-depth understanding of both subjects.By reviewing existing applications, available CV technologies, and identifying possible future research directions, we aim to provide a quick introduction of CV4CEA to researchers with expertise in agriculture or computer vision alone.It is our hope that the current survey will serve as a bridge between researchers from diverse backgrounds and contribute to accelerated innovation in the next decade.

Fig. 1 .
Fig. 1.An illustration of the end-to-end agriculture process of CEAs, from seed planting to harvest and sales, with five major deep learning based CV in agriculture applications-Growth Monitoring, Fruit and Flower Detection, Fruit Counting, Maturity Level Classification and Pest and Disease Detection -mapped to the corresponding applicable plant growth stages.Autonomous Seed Sowing and Autonomous Harvest and Sales in gray boxes are relevant steps in the agriculture process of CEAs but are out of the scope of our survey which focus on CV in CEAs.Orange lines represent arrows originated from pest and disease detection.Green lines represent arrows with stage 4 as destination.
and the form factors may pose different requirements for computer vision technologies.Typical forms for CEA are glasshouses with transparent shells or completely enclosed facilities.Depending on the cultivars being planted, internal arrangement of the farm can be classified into stacked horizontal systems, vertical growth surfaces, and multi-floor towers.Form factors have influence on lighting, which is an important consideration in CV applications.For example, glasshouses with transparent shells utilize natural light to reduce energy consumption but may not provide sufficient lighting for CV around the clock.In comparison, a completely enclosed facility can have greater control of lighting conditions.Moreover, internal arrangement of the farm also affect camera angle.If cultivars being planted change frequently as a result of the high turnover rate in CEAs, the arrangement of shelves and plants might change.This would affect the camera angles and thus the resulting inference performance.CV systems need adapt to the change of the environment.
3, proposes pre-trained Faster R-CNN network to estimate both the ripeness and quantity of sweet pepper.Two formulations of the model are tested.One treats ripe/unripe as additional classes on top of foreground/background, and the other performs foreground/background classification first and then performs ripeness classification on foreground regions.The second approach generates better ripeness classification results as the ripe/unripe classes are more balanced when only the foreground regions are considered.

Table 1 .
Factors to consider when applying CV techniques in CEA and some corresponding countermeasures.

Table 2 .
Performance of various leaf instance segmentation techniques on the CVPPP A1 test set.Higher SBD and lower |DiC| indicate better performance.(GT-FG) indicates model making use of ground-truth foregrounds

Table 3 .
Performance of various fruit and flower detection techniques.Datasets without reference are unpublished datasets.

Table 4 .
[238]rmance of various fruit counting techniques.Datasets without reference are unpublished datasets.[238]usesdirect regression method thus does not need IoU threshold

Table 5 .
[366]rmance of various maturity level classification techniques.Datasets without reference are unpublished datasets.Performance "-" are papers with unsummarizable metric results.[366]usesdirect classification method thus does not need IoU threshold

Table 6 .
[116]rmance of various pest and disease detection techniques.Datasets without reference are unpublished datasets.Performance "-" are papers with unsummarizable metric results.*Studiesperformdirect classification on image thus do not need IoU threshold.[116]usespatch level segmentation which does not need IoU threshold as well.
Table8, we discover fourteen datasets in CEA, with three for Growth Monitoring, five for Fruit Detection, and six for Pest and Disease Detection.Each targeted task contains at least one dataset that covers multiple species to facilitate training of generalizable and transferable models.The largest dataset is CVPPP with 6,287 and 165,120 RGB images for Arabidopsis and Tobacco respectively, aiming for growth monitoring related tasks.All the available datasets are composed of real images.While real images provide realistic data, we also want to encourage publication of synthetic datasets, which usually feature balanced class distribution and accurate labeling.Another point noteworthy is that many real images are collected under simplified laboratory environments, which may bias the data toward specific lighting conditions, backgrounds, plant orientation, or camera positions.For real world application, practitioners may need to further finetune the trained models on more realistic data.

Table 7 .
Dataset for CV tasks in CEA

Table 8 .
Dataset for CV tasks in CEA