US20200242777A1 - Depth-aware object counting - Google Patents
Depth-aware object counting Download PDFInfo
- Publication number
- US20200242777A1 US20200242777A1 US16/754,988 US201716754988A US2020242777A1 US 20200242777 A1 US20200242777 A1 US 20200242777A1 US 201716754988 A US201716754988 A US 201716754988A US 2020242777 A1 US2020242777 A1 US 2020242777A1
- Authority
- US
- United States
- Prior art keywords
- filter
- image
- segment
- density map
- machine learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/11—Region-based segmentation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/136—Segmentation; Edge detection involving thresholding
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2200/00—Indexing scheme for image data processing or generation, in general
- G06T2200/04—Indexing scheme for image data processing or generation, in general involving 3D image data
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10028—Range image; Depth image; 3D point clouds
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30196—Human being; Person
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30242—Counting objects in image
Definitions
- the subject matter described herein relates to machine learning.
- Machine learning technology enables computers to learn tasks. For example, machine learning may allow a computer to learn to perform a task during a training phase. Later, during an operational phase, the computer may be able to perform the learned task.
- Machine learning may take the form of a neural network, such as a deep learning neural network, a convolutional neural network (CNN), a state vector machine, a Bayes classifier, and other types of machine learning models.
- CNN convolutional neural network
- state vector machine e.g., a Bayes classifier
- Methods and apparatus, including computer program products, are provided for depth-aware object counting.
- a method that includes processing, by the trained machine learning model, a first segment of an image and a second segment of the image, the first segment being processed using a first filter selected, based on depth information, to enable formation of a first density map, and the second segment being processed using a second filter selected, based on the depth information, to enable formation of a second density map; combining, by the trained machine learning model, the first density map and the second density map to form a density map for the image; and providing, by the trained machine learning model, an output based on the density map, the output being representative of an estimate of a quantity of objects in the image.
- the trained machine learning model may receive the image including a plurality of objects, wherein the image is segmented, based on the depth information, into at least the first segment and the second segment.
- the depth information may be received from another machine learning model trained to output the depth information from the image.
- the trained machine learning model may include a multicolumn convolutional neural network including a first convolutional neural network and a second convolutional neural network.
- the first convolutional network may include the first filter.
- the second convolutional network may include the second filter.
- the first filter and the second filter each include a convolutional layer.
- the depth information may indicate the location of the first segment and/or the second segment.
- the depth information may indicate an object size due to distance from a camera.
- the depth information may indicate a first filter size of the first filter and a second filter size of the second filter.
- the trained machine learning model may select, based on the depth information, the first filter size of the first filter and the second filter size of the second filter.
- the training may be based on reference images, such that the machine learning model trains to learn generation of density maps.
- the plurality of objects may include a plurality of people, a plurality of vehicles, and/or a crowd of people.
- the first density map may estimate a density of objects in the first segment.
- the second density map may estimate a density of objects in the second segment.
- the density map may estimate a density of objects in the image.
- FIG. 1 depicts an example of an image including a crowd of people and a corresponding density map, in accordance with some example embodiments
- FIG. 2A depicts an example of a convolutional neural network (CNN), in accordance with some example embodiments
- FIG. 2B depicts another example of a CNN, in accordance with some example embodiments.
- FIG. 3A depicts an example of a neuron for a neural network, in accordance with some example embodiments
- FIG. 4 depicts a multicolumn convolutional neural network (MCCNN), in accordance with some example embodiments
- FIGS. 5A-5D depict process flows for determining an object count, in accordance with some example embodiments
- FIG. 6 depicts an example of an apparatus, in accordance with some example embodiments.
- FIG. 7 depicts another example of an apparatus, in accordance with some example embodiments.
- Machine learning may be used to perform one or more tasks such as count within at least one image a quantity of objects.
- a machine learning model such as a neural network, a convolutional neural network (CNN), a multi-column CNN (MCCNN), and/or other type of machine learning, can be trained to learn how to process at least one image to determine an estimate of the quantity of objects, such as people or other types of objects, in the at least one image (which may be in the form of frames of a video).
- public safety officials may want to know a crowd count at a given location which can be useful for a variety of reasons including crowd control, restricting the quantity of people at a location, minimizing the risk of a stampede, and/or or minimizing the risk of some other large group related mayhem.
- traffic safety officials may want to know a count of vehicles on a road (or at a location), and this count may be useful for a variety of reasons including traffic congestion control and management.
- the trained machine learning model may be used to count objects, such as people, vehicles, or other objects, in at least one image, in accordance with some example embodiments.
- the trained machine learning model may provide an actual count of the quantity of objects estimated to be in an image, or may provide a density map providing an estimate of the quantity of objects per square unit of distance, such as quantity of objects per square meter.
- the density map may provide more information in the sense that the density map may estimate the quantity of objects in the image and the distribution, or density, of objects across the image.
- FIG. 1 depicts an example of an image 100 including objects to be counted 100 and a corresponding density map 105 , in accordance with some example embodiments.
- the objects represent people, although as noted the objects may represent other types of objects as well.
- the density map 105 may provide information about the objects, such as people, in image 100 , such a density of people per square meter, a distribution of people across the image, and/or as a count of the quantity of people in at least a portion of the image.
- the scale of the objects, such as people, in the image may change due to size (e.g., scale) changes caused by the perspective of the camera in relationship to the people. For example, a person in the foreground of the image 100 may appear larger as that person is closer to the camera, when compared to a similarly sized person in the background and thus farther away from the camera. This perspective caused size variation may affect the accuracy of the count of objects in the at least one image 100 and the accuracy of the corresponding density map 105 .
- a machine learning model such as a neural network, a CNN, an MCCNN, and/or the like, may be used to determine an estimate of the quantity of objects, such as people, in an image.
- the estimate may be in the form of a density map of the image.
- the machine learning model may be implemented as an MCCNN, although other types of machine learning models may be used as well.
- crowd counting is described in the paper by Y. Zhang et al., “Single-image crowd counting via multi-column convolutional neural network,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
- the density map 105 of an image 100 may be determined by at least segmenting, based on the relative distances of objects such as people from a camera viewpoint, the whole image into at least two regions, although the image may be segmented into other quantities as well (e.g., 3, 4, or more segmented regions).
- the machine learning model such as the MCCNN configured with at least one filter selected to handle the object sizes (e.g., head or people sizes) in the corresponding region, may determine a density map, in accordance with some example embodiments.
- the density maps for each of the segmented regions may then be combined to form a density map 105 for the whole image 100 , in accordance with some example embodiments.
- a technical effect of one or more of the example embodiments disclosed herein may be enhanced processing speed due to the segmentation of the images, when compared to processing the whole image, and/or another technical effect of one or more of the example embodiments disclosed herein may be more accurate counting as each segment is processed with a filter specifically to account for the size induced perspective effects for that region and the objects in that region.
- FIG. 2A depicts an example of a CNN 200 , in accordance with some example embodiments.
- the CNN may include at least one convolutional layer 210 , 230 , at least one pooling layer 220 , 240 , and a fully connected layer 250 .
- the convolution layer 210 may be referred to as a filter, and may comprise a matrix that convolves at least a portion of the input image 100 .
- the size of this filter, or matrix may vary in order to detect and filter the object.
- a 7 by 7 matrix is selected as the filter at 210 to convolve with image 100 , so objects to be counted would need to be less than 7 ⁇ 7 pixels in order to be properly captured (while objects larger than 7 ⁇ 7 would be filtered out).
- the pooling layer 220 may be used to downsample the convolved image output by the convolution layer 210 .
- the pooling layer may be formed by a sliding window (or vector) sliding across the convolved image output by the convolution layer 210 .
- the pooling layer may have a stride length representative of the width of the window in pixels.
- the fully connected layer 250 may generate an output 204 .
- FIG. 2B depicts another example of a CNN 299 , in accordance with some example embodiments.
- the CNN 299 may be configured to determine how to segment, based on a depth map, an input image.
- the depth map provides information regarding the relative distances of objects, such as people, heads, and/or the like, from the camera.
- CNN 299 may determine, based on the depth map 277 , segments 298 A-C for the input image 100 .
- the size based effects of perspective in a given segment may be the same or similar, so the filter convolving the segment may be better able to detect the object of interest, such as the heads, people, and/or the like.
- the CNN 299 may be trained to determine a depth map 277 , in accordance with some example embodiments.
- the depth map 277 may, as noted, provide an indication of the relative distances of objects (e.g., people, heads, and/or the like) from the camera. As such, the depth map may provide an indication of the perspective caused size differences in the image.
- objects farther away from the camera may have pixels that are brighter, when compared to objects that are closer to the camera.
- the depth map 277 may be used to segment, based on the perspective based size differences, the image 100 into two or more segmented regions, such as 298 A-C.
- the previous example uses a depth map having brighter pixels for objects farther away, the pixels may be darker or have other values to signify depth.
- first segmented region 298 A may have objects appearing smaller in size (due to perspective), when compared to the second segmented region 298 B.
- second segmented region 298 B may have objects appearing smaller in size (due to the perspective), when compared to the third segmented region 298 C.
- the CNN 299 may be trained using reference images. These reference images may include objects, such as people in a crowd, and labels indicating the segments determined a priori based on relative size differences caused by perspective. Moreover, these segments of the reference images may correspond to certain sized objects in each of the segments and, as such, corresponding filter sizes. The CNN may then be trained until the CNN can learn to segment the reference images, which may also dictate the filter size to be used for that segment. Once trained, the trained CNN 299 may be used to determine the segments in other input images, in accordance with some example embodiments. In some example embodiments, the training of the CNN
- the CNN 299 may include a 7 ⁇ 7 convolutional layer 210 (which is the initial filter layer), followed by a 3 ⁇ 3 pooling layer 220 , followed by a 5 ⁇ 5 convolutional layer 230 , followed by a 3 ⁇ 3 pooling layer 240 , followed by a 3 ⁇ 3 convolution layer 265 , followed by a 3 ⁇ 3 convolutional layer 267 , followed by a 3 ⁇ 3 pooling layer 268 , and then coupled to a fully connected layer 250 (also referred to as an activation layer).
- the fully connected layer may generate an output, which in this example is a depth map 277 .
- the CNN 299 is depicted as having a certain configuration of layers, other types and quantities of layers may be implemented as well to provide machine learning that generates the depth map 277 and associated segments 298 A-C.
- one or more thresholds may be used to form segments 298 A-C. For example, pixels brighter than a certain threshold value may be assigned to segment 298 A, while pixels darker than a certain threshold may be assigned to segment 298 C.
- each of the segments 298 A-C may, as noted, have a certain size object and thus map to a given size of filter at 410 A, 410 B, and 410 C as explained below with respect to FIG. 4 .
- FIG. 3A depicts an example of an artificial neuron Aj 350 which may be implemented in a neural network, such as a CNN, an MCCNN, and/or the like, in accordance with some example embodiments. It will be appreciated that FIG. 3A represents a model of an artificial neuron 350 , and the neuron 350 can have other configurations including quantities of inputs and/or quantities of outputs.
- the neuron 350 may include a plurality of inputs to receive the pixel related values of an image.
- the neuron 350 may generate an output A j (t) 370 based on activation values A i (t ⁇ 1) (which correspond to A 0 -A 7 ) 360 A-H, connection weights w ij 365 A-H (which are labeled w oj through w 7j ), and input values 310 A-H (labeled S 0 -S 7 ).
- each one of the activation values 360 A-H may be multiplied by one of the corresponding weights 365 A-H.
- connection weight w oj 365 A is multiplied by activation value A 0 360 A
- connection weight w ij 365 B is multiplied by activation value A 1 360 B, and so forth.
- the products i.e., of the multiplications of the connections and activation values
- the outputs 370 may be used as an activation value at a subsequent time (e.g., at t+1) or provided to another node.
- the neuron 350 may be implemented in accordance with a neural model such as:
- K corresponds to a basis function (examples of which include a sigmoid, a wavelet, and any other basis function)
- a j (t) corresponds to an output value provided by a given neuron (e.g., the j th neuron) at a given time t
- a i (t ⁇ 1) corresponds to a prior output value (or activation value) assigned to a connection i for the j th neuron at a previous time t ⁇ 1
- w ij represents the i th connection value for the j th neuron
- j varies in accordance with the quantity of neurons, wherein the values of i vary from 0 to n
- n corresponds to the number of connections to the neuron.
- FIG. 3B depicts interconnected neurons 350 forming a neural network 399 , in accordance with some example embodiments.
- the neural network 399 may be configured to provide a CNN, such as CNNs 200 , 299 , an MCCNN, or portions, such as layers of a neural network (e.g., convolutional layer 210 may be implemented using a plurality of interconnected neurons 350 ).
- the neuron 350 including the neural network 399 may be implemented using code, circuitry, and/or a combination thereof.
- the neuron 350 and/or the neural network 399 may be implemented using specialized circuitry including, for example, at least one graphics processing unit (GPU, which is configured to better handle parallel processing, matrix operations, and/or the like when compared to a traditional central processing unit) or dedicated neural network circuitry.
- GPU graphics processing unit
- the neural network 399 may include an input layer 360 A, one or more hidden layers 360 B, and an output layer 360 C. Although not shown, other layers may be implemented as well, such as a pooling layer. It will be appreciated that the neural network's 3-2-3 node structure is used to facilitate explanation and, as such, the neural network 399 may be structured in other configurations, such as a 3 ⁇ 3 structure (with or without hidden layer(s)), a 5 ⁇ 5 structure (with or without hidden layer(s)), a 7 ⁇ 7 structure (with or without hidden layer(s)), and/or other structures (with or without hidden layer(s)) as well.
- training data such as reference images with labels (e.g., indicating segments, depth maps, crowd counts, and/or the like)
- reference images with labels e.g., indicating segments, depth maps, crowd counts, and/or the like
- the CNN 399 may receive labeled training data, such as reference images with the proper segments labeled, so that the CNN 299 can train iteratively until it learns to form a depth map and/or segments for images.
- the neurons of the network may learn by optimizing to a mean square error (e.g., between the labeled training data at the input layer 360 A and what is generated at the output of the output layer 360 C) using gradient descent and/or the like.
- a mean square error e.g., between the labeled training data at the input layer 360 A and what is generated at the output of the output layer 360 C
- the neural network's configuration such as the values of the weights, activation values, basis function, and/or the like, can be saved to storage. This saved configuration represents the trained neural network.
- the CNN 299 may be used to segment image 100 into regions 298 A-C.
- each of the segmented regions 298 A-C may have about the same size object (e.g., head or people size), and thus map to a given size of filter at 410 A, 410 B, and 410 C.
- the segmented regions 298 A-C (and/or filter sizes for the regions) may be provided to another machine learning model, such as an MCCNN 400 as shown in FIG. 4 , in accordance with some example embodiments.
- the MCCNN 400 may include a CNN 405 A-C for each of the regions segmented in the image.
- a CNN 405 A-C for each of the regions segmented in the image.
- the first CNN 405 A may include a first convolutional layer 410 A providing a filter of for example 3 ⁇ 3 pixels. This filter may be selected based on the size of the objects in the segmented region 298 A. As noted above, the segmented region 298 A may have about the same size objects (e.g., head or people size), so segmented region 298 A may map to the filter size of 3 ⁇ 3 pixels at 410 A, for example. In other words, the depth information defining where the segments are in image 100 may also enable MCCNN to select the proper filter size for each segment 298 A-C.
- a first convolutional layer 410 A providing a filter of for example 3 ⁇ 3 pixels. This filter may be selected based on the size of the objects in the segmented region 298 A. As noted above, the segmented region 298 A may have about the same size objects (e.g., head or people size), so segmented region 298 A may map to the filter size of 3 ⁇ 3 pixels at 410 A, for example. In other words
- the first convolutional layer 410 A may be followed by a convolutional layer 412 A, a pooling layer 414 A, a convolutional layer 418 A, a pooling layer 417 A, a convolutional layer 418 A, and a fully connected layer 420 A.
- the first CNN 405 A includes a certain configuration of intermediate layers 412 A- 418 A, other types and/or quantities of layers may be implemented as well.
- the second CNN 405 B may include a first convolutional layer 410 B providing a filter of for example 5 ⁇ 5 pixels. This 5 ⁇ 5 pixel filter may be selected based on the size of the objects in the segmented region 298 B. As noted above with respect to filter 410 A, the segmented region 298 B may have about the same size object (e.g., head or people size), so segmented region 298 B may map to the filter size of 5 ⁇ 5 pixels at 410 B, for example.
- object e.g., head or people size
- the first convolutional layer 410 B may be followed by a convolutional layer 412 B, a pooling layer 414 B, a convolutional layer 418 B, a pooling layer 417 B, a convolutional layer 418 B, and a fully connected layer 420 B.
- the second CNN 405 B includes a certain configuration of intermediate layers 412 B- 418 B, other types and/or quantities of layers may be implemented as well.
- the third CNN 405 C may include a first convolutional layer 410 C providing a filter of for example 7 ⁇ 7 pixels. This filter may be selected based on the size of the objects in the segmented region 298 C.
- the segmented region 298 C may also have about the same size objects (e.g., head or people sizes), so segmented region 298 C may map to the filter size of 7 ⁇ 7 pixels at 410 C, for example.
- the depth information defining where the segments are in image 100 may also enable selection of the proper filter size for each segment.
- the first convolutional layer 410 C may be followed by a convolutional layer 412 C, a pooling layer 414 C, a convolutional layer 418 C, a pooling layer 417 C, a convolutional layer 418 C, and a fully connected layer 420 C.
- the third CNN 405 C includes a certain configuration of intermediate layers 412 C- 418 C, other types and/or quantities of layers may be implemented as well.
- the MCCNN 400 (which in this example includes 3 CNN columns) may include the first CNN 405 A may have the filter 410 A which samples a first segmented region 298 A and outputs a first density map 498 A for the first region, the second CNN 405 B may have the filter 410 B which samples the second segmented region 298 B of the image and outputs a second density map 498 B for the second region, and the third CNN 298 C may have the filter 410 C which samples the second segmented region 298 C of the image and outputs a third density map 498 C for the third region.
- the density map 499 may, as noted, provide an estimate of the quantity of objects per square unit of distance, from which the quantity of objects in the image and the distribution of the objects across the image can be determined.
- the objects are people, although other types of object may be counted in the image as well.
- the filters 410 A-C in each of the column CNNs 405 A-C may, as noted, be selected based on the size of the objects in the corresponding region and, in particular, the size induced perspective differences in the image. For example, in a given segmented region 298 A-C of the image, the size of the people (or their heads) may have the same of similar perspective and thus the same or similar size. As such, the filter 410 A for the first CNN 405 A may be a smaller filter to take into account the similar people/head sizes in the region 298 A farther away from the camera, when compared to the filter 410 B for region 298 B which is closer to the camera (and thus would require a larger filter).
- the filter 410 B for the second CNN 405 B handling the region 298 B may be a smaller filter, when compared to the filter 410 C for the third CNN 405 C handling the region 298 C.
- the MCCNN 400 may select the filters at 410 A-C based on the depth information for each of the three regions 298 A-C, and each region may be processed using one of the corresponding column CNNs 405 A, B, or C configured specifically for the approximate size of the object (e.g., heads or people) in the corresponding region.
- the MCCNN 400 may thus select, based on the depth information indicative of the segment and object size in the segment, the size of the corresponding initial filter 410 A, B, or C, so that the objects in the region can pass through the corresponding filter.
- the MCCNN 400 may be trained using a reference set of images. These reference images may include reference images having been segmented and having known density maps for each of the segments. Reference images may represent ground truth in the sense that the quantity of people in (or the density map for) the image(s) (or segment(s)) may be known to a certain degree of certainty.
- the MCCNN 400 may then be trained until the MCCNN can learn to generate a density map for the reference images. Once trained, the trained MCCNN may be used to determine density maps for other input images, in accordance with some example embodiments.
- the image 100 (which is being processed to determine an object count) may represent a video stream captured by at least one camera, such as an omnidirectional, or multi-view, camera and/or the like.
- An example of an omnidirectional, multi-view camera is the Nokia OzO camera, which may generate 360 panoramic images in multiple planes.
- the images from the camera can be input to the CNN 299 and/or MCCNN 400 in order to enable generation of a density map and a corresponding crowd count in each image.
- the OzO camera may include a plurality of cameras, and the images from each of these cameras can be processed to enable segmentation and/or determine a density map from which a crowd count can be determined.
- each camera of an OzO camera may be input into a separate CNN of the MCCNN and then the output density maps may be combined to form an aggregate density map 499 .
- FIG. 5A depicts a process flow for training a machine learning model, such as CNN 299 to learn how to generate depth information, such as depth maps, to enable image segmentation, in accordance with some example embodiments.
- a machine learning model such as CNN 299 to learn how to generate depth information, such as depth maps, to enable image segmentation, in accordance with some example embodiments.
- the description of FIG. 5A refers to FIGS. 1 and 2B .
- At 502 at least one reference image may be received labeled with depth information, in accordance with some example embodiments.
- the CNN 299 may receive reference images having labels indicating the depth of each image.
- each reference image may have a corresponding depth map and/or the location of the segments within the image.
- the objects in the segments in the reference image(s) may be about the same distance from the camera and, as such, have about the same size to enable filtering with the same size filter.
- a machine learning model may be trained to learn based on the received reference images, in accordance with some example embodiments.
- the CNN 299 may train, based on the receive images, to learn how to generate the depth information, such as the depth map, the location of the segments for received reference images, and/or size of the objects (or filter size) for each segment.
- the training may be iterative using gradient descent and/or like.
- the CNN's configuration e.g., values of the weights, activation values, basis function, and/or the like
- This saved configuration represents the trained CNN, which can be used, in an operational phase, to determine depth information, such as depth maps, segments, for images other than the reference images, and/or size of the objects (or filter size) for each segment.
- FIG. 5B depicts a process flow for training a machine learning model, such as an MCCNN to provide object count information, in accordance with some example embodiments.
- the description of FIG. 5A refers to FIGS. 1 and 4 .
- At 512 at least one reference image may be received labeled with density information, in accordance with some example embodiments.
- the MCCNN 400 may receive reference images having labels indicating the segments in each image, and a density of the objects, such as people/heads per square meter, object count, and/or the like, in the segment.
- the reference image 100 FIG. 4
- each segment may have a corresponding density map to enable training.
- each of the segments may have about the same size objects (with respect to perspective), so a given filter can be used on the objects in the corresponding segment.
- a machine learning model may be trained to learn to determine density maps, in accordance with some example embodiments.
- the MCCNN 400 may train based on the receive reference images to learn how to generate the object density information, such as a density map, count, and/or the like.
- each column CNN 405 A-C of the MCCNN may be trained using a first convolutional layer having a filter selected specifically to account for the size induced perspective effects of that region being handled by the column CNN.
- the MCCNN's configuration e.g., values of the weights, activation values, basis function, and/or the like
- This saved configuration represents the trained MCCNN, which can be used, in an operational phase, to determine density information, such as depth maps and segments, for images other than the reference images.
- FIG. 5C depicts a process flow for a trained machine learning model in an operational phase, in accordance with some example embodiments.
- the description of FIG. 5A refers to FIGS. 1 and 2B .
- At 522 at least one image may be received by the trained machine learning model, in accordance with some example embodiments.
- the trained CNN 299 may receive at least one image 100 requiring an estimate of an object count.
- the trained CNN may process the at least one input image 100 to determine, at 524 , depth information, which may be in the form of a depth map, and/or an indication of where the at least one image should be segmented, in accordance with some example embodiments.
- the depth information may also indicate the size of the objects in the segment(s) and/or the corresponding filter size for the segment(s).
- the trained machine learning model such as the trained CNN 299 , may output depth information to another machine learning model, such as the MCCNN 400 , in accordance with some example embodiments.
- FIG. 5D depicts a process flow for a trained machine learning model in an operational phase, in accordance with some example embodiments.
- the description of FIG. 5A refers to FIGS. 1 and 4 .
- At 532 at least one image may be received by the trained machine learning model, in accordance with some example embodiments.
- the trained MCCNN 400 may receive at least one image.
- the image may be received with depth information to enable segmentation of the image 100 into a plurality of portions.
- the image 100 is segmented into 3 portions 298 A-C, although other quantity of segments may be used as well.
- the depth information may enable the MCCNN to select a filter at 410 A-C that is sized to process the size of objects found in each of the segments 298 A-C.
- Each segmented region 298 A-C may be processed, at 534 , by a CNN 405 A-C of the MCCNN 400 , in accordance with some example embodiments.
- the segments may be segmented, based on depth information, to take into account the perspective induced size differences.
- This enables each of the CNN's 405 A-C to have a filter better suited for the size of the objects, such as heads, people, and/or the like, in the corresponding segment being handled by the corresponding CNN.
- the CNN 405 A includes objects in the background (which causes objects to appear smaller due to perspective), so the convolutional layer's 410 A filter is, for example, a 3 ⁇ 3 matrix to accommodate the relatively smaller sized heads and/or people.
- the trained machine learning model may generate a density map for each segmented region of the image, in accordance with some example embodiments. As shown at FIG. 4 , each column CNN 405 A-C generates a density map 498 A-C.
- the trained machine learning model may combine the density maps for each region to form a density map for the entire image received at the input, in accordance with some example embodiments.
- the MCCNN 400 may combine the density maps 498 A-C into density map 499 , which represents the density map 499 for the entire image 100 .
- a trained machine learning model may output an indication of the object count, in accordance with some example embodiments.
- the MCCNN 400 may output the density map 499 or further process the density map to provide a count, such people count, for the entire image or a count for a portion of the image.
- FIG. 6 depicts a block diagram illustrating a computing system 600 , in accordance with some example embodiments.
- the computing system 600 may be used to implement machine learning models, such as CNN 200 , CNN 299 , MCCNN 400 , and/or the like as disclosed herein including FIGS. 5A-5D to perform counting of objects in images, in accordance with some example embodiments.
- the processor 610 may be capable of processing instructions stored in the memory 620 and/or on the storage device 630 to display graphical information for a user interface provided via the input/output device 640 .
- the memory 620 may be a computer readable medium such as volatile or non-volatile that stores information within the computing system 600 .
- the memory 620 can store instructions, such as computer program code.
- the storage device 630 may be capable of providing persistent storage for the computing system 600 .
- the storage device 630 can be a floppy disk device, a hard disk device, an optical disk device, or a tape device, or other suitable persistent storage mechanism.
- the input/output device 640 provides input/output operations for the computing system 600 .
- the input/output device 640 includes a keyboard and/or pointing device. In various implementations, the input/output device 640 includes a display unit for displaying graphical user interfaces. Alternatively or additionally, the input/output device 640 may include wireless and/or wired interface to enable communication with other devices, such as other network nodes. For example, the input/output device 640 can include an Ethernet interface, a WiFi interface, a cellular interface, and/or other wired and/or wireless interface to allow communications with one or more wired and/or wireless networks and/or devices.
- the apparatus 10 may comprise, or be comprised in, an apparatus, such as a mobile phone, smart phone, camera (e.g., OzO, closed circuit television, webcam), drone, self-driving vehicle, car, unmanned aerial vehicle, autonomous vehicle, and/or Internet of Things (IoT sensor, such as a traffic sensor, industrial sensor, and/or the like) to enable counting of objects, in accordance with some example embodiments.
- an apparatus such as a mobile phone, smart phone, camera (e.g., OzO, closed circuit television, webcam), drone, self-driving vehicle, car, unmanned aerial vehicle, autonomous vehicle, and/or Internet of Things (IoT sensor, such as a traffic sensor, industrial sensor, and/or the like) to enable counting of objects, in accordance with some example embodiments.
- IoT sensor Internet of Things
- the processor 20 may, for example, be embodied in a variety of ways including circuitry, at least one processing core, one or more microprocessors with accompanying digital signal processor(s), one or more processor(s) without an accompanying digital signal processor, one or more coprocessors, one or more multi-core processors, one or more controllers, processing circuitry, one or more computers, various other processing elements including integrated circuits (for example, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), and/or the like), or some combination thereof. Accordingly, although illustrated in FIG. 7 as a single processor, in some example embodiments the processor 20 may comprise a plurality of processors or processing cores.
- ASIC application specific integrated circuit
- FPGA field programmable gate array
- the apparatus 10 may be capable of operating with one or more air interface standards, communication protocols, modulation types, access types, and/or the like.
- Signals sent and received by the processor 20 may include signaling information in accordance with an air interface standard of an applicable cellular system, and/or any number of different wireline or wireless networking techniques, comprising but not limited to Wi-Fi, wireless local access network (WLAN) techniques, such as Institute of Electrical and Electronics Engineers (IEEE) 802.11, 802.16, 802.3, ADSL, DOCSIS, and/or the like.
- these signals may include speech data, user generated data, user requested data, and/or the like.
- the apparatus 10 and/or a cellular modern therein may be capable of operating in accordance with various first generation (1G) communication protocols, second generation (2G or 2.5G) communication protocols, third-generation (3G) communication protocols, fourth-generation (4G) communication protocols, fifth-generation (5G) communication protocols, Internet Protocol Multimedia Subsystem (IMS) communication protocols (for example, session initiation protocol (SIP) and/or the like.
- the apparatus 10 may be capable of operating in accordance with 2G wireless communication protocols IS-136, Time Division Multiple Access TDMA, Global System for Mobile communications, GSM, IS-95, Code Division Multiple Access, CDMA, and/or the like.
- the apparatus 10 may be capable of operating in accordance with 2.5G wireless communication protocols General Packet Radio Service (GPRS), Enhanced Data GSM Environment (EDGE), and/or the like. Further, for example, the apparatus 10 may be capable of operating in accordance with 3G wireless communication protocols, such as Universal Mobile Telecommunications System (UMTS), Code Division Multiple Access 2000 (CDMA2000), Wideband Code Division Multiple Access (WCDMA), Time Division-Synchronous Code Division Multiple Access (TD-SCDMA), and/or the like. The apparatus 10 may be additionally capable of operating in accordance with 3.9G wireless communication protocols, such as Long Term Evolution (LTE), Evolved Universal Terrestrial Radio Access Network (E-UTRAN), and/or the like. Additionally, for example, the apparatus 10 may be capable of operating in accordance with 4G wireless communication protocols, such as LTE Advanced, 5G, and/or the like as well as similar wireless communication protocols that may be subsequently developed.
- GPRS General Packet Radio Service
- EDGE Enhanced Data GSM Environment
- the processor 20 and/or user interface circuitry comprising the processor 20 may be configured to control one or more functions of one or more elements of the user interface through computer program instructions, for example, software and/or firmware, stored on a memory accessible to the processor 20 , for example, volatile memory 40 , non-volatile memory 42 , and/or the like.
- the apparatus 10 may include a battery for powering various circuits related to the mobile terminal, for example, a circuit to provide mechanical vibration as a detectable output.
- the user input interface may comprise devices allowing the apparatus 20 to receive data, such as a keypad 30 (which can be a virtual keyboard presented on display 28 or an externally coupled keyboard) and/or other input devices.
- the apparatus 10 may include other short-range transceivers, such as an infrared (IR) transceiver 66 , a BluetoothTM (BT) transceiver 68 operating using BluetoothTM wireless technology, a wireless universal serial bus (USB) transceiver 70 , a BluetoothTM Low Energy transceiver, a ZigBee transceiver, an ANT transceiver, a cellular device-to-device transceiver, a wireless local area link transceiver, and/or any other short-range radio technology.
- Apparatus 10 and, in particular, the short-range transceiver may be capable of transmitting data to and/or receiving data from electronic devices within the proximity of the apparatus, such as within 10 meters, for example.
- the apparatus 10 may comprise memory, such as a subscriber identity module (SIM) 38 , a removable user identity module (R-UIM), an eUICC, an UICC, and/or the like, which may store information elements related to a mobile subscriber.
- SIM subscriber identity module
- R-UIM removable user identity module
- eUICC eUICC
- UICC UICC
- the apparatus 10 may include volatile memory 40 and/or non-volatile memory 42 .
- volatile memory 40 may include Random Access Memory (RAM) including dynamic and/or static RAM, on-chip or off-chip cache memory, and/or the like.
- RAM Random Access Memory
- Non-volatile memory 42 which may be embedded and/or removable, may include, for example, read-only memory, flash memory, magnetic storage devices, for example, hard disks, floppy disk drives, magnetic tape, optical disc drives and/or media, non-volatile random access memory (NVRAM), and/or the like. Like volatile memory 40 , non-volatile memory 42 may include a cache area for temporary storage of data. At least part of the volatile and/or non-volatile memory may be embedded in processor 20 .
- the memories may store one or more software programs, instructions, pieces of information, data, and/or the like which may be used by the apparatus for performing operations disclosed herein including, for example, processing, by the trained machine learning model, a first segment of an image and a second segment of the image, the first segment being processed using a first filter selected, based on depth information, to enable formation of a first density map, and the second segment being processed using a second filter selected, based on the depth information, to enable formation of a second density map; combining, by the trained machine learning model, the first density map and the second density map to form a density map for the image; providing, by the trained machine learning model, an output based on the density map, the output being representative of an estimate of a quantity of objects in the image, and/or other aspects disclosed herein with respect to the CNN, MCCNN 400 , and/or the like for counting of objects in images.
- the processor 20 may be configured using computer code stored at memory 40 and/or 42 to at least including, for example, processing, by the trained machine learning model, a first segment of an image and a second segment of the image, the first segment being processed using a first filter selected, based on depth information, to enable formation of a first density map, and the second segment being processed using a second filter selected, based on the depth information, to enable formation of a second density map; combining, by the trained machine learning model, the first density map and the second density map to form a density map for the image; and/or other aspects disclosed herein with respect to the CNN, MCCNN 400 , and/or the like for counting of objects in images.
- a “computer-readable medium” may be any non-transitory media that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer or data processor circuitry, with examples depicted at FIG. 7
- computer-readable medium may comprise a non-transitory computer-readable storage medium that may be any media that can contain or store the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer.
- the base stations and user equipment (or one or more components therein) and/or the processes described herein can be implemented using one or more of the following: a processor executing program code, an application-specific integrated circuit (ASIC), a digital signal processor (DSP), an embedded processor, a field programmable gate array (FPGA), and/or combinations thereof.
- ASIC application-specific integrated circuit
- DSP digital signal processor
- FPGA field programmable gate array
- These various implementations may include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
- These computer programs also known as programs, software, software applications, applications, components, program code, or code
- computer-readable medium refers to any computer program product, machine-readable medium, computer-readable storage medium, apparatus and/or device (for example, magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions.
- PLDs Programmable Logic Devices
- systems are also described herein that may include a processor and a memory coupled to the processor.
- the memory may include one or more programs that cause the processor to perform one or more of the operations described herein.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Image Analysis (AREA)
Abstract
Description
- The subject matter described herein relates to machine learning.
- Machine learning technology enables computers to learn tasks. For example, machine learning may allow a computer to learn to perform a task during a training phase. Later, during an operational phase, the computer may be able to perform the learned task. Machine learning may take the form of a neural network, such as a deep learning neural network, a convolutional neural network (CNN), a state vector machine, a Bayes classifier, and other types of machine learning models.
- Methods and apparatus, including computer program products, are provided for depth-aware object counting.
- In some example embodiments, there may be provided a method that includes processing, by the trained machine learning model, a first segment of an image and a second segment of the image, the first segment being processed using a first filter selected, based on depth information, to enable formation of a first density map, and the second segment being processed using a second filter selected, based on the depth information, to enable formation of a second density map; combining, by the trained machine learning model, the first density map and the second density map to form a density map for the image; and providing, by the trained machine learning model, an output based on the density map, the output being representative of an estimate of a quantity of objects in the image.
- In some variations, one or more of the features disclosed herein including the following features can optionally be included in any feasible combination. The trained machine learning model may receive the image including a plurality of objects, wherein the image is segmented, based on the depth information, into at least the first segment and the second segment. The depth information may be received from another machine learning model trained to output the depth information from the image. The trained machine learning model may include a multicolumn convolutional neural network including a first convolutional neural network and a second convolutional neural network. The first convolutional network may include the first filter. The second convolutional network may include the second filter. The first filter and the second filter each include a convolutional layer. The depth information may indicate the location of the first segment and/or the second segment. The depth information may indicate an object size due to distance from a camera. The depth information may indicate a first filter size of the first filter and a second filter size of the second filter. The trained machine learning model may select, based on the depth information, the first filter size of the first filter and the second filter size of the second filter. The training may be based on reference images, such that the machine learning model trains to learn generation of density maps. The plurality of objects may include a plurality of people, a plurality of vehicles, and/or a crowd of people. The first density map may estimate a density of objects in the first segment. The second density map may estimate a density of objects in the second segment. The density map may estimate a density of objects in the image.
- The above-noted aspects and features may be implemented in systems, apparatus, methods, and/or articles depending on the desired configuration. The details of one or more variations of the subject matter described herein are set forth in the accompanying drawings and the description below. Features and advantages of the subject matter described herein will be apparent from the description and drawings, and from the claims.
- In the drawings,
-
FIG. 1 depicts an example of an image including a crowd of people and a corresponding density map, in accordance with some example embodiments; -
FIG. 2A depicts an example of a convolutional neural network (CNN), in accordance with some example embodiments; -
FIG. 2B depicts another example of a CNN, in accordance with some example embodiments; -
FIG. 3A depicts an example of a neuron for a neural network, in accordance with some example embodiments; -
FIG. 3B depicts an example of a neural network including at least one neuron, in accordance with some example embodiments; -
FIG. 4 depicts a multicolumn convolutional neural network (MCCNN), in accordance with some example embodiments; -
FIGS. 5A-5D depict process flows for determining an object count, in accordance with some example embodiments; -
FIG. 6 depicts an example of an apparatus, in accordance with some example embodiments; and -
FIG. 7 depicts another example of an apparatus, in accordance with some example embodiments; - Like labels are used to refer to same or similar items in the drawings.
- Machine learning may be used to perform one or more tasks such as count within at least one image a quantity of objects. For example, a machine learning model, such as a neural network, a convolutional neural network (CNN), a multi-column CNN (MCCNN), and/or other type of machine learning, can be trained to learn how to process at least one image to determine an estimate of the quantity of objects, such as people or other types of objects, in the at least one image (which may be in the form of frames of a video). To illustrate further by way of another example, public safety officials may want to know a crowd count at a given location which can be useful for a variety of reasons including crowd control, restricting the quantity of people at a location, minimizing the risk of a stampede, and/or or minimizing the risk of some other large group related mayhem. To illustrate further by way of another example, traffic safety officials may want to know a count of vehicles on a road (or at a location), and this count may be useful for a variety of reasons including traffic congestion control and management. The trained machine learning model may be used to count objects, such as people, vehicles, or other objects, in at least one image, in accordance with some example embodiments.
- When counting objects in an image, the trained machine learning model may provide an actual count of the quantity of objects estimated to be in an image, or may provide a density map providing an estimate of the quantity of objects per square unit of distance, such as quantity of objects per square meter. The density map may provide more information in the sense that the density map may estimate the quantity of objects in the image and the distribution, or density, of objects across the image.
- Although some of the examples described herein refer to counting people in images, this is merely an example of the types of objects that can be counted as other types of objects, such as vehicles, and/or the like, may be counted as well.
-
FIG. 1 depicts an example of animage 100 including objects to be counted 100 and acorresponding density map 105, in accordance with some example embodiments. In the example ofFIG. 1 , the objects represent people, although as noted the objects may represent other types of objects as well. - The
density map 105 may provide information about the objects, such as people, inimage 100, such a density of people per square meter, a distribution of people across the image, and/or as a count of the quantity of people in at least a portion of the image. In the crowd counting example, the scale of the objects, such as people, in the image may change due to size (e.g., scale) changes caused by the perspective of the camera in relationship to the people. For example, a person in the foreground of theimage 100 may appear larger as that person is closer to the camera, when compared to a similarly sized person in the background and thus farther away from the camera. This perspective caused size variation may affect the accuracy of the count of objects in the at least oneimage 100 and the accuracy of thecorresponding density map 105. - In some example embodiments, a machine learning model, such as a neural network, a CNN, an MCCNN, and/or the like, may be used to determine an estimate of the quantity of objects, such as people, in an image. The estimate may be in the form of a density map of the image. In some example embodiments, the machine learning model may be implemented as an MCCNN, although other types of machine learning models may be used as well. In the case of an MCCNN, crowd counting is described in the paper by Y. Zhang et al., “Single-image crowd counting via multi-column convolutional neural network,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
- In some example embodiments, the
density map 105 of animage 100 may be determined by at least segmenting, based on the relative distances of objects such as people from a camera viewpoint, the whole image into at least two regions, although the image may be segmented into other quantities as well (e.g., 3, 4, or more segmented regions). For each segmented region, the machine learning model, such as the MCCNN configured with at least one filter selected to handle the object sizes (e.g., head or people sizes) in the corresponding region, may determine a density map, in accordance with some example embodiments. The density maps for each of the segmented regions may then be combined to form adensity map 105 for thewhole image 100, in accordance with some example embodiments. Without in any way limiting the scope, interpretation, or application of the claims appearing below, a technical effect of one or more of the example embodiments disclosed herein may be enhanced processing speed due to the segmentation of the images, when compared to processing the whole image, and/or another technical effect of one or more of the example embodiments disclosed herein may be more accurate counting as each segment is processed with a filter specifically to account for the size induced perspective effects for that region and the objects in that region. -
FIG. 2A depicts an example of aCNN 200, in accordance with some example embodiments. The CNN may include at least one 210, 230, at least oneconvolutional layer 220, 240, and a fully connectedpooling layer layer 250. - The
convolution layer 210 may be referred to as a filter, and may comprise a matrix that convolves at least a portion of theinput image 100. As noted above, the size of this filter, or matrix, may vary in order to detect and filter the object. In this example, a 7 by 7 matrix is selected as the filter at 210 to convolve withimage 100, so objects to be counted would need to be less than 7×7 pixels in order to be properly captured (while objects larger than 7×7 would be filtered out). Thepooling layer 220 may be used to downsample the convolved image output by theconvolution layer 210. To downsample the convolved image into a smaller image, the pooling layer may be formed by a sliding window (or vector) sliding across the convolved image output by theconvolution layer 210. The pooling layer may have a stride length representative of the width of the window in pixels. The fully connectedlayer 250 may generate anoutput 204. -
FIG. 2B depicts another example of aCNN 299, in accordance with some example embodiments. TheCNN 299 may be configured to determine how to segment, based on a depth map, an input image. The depth map provides information regarding the relative distances of objects, such as people, heads, and/or the like, from the camera. For example,CNN 299 may determine, based on thedepth map 277,segments 298A-C for theinput image 100. The size based effects of perspective in a given segment may be the same or similar, so the filter convolving the segment may be better able to detect the object of interest, such as the heads, people, and/or the like. - In the example of
FIG. 2B , theCNN 299 may be trained to determine adepth map 277, in accordance with some example embodiments. Thedepth map 277 may, as noted, provide an indication of the relative distances of objects (e.g., people, heads, and/or the like) from the camera. As such, the depth map may provide an indication of the perspective caused size differences in the image. In thedepth map 277, objects farther away from the camera may have pixels that are brighter, when compared to objects that are closer to the camera. As such, thedepth map 277 may be used to segment, based on the perspective based size differences, theimage 100 into two or more segmented regions, such as 298A-C. Although the previous example uses a depth map having brighter pixels for objects farther away, the pixels may be darker or have other values to signify depth. - To illustrate further, the first
segmented region 298A may have objects appearing smaller in size (due to perspective), when compared to the secondsegmented region 298B. And, the secondsegmented region 298B may have objects appearing smaller in size (due to the perspective), when compared to the thirdsegmented region 298C. Although the previous examplesegmented image 100 into three segments, other quantities of segments may be used as well. - In some example embodiments, the
CNN 299 may be trained using reference images. These reference images may include objects, such as people in a crowd, and labels indicating the segments determined a priori based on relative size differences caused by perspective. Moreover, these segments of the reference images may correspond to certain sized objects in each of the segments and, as such, corresponding filter sizes. The CNN may then be trained until the CNN can learn to segment the reference images, which may also dictate the filter size to be used for that segment. Once trained, the trainedCNN 299 may be used to determine the segments in other input images, in accordance with some example embodiments. In some example embodiments, the training of the CNN - In the example of
FIG. 2B , theCNN 299 may include a 7×7 convolutional layer 210 (which is the initial filter layer), followed by a 3×3pooling layer 220, followed by a 5×5convolutional layer 230, followed by a 3×3pooling layer 240, followed by a 3×3convolution layer 265, followed by a 3×3convolutional layer 267, followed by a 3×3pooling layer 268, and then coupled to a fully connected layer 250 (also referred to as an activation layer). The fully connected layer may generate an output, which in this example is adepth map 277. Although theCNN 299 is depicted as having a certain configuration of layers, other types and quantities of layers may be implemented as well to provide machine learning that generates thedepth map 277 and associatedsegments 298A-C. In some example embodiments, one or more thresholds may be used to formsegments 298A-C. For example, pixels brighter than a certain threshold value may be assigned tosegment 298A, while pixels darker than a certain threshold may be assigned tosegment 298C. Moreover, each of thesegments 298A-C may, as noted, have a certain size object and thus map to a given size of filter at 410A, 410B, and 410C as explained below with respect toFIG. 4 . -
FIG. 3A depicts an example of anartificial neuron Aj 350 which may be implemented in a neural network, such as a CNN, an MCCNN, and/or the like, in accordance with some example embodiments. It will be appreciated thatFIG. 3A represents a model of anartificial neuron 350, and theneuron 350 can have other configurations including quantities of inputs and/or quantities of outputs. For example, theneuron 350 may include a plurality of inputs to receive the pixel related values of an image. - Referring to
FIG. 3A , theneuron 350 may generate an output Aj(t) 370 based on activation values Ai(t−1) (which correspond to A0-A7) 360A-H,connection weights w ij 365A-H (which are labeled woj through w7j), and input values 310A-H (labeled S0-S7). At a given time, t, each one of the activation values 360A-H may be multiplied by one of thecorresponding weights 365A-H. For example,connection weight w oj 365A is multiplied byactivation value A 0 360A, connection weight wij 365B is multiplied byactivation value A 1 360B, and so forth. The products (i.e., of the multiplications of the connections and activation values) are then summed, and the resulting sum is operated on by a basis function K to yield at time t the output Aj(t) 370 fornode A j 350. Theoutputs 370 may be used as an activation value at a subsequent time (e.g., at t+1) or provided to another node. - The
neuron 350 may be implemented in accordance with a neural model such as: -
- wherein K corresponds to a basis function (examples of which include a sigmoid, a wavelet, and any other basis function), Aj(t) corresponds to an output value provided by a given neuron (e.g., the jth neuron) at a given time t, Ai(t−1) corresponds to a prior output value (or activation value) assigned to a connection i for the jth neuron at a previous time t−1, wij represents the ith connection value for the jth neuron, wherein j varies in accordance with the quantity of neurons, wherein the values of i vary from 0 to n, and wherein n corresponds to the number of connections to the neuron.
-
FIG. 3B depictsinterconnected neurons 350 forming aneural network 399, in accordance with some example embodiments. Theneural network 399 may be configured to provide a CNN, such as 200, 299, an MCCNN, or portions, such as layers of a neural network (e.g.,CNNs convolutional layer 210 may be implemented using a plurality of interconnected neurons 350). Theneuron 350 including theneural network 399 may be implemented using code, circuitry, and/or a combination thereof. In some example embodiments, theneuron 350 and/or the neural network 399 (which includes the neurons 350) may be implemented using specialized circuitry including, for example, at least one graphics processing unit (GPU, which is configured to better handle parallel processing, matrix operations, and/or the like when compared to a traditional central processing unit) or dedicated neural network circuitry. - In the example of
FIG. 3B , theneural network 399 may include aninput layer 360A, one or morehidden layers 360B, and anoutput layer 360C. Although not shown, other layers may be implemented as well, such as a pooling layer. It will be appreciated that the neural network's 3-2-3 node structure is used to facilitate explanation and, as such, theneural network 399 may be structured in other configurations, such as a 3×3 structure (with or without hidden layer(s)), a 5×5 structure (with or without hidden layer(s)), a 7×7 structure (with or without hidden layer(s)), and/or other structures (with or without hidden layer(s)) as well. - During training of a neural network, such as
neural network 399, training data, such as reference images with labels (e.g., indicating segments, depth maps, crowd counts, and/or the like), may be fed as an input to theinput layer 360A neurons over time (e.g., t, t+1, etc.) until theneural network 399 learns to perform a task. In the example ofFIG. 3B for example, theCNN 399 may receive labeled training data, such as reference images with the proper segments labeled, so that theCNN 299 can train iteratively until it learns to form a depth map and/or segments for images. To illustrate further, the neurons of the network may learn by optimizing to a mean square error (e.g., between the labeled training data at theinput layer 360A and what is generated at the output of theoutput layer 360C) using gradient descent and/or the like. When the neural network is trained, the neural network's configuration, such as the values of the weights, activation values, basis function, and/or the like, can be saved to storage. This saved configuration represents the trained neural network. - Referring again to
FIG. 2B , theCNN 299 may be used tosegment image 100 intoregions 298A-C. As noted above, each of thesegmented regions 298A-C may have about the same size object (e.g., head or people size), and thus map to a given size of filter at 410A, 410B, and 410C. Moreover, thesegmented regions 298A-C (and/or filter sizes for the regions) may be provided to another machine learning model, such as anMCCNN 400 as shown inFIG. 4 , in accordance with some example embodiments. - In accordance with some example embodiments, the
MCCNN 400 may include aCNN 405A-C for each of the regions segmented in the image. In the example ofFIG. 4 , there are threesegmented regions 298A-C, so there are three columns in the MCCNN, each column including a corresponding one of theCNNs 405A-C. - The
first CNN 405A may include a firstconvolutional layer 410A providing a filter of for example 3×3 pixels. This filter may be selected based on the size of the objects in thesegmented region 298A. As noted above, thesegmented region 298A may have about the same size objects (e.g., head or people size), sosegmented region 298A may map to the filter size of 3×3 pixels at 410A, for example. In other words, the depth information defining where the segments are inimage 100 may also enable MCCNN to select the proper filter size for eachsegment 298A-C. The firstconvolutional layer 410A may be followed by aconvolutional layer 412A, apooling layer 414A, aconvolutional layer 418A, apooling layer 417A, aconvolutional layer 418A, and a fully connectedlayer 420A. Although thefirst CNN 405A includes a certain configuration ofintermediate layers 412A-418A, other types and/or quantities of layers may be implemented as well. - The
second CNN 405B may include a firstconvolutional layer 410B providing a filter of for example 5×5 pixels. This 5×5 pixel filter may be selected based on the size of the objects in thesegmented region 298B. As noted above with respect to filter 410A, thesegmented region 298B may have about the same size object (e.g., head or people size), sosegmented region 298B may map to the filter size of 5×5 pixels at 410B, for example. The firstconvolutional layer 410B may be followed by a convolutional layer 412B, apooling layer 414B, aconvolutional layer 418B, apooling layer 417B, aconvolutional layer 418B, and a fully connectedlayer 420B. Although thesecond CNN 405B includes a certain configuration of intermediate layers 412B-418B, other types and/or quantities of layers may be implemented as well. - The
third CNN 405C may include a firstconvolutional layer 410C providing a filter of for example 7×7 pixels. This filter may be selected based on the size of the objects in thesegmented region 298C. Thesegmented region 298C may also have about the same size objects (e.g., head or people sizes), sosegmented region 298C may map to the filter size of 7×7 pixels at 410C, for example. In other words, the depth information defining where the segments are inimage 100 may also enable selection of the proper filter size for each segment. The firstconvolutional layer 410C may be followed by aconvolutional layer 412C, a pooling layer 414C, aconvolutional layer 418C, apooling layer 417C, aconvolutional layer 418C, and a fully connectedlayer 420C. Although thethird CNN 405C includes a certain configuration ofintermediate layers 412C-418C, other types and/or quantities of layers may be implemented as well. - In accordance with some example embodiments, the MCCNN 400 (which in this example includes 3 CNN columns) may include the
first CNN 405A may have thefilter 410A which samples a firstsegmented region 298A and outputs a first density map 498A for the first region, thesecond CNN 405B may have thefilter 410B which samples the secondsegmented region 298B of the image and outputs a second density map 498B for the second region, and thethird CNN 298C may have thefilter 410C which samples the secondsegmented region 298C of the image and outputs a third density map 498C for the third region. To generate the 499 density map for theentire image 100, the first density map 498A, the second density map 498B, and the third density map 498C may be combined, in accordance with some example embodiments. Thedensity map 499 may, as noted, provide an estimate of the quantity of objects per square unit of distance, from which the quantity of objects in the image and the distribution of the objects across the image can be determined. In this example, the objects are people, although other types of object may be counted in the image as well. - In some example embodiments, the
filters 410A-C in each of thecolumn CNNs 405A-C may, as noted, be selected based on the size of the objects in the corresponding region and, in particular, the size induced perspective differences in the image. For example, in a givensegmented region 298A-C of the image, the size of the people (or their heads) may have the same of similar perspective and thus the same or similar size. As such, thefilter 410A for thefirst CNN 405A may be a smaller filter to take into account the similar people/head sizes in theregion 298A farther away from the camera, when compared to thefilter 410B forregion 298B which is closer to the camera (and thus would require a larger filter). Likewise, thefilter 410B for thesecond CNN 405B handling theregion 298B may be a smaller filter, when compared to thefilter 410C for thethird CNN 405C handling theregion 298C. In this way, theMCCNN 400 may select the filters at 410A-C based on the depth information for each of the threeregions 298A-C, and each region may be processed using one of thecorresponding column CNNs 405A, B, or C configured specifically for the approximate size of the object (e.g., heads or people) in the corresponding region. TheMCCNN 400 may thus select, based on the depth information indicative of the segment and object size in the segment, the size of the correspondinginitial filter 410A, B, or C, so that the objects in the region can pass through the corresponding filter. - In some example embodiments, the
MCCNN 400 may be trained using a reference set of images. These reference images may include reference images having been segmented and having known density maps for each of the segments. Reference images may represent ground truth in the sense that the quantity of people in (or the density map for) the image(s) (or segment(s)) may be known to a certain degree of certainty. TheMCCNN 400 may then be trained until the MCCNN can learn to generate a density map for the reference images. Once trained, the trained MCCNN may be used to determine density maps for other input images, in accordance with some example embodiments. - Referring again to
FIG. 1 , the image 100 (which is being processed to determine an object count) may represent a video stream captured by at least one camera, such as an omnidirectional, or multi-view, camera and/or the like. An example of an omnidirectional, multi-view camera is the Nokia OzO camera, which may generate 360 panoramic images in multiple planes. In the case of the omnidirectional, multi-view camera, the images from the camera can be input to theCNN 299 and/orMCCNN 400 in order to enable generation of a density map and a corresponding crowd count in each image. To illustrate further, the OzO camera may include a plurality of cameras, and the images from each of these cameras can be processed to enable segmentation and/or determine a density map from which a crowd count can be determined. Referring toFIG. 4 , each camera of an OzO camera may be input into a separate CNN of the MCCNN and then the output density maps may be combined to form anaggregate density map 499. -
FIG. 5A depicts a process flow for training a machine learning model, such asCNN 299 to learn how to generate depth information, such as depth maps, to enable image segmentation, in accordance with some example embodiments. The description ofFIG. 5A refers toFIGS. 1 and 2B . - At 502, at least one reference image may be received labeled with depth information, in accordance with some example embodiments. For example, the
CNN 299 may receive reference images having labels indicating the depth of each image. To illustrate further, each reference image may have a corresponding depth map and/or the location of the segments within the image. The objects in the segments in the reference image(s) may be about the same distance from the camera and, as such, have about the same size to enable filtering with the same size filter. - At 504, a machine learning model may be trained to learn based on the received reference images, in accordance with some example embodiments. For example, the
CNN 299 may train, based on the receive images, to learn how to generate the depth information, such as the depth map, the location of the segments for received reference images, and/or size of the objects (or filter size) for each segment. The training may be iterative using gradient descent and/or like. When the CNN is trained, the CNN's configuration (e.g., values of the weights, activation values, basis function, and/or the like) may be saved at 506, to storage, in accordance with some example embodiments. This saved configuration represents the trained CNN, which can be used, in an operational phase, to determine depth information, such as depth maps, segments, for images other than the reference images, and/or size of the objects (or filter size) for each segment. -
FIG. 5B depicts a process flow for training a machine learning model, such as an MCCNN to provide object count information, in accordance with some example embodiments. The description ofFIG. 5A refers toFIGS. 1 and 4 . - At 512, at least one reference image may be received labeled with density information, in accordance with some example embodiments. For example, the
MCCNN 400 may receive reference images having labels indicating the segments in each image, and a density of the objects, such as people/heads per square meter, object count, and/or the like, in the segment. For example, the reference image 100 (FIG. 4 ) may be segmented a priori and each segment may have a corresponding density map to enable training. Moreover, each of the segments may have about the same size objects (with respect to perspective), so a given filter can be used on the objects in the corresponding segment. - At 514, a machine learning model may be trained to learn to determine density maps, in accordance with some example embodiments. For example, the
MCCNN 400 may train based on the receive reference images to learn how to generate the object density information, such as a density map, count, and/or the like. In some example embodiments, eachcolumn CNN 405A-C of the MCCNN may be trained using a first convolutional layer having a filter selected specifically to account for the size induced perspective effects of that region being handled by the column CNN. When the MCCNN is trained, the MCCNN's configuration (e.g., values of the weights, activation values, basis function, and/or the like) may be saved at 516, to storage, in accordance with some example embodiments. This saved configuration represents the trained MCCNN, which can be used, in an operational phase, to determine density information, such as depth maps and segments, for images other than the reference images. -
FIG. 5C depicts a process flow for a trained machine learning model in an operational phase, in accordance with some example embodiments. The description ofFIG. 5A refers toFIGS. 1 and 2B . - At 522, at least one image may be received by the trained machine learning model, in accordance with some example embodiments. For example, the trained
CNN 299 may receive at least oneimage 100 requiring an estimate of an object count. The trained CNN may process the at least oneinput image 100 to determine, at 524, depth information, which may be in the form of a depth map, and/or an indication of where the at least one image should be segmented, in accordance with some example embodiments. The depth information may also indicate the size of the objects in the segment(s) and/or the corresponding filter size for the segment(s). At 526, the trained machine learning model, such as the trainedCNN 299, may output depth information to another machine learning model, such as theMCCNN 400, in accordance with some example embodiments. -
FIG. 5D depicts a process flow for a trained machine learning model in an operational phase, in accordance with some example embodiments. The description ofFIG. 5A refers toFIGS. 1 and 4 . - At 532, at least one image may be received by the trained machine learning model, in accordance with some example embodiments. For example, the trained
MCCNN 400 may receive at least one image. Moreover, the image may be received with depth information to enable segmentation of theimage 100 into a plurality of portions. In the example ofFIG. 4 , theimage 100 is segmented into 3portions 298A-C, although other quantity of segments may be used as well. Moreover, the depth information may enable the MCCNN to select a filter at 410A-C that is sized to process the size of objects found in each of thesegments 298A-C. - Each
segmented region 298A-C may be processed, at 534, by aCNN 405A-C of theMCCNN 400, in accordance with some example embodiments. Specifically, the segments may be segmented, based on depth information, to take into account the perspective induced size differences. This enables each of the CNN's 405A-C to have a filter better suited for the size of the objects, such as heads, people, and/or the like, in the corresponding segment being handled by the corresponding CNN. For example, theCNN 405A includes objects in the background (which causes objects to appear smaller due to perspective), so the convolutional layer's 410A filter is, for example, a 3×3 matrix to accommodate the relatively smaller sized heads and/or people. As noted above, the size of the filter (which in this example is 3×3) may be selected to pass the object of interest, which are people in this example. By comparison, theCNN 405C has objects in the foreground (which causes objects to appear larger due to perspective), so the convolutional layer's 410C filter is for example, a 7×7, matrix to accommodate the relatively larger sized heads and/or people. - At 536, the trained machine learning model may generate a density map for each segmented region of the image, in accordance with some example embodiments. As shown at
FIG. 4 , eachcolumn CNN 405A-C generates a density map 498A-C. - At 538, the trained machine learning model may combine the density maps for each region to form a density map for the entire image received at the input, in accordance with some example embodiments. For example, the
MCCNN 400 may combine the density maps 498A-C intodensity map 499, which represents thedensity map 499 for theentire image 100. - At 540, a trained machine learning model may output an indication of the object count, in accordance with some example embodiments. For example, the
MCCNN 400 may output thedensity map 499 or further process the density map to provide a count, such people count, for the entire image or a count for a portion of the image. -
FIG. 6 depicts a block diagram illustrating acomputing system 600, in accordance with some example embodiments. Thecomputing system 600 may be used to implement machine learning models, such asCNN 200,CNN 299,MCCNN 400, and/or the like as disclosed herein includingFIGS. 5A-5D to perform counting of objects in images, in accordance with some example embodiments. For example, thesystem 600 may comprise, or be comprised in, an apparatus, such as a mobile phone, smart phone, camera (e.g., OzO, closed circuit television, webcam), drone, self-driving vehicle, car, unmanned aerial vehicle, autonomous vehicle, and/or Internet of Things (IoT sensor, such as a traffic sensor, industrial sensor, and/or the like) to enable counting of objects, in accordance with some example embodiments. - As shown in
FIG. 6 , thecomputing system 600 can include aprocessor 610, amemory 620, astorage device 630, input/output devices 640, and/or a camera 660 (which can be used to capture images including objects to be counted in accordance with some example embodiments). Theprocessor 610, thememory 620, thestorage device 630, and the input/output devices 640 can be interconnected via asystem bus 650. Theprocessor 610 may be capable of processing instructions for execution within thecomputing system 600. Such executed instructions can implement one or more aspects of the machine learning models, such asCNN 200,CNN 299,MCCNN 400, and/or the like. Theprocessor 610 may be capable of processing instructions stored in thememory 620 and/or on thestorage device 630 to display graphical information for a user interface provided via the input/output device 640. Thememory 620 may be a computer readable medium such as volatile or non-volatile that stores information within thecomputing system 600. Thememory 620 can store instructions, such as computer program code. Thestorage device 630 may be capable of providing persistent storage for thecomputing system 600. Thestorage device 630 can be a floppy disk device, a hard disk device, an optical disk device, or a tape device, or other suitable persistent storage mechanism. The input/output device 640 provides input/output operations for thecomputing system 600. In some example embodiments, the input/output device 640 includes a keyboard and/or pointing device. In various implementations, the input/output device 640 includes a display unit for displaying graphical user interfaces. Alternatively or additionally, the input/output device 640 may include wireless and/or wired interface to enable communication with other devices, such as other network nodes. For example, the input/output device 640 can include an Ethernet interface, a WiFi interface, a cellular interface, and/or other wired and/or wireless interface to allow communications with one or more wired and/or wireless networks and/or devices. -
FIG. 7 illustrates a block diagram of anapparatus 10, in accordance with some example embodiments. Theapparatus 10 may represent a user equipment, such as a wireless device examples of which include a smartphone, a tablet, and/or the like. Theapparatus 10 may be used to implement machine learning models, such asCNN 200,CNN 299,MCCNN 400, and/or the like as disclosed herein includingFIGS. 5A-5D to perform counting of objects in images, in accordance with some example embodiments. Moreover, theapparatus 10 may include acamera 799, and theprocessor 20 may comprise GPU or other special purpose processor to handle the processing of the machine learning models. Like the system atFIG. 6 , theapparatus 10 may comprise, or be comprised in, an apparatus, such as a mobile phone, smart phone, camera (e.g., OzO, closed circuit television, webcam), drone, self-driving vehicle, car, unmanned aerial vehicle, autonomous vehicle, and/or Internet of Things (IoT sensor, such as a traffic sensor, industrial sensor, and/or the like) to enable counting of objects, in accordance with some example embodiments. - The
apparatus 10 may include at least oneantenna 12 in communication with atransmitter 14 and areceiver 16. Alternatively transmit and receive antennas may be separate. Theapparatus 10 may also include aprocessor 20 configured to provide signals to and receive signals from the transmitter and receiver, respectively, and to control the functioning of the apparatus.Processor 20 may be configured to control the functioning of the transmitter and receiver by effecting control signaling via electrical leads to the transmitter and receiver. Likewise,processor 20 may be configured to control other elements ofapparatus 10 by effecting control signaling via electricalleads connecting processor 20 to the other elements, such as a display or a memory. Theprocessor 20 may, for example, be embodied in a variety of ways including circuitry, at least one processing core, one or more microprocessors with accompanying digital signal processor(s), one or more processor(s) without an accompanying digital signal processor, one or more coprocessors, one or more multi-core processors, one or more controllers, processing circuitry, one or more computers, various other processing elements including integrated circuits (for example, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), and/or the like), or some combination thereof. Accordingly, although illustrated inFIG. 7 as a single processor, in some example embodiments theprocessor 20 may comprise a plurality of processors or processing cores. - The
apparatus 10 may be capable of operating with one or more air interface standards, communication protocols, modulation types, access types, and/or the like. Signals sent and received by theprocessor 20 may include signaling information in accordance with an air interface standard of an applicable cellular system, and/or any number of different wireline or wireless networking techniques, comprising but not limited to Wi-Fi, wireless local access network (WLAN) techniques, such as Institute of Electrical and Electronics Engineers (IEEE) 802.11, 802.16, 802.3, ADSL, DOCSIS, and/or the like. In addition, these signals may include speech data, user generated data, user requested data, and/or the like. - For example, the
apparatus 10 and/or a cellular modern therein may be capable of operating in accordance with various first generation (1G) communication protocols, second generation (2G or 2.5G) communication protocols, third-generation (3G) communication protocols, fourth-generation (4G) communication protocols, fifth-generation (5G) communication protocols, Internet Protocol Multimedia Subsystem (IMS) communication protocols (for example, session initiation protocol (SIP) and/or the like. For example, theapparatus 10 may be capable of operating in accordance with 2G wireless communication protocols IS-136, Time Division Multiple Access TDMA, Global System for Mobile communications, GSM, IS-95, Code Division Multiple Access, CDMA, and/or the like. In addition, for example, theapparatus 10 may be capable of operating in accordance with 2.5G wireless communication protocols General Packet Radio Service (GPRS), Enhanced Data GSM Environment (EDGE), and/or the like. Further, for example, theapparatus 10 may be capable of operating in accordance with 3G wireless communication protocols, such as Universal Mobile Telecommunications System (UMTS), Code Division Multiple Access 2000 (CDMA2000), Wideband Code Division Multiple Access (WCDMA), Time Division-Synchronous Code Division Multiple Access (TD-SCDMA), and/or the like. Theapparatus 10 may be additionally capable of operating in accordance with 3.9G wireless communication protocols, such as Long Term Evolution (LTE), Evolved Universal Terrestrial Radio Access Network (E-UTRAN), and/or the like. Additionally, for example, theapparatus 10 may be capable of operating in accordance with 4G wireless communication protocols, such as LTE Advanced, 5G, and/or the like as well as similar wireless communication protocols that may be subsequently developed. - It is understood that the
processor 20 may include circuitry for implementing audio/video and logic functions ofapparatus 10. For example, theprocessor 20 may comprise a digital signal processor device, a microprocessor device, an analog-to-digital converter, a digital-to-analog converter, and/or the like. Control and signal processing functions of theapparatus 10 may be allocated between these devices according to their respective capabilities. Theprocessor 20 may additionally comprise an internal voice coder (VC) 20 a an internal data modem (DM) 20 b, and/or the like. Further, theprocessor 20 may include functionality to operate one or more software programs, which may be stored in memory. In general,processor 20 and stored software instructions may be configured to causeapparatus 10 to perform actions. For example,processor 20 may be capable of operating a connectivity program, such as a web browser. The connectivity program may allow theapparatus 10 to transmit and receive web content, such as location-based content, according to a protocol, such as wireless application protocol, WAP, hypertext transfer protocol, HTTP, and/or the like. -
Apparatus 10 may also comprise a user interface including, for example, an earphone orspeaker 24, aringer 22, amicrophone 26, adisplay 28, a user input interface, and/or the like, which may be operationally coupled to theprocessor 20. Thedisplay 28 may, as noted above, include a touch sensitive display, where a user may touch and/or gesture to make selections, enter values, and/or the like. Theprocessor 20 may also include user interface circuitry configured to control at least some functions of one or more elements of the user interface, such as thespeaker 24, theringer 22, themicrophone 26, thedisplay 28, and/or the like. Theprocessor 20 and/or user interface circuitry comprising theprocessor 20 may be configured to control one or more functions of one or more elements of the user interface through computer program instructions, for example, software and/or firmware, stored on a memory accessible to theprocessor 20, for example,volatile memory 40,non-volatile memory 42, and/or the like. Theapparatus 10 may include a battery for powering various circuits related to the mobile terminal, for example, a circuit to provide mechanical vibration as a detectable output. The user input interface may comprise devices allowing theapparatus 20 to receive data, such as a keypad 30 (which can be a virtual keyboard presented ondisplay 28 or an externally coupled keyboard) and/or other input devices. - As shown in
FIG. 7 ,apparatus 10 may also include one or more mechanisms for sharing and/or obtaining data. For example, theapparatus 10 may include a short-range radio frequency (RF) transceiver and/orinterrogator 64, so data may be shared with and/or obtained from electronic devices in accordance with RF techniques. Theapparatus 10 may include other short-range transceivers, such as an infrared (IR)transceiver 66, a Bluetooth™ (BT)transceiver 68 operating using Bluetooth™ wireless technology, a wireless universal serial bus (USB)transceiver 70, a Bluetooth™ Low Energy transceiver, a ZigBee transceiver, an ANT transceiver, a cellular device-to-device transceiver, a wireless local area link transceiver, and/or any other short-range radio technology.Apparatus 10 and, in particular, the short-range transceiver may be capable of transmitting data to and/or receiving data from electronic devices within the proximity of the apparatus, such as within 10 meters, for example. Theapparatus 10 including the Wi-Fi or wireless local area networking modem may also be capable of transmitting and/or receiving data from electronic devices according to various wireless networking techniques, including 6LoWpan, Wi-Fi, Wi-Fi low power, WLAN techniques such as IEEE 802.11 techniques, IEEE 802.15 techniques, IEEE 802.16 techniques, and/or the like. - The
apparatus 10 may comprise memory, such as a subscriber identity module (SIM) 38, a removable user identity module (R-UIM), an eUICC, an UICC, and/or the like, which may store information elements related to a mobile subscriber. In addition to the SIM, theapparatus 10 may include other removable and/or fixed memory. Theapparatus 10 may includevolatile memory 40 and/ornon-volatile memory 42. For example,volatile memory 40 may include Random Access Memory (RAM) including dynamic and/or static RAM, on-chip or off-chip cache memory, and/or the like.Non-volatile memory 42, which may be embedded and/or removable, may include, for example, read-only memory, flash memory, magnetic storage devices, for example, hard disks, floppy disk drives, magnetic tape, optical disc drives and/or media, non-volatile random access memory (NVRAM), and/or the like. Likevolatile memory 40,non-volatile memory 42 may include a cache area for temporary storage of data. At least part of the volatile and/or non-volatile memory may be embedded inprocessor 20. The memories may store one or more software programs, instructions, pieces of information, data, and/or the like which may be used by the apparatus for performing operations disclosed herein including, for example, processing, by the trained machine learning model, a first segment of an image and a second segment of the image, the first segment being processed using a first filter selected, based on depth information, to enable formation of a first density map, and the second segment being processed using a second filter selected, based on the depth information, to enable formation of a second density map; combining, by the trained machine learning model, the first density map and the second density map to form a density map for the image; providing, by the trained machine learning model, an output based on the density map, the output being representative of an estimate of a quantity of objects in the image, and/or other aspects disclosed herein with respect to the CNN,MCCNN 400, and/or the like for counting of objects in images. - The memories may comprise an identifier, such as an international mobile equipment identification (IMEI) code, capable of uniquely identifying
apparatus 10. The memories may comprise an identifier, such as an international mobile equipment identification (IMEI) code, capable of uniquely identifyingapparatus 10. In the example embodiment, theprocessor 20 may be configured using computer code stored atmemory 40 and/or 42 to control and/or provide one or more aspects disclosed herein (see, for example,process 600, 700, and/or other operations disclosed herein). For example, theprocessor 20 may be configured using computer code stored atmemory 40 and/or 42 to at least including, for example, processing, by the trained machine learning model, a first segment of an image and a second segment of the image, the first segment being processed using a first filter selected, based on depth information, to enable formation of a first density map, and the second segment being processed using a second filter selected, based on the depth information, to enable formation of a second density map; combining, by the trained machine learning model, the first density map and the second density map to form a density map for the image; and/or other aspects disclosed herein with respect to the CNN,MCCNN 400, and/or the like for counting of objects in images. - Some of the embodiments disclosed herein may be implemented in software, hardware, application logic, or a combination of software, hardware, and application logic. The software, application logic, and/or hardware may reside on
memory 40, thecontrol apparatus 20, or electronic components, for example In some example embodiment, the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media. In the context of this document, a “computer-readable medium” may be any non-transitory media that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer or data processor circuitry, with examples depicted atFIG. 7 , computer-readable medium may comprise a non-transitory computer-readable storage medium that may be any media that can contain or store the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer. - The subject matter described herein may be embodied in systems, apparatus, methods, and/or articles depending on the desired configuration. For example, the base stations and user equipment (or one or more components therein) and/or the processes described herein can be implemented using one or more of the following: a processor executing program code, an application-specific integrated circuit (ASIC), a digital signal processor (DSP), an embedded processor, a field programmable gate array (FPGA), and/or combinations thereof. These various implementations may include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device. These computer programs (also known as programs, software, software applications, applications, components, program code, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the term “computer-readable medium” refers to any computer program product, machine-readable medium, computer-readable storage medium, apparatus and/or device (for example, magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions. Similarly, systems are also described herein that may include a processor and a memory coupled to the processor. The memory may include one or more programs that cause the processor to perform one or more of the operations described herein.
- Although a few variations have been described in detail above, other modifications or additions are possible. In particular, further features and/or variations may be provided in addition to those set forth herein. Moreover, the implementations described above may be directed to various combinations and subcombinations of the disclosed features and/or combinations and subcombinations of several further features disclosed above. Other embodiments may be within the scope of the following claims.
- If desired, the different functions discussed herein may be performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more of the above-described functions may be optional or may be combined. Although various aspects of some of the embodiments are set out in the independent claims, other aspects of some of the embodiments comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims. It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications that may be made without departing from the scope of some of the embodiments as defined in the appended claims. Other embodiments may be within the scope of the following claims. The term “based on” includes “based on at least.” The use of the phase “such as” means “such as for example” unless otherwise indicated.
Claims (21)
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2017/108952 WO2019084854A1 (en) | 2017-11-01 | 2017-11-01 | Depth-aware object counting |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20200242777A1 true US20200242777A1 (en) | 2020-07-30 |
| US11270441B2 US11270441B2 (en) | 2022-03-08 |
Family
ID=66331257
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US16/754,988 Active US11270441B2 (en) | 2017-11-01 | 2017-11-01 | Depth-aware object counting |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US11270441B2 (en) |
| EP (1) | EP3704558A4 (en) |
| CN (1) | CN111295689B (en) |
| WO (1) | WO2019084854A1 (en) |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10997450B2 (en) * | 2017-02-03 | 2021-05-04 | Siemens Aktiengesellschaft | Method and apparatus for detecting objects of interest in images |
| CN113240650A (en) * | 2021-05-19 | 2021-08-10 | 中国农业大学 | Fry counting system and method based on deep learning density map regression |
| US20220388162A1 (en) * | 2021-06-08 | 2022-12-08 | Fanuc Corporation | Grasp learning using modularized neural networks |
| US20220391638A1 (en) * | 2021-06-08 | 2022-12-08 | Fanuc Corporation | Network modularization to learn high dimensional robot tasks |
| US20230196782A1 (en) * | 2021-12-17 | 2023-06-22 | At&T Intellectual Property I, L.P. | Counting crowds by augmenting convolutional neural network estimates with fifth generation signal processing data |
| CN118396422A (en) * | 2024-05-10 | 2024-07-26 | 浙江天演维真网络科技股份有限公司 | Intelligent peach blossom recognition and fruit yield prediction method based on fusion model |
Families Citing this family (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11048948B2 (en) * | 2019-06-10 | 2021-06-29 | City University Of Hong Kong | System and method for counting objects |
| CN110866476B (en) * | 2019-11-06 | 2023-09-01 | 南京信息职业技术学院 | Dense stacking target detection method based on automatic labeling and transfer learning |
| US11393182B2 (en) | 2020-05-29 | 2022-07-19 | X Development Llc | Data band selection using machine learning |
| CN111652168B (en) * | 2020-06-09 | 2023-09-08 | 腾讯科技(深圳)有限公司 | Group detection method, device, equipment and storage medium based on artificial intelligence |
| CN111815665B (en) * | 2020-07-10 | 2023-02-17 | 电子科技大学 | Crowd Counting Method Based on Depth Information and Scale Awareness Information in a Single Image |
| US11606507B1 (en) | 2020-08-28 | 2023-03-14 | X Development Llc | Automated lens adjustment for hyperspectral imaging |
| US11651602B1 (en) | 2020-09-30 | 2023-05-16 | X Development Llc | Machine learning classification based on separate processing of multiple views |
| US12033329B2 (en) | 2021-07-22 | 2024-07-09 | X Development Llc | Sample segmentation |
| US11995842B2 (en) | 2021-07-22 | 2024-05-28 | X Development Llc | Segmentation to improve chemical analysis |
| US12400422B2 (en) | 2021-08-25 | 2025-08-26 | X Development Llc | Sensor fusion approach for plastics identification |
| CN115151949B (en) * | 2022-06-02 | 2025-06-17 | 深圳市正浩创新科技股份有限公司 | Target object collection method, device and storage medium |
Family Cites Families (16)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7139409B2 (en) | 2000-09-06 | 2006-11-21 | Siemens Corporate Research, Inc. | Real-time crowd density estimation from video |
| US8195598B2 (en) * | 2007-11-16 | 2012-06-05 | Agilence, Inc. | Method of and system for hierarchical human/crowd behavior detection |
| US8184175B2 (en) * | 2008-08-26 | 2012-05-22 | Fpsi, Inc. | System and method for detecting a camera |
| CN102521646B (en) | 2011-11-11 | 2015-01-21 | 浙江捷尚视觉科技股份有限公司 | Complex scene people counting algorithm based on depth information cluster |
| US9740937B2 (en) * | 2012-01-17 | 2017-08-22 | Avigilon Fortress Corporation | System and method for monitoring a retail environment using video content analysis with depth sensing |
| GB2505501B (en) | 2012-09-03 | 2020-09-09 | Vision Semantics Ltd | Crowd density estimation |
| US10009579B2 (en) | 2012-11-21 | 2018-06-26 | Pelco, Inc. | Method and system for counting people using depth sensor |
| CN105654021B (en) | 2014-11-12 | 2019-02-01 | 株式会社理光 | Method and apparatus of the detection crowd to target position attention rate |
| JP6494253B2 (en) * | 2014-11-17 | 2019-04-03 | キヤノン株式会社 | Object detection apparatus, object detection method, image recognition apparatus, and computer program |
| US9613255B2 (en) * | 2015-03-30 | 2017-04-04 | Applied Materials Israel Ltd. | Systems, methods and computer program products for signature detection |
| CN107624189B (en) * | 2015-05-18 | 2020-11-20 | 北京市商汤科技开发有限公司 | Method and apparatus for generating a predictive model |
| KR101788269B1 (en) * | 2016-04-22 | 2017-10-19 | 주식회사 에스원 | Method and apparatus for sensing innormal situation |
| US10152630B2 (en) * | 2016-08-09 | 2018-12-11 | Qualcomm Incorporated | Methods and systems of performing blob filtering in video analytics |
| US10055669B2 (en) * | 2016-08-12 | 2018-08-21 | Qualcomm Incorporated | Methods and systems of determining a minimum blob size in video analytics |
| CN106778502B (en) | 2016-11-21 | 2020-09-22 | 华南理工大学 | Crowd counting method based on deep residual error network |
| CN106650913B (en) | 2016-12-31 | 2018-08-03 | 中国科学技术大学 | A kind of vehicle density method of estimation based on depth convolutional neural networks |
-
2017
- 2017-11-01 WO PCT/CN2017/108952 patent/WO2019084854A1/en not_active Ceased
- 2017-11-01 CN CN201780096479.5A patent/CN111295689B/en active Active
- 2017-11-01 US US16/754,988 patent/US11270441B2/en active Active
- 2017-11-01 EP EP17930503.2A patent/EP3704558A4/en active Pending
Cited By (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10997450B2 (en) * | 2017-02-03 | 2021-05-04 | Siemens Aktiengesellschaft | Method and apparatus for detecting objects of interest in images |
| CN113240650A (en) * | 2021-05-19 | 2021-08-10 | 中国农业大学 | Fry counting system and method based on deep learning density map regression |
| US20220388162A1 (en) * | 2021-06-08 | 2022-12-08 | Fanuc Corporation | Grasp learning using modularized neural networks |
| US20220391638A1 (en) * | 2021-06-08 | 2022-12-08 | Fanuc Corporation | Network modularization to learn high dimensional robot tasks |
| US11809521B2 (en) * | 2021-06-08 | 2023-11-07 | Fanuc Corporation | Network modularization to learn high dimensional robot tasks |
| US12017355B2 (en) * | 2021-06-08 | 2024-06-25 | Fanuc Corporation | Grasp learning using modularized neural networks |
| US20230196782A1 (en) * | 2021-12-17 | 2023-06-22 | At&T Intellectual Property I, L.P. | Counting crowds by augmenting convolutional neural network estimates with fifth generation signal processing data |
| US12190594B2 (en) * | 2021-12-17 | 2025-01-07 | At&T Intellectual Property I, L.P. | Counting crowds by augmenting convolutional neural network estimates with fifth generation signal processing data |
| CN118396422A (en) * | 2024-05-10 | 2024-07-26 | 浙江天演维真网络科技股份有限公司 | Intelligent peach blossom recognition and fruit yield prediction method based on fusion model |
Also Published As
| Publication number | Publication date |
|---|---|
| US11270441B2 (en) | 2022-03-08 |
| CN111295689B (en) | 2023-10-03 |
| EP3704558A1 (en) | 2020-09-09 |
| WO2019084854A1 (en) | 2019-05-09 |
| CN111295689A (en) | 2020-06-16 |
| EP3704558A4 (en) | 2021-07-07 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11270441B2 (en) | Depth-aware object counting | |
| US12062243B2 (en) | Distracted driving detection using a multi-task training process | |
| KR102529574B1 (en) | Semantic Segmentation with Soft Cross-Entropy Loss | |
| CN111797983A (en) | A kind of neural network construction method and device | |
| US10614339B2 (en) | Object detection with neural network | |
| US10872275B2 (en) | Semantic segmentation based on a hierarchy of neural networks | |
| CN111292366B (en) | Visual driving ranging algorithm based on deep learning and edge calculation | |
| CN104615986B (en) | The method that pedestrian detection is carried out to the video image of scene changes using multi-detector | |
| US20200019799A1 (en) | Automated annotation techniques | |
| CN110033481A (en) | Method and apparatus for carrying out image procossing | |
| CN112101114B (en) | Video target detection method, device, equipment and storage medium | |
| WO2018152741A1 (en) | Collaborative activation for deep learning field | |
| CN112840347B (en) | Method, apparatus and computer readable medium for object detection | |
| CN113627332A (en) | A Distracted Driving Behavior Recognition Method Based on Gradient Control Federated Learning | |
| US20220171981A1 (en) | Recognition of license plate numbers from bayer-domain image data | |
| CN107301376A (en) | A kind of pedestrian detection method stimulated based on deep learning multilayer | |
| CN112101185B (en) | Method for training wrinkle detection model, electronic equipment and storage medium | |
| US20240273742A1 (en) | Depth completion using image and sparse depth inputs | |
| US20230298142A1 (en) | Image deblurring via self-supervised machine learning | |
| CN114970654B (en) | Data processing method and device and terminal | |
| CN111339226B (en) | A method and device for constructing a map based on classification detection network | |
| CN116823884A (en) | Multi-target tracking method, system, computer equipment and storage medium | |
| CN114943903A (en) | Self-adaptive clustering target detection method for aerial image of unmanned aerial vehicle | |
| CN113902044A (en) | Image target extraction method based on lightweight YOLOV3 | |
| CN116030507A (en) | An electronic device and a method for identifying whether a face in an image wears a mask |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: APPLICATION DISPATCHED FROM PREEXAM, NOT YET DOCKETED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| AS | Assignment |
Owner name: NOKIA TECHNOLOGIES OY, FINLAND Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:NOKIA TECHNOLOGIES (BEIJING) CO., LTD;REEL/FRAME:056504/0448 Effective date: 20201124 Owner name: NOKIA TECHNOLOGIES (BEIJING) CO., LTD, CHINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:JIANG, XIAOHENG;REEL/FRAME:056504/0400 Effective date: 20171114 Owner name: NOKIA TECHNOLOGIES OY, FINLAND Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:JIANG, XIAOHENG;REEL/FRAME:056504/0400 Effective date: 20171114 |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: AWAITING TC RESP., ISSUE FEE NOT PAID |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
| FEPP | Fee payment procedure |
Free format text: SURCHARGE FOR LATE PAYMENT, LARGE ENTITY (ORIGINAL EVENT CODE: M1554); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |