WO2007130688A2

WO2007130688A2 - Mobile computing device with imaging capability

Info

Publication number: WO2007130688A2
Application number: PCT/US2007/011120
Authority: WO
Inventors: Enrico Di Bernardo; Mario Munich; Paolo Pirjanian; William Gross
Original assignee: Evolution Robotics Inc
Current assignee: Evolution Robotics Inc
Priority date: 2006-05-10
Filing date: 2007-05-08
Publication date: 2007-11-15
Anticipated expiration: 2008-11-10
Also published as: WO2007130688A3

Abstract

A system and method for performing object recognition with a mobile computing device (110) and server (120) is disclosed. The mobile computing device (110), preferably a camera phone, is configured to capture (302) digital pictures or video, extract (304) visual features from the image data, and transmit (306) the visual features to a server via the cellular network or Internet, for example. Upon receipt, the server (120) compares the extracted features to the features of a plurality of known objects to identify (308) one or more items depicted in the image data. Depending on the item identified, the server may execute one or more predetermined actions including transmitting (312) product information to the mobile phone. The product information may specify the price, quantity, availability, location information for the identified item.

Description

MOBILE COMPUTING DEVICE WITH IMAGING CAPABILITY

CROSS-REFERENCE TO RELATED APPLICATION

This application claims the benefit of U.S. Provisional Patent Application Serial No. 60/799,621 filed May 10, 2006, entitled "SYSTEM AND METHOD FOR USING MOBILE COMPUTING DEVICE WITH IMAGING CAPABILITY."

TECHNICAL FIELD

The invention generally relates to a mobile computing device configured to generate and transmit visual features of images. In particular, the invention relates to a mobile phone or other communication device configured to capture images, extract visual features from those images, and transmit those visual features to a server where objects depicted in the image are recognized.

BACKGROUND ART

A significant percentage of mobile phones are configured with pinhole cameras with which the user can capture digital images and video. The resolution and quality of the images is generally sufficient to recognize objects depicted in those images using various techniques known to those skilled in the art. The process of object recognition is, however, performed by a remote server instead of the phone due to processing and memory limitations. In some prior implementations, the images — or a portion of the image — are captured by the phone and transmitted to a remote server for identification based on, for example, the presence of a universal product code (UPC) shown in the image. This system, however, requires that image data be transmitted by the phone over a cellular network. Transmission of the raw image data can be especially burdensome where the image contains several mega pixels. In addition to the time transmission and expense of bandwidth, the process is susceptible to signal noise, transmission errors, and disruption if the connection between the phone and the base station is dropped. In addition, privacy laws in some countries prohibit image from being relayed to a server without the user explicit authorization.

In some other prior art approaches, the phone is configured to capture an image and decode a universal product code (UPC) depicted in the image. The decoded number associated with the UPC may then be transmitted back to the server for processing. While this approach is useful in limited situations in which the object has computer- readable indicia, such as system is unable to actually recognize the object based on its visual appearance more generally.

There is therefore a need for a technique that consumes less time and bandwidth than transmission of raw image data with little or no adverse effect on the recognition.

DISCLOSURE OF INVENTION

The invention features a system and method for using a mobile computing device in combination with a distributed network to perform object recognition and initiate responsive action. In the preferred embodiment, the invention is a method of performing object recognition with a camera phone in a cellular communications network, the method comprising: using the mobile computing device to extract one or more visual features from an image captured by the mobile computing device; transmitting the extracted features from the mobile computing device to a remote server; using the server to compare the one or more visual features with a database of visual features representing known objects; and identifying an item in the image as one of the known objects based on a match between the extracted features and the visual features of the plurality of known objects.

In the preferred embodiment, the mobile computing device is a mobile phone although various other devices may be suitable including digital cameras, personal computers, personal digital assistant (PDAs), digital cameras with wireless network capability, portable media players, and global positioning systems (GPSs), for example. The mobile device is configured to transmit the visual features to the object server recognition server via a cellular communications network, although a packet- switched network such as a local area network (LAN), metropolitan area network (MAN), and/or the Internet may also being employed. The visual features extracted from the image are preferably scale-invariant features. Suitable feature detector/descriptor types include, but are not limited to, scale-invariant feature transform (SIFT) features and Speeded Up Robust Features (SURF) in which the visual features can be characterized by a vector comprising image gradient data, image scale data, feature orientation data, and geometric location data.

Once the item depicted in the camera phone image is identified by the server, the server may take any of a number of actions. In some embodiments, the server is configured to return information to the mobile computing device or take action on behalf of the user of the mobile computing device. If the identified item is a product offered for sale, for example, the server may transmit product information with the price and availability as well as a hyperlink with which the user can purchase the product.

The invention also includes a system for performing object recognition with the mobile computing device and object recognition server. The mobile computing device preferably includes a camera adapted to capture a digital photograph or video of an item, a feature extractor adapted to extract one or more visual features from the image, and a transmitter adapted to send the one or more visual features from the mobile computing device to the server. The server preferably includes a database of visual features associated with a plurality of known objects and an object recognition processor configured to identify the item based on the one or more visual features and the visual features associated with the plurality of known objects. As stated, the mobile computing device is preferably a mobile phone and the server is configured to return product information to the phone or take other predetermined action.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings, and in which:

FIG. 1 is a functional block diagram of a distributed network with, in accordance with the preferred embodiment of the present invention; FIG. 2 is a functional block diagram of a mobile computing device, an information service provider, and an object recognition server, in accordance with the preferred embodiment of the present invention;

FIG. 3 is a flowchart of a process of recognizing a product using a mobile computing device in a distributed network, in accordance with the preferred embodiment of the present invention; FIG. 4 is a flowchart of the method of extracting scale-invariant visual features, in accordance with the preferred embodiment of the present invention; and FIG. 5 is a flowchart of the method of implementing object recognition using scale- invariant visual features, in accordance with the preferred embodiment of the present invention.

BEST MODES FOR CARRYING OUT THE INVENTION

Illustrated in FIG. 1 is a network for implementing the object recognition system 100 and method of requesting information or otherwise acquiring information about a product or other item using a mobile communication device with minimal user interaction. The network includes a plurality of mobile computing devices with imaging capability, which are diagrammatically represented by users 1 10-112. The mobile computing devices in the preferred embodiment are cellular phones with built- in digital cameras, although various other forms of devices may be employed including personal computers, personal digital assistants (PDAs), digital cameras with wireless network capability, portable media players, global positioning systems (GPS), and the like, as well as some non-portable devices. The cellular phones, sometimes referred to herein as handsets, are operatively coupled to a cellular network 103 adapted to provide access to a data communications network such as the Internet 104 via a cellular service provider 102. In some embodiments, handsets are enabled with the 802.1 1 protocol (or Wi-Max, Bluetooth, or other means of wireless data communication) to access the Internet 104 or other packet-switched network more directly through an access point 105. The network further includes an information service provider 106 adapted to provide information about products, for example, based on objects that are depicted in digital photographs or snapshots taken by the users 1 10-1 12 and identified by the object recognition server 120. In accordance with an exemplary embodiment, users 1 10-1 12 access the information service provider 106 for purposes of purchasing the product or item, getting additional information on a product or item, or receiving coupons or discounts, for example.

Additionally, unless otherwise indicated, the functions described herein are performed by programs including executable code or computer readable instructions running on one or more handsets and/or general-purpose computers. The computers can include one or more central processing units for executing program code, volatile memory, such as random access memory (RAM) for temporarily storing data and data structures during program execution, non-volatile memory, such as a hard disc drive, flash memory, or optical drive, for storing programs and data, including databases, and a network interface for accessing an intranet and/or the Internet. However, the functions described herein can also be implemented using special purpose computers, application-specific integrated circuit (ASIC), state machines, and/or hardwired electronic circuits. The example processes described herein do not necessarily have to be performed in the described sequence, and not all states have to be reached or performed.

Illustrated in FIG. 2 is a functional block diagram of the object recognition system including a mobile computing device 110, information service provider 106, and object recognition server 120. The mobile computing device 1 10, preferably a cellular phone, includes a user interface 202 with a keypad 203, audio input/output 204, and a liquid crystal display (LCD) for displaying images and corresponding product information; a digital camera 208; and a feature extractor 252 for generating visual features used to identify objects depicted in the images acquired by the camera 208. When the user encounters a product, or printed material containing a picture of interest, or other item 210 of interest, the user may initiate a search for more product information by snapping a picture of the item. The acquired image is passed to the feature extractor 252 which generates a plurality of visual features that characterize the object(s) depicted therein. The feature extractor 252 in the preferred embodiment generates a plurality of scale-invariant visual features with a scale-invariant feature transform (SIFT) processor 254 discuss in.more detail below. The SIFT visual features generally represent the orientation, form, and contrast of lines, edges, corners, textures, and like elements of an image, but do not include computer readable indicia like UPC codes. The visual features are then transmitted to the object recognition server 120 which employs a pattern recognition processor or module 256 to identify the one or more items depicted in the photograph by comparing the extracted visual features with the visual features of known objects retained in the feature descriptor database 272. The one or more identified items are then used to retrieve instructions to initiate a predetermined action or send additional information from the product info database 280 of the information service provider 106 or from a remote part of the network via a uniform resource locator (URL). The additional information is returned to the user where it may be displayed as a product page 206 via the user interface 202. A product page 206 as used herein refers to a combination of text, graphical, and/or hyperlinks that present product information selected by or tailored for the user based on the image captured by the camera. The categories of product information that can be returned to the user 110 include web pages or documents with various forms of content (or hyperlinks to content) including purchasing information, text, graphics, audio, video, and various other forms resources. The purchasing information may further include one or more pictures of the product, the product's prices, quantity, availability, location, physical or functional specifications (dimensions, weight, model, part #, universal product code (UPC) code), pictures, video, product reviews, and/or a URL of a website or other link to such information. The object recognition 120 may also be configured to take a prescribed action associated with the identified objection including, for example, entering the user/sender into a sweepstakes specified in the image, allowing the user to cast a vote or a preference in an election or survey indicated in the image, sending samples of the product depicted to user's residence, or calling the user back to provide more information in person or with a recorded message.

The product page 206 may also include a thumbnail picture 21 1 of the identified item 210 to enable the user to verify the matching object identified by the server 120. Where the object recognition server 120 identifies a plurality of known objects or product pages associated with the pictured item 210, the user may be provided a list including a brief description of the each of the objects from which the user may select. If the user snaps a photograph of a can of COCA-COLA (TM). the list returned by the server 120 may include the price and product information for a 6-pack and 12-pack of COCA-COLA (TM). Similarly, if the object recognition server includes a plurality of informational pages or links associated with the item 210, the server may return a product page 21 1 including a menu from which the user may select, the menu enabling the user to select price information, the name and address of stores where the product can be purchased, maps locating the stores, news and reviews about the item, and like information.

In some embodiments, the product page 206 enables the user to execute an order for the product. Using a one-click implementation, for example, the user may order the identified product by clicking on a single "purchase" button once the user has logged in to the vendor's website by providing identifying information including an account name and personal identification number (PIN), for example. The order may be automatically charged against a user-designated financial account (e.g., credit card or phone bill) and shipped to a pre-designated address based on information previously entered by the user. In some embodiment the user may identify himself by entering biometric information such as a fingerprint or into the mobile device, thereby allowing the purchase to be automatically charged to the phone service provider bill.

Illustrated in FIG. 3 is a flowchart of a process of invoking object recognition and returning product information for a user based on a photograph taken with a mobile computing device, e.g., a camera phone handset. In the first step 302, the user snaps a photograph of an object, i.e., the item of interest. The item may be, for example, a product, place, advertisement, or the like. The product may also be displayed in a store or depicted in a catalog or a magazine, for example. The image captured by the handset may be in any of a number of raster formats including bit map, JPEG, or GIF or in video formats including MPEG and AVI. The item depicted in the photograph may include the entire item or a portion of the item sufficient to extract visual features with which to identify the item. Thereafter, the user may select a preprogrammed camera mode that initiates the following operations to solicit more information about the item. When selected, the image is transferred to a first process, preferably the phone's SIFT processor, which typically extracts 304 a plurality of visual features used to characterize the object depicted in the photograph. As one skilled in the art will appreciate, SIFT is able to consistently identify visual features with which objects can be identified independent of differences in image size, image lighting conditions, position and orientation of the features in the images and camera viewpoint. The number and quality of extracted visual features will vary depending on the image, but frequently varies between 50 and 2000 features per VGA resolution (640 x 480) image. The process of extracting visual features is discussed in more detail below in context of FIG. 4.

The visual features, generally in the form of feature descriptors, are transmitted 306 to the information service provider 106 with instructions to provide additional information or take other action regarding the item. The instructions may further specify the type of information requested by the user or the ID of the application which is generating the request. For example, the application running on the device may allow the user to specify what type of information the users is interested in — pricing, reviews, samples, coupons, store names and locations, etc. — in regards to the item photographed.

In this exemplary embodiment, the information service provider 106 forwards the plurality of visual features to a second process or module, preferably the object recognition processor 120. Upon receipt, the object recognition processor 120 compares each of the features from the captured image with the features of known objects retained in the feature descriptor database 272. A match is identified if the Euclidian distance between the extracted features and features of known objects satisfies a determined threshold, or the distance is minimal compared to the other known objects. If one or more objects are identified by matching 308 the visual features with those of the database, the identities of the objects are returned to the information service provider 106 and the associated product information retrieved 310. Thereafter, the associated product information is transmitted 312 to the user's handset. As stated, the user can respond 314 to the product information by, for example, purchasing the identified product using an executable instruction included in the product page 206.

Illustrated in FIG. 4 is a flowchart of the method of extracting scale-invariant visual features from a digital photograph in the preferred embodiment. Visual features, preferably SIFT-based visual features, are extracted 402 from the image by generating a plurality of Difference-of-Gaussian (DoG) images from the input image. A Difference-of-Gaussian image represents a band-pass filtered image produced by subtracting a first copy of the image blurred with a first Gaussian kernel from a second copy of the image blurred with a second Gaussian kernel. This process is repeated for multiple frequency bands in order to accentuate objects and object features independent of their size and resolution. While image blurring is achieved using a Gaussian convolution kernel of variable width, one skilled in the art will appreciate that the same results may be achieved by using a fixed-width Gaussian kernal of appropriate variance with images of different scale, i.e., resolution, produced by down-sampling the original input image.

Each of the DoG images is inspected by the SIFT processor 254 to identify the pixel extrema including minima and maxima. To be selected; an extremum must possess the highest or lowest pixel intensity among the eight adjacent pixels in the same DoG image as well as the nine adjacent pixels in the two adjacent DoG images having the closest related band-pass filtering, i.e., the adjacent DoG images having the next highest scale and the next lowest scale if present. The identified extrema, which may be referred to herein as image "keypoints," are associated with the center point of visual features. In some embodiments, an improved estimate of the location of each extremum within a DoG image may be determined through interpolation using a 3- dimensional quadratic function, for example, to improve feature matching and stability.

With each of the visual features localized, the local image properties are used to assign an orientation to each of the keypoints. By consistently assigning each of the features an orientation, different keypoints may be readily identified within different images even where the object with which the features are associated is displaced or rotated within the image. In the preferred embodiment, the orientation is derived from an orientation histogram formed from gradient orientations at all points within a circular window or region around the keypoint. As one skilled in the art will appreciate, it may be beneficial to weight the gradient magnitudes with a circularly- symmetric Gaussian weighting function where the gradients are based on non- adjacent pixels in the vicinity of a keypoint. The peak in the orientation histogram, which corresponds to a dominant direction of the gradients local to a keypoint, is assigned to be the feature's orientation. With the orientation of each keypoint assigned, the SIFT processor 254 of the feature extractor 252 generates 408 a feature descriptor to characterize the image data in a region surrounding each identified keypoint at its respective orientation. In the preferred embodiment, the surrounding region within the associated DoG image is subdivided into an Mx M array of subfields aligned with the keypoint's assigned orientation. Each subfϊeld in turn is characterized by an orientation histogram having a plurality of bins, each bin representing the sum of the image's gradient magnitudes having an orientation within a particular angular range and present within the associated subfϊeld. As one skilled in the art will appreciate, generating the feature descriptor from the one DoG image in which the inter-scale extrema is located ensures that the feature descriptor is largely independent of the scale at which the associated object is depicted in the images being compared. In the preferred embodiment, the feature descriptor includes a 128 byte array corresponding to a 4 x 4 array of subfields with each subfield including eight bins corresponding to an angular width of 45 degrees. The feature descriptor in the preferred embodiment further includes an identifier of the associated image, the scale of the DoG image in which the associated keypoint was identified, the orientation of the feature, and the geometric location of the keypoint in the associated DoG image.

As one skilled in the art will appreciate, other visual features may be used to identify the pictured item including, for example, scale-invariant and rotation-invariant technique referred to as Speeded Up Robust Features (SURF). The SURF technique uses a Hessian matrix composed of box filters that operator on points of the image to determine the location of keypoints as well as the scale of the image data at which the keypoint is an extremum in scale space. The box filters approximate Gaussian second order derivative filters. An orientation is assigned to the feature based on Gaussian- weighted, Haar-wavelet responses in the horizontal and vertical directions. A square aligned with the assigned orientation is centered about the point for purposes of generating a a feature descriptor. Multiple Haar-wavelet responses are generated at multiple points for orthogonal directions in each of 4 x 4 sub-regions that make up the square. The sum of the wavelet response in direction, together with the polarity and intensity information derived from the absolute values of the wavelet responses, yield a four-dimensional vector for each sub-region and a 64-length feature descriptor. SURF is taught in: Herbert Bay, Tinne Tuytelaars, Luc Van Gool, "SURF: Speeded Up Robust Features", Proceedings of the ninth European Conference on Computer Vision, May 2006, which is hereby incorporated by reference herein.

One skilled in the art will appreciate that there are other feature detectors and feature descriptors that may be employed in combination with the present invention. Exemplary feature detectors include the salient region detector that maximizes the entropy within the region, which was proposed by Kadir and Brady; and the edge- based region detector proposed by June et al; and various affϊne-invariant feature detectors known to those skilled in the art. Exemplary feature descriptors include Gaussian derivatives, moment invariants, complex features, steerable filters, and phase-based local features known to those skilled in the art.

Illustrated in FIG. 5 is a flowchart of the method of identifying one or more known objects that match the visual features extracted from the photograph, as referenced in step 308 of FIG. 3. As a first step, each of the extracted feature descriptors of the user's photograph is compared to the feature descriptors 272 of the known objects to identify 502 matching features. Two features match when the Euclidian distance between their respective SIFT feature descriptors is below some threshold. These matching features, referred to here as nearest neighbors, may be identified in any number of ways including a linear search ("brute force search"). In the preferred embodiment, however, the pattern recognition module 256 identifies a nearest- neighbor using a Best-Bin-First search in which the vector components of a feature descriptor are used to search a binary tree composed from each of the feature descriptors of the other images to be searched. Although the Best-Bin-First search is generally less accurate than the linear search, the Best-Bin-First search provides substantially the same results with significant computational savings.

After the nearest-neighbor is identified, a counter associated with the particular known object containing the nearest neighbor is incremented to effectively enter a "vote" to ascribe similarity between the image of the product and the known object with respect to the particular feature. In some embodiments, the voting is performed in a 5 dimensional space where the dimensions are image identifier or number, and the relative scale, rotation, and translation of the two matching features. The known objects that accumulate a number of "votes" in excess of a predetermined threshold (or dynamically determined threshold) are selected for subsequent processing as described below while the known objects with an insufficient number of votes are removed from further consideration.

With the features common to images of the item and known objects preliminarily identified, the pattern recognition module 256 determines 504 the geometric consistency between the sets of matching features common to the photograph of the item 210 and each image of a known object selected for subsequent processing. In the preferred embodiment, a combination of features (referred to as "feature patterns") that are common to the DoG images associated with the photograph and known object are aligned using an affine transformation, which maps the coordinates of features of the item's image to the coordinates of the corresponding features in the image of the known object. If the feature patterns are associated with the same underlying item 210 or product, the feature descriptors characterizing the product will geometrically align with only minor difference in the respective feature coordinates.

The degree to which the images of the known object and the item 210 match (or fail to match) can be quantified in terms of a "residual error" computed 506 for each affine transform comparison. A small error signifies a close alignment between the feature patterns which may be due to the fact that the same underlying object is being depicted in the two images. In contrast, a large error generally indicates that the feature patterns do not align, although common feature descriptors match individually by coincidence.

As indicated by decision block 508, the process described above is repeated for each image of a known object for which the number of matching visual features exceeds the predetermined threshold.

The process of implementing SIFT and pattern recognition described above has been extensively described in U.S. Patent No. 6,71 1 ,293 issued March 23, 2004, which is hereby incorporated by reference herein, and described by David G. Lowe, "Object Recognition from Local Scale-Invariant Features," Proceedings of the International Conference on Computer Vision, Corfu, Greece, September, 1999 and by David G. Lowe, "Local Feature View Clustering for 3D Object Recognition," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii, December, 2001 ; both of which are incorporated herein by reference.

In still other embodiments, the object recognition server 120 may be employed to: (1 ) identify and locate images including a known brand, design, or logo; (2) identify and locate several instances of the same image or substantially similar versions of an image that may include minor editing, the forms of editing including, for instance, cropping, re-sampling, and modification or removal of copyright information; (3) identify and locate all images in a database of images or video sequences, for example, that contain a user specified visual pattern, even if the original graphic has been distorted by: changes in scale, rotations (in-plane and out-of-plane), translations, affϊne geometric transformation, changes in brightness, changes in color, changes in gamma, compression artifacts, noise from various sources, lens distortion from an imaging process, cropping, changes in lighting, and occlusions that may obstruct portions of an object to be recognized (4) look-up works of art (paintings, statues, buildings, monuments) for reviews, historical information, and tour guide information, (5) look-up building addresses, (6) translate signage, (7) look-up digital postcards, (8) look-up movie information, (9) look-up collectibles, (10) retrieve restaurant guides, (11) look-up houses for sale, ( 12) look-up spare parts, (13) participate in scavenger hunts, and (14) identify the value of currency for the blind.

Although the description above contains many specifications, these should not be construed as limiting the scope of the invention but as merely providing illustrations of some of the presently preferred embodiments of this invention.

Therefore, the invention has been disclosed by way of example and not limitation, and reference should be made to the following claims to determine the scope of the present invention.

Claims

CLAIMSWe claim:

1. A method of performing object recognition in a communications network comprising a mobile computing device with camera operably coupled to a server, the method comprising: extracting, at the mobile computing device, one or more visual features from an image of an item captured with the mobile computing device; transmitting the one or more visual features from the mobile computing device to the server; comparing, at the server, the one or more visual features with a set of visual features associated with a plurality of known objects; and identifying the item from the plurality of known objects based on a match between the one or more extracted visual features and the set of visual features associated with the plurality of known objects.

2. The method of claim 1, wherein the visual features comprise scale-invariant features.

3. The method of claim 2, wherein the scale-invariant features comprise scale- invariant feature transform (SIFT) features or Speeded Up Robust Features (SURF).

4. The method of claim 3, wherein each of the visual features is characterized by a vector comprising image gradient data, an image scale, a feature orientation, and a geometric location.

5. The method of claim 1 , wherein the mobile computing device comprises a mobile phone with a camera.

6. The method of claim 1 , wherein the mobile computing device comprises a digital camera, personal computer, personal digital assistant (PDAs), digital camera with wireless network capability, portable media players, global positioning system (GPS).

7. The method of claim 1, wherein the communications network comprises the Internet.

8. The method of claim 1, further comprising the step of: returning, to the mobile computing device, information associated with the identified item.

9. The method of claim 8, wherein the information comprises product information.

10. The method of claim 9, wherein the information further comprises instructions adapted to allow a user to purchase the product associated with the product information.

1 1. The method of claim 8, wherein the returned information is selected from the group consisting of one or more: images, videos, multimedia data, web pages, hyperlinks, and combinations thereof.

12. A system for performing object recognition in a communications network, the system comprising: a first computing device and a second computing device; wherein the first computing device comprises: a) with camera adapted to capture a photograph of an item; b) a feature extractor adapted to extract one or more visual features from the image; c) a transmitter adapted to send the one or more visual features from the first computing device to the second computing device; and wherein the second computing device comprises: a) a database of visual features associated with a plurality of known objects; and b) a processor configured to identify the item based on the one or more visual features and the visual features associated with the plurality of known objects.

13. The system in claim 12, wherein the first computing device is a mobile phone.

14. The system in claim 12, wherein the second computing device is a server configured to initiate an action after identifying the item.

15. The system in claim 14, wherein the action includes transmitting product information to the user.

16. A mobile phone for performing object recognition in a communications network, the phone comprising: a camera configured to capture at least one image; a feature extractor configured to extract one or more scale-invariant visual features from the image; and a transmitter adapted to send the one or more visual features to an object recognition server via the communications network.