This application claims benefit and priority to U.S. application No. 62/121,104 filed on 26/2/2015 and is also a continuation-in-part application to U.S. application No. 14/749,118 filed on 24/6/2015, the contents of both applications being incorporated by reference to the maximum extent allowable under law.
Detailed Description
In the following description, numerous details are set forth in order to provide an understanding of the present disclosure. However, it will be understood by those skilled in the art that the embodiments of the present disclosure may be practiced without these details and that numerous variations or modifications from the described embodiments may be possible.
As will be described in detail herein, the present disclosure relates to an algorithmic framework for determining mobile device user context in the form of motion activity, voice activity, and spatial environment using single sensor data and multi-sensor data fusion. In particular, the algorithmic framework provides probabilistic information about motion activity, voice activity, and spatial environment through heterogeneous sensor measurements, which may include data from accelerometers, barometers, gyroscopes, and microphones (but not limited to these sensors) embedded on mobile devices. The computing architecture allows for combining probabilistic outputs in a number of ways in order to infer meta-level context-aware information about a mobile device user.
Referring first to FIG. 1, an electronic device 100 is now described. The electronic device 100 may be a smartphone, tablet computer, smart watch, activity tracker, or other wearable device. The electronic device 100 includes a Printed Circuit Board (PCB)99 on which various components are mounted. Conductive traces 97 printed on the PCB 99 are used to electrically couple these various components in a desired manner.
Mounted on the PCB 99 is a system-on-a-chip (SoC)150 that includes a Central Processing Unit (CPU)152 coupled to a Graphics Processing Unit (GPU) 154. Memory block 140, optional transceiver 160, and touch-sensitive display 130 are coupled to SoC150, via which SoC150 may wirelessly communicate with a remote server over the internet, via which SoC150 may display output and receive input. Coupled to SoC150 is a sensor unit 110 that includes a three-axis accelerometer 111 for determining accelerations experienced by electronic device 100, a microphone 112 for detecting audible noise in the environment, a barometer 116 for determining atmospheric pressure in the environment (and thus, indicating the altitude of electronic device 100), a barometer 113 for determining the angular rate and thus orientation (roll, pitch, or yaw) of electronic device 100 relative to the environment, a magnetometer 118 for determining the angular rate and thus orientation (roll, pitch, or yaw) of electronic device 100 relative to the environment, and a proximity sensor 119 for determining the ambient light level in the environment in which electronic device 100 is located, SoC150 may communicate with a remote server over the internet via the WiFi transceiver, SoC150 may determine the geospatial location of electronic device 100 via the GPS receiver, and the light sensor is for determining the ambient light level in the environment in which electronic device 100 is located, the magnetometer is used to determine the strength of magnetic fields in the environment and thereby the orientation of the electronic device 100, and the proximity sensor is used to determine the proximity of a user with respect to the electronic device 100.
Sensor unit 110 is configurable and mounted on PCB 99 spaced apart from SoC150, and its various sensors are coupled to the SoC by conductive traces 97. Some of the sensors of the sensor unit 110 may form a MEMS sensing unit 105, which may include any sensor that can be implemented in MEMS, such as an accelerometer 111 and a gyroscope 114.
The sensor unit 110 may be formed of discrete components and/or integrated components and/or a combination of discrete and integrated components, and may be formed as a package. It should be understood that the sensors shown as part of the sensor unit 110 are each optional, and that some of the sensors shown may be used, and some of the sensors shown may be omitted.
It should be understood that the configurable sensor unit 110 or the MEMS sensing unit 105 is not part of the SoC150, but is a separate and distinct component from the SoC 150. In practice, the sensor unit 110 or MEMS sensor unit 105 and SoC150 may be separate, distinct, mutually exclusive structures or packages mounted on the PCB 99 at different locations and coupled together via conductive traces 97 as shown. In other applications, the sensor unit 110 or the MEMS sensor unit 105 and the SoC150 may be contained using a single package, or may have any other relationship suitable for each other. Further, in some applications, the sensor unit 110 or the MEMS sensor unit 105 and the processing node 120 may be collectively considered as the sensor chip 95.
Each sensor of sensor unit 110 collects signals, performs signal conditioning, and presents digitized output at different sampling rates. A single one of these sensors may be used, or multiple ones of these sensors may be used. The multi-channel digital sensor data from the sensors of the sensor unit 110 is passed to the processing node 120. Processing node 120 performs various signal processing tasks. First, the pre-processing steps of filtering and down-sampling the multi-channel sensor data are completed (block 121), and then time synchronization between different data channels when using sensor data from multiple sensors is performed (block 122). Sensor data obtained from a single sensor or multiple sensors is then buffered into a frame using overlapping/sliding time domain windows (block 123). Sensor-specific features are extracted from the data frame and given as output to a probabilistic classifier routine (block 124).
In a probabilistic classifier routine, Motion Activity Vectors (MAVs), Voice Activity Vectors (VAVs), and Spatial Environment Vectors (SEVs) are generated from these sensor-specific features. These vectors are then processed to form a posterior probability from each vector (block 125). The pattern library of probabilistic classifiers is used to obtain three posterior probabilities based on the vector and stored in memory block 140 or in cloud 170 accessed over the internet. Using these pattern libraries, a basic level context aware a posteriori probability is obtained for each data frame, which can be used to make inferences about the basic level or meta level context of the electronic device 100 (block 126). Display 130 may be used to present such inferences and intermediate results, as desired.
Therefore, the motion activity posterior probability is generated from the motion activity vector, and represents the probability that each element of the motion activity vector changes according to time. A voice activity posterior probability is generated from the voice activity vector and represents a probability that each element of the voice activity vector changes according to time. A spatial environment posterior probability is generated from the spatial environment vector, the spatial environment posterior probability representing a probability that each element of the spatial environment vector changes according to time. The sum of each probability of the athletic activity posterior probabilities at any given time is equal to one (i.e., 100%). Similarly, the sum of each probability of the speech activity a posteriori probabilities at any given time is equal to one, and the sum of each probability of the spatial environment a posteriori probabilities at any given time is equal to one.
The basic level context has a plurality of aspects, each aspect based on a motion activity vector, a speech activity vector, and a spatial environment vector. Each aspect of the basic level context based on the motion activity vector is mutually exclusive to each other, each aspect of the basic level context based on the speech activity vector is mutually exclusive to each other, and each aspect of the basic level context based on the spatial environment vector is mutually exclusive to each other.
One of these aspects of the basic level scenario is the movement pattern of the user carrying the electronic device. Further, one of these aspects of the basic level context is the nature of the biologically generated sound within audible distance of the user. Furthermore, one of these aspects of the basic level scenario is the nature of the physical space around the user.
Examples of multiple classes of motion patterns, properties of biologically generated sounds, properties of physical space will now be given, although it is understood that the present disclosure contemplates and is intended to encompass any such classes.
Different categories of motion patterns may include user standing still, walking, going up stairs, going down stairs, jogging, cycling, climbing, using a wheelchair, and riding a vehicle. The different classes of determined properties of the biologically generated sound may include that the user is engaged in a telephone conversation, that the user is engaged in a multi-party conversation, that the user is speaking, that another party is speaking, that a background conversation occurs around the user, and that an animal utters a sound. The different categories of properties of the physical space around the user may include an office environment, a home environment, a mall environment, a street environment, a stadium environment, a restaurant environment, a bar environment, a beach environment, a natural environment, a temperature of the physical space, an air pressure of the physical space, and a humidity of the physical space.
Each vector has a class of "none of these are" which means the remaining classes in each vector are not explicitly incorporated as elements. This allows the sum of the probabilities of the elements of the vector to be equal to one, i.e. mathematically related. Also, this makes the vector representation flexible, so that new classes can be explicitly incorporated in the corresponding vector as needed, and this will simply change the composition of the "none of these" classes for that vector.
A meta-level context represents an inference made from a combination of probabilities of two or more classes of posterior probabilities. For example, the meta-level context may be that the user of the electronic device 100 is walking in a mall or busy in a telephone conversation in an office.
The processing node 120 may communicate the determined base-level context and the meta-level context to the SoC150, which may perform at least one contextual function of the electronic device 100 according to the base-level context or the meta-level context of the electronic device.
Fig. 3 shows the derivation of basic level context awareness from time-dependent information about the activity/environment class in each of the three vectors. Meta-level context awareness is derived from time-stamped information available from one or more of these base-level vectors and information stored in mobile device memory 140 or cloud 170 (e.g., schema library and database). The following introduces a desirable form of representing this information useful in application development related to base-level and meta-level context awareness.
The method for representing information is in the form of a probability that the class of vectors (motion activity, speech activity, and spatial environment) changes according to time, given the observations from one sensor or multiple sensors. This general information representation can be used to solve several application problems, such as detecting possible events from each vector in a time frame. These can be estimated as the posterior probabilities that each element of the MAV, VAV and SEV vectors is adjusted at a given time according to "observations", which are features derived from the sensor data records. The respective vectors of probability values are the corresponding "a posteriori probabilities", i.e. the motion activity a posteriori probability (MAP), the voice activity a posteriori probability (VAP) and the spatial environment a posteriori probability (SEP) of the processed output of the base level context awareness information.
Fig. 4 shows the probability of an element of the MAP comprising MAVs as a function of time, feature estimates derived from time-window observation data. The probability of the motion activity class is estimated from time window data obtained from one or more of the various sensors. Some of the models that may be used are i) Hidden Markov Models (HMMs), ii) Gaussian Mixture Models (GMMs), iii) Artificial Neural Networks (ANN) that produce probabilistic outputs for each class, and iv) multi-class probabilistic Support Vector Machines (SVMs) that incorporate Directed Acyclic Graphs (DAGs) and Voting (Maximum Wins Voting (MWV)). For each athletic activity class, the model parameters are trained using supervised learning from a training database that includes annotated data from all sensors to be used.
The number of sensors used to obtain the MAP depends on a number of factors, such as the number of available sensors on the mobile device 100, energy consumption constraints for the task, accuracy of the estimation, and so forth. When more than one sensor is used, different methods may be used to estimate MAP. One particularly useful method for fusing data from up to K different sensors to estimate MAP is shown in FIG. 4. In this method, sensor-specific features are extracted from the time window data from the corresponding sensor, and these features from the sensor are used to obtain the MAP.
Fig. 5 shows the probability that the VAP and SEP include feature estimates, derived from time-varying time-window observations received from microphone 112, which may be the beamformed output of such microphone array, for the elements of the VAV and SEV, respectively. With regard to MAP, probabilities are obtained from each active model (e.g., HMM, GMM, ANN, and multi-class probabilistic SVM incorporating DAG or MWV that produce probabilistic outputs for each class). For each athletic activity class, the model parameters are trained using supervised learning from a training database that includes annotated data from all sensors to be used.
The MAP based on tri-axial accelerometer data for a "walking" athletic activity of 150 seconds duration is shown in fig. 6. The tri-axial accelerometer data is sampled at 50Hz and a five second time window data frame is extracted. Successive frames are obtained by shifting the time window by two seconds. The amplitude of the three channel data is used to extract 17-dimensional features per frame. These features include the maximum number, minimum number, mean, root mean square, three cumulative features, and 10 th order linear prediction coefficients. The probability of each activity is estimated from the multi-class probabilistic SVM frame incorporating the DAG. For athletic activity in the MAV, a multi-class probabilistic SVM-DAG model of the MAP graph in fig. 6 is trained from the tri-axial accelerometer data using supervised learning from a training database that includes time-synchronized multi-sensor data from tri-axial accelerometer 111, barometer 113, tri-axial gyroscope 114, microphone 112, and tri-axial magnetometer 118.
The temporal evolution of a posteriori probability information as shown for MAP in fig. 6 is a general representation of context aware information at the basic level. It provides the probability of a class in an activity/context vector at a given time and shows its evolution over time. The following silence features of this representation format are relevant:
at any given time, the sum of the probabilities for all classes equals one; and is
At any given time, the activity/context classification is performed from the corresponding a posteriori probabilities, supporting the most probable class, thus providing a hard decision.
The "confidence" in the classification result, such as the difference between the maximum probability value and the second highest probability value, may be obtained from different measurements. The greater the difference between the two probability values, the greater confidence in the accuracy of the decoded class should be.
It is observed from fig. 6 that the probability of walking is highest compared to the probabilities of all other athletic activities, which results in correct classification at almost all times in the graph. The classification result is erroneous in two small time intervals, where the correct activity is misclassified as "stair-stepping".
Another time evolution of MAP based on tri-axial accelerometer data for a 30 second duration "stair climbing" athletic activity is shown in fig. 7. It can be seen that the maximum probability class at each time instant varies between "stair-climbing," "walking," and some other athletic activity. Therefore, decoding motion activity will be erroneous at those times, where the "stair up" class does not have the maximum probability. Also, the maximum probability at each time instant is lower than the "walking" activity shown in the MAP of fig. 6 and closer to the next highest probability. From this it can be deduced that the "confidence" in the accuracy of the decoded class is lower than in the "walking" activity case of fig. 6.
FIG. 8 shows two methods of data fusion from multiple sensors. The first method involves concatenating the features obtained from each sensor to form a composite feature vector. This feature vector is then given as input to the probabilistic classifier. The second method is based on bayesian theory. Suppose observation ZK={Z1,…,ZKIn which Z isiIs the feature vector for sensor number i. Bayes' theorem considers the following: given a particular class, the slave sensor SiCharacteristic vector Z ofiCollected information and slave sensor SjCharacteristic vector Z ofjThe information obtained is irrelevant. That is, P (Z)i,ZjClass IL)=P(ZiClass IL).P(ZjClass IL) Given this kind, it gives the joint probability of feature vectors from multiple sensors. Bayesian theorem is then used to fuse the data from multiple sensors to obtain the posterior probability.
FIG. 2 depicts a flow diagram of a method for determining probabilistic context awareness for a mobile device user using single-sensor and multi-sensor data fusion. Make SiDenotes the ith sensor, where i ═ 1,2, … K, and K is the total number of sensors used (block 202). The sensor providing input data si(m), where i is the sensor number from 1 to K, and m is the discrete time index. Preprocessed time-aligned data si(m) is segmented into a plurality of fixed duration frames xi(n) (block 204).
Thereafter, sensor-specific features are extracted and grouped into a plurality of vectors (block 206). Let z be
f iIs a feature f, which is data x from the ith sensor
i(n) is extracted. Compound special materialThe eigenvector is the pass Z
i=[z
1 i,z
2 i,…,z
Fi i]' given Z
i. Composite feature vector for n sensors
And (4) showing. For basic level context detection, the following features are extracted.
i.MAV:
a. An accelerometer: maximum number, minimum number, mean, root mean square, 3 cumulative characteristics, and 10 th order linear prediction coefficients.
These three cumulative characteristics are as follows:
1. average minimum number: is defined as xiAverage of the first 15% of (n).
2. Average median number: is defined as xiAverage between 30% and 40% of (n).
3. Average maximum number: is defined as xiAverage of (n) between 95% and 100%.
b. A pressure sensor: maximum number, minimum number, mean, slope, and 6 th order linear prediction coefficients.
c. A gyroscope: maximum number, minimum number, mean, root mean square, 3 cumulative characteristics, and 10 th order linear prediction coefficients.
d. A microphone: concatenated 10 th order linear prediction coefficients, zero crossing rate and short time energy.
VAV and SEV:
a. a microphone: 13 mel-frequency cepstral coefficients (MFCCs), 13 differential MFCCs, and 13 double differential MFCCs.
b. Microphone array: 13 MFCCs, 13 differential MFCCs, and 13 double differential MFCCs.
The feature vectors are given as inputs to a probabilistic classifier, such as a multiclass probabilistic SVM-DAG (block 208). The obtained outputs are the corresponding a posteriori probabilities viz, MAP, VAP and SEP of the corresponding base level context awareness vectors MAV, VAV, SEV (block 212). The posterior probability is [ P (class)1/ZK) P (class)2/ZK) ,., P (class)L/ZK)]' ofForm (L) wherein L is the number of classes in the MAV/VAV/SEV.
Fig. 9 and 10 show MAPs using data from two sensors, such as a three-axis accelerometer and a barometer. The 17 features from the tri-axial accelerometer listed above are used and one feature (i.e., the time slope of the pressure over a 5 second frame estimated using the least squares method) is used together in the multi-class probabilistic SVM-DAG model of the 18-dimensional input to obtain the probability for each active class. Comparing fig. 6 with fig. 9, it can be seen that one of the two false decision intervals when only accelerometer data is used is corrected using barometer data fusion. The effect of the fusion of accelerometer data with barometer data is evident in the comparison of fig. 6 and 9, respectively, where all incorrect decisions using accelerometer sensor data are corrected when the accelerometer data is fused with barometer data. Additional input from the pressure sensor can correctly disambiguate "stair-up" activity from "walking" and other activities.
The performance of the 9 classes of athletic activity classifiers using probabilistic MAP outputs is shown in FIG. 11 in the form of a confusion matrix. The classification is based on a fusion of 18 features obtained from accelerometer data and barometer data obtained from a smartphone. MAP is obtained using a multi-class probabilistic SVM-DAG model that was previously trained based on user data. Performance results have been obtained using leave-one-out on data from 10 subjects. The rows in the confusion matrix give the true motion activity class and the columns give the decoding activity class. Thus, the diagonal values represent the percentage of correct decisions for the corresponding class, while the non-diagonal values represent incorrect decisions. The total percentage of correct decisions obtained for the 9 activity classes was 95.16%.
The single-sensor data and/or the multi-sensor fused data are used to derive probabilistic output on basic-level context-aware information. This general algorithmic framework for basic level context awareness is extensible such that it may also include more motion and voice activity classes and spatial environment contexts in probabilistic output formats as needed. These corresponding a posteriori probability outputs may be integrated over time to provide a more accurate, but delayed, decision regarding activity or environmental class. The algorithmic framework allows integrating additional a posteriori probabilities for other classes of detection tasks derived from the same sensor or additional sensors.
The posterior probability output of motion or voice activity and spatial environment classes can be used to perform meta-level probabilistic analysis and develop embedded applications for context awareness as shown in fig. 12. For example, an inference of "walking" activity class from MAP and a "mall" class from SEP may together draw a meta-level inference: the user is walking in a mall. Probabilistic information in the three a posteriori probabilities can be used as input to a meta-level context-aware classifier, on which more advanced applications can be built.
Fig. 13 shows a snapshot of an application developed using Java for an android OS based smartphone. The user interface of the application includes start, stop, and pause buttons as shown in the snapshot on the left for calculating a posterior probability in real time, logging its time evolution, and displaying them graphically in real time for up to 40 past frames. The snapshot on the right shows the MAPs of the 9 athletic activity classes as a function of time. It also displays the decoded class of the current frame from the maximum probability value. The total duration of time the user spent in each athletic activity class since the start of the application is also shown. The application determines the athletic activity posterior probability using a fusion of accelerometer, barometer, and gyroscope data. The number of features varies depending on the number of sensors used. The posterior probability is evaluated using one of three methods: i) multi-class probabilistic SVMs in conjunction with a DAG, ii) multi-class probabilistic SVMs in conjunction with MWVs, and iii) multi-class SVMs that produce hard decision outputs. The real-time graphical display of the probability values of all classes also gives a quick visual depiction of the "confidence" of the classification result as the most probable class by comparing the second highest probability class.
Although the foregoing description has been described herein with reference to particular means, materials and embodiments, it is not intended to be limited to the particulars disclosed herein; but rather extends to all functionally equivalent structures, methods and uses, such as are within the scope of the appended claims.