WO2022155259A1

WO2022155259A1 - Audible textual virtualization

Info

Publication number: WO2022155259A1
Application number: PCT/US2022/012197
Authority: WO
Inventors: Mark Solomon; Zac PAVLOV; Jerome Scholler; Vivardhan Kanoria; Vivek Vidyasagaran
Original assignee: Tangible Play Inc
Current assignee: Tangible Play Inc
Priority date: 2021-01-12
Filing date: 2022-01-12
Publication date: 2022-07-21
Anticipated expiration: 2023-07-12

Abstract

Various implementations for virtualization of tangible object components include a method that includes capturing, using a video capture device associated with a computing device, a video stream of a physical activity scene including a tangible interface object, capturing, using an audio capture device associated with the computing device, an audio stream of the environment around the audio capture device, the audio stream including a pronunciation of a word by a user, comparing the captured audio stream including the pronunciation of the word to an expected sound model, and displaying a visual cue on a display screen of the computing device based on the comparison.

Description

AUDIBLE TEXTUAL VIRTUALIZATION

BACKGROUND

[0001] The present disclosure relates to detection of audible speech and visualization of the recognized speech.

[0002] Speech recognition software is used to capture audible speech and identify the audible speech using a computer. Speech recognition software may be used for inputting commands and instructions, such as by telling a smart home device a command, talking into an input device to convert speech to text in a mobile device, or using speech commands instead of keyboard inputs to navigate through a menu. Current implementations of speech recognition software often have a noticeable delay after the audible command or input is spoken. Furthermore, the speech recognition software often miscategorizes or fails to recognize different speech inputs based on various factors of the speech input, such as the volume of the input, background noise, mumbles, etc.

[0003] Studies have been done on children and found that when children read out loud to other people, pet animals or even stuffed toys in some cases, it can help improve the children’s reading ability. However, current solutions in speech recognition software are not capable of recognizing and using the speech of a child as they speak out loud in a timely manner that would mimic allowing students to read to other people, pet animals or stuffed toys. These current solutions have long delay and fail to recognize speech as children are reading out loud, making the current solutions unsuitable for reading with a child.

SUMMARY

[0004] According to one innovative aspect of the subject matter in this disclosure, a method for audible textual virtualization is described. In an example implementation, the method includes capturing, using a video capture device associated with a computing device, a video stream of a physical activity scene including a tangible interface object; capturing, using an audio capture device associated with the computing device, an audio stream of an environment around the audio capture device, the audio stream including a pronunciation of a word by a user; comparing the captured audio stream including the pronunciation of the word to an expected sound model; and displaying a visual cue on a display screen of the computing device based on the comparison.

[0005] Implementations may include one or more of the following features. The method may include determining, using a processor of the computing device, an identity of the tangible interface object; and where, the expected sound model is based on the identity of the tangible interface object. The tangible interface object is a book and determining the identity of the tangible interface object includes determining a title of the book. The audio stream that includes the pronunciation of the word by the is a pronunciation of the word from the book. The expected sound model is from a database of sound models associated with a specific page of the book. The visual cue is a virtual depiction of the pronunciation of the word. The visual cue further may include a highlighting effect that indicates a correctness of the pronunciation of the word. The method may include determining, using a processor of the computing device, that an expected word was missed by the user during the pronunciation of the word based on the comparison between the captured audio stream including the pronunciation of the word and the expected sound model; and displaying an additional visual cue indicating the expected word that was missed. The method may include determining, using a processor of the computing device, a correctness of the pronunciation of the word based on the comparison; and categorizing, using the processor of the computing device, the user based on the correctness. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.

[0006] The reading session system also includes a stand configured to position a computing device having one or more processors; a video capture device configured to capture a video stream of a physical activity scene, the video stream including a tangible interface object in the physical activity scene; an audio capture device configured to capture an audio stream of an environment around the audio capture device, the audio stream including a pronunciation of a word by a user; an activity application executable by the one or more processors to compare the captured audio stream including the pronunciation of the word to an expected sound model; and a display configured to display a visual cue based on the comparison. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

[0007] Implementations may include one or more of the following features. The reading session system may include: a detector executable by the one or more processors to determine an identity of the tangible interface object; and where, the expected sound model is based on the identity of the tangible interface object. The tangible interface object is a book and determining the identity of the tangible interface object includes determining a title of the book. The audio stream that includes the pronunciation of the word by the is a pronunciation of the word from the book. The expected sound model is from a database of sound models associated with a specific page of the book. The visual cue is a virtual depiction of the pronunciation of the word. The visual cue further may include a highlighting effect that indicates a correctness of the pronunciation of the word. The activity application is further configured to determine that an expected word was missed by the user during the pronunciation of the word based on the comparison between the captured audio stream including the pronunciation of the word and the expected sound model and the display is further configured to display an additional visual cue indicating the expected word was missed. The activity application is further configured to determine a correctness of the pronunciation of the word based on the comparison and categorize the user based on the correctness. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.

[0008] One general aspect includes a method including capturing, using a video capture device associated with a computing device, a video stream of a physical activity scene including a book with a group of visible words; capturing, using an audio capture device associated with the computing device, an audio stream including a pronunciation of the group of visible words; determining, using a processor of the computing device, an identity of a visible page of the book from the captured video stream; retrieving, using the processor of the computing device, a group of expected sound models based on the identity of the visible page of the book; comparing, using the processor of the computing device, the captured audio stream including the pronunciation of the group of visible words to the group of expected sound models; determining, using the processor of the computing device, a correctness of the pronunciations of the group of visible words by determining which pronunciations of the group of visible words from the captured audio stream exceed a matching threshold with a sound model from the group of expected sound models based on the comparison; and displaying, on a display of the computing device, the correctness of the pronunciations of the group of visible words as visual cues. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

[0009] Implementations may include one or more of the following features. The method where displaying the correctness of the pronunciations of the group of visible words as visual cues includes displaying a virtual representation of the visible words with highlighting indicating which of the pronunciations of the visible words were correct and which of the pronunciations of the visible words were incorrect. Determining the identity of the visible page of the book from the captured video stream includes detecting a reference marker on the visible page of the book from the captured video stream and determining an identity of the reference marker. Determining, using the processor of the computing device, a correctness of the pronunciations of the group of visible words further may include, displaying in on the display of the computing device, a representation of each of the pronunciations of the group of visible words after being compared to the expected sound models. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.

[0010] Other implementations of one or more of these aspects and other aspects described in this document include corresponding systems, apparatus, and computer programs, configured to perform the actions of the methods, encoded on computer storage devices. The above and other implementations are advantageous in a number of respects as articulated through this document. Moreover, it should be understood that the language used in the present disclosure has been principally selected for readability and instructional purposes, and not to limit the scope of the subject matter disclosed herein.

BRIEF DESCRIPTION OF THE DRAWINGS

[0011] The disclosure is illustrated by way of example, and not by way of limitation in the figures of the accompanying drawings in which like reference numerals are used to refer to similar elements.

[0012] Figure 1 is an example configuration of a system for virtualization of speech recognition.

[0013] Figure 2 is an example configuration of a system for virtualization of speech recognition.

[0014] Figure 3 is an example configuration of a system for virtualization of speech recognition.

[0015] Figure 4 is an example configuration of a system for virtualization of speech recognition.

[0016] Figure 5 is an example configuration of a system for virtualization of speech recognition.

[0017] Figure 6 is an example configuration of a system for virtualization of speech recognition.

[0018] Figures 7A-7C are an example configuration of a system for virtualization of speech recognition.

[0019] Figure 8 is an example configuration of a system for virtualization of speech recognition.

[0020] Figure 9 is an example configuration of a system for virtualization of speech recognition. [0021] Figure 10 is an example configuration of a system for virtualization of speech recognition.

[0022] Figure 11 is an example configuration of a system for virtualization of speech recognition.

[0023] Figures 12A-12B are an example configuration of a system for virtualization of speech recognition.

[0024] Figure 13 is an example configuration of a system for virtualization of speech recognition.

[0025] Figure 14 is an example configuration of a system for virtualization of speech recognition.

[0026] Figures 15A-15C are an example configuration of a system for virtualization of speech recognition.

[0027] Figure 16 is an example configuration of a system for virtualization of speech recognition.

[0028]

[0029] Figure 17 is a block diagram illustrating an example computer system for virtualization of speech recognition.

[0030] Figure 18 is a block diagram illustrating an example computing device.

[0031] Figure 19 is a flowchart of an example method for virtualization of speech recognition by capturing audible sound and matching it to sound models.

[0032] Figure 20 is an example configuration of a system for virtualization of speech recognition.

DETAILED DESCRIPTION

[0033] Figure 1 is an example configuration of a system 100 for audible text virtualization. In the example system, a graphical cue, such as a visual cue 124 may be presented on a display 112 of a computing device 104. As depicted, the configuration of the system 100 includes, in part, a tangible, physical activity scene 116, on which tangible interface objects 120 (not shown) may be positioned (e.g., placed, drawn, created, molded, built, projected, etc.) and a computing device 104 that is equipped or otherwise coupled to a video capture device 110 (not shown) coupled to an adapter 108 configured to capture video of the physical activity scene 116. In some implementations, instead of an adapter 108, the video capture device 110 can have a field of view that is directed towards an area that includes the physical activity scene 116. The computing device 104 includes novel software and/or hardware capable of displaying a virtual scene including in some implementations the visual cue 124 along with other virtual elements.

[0034] While the physical activity scene 116 on which the platform is situated is depicted as substantially horizontal in Figure 1, it should be understood that the physical activity scene 116 can be vertical or positioned at any other angle suitable to the user for interaction. The physical activity scene 116 can have any color, pattern, texture, and topography. For instance, the physical activity scene 116 can be substantially flat or be disjointed/ discontinuous in nature. Non-limiting examples of an activity scene include a table, a table top, desk, counter, ground, a wall, a whiteboard, a chalkboard, a customized scene, a user’s lap, etc. In some implementations, the physical activity scene 116 may be preconfigured for use with a tangible interface object 120 (not shown). While in further implementations, the physical activity scene 116 may be any scene on which the tangible interface object 120 may be positioned.

[0035] In some implementations, an input device 318, such as a microphone 1800 of the computing device 104, may capture audible sounds from a user, such as a word, a phrase or sentence that a user is saying. For example, the user may speak out loud a word that is represented by the visual cue 124, such as “hi”, and the microphone may capture an audio stream that includes the “hi” spoken by the user and store it as a sound file in the computing device 104 for further processing as described elsewhere herein. In another example, the user may speak out loud a word that is represented by a picture cue illustrating an object, such as a dog, and the microphone may capture the word “dog” spoken by the user and store it as a sound file in the computing device 104 for further processing as described elsewhere herein.

[0036] In another example embodiment, as shown in Figure 2, a virtual character 122 may be displayed on the screen and may provide virtual interactions with a user. For example, the virtual character 122 may execute a routine that includes playing a sound file to ask the user through the speaker of the computing device 104 to speak or sound out the letters of the visual cue 124 while the microphone captures the sounds the user makes. In some implementations, the virtual character 122 may point or gesture as shown in Figure 2 in order to draw the user’s attention or assist the user as the user sounds out the word represented by the visual cue 124. By using the combination of the sound file to prompt the user along with the pointing or gesture of the virtual character 122, the application mimics how a user would interact with a person as they read with the virtual character 122.

[0037] In another example embodiment, as shown in Figure 3, the virtual character 122 and/or a virtual graphical objects 126, such as a virtual assistant depicted in Figure 3 as a star character, may perform animations or other routines when the correct word is pronounced based on the word represented by the visual cue 124. These animations or other routines may provide reinforcement to the user when the correct word is pronounced in order to mimic a reading session with another person or teacher. The activity application 214 of the computing device 104 (e.g., as shown in Figure 18) may identify that a word is correctly pronounced by comparing the sound file captured by the microphone 1800 to expected model sounds for pronouncing the word represented by the visual cue 124 and matching portions of the sound file to portions of the expected models in order to satisfy a matching threshold in order to determine that the correct word is being pronounced. In some implementations, in order to decrease the amount of time it takes to compare the sound file to expected sound models, the activity application(s) 214 may identify one or more tangible interface objects 120 or routines from the virtual scene and retrieve expected sound models for words that will be presented on the tangible interface objects 120 or routines from the virtual scene. By limiting the amount of sound models being compared, the activity application(s) 214 can quickly determine if there is a match between the sound file and the expected sound models.

[0038] In another example embodiment, as shown in Figure 4, the system 100 may display prompts or feedback in a portion of the display screen 112 that the visual cue 124 is expected to appear. For example, in one activity, the animated character 122 refers to “dream magic” and a dream cloud icon 131 appears on the portion of the display screen 112 where the visual cue 124 may appear. As shown in Figure 5, the dream cloud icon may exhibit various animations when a word represented by the visual cue 124 is correctly or incorrectly pronounced by the user and captured by the microphone 1800 of the computing device 104. In further implementations, the system 100 may cause a speaker of the computing device 104 to provide hints or guidance, such as having the speaker play a recorded sound file that says phrases, such as “that looks tricky” or “try sounding it out” in order to guide and/or encourage the user. The activity application 214 may provide feedback in substantially real-time as the user correctly or incorrectly pronounces the phrase or word depicted by the visual cue 124. In some implementations, the activity application 214 can focus on portions of the visual cue 124, that the user is sounding out and emphasize the sound or portion of the visual cue 124 that the user needs additional help with, similar to how a teacher would assist a user as they read with the teacher.

[0039] In some implementations, the activity application 214 may have different routines to assist the user based on how many errors have occurred. For example, after a first error, the activity application 214 may cause the portion of the display 112 to form an animation that indicates the user incorrectly pronounced something and encourages them to try again. In another example embodiment, as shown in Figure 6, after a second or subsequent error, the activity application 214 may identify a portion of the visual cue 124 that is being incorrectly pronounced and provide a pronunciation indicator 602 in order to highlight the portion that needs to be corrected or focused on during pronunciation. For example, if the user is incorrectly pronouncing the word “grape” the activity application 214 may break down the word “grape” and have the user focus on the vowels and the sound the vowels make by using the pronunciation indicator 602 to highlight the vowel. In some implementations, a sound file may be used to provide feedback and play the vowel sound through the speaker in order to assist the user in recognizing what the vowel sound should sound like in the word “grape”. For example, the feedback may indicate that the letter “a” in “grape” makes the long “aye” sound. Then the activity application 214 may instruct the user to sound out each letter in the visual cue 124 and make the sound for the letter “a” as instructed. In some implementations, the pronunciation indicator 602 may move from letter to letter as the user utters the portions of the word represented by the visual cue 124 and the activity application 214 may process the captured sound files as the user utters the portions of the word. This allows the activity application 214 to provide feedback in substantially real-time while reinforcing correct pronunciation and assisting in correcting improper pronunciation.

[0040] Figures 7A-7C are example configurations for instructing a user to place a tangible interface object 120 (not shown) in front of the computing device 104. In the example of Figure 7A, an animation may appear on the display screen 112 of the computing device 104 that depicts a book title representing a book 702 as a tangible interface object 120 for the user to place in front of the system 100. Another animation in Figure 7B illustrates to the user on the display screen 112 of the computing device 104 how the book 702 should be placed on the physical activity scene 116 in order for the book 702 to be within the field of view of the video capture device 110. Another animation in Figure 7C may then show the animated character 122 looking and/or pointing down into the space on the physical activity scene 116 where the user may place tangible interface object(s) 120, such as the book 702 from Figure 7A. By providing gesture from the animated character 122 and/or an animation showing the user how to position the tangible interface object 120, the virtual scene is able to interact with the user in substantially real-time and direct the user as needed in how to interact with the physical activity scene 116. In some implementations, the interactions between the animated character 122 and the user may be routines executed by the activity application 214, while in further implementations, the activity application 214 may employ various machine learning algorithms and over time execute various independent interactions with the user autonomously. These interactions facilitate a reading session between a user and the animated character 122. The animated character 122 may be configured to provide these placement cues in order to indicate to the user where to position the tangible interface objects 120 in the real -world based on the intuitive placement cues provided by the animated character 122. In some implementations, the placement cues may be designed to assist even young children who can detect where and/or how the animated character 122 is making the placement cues without explicit instructions being presented on the display screen 112 for how to position the tangible interface object 120. These placement cues may increase the engagement from the user as they subconsciously follow the placement cues to quickly position tangible interface objects 120 correct without explicit instructions that may break the engagement.

[0041] In some implementations, as shown in Figure 8, the tangible interface object 120 may be a book or other instructional material that may be placed on the physical activity scene 116 in front of the computing device 104 and within the field of view of the video capture device 110 coupled to the computing device 104. In some implementations, as shown in Figure 8, various image markings 802 on the tangible interface object 120 may be visible to the detection engine 212 of the computing device 104 (e.g., as shown in Figure 18) that may be used to recognize the type of the tangible interface object 120. These image markings 802 may be used to determine which activities to execute by the activity application 214 for display of various routines on the display 112 of the computing device 104.

[0042] In some implementations, as shown in Figure 9, the activity application 214 may detect a correct placement of the tangible interface object 120, such as the book, and may provide instructions or placement cues in order to correctly position the tangible interface object 120 for optimal detection of the tangible interface object 120. As shown in Figure 9, a portion of the display 112 may include a virtual depiction 902 of the physical activity scene 116. The virtual depiction 902 may include an image of the physical activity scene 116 that has been captured by the video capture device 110 and trimmed to fit within the portion of the display 112. In further implementations, the virtual depiction 902 may be a virtualization of the tangible interface object 120, such as a top portion of a book showing only a portion of the book title. By virtually displaying a portion of the physical activity scene 116 on the virtual scene, the user is further immersed in the interaction with the virtual character 122.

[0043] In some implementations, as shown in Figure 10, as a user reads the words as printed or presented on the tangible interface object 120, such as a page of a book, the detected pronunciation of those words appears on the display 112 as a visual cue 124. For example, the words may slide up from a side of the display 112 as the utterance of those words are captured and recognized. If the words are correct, they may be highlighted in one color or indicator, and if they are incorrect, they may be highlighted in another color or indicator, and/or crossed out to indicate an error as the user continues to read from the tangible interface object 120. [0044] In some implementations, as shown in Figure 11, the detection engine 212 can detect if the input from the video capture device 110 has changed, such as if the tangible interface object 120 is jostled or shifted and may either update the calibration profile for the change and/or provide instructions 1102 to the user to correct an alignment of the tangible interface object 120. In some implementations, the activity application 214 detects when a user finishes reading words on a page of the tangible interface object 120 and may provide instructions 1102 or guide to the user to turn the page or perform another interaction with the tangible interface object 120. In some implementations, the activity application 214 can detect a speed and/or cadence of the words being spoken by a user. The activity application 214 may create a profile of the user and over time leam the expected speed and/or cadence of the user. Using this profile, the activity application 214 can detect anomalies, such as a pause or hesitation that may indicate that the user is confused and/or struggling with a pronunciation of a word and provide assistance based on that anomaly.

[0045] In some implementations, as shown in Figures 12A and 12B, the activity application 214 may detect that a user has struggled with and/or incorrectly pronounced a word. That word may be highlighted as an incorrect word 140 on the display 112 of the computing device 104 as shown in Figure 12A. In some implementations, the incorrect word 140 may include a visual indicator to highlight to the user that the word was incorrect. In some implementations, the activity application 214 may highlight the incorrect word 140 without interrupting the user and allow the user to continue reading the words on the page of the tangible interface object 120. In some implementations, if multiple instances of the incorrect word 140 are present or if the activity application 214 determines that additional review of the incorrect word 140 is needed, then the activity application 214 may cause the incorrect word 140 to be the focus of a specific lesson as shown in Figure 12B. The incorrect word 140 may be reviewed and the activity application 214 may provide assistance in sounding out and/or pronouncing the incorrect word 140. In some implementations, the incorrect word 140 may then be practiced on the page again as shown in Figure 12A. In some implementations, the activity application 214 may trigger a routine associated with mini games and/or instructional moments to teach the sounding out and/or pronouncing the incorrect word 140. [0046] In some implementations, as shown in Figure 13, if an entire section of words that the activity application 214 is expecting the user to pronounce are skipped or mispronounced while reading from a page of the tangible interface object 120, then the entire section of incorrect words 140 may be highlighted for additional focus on the display 112 of the computing device 104. In some implementations, the animated character 122 may provide supplementary context of what various words mean and/or pronunciations in order to mimic an interaction with a teacher during a reading session. [0047] In some implementations, as shown in Figure 14, a request for feedback or ratings can be provided, such as by using a rating icon 160 on the display 112 of the computing device 104. The rating icon 160 may appear once the activity application 214 detects that the book has been completely read or the activity is completed. In some implementations, the rating icon 160 may be linked to a profile of a user and as personal ratings are gathered from specific users, a profile of likes and/or interests may be created. In some implementations, the profile may be used to determine difficulty levels associated with reading and the rating icon 160 response may be incorporated into determining future difficulty levels for reading recommendations. Whether the activity was too easy or too difficult for the user to complete may be incorporated into future reading recommendations in order to provide content that will engage the user without overwhelming or underwhelming them. [0048] In some implementations, as shown in Figure 15A-Figure 15C, various animations may be presented on the display 112 of the computing device 104 in order to engage with users as they participate in the activity, such as reading a page of the book or completing a last page of the book. In some implantations, as shown in Figure 15A, once the book is complete, a light animation 1502 may appear near the virtual character 122 on the display 112. In Figure 15B, the light animation 1502 may grow and fill at least a portion of the display 112. In Figure 15C, the light animation 1502 may appear to fill and light a virtual element 1504, such as a lamp or light in the virtual scene and cause the scene to be brighter. In some implementations, the effects, such as the virtual element 1504 and/or brightness of the scene may persist through future activities within the virtual scene. This action of completing a task and causing an animation that turns into an effect that persists in the activity, can increase the engagement of the user.

[0049] In some implementations, as shown in Figure 16, a continue prompt 170 may be displayed on the display 112. This continue prompt 170 may appear after an activity/task/book has been completed in order to determine if the user wants to continue with another activity/task/book/etc. In some implementations, a user can respond to the continue prompt 170 by selecting an icon displayed on the display 112. In further implementations, the user may speak an audible command, such as a “yes” or “no” and the activity application 214 may determine the next action based on a speech recognition of the audible command. In some implementations, additional audio commands may be received from the user, similar to how a user would interact with a teacher during a reading session. This allows the user to stay immersed in the experience without having to manage menus or other functionality.

[0050] In some implementations, the physical activity scene 116 may be integrated with a stand 106 that supports the computing device 104 or may be distinct from the stand 106 but placeable adjacent to the stand 106. In some instances, the size of the interactive area on the physical activity scene 116 may be bounded by the field of view of the video capture device 110 (not shown) and can be adapted by an adapter 108 and/or by adjusting the position of the video capture device 110. In additional examples, the boundary and/or other indicator may be a light projection (e.g., pattern, context, shapes, etc.) projected onto the physical activity scene 116.

[0051] In some implementations, the computing device 104 included in the example configuration 100 may be situated on the scene or otherwise proximate to the scene. The computing device 104 can provide the user(s) with a virtual portal for displaying the virtual scene. For example, the computing device 104 may be placed on a table in front of a user so the user can easily see the computing device 104 while interacting with the tangible interface object 120 on the physical activity scene 116. Example computing devices 104 may include, but are not limited to, mobile phones (e.g., feature phones, smart phones, etc.), tablets, laptops, desktops, netbooks, TVs, set-top boxes, media streaming devices, portable media players, navigation devices, personal digital assistants, etc.

[0052] The computing device 104 includes or is otherwise coupled (e.g., via a wireless or wired connection) to a video capture device 110 (also referred to herein as a camera) for capturing a video stream of the physical activity scene. As depicted in Figure 1, the video capture device 110 (not shown) may be a front-facing camera that is equipped with an adapter 108 that adapts the field of view of the camera 110 to include, at least in part, the physical activity scene 116. For clarity, the physical activity scene of the physical activity scene 116 captured by the video capture device 110 is also interchangeably referred to herein as the activity scene or the activity scene in some implementations.

[0053] As depicted in Figure 1, the computing device 104 and/or the video capture device 110 may be positioned and/or supported by a stand 106. For instance, the stand 106 may position the display 112 of the computing device 104 in a position that is optimal for viewing and interaction by the user who may be simultaneously positioning the tangible interface object 120 (e.g., shown in Figure 8) and/or interacting with the physical environment. The stand 106 may be configured to rest on the activity scene (e.g., table, desk, etc.) and receive and sturdily hold the computing device 104 so the computing device 104 remains still during use.

[0054] In some implementations, the tangible interface object 120 may be used with a computing device 104 that is not positioned in a stand 106 and/or using an adapter 108. The user may position and/or hold the computing device 104 such that a front facing camera or a rear facing camera may capture the tangible interface object 120 and then a virtual scene may be presented on the display 112 of the computing device 104 based on the capture of the tangible interface object 120. [0055] In some implementations, the adapter 108 adapts a video capture device 110 (e.g., front-facing, rear-facing camera) of the computing device 104 to capture substantially only the physical activity scene 116, although numerous further implementations are also possible and contemplated. For instance, the adapter 108 can split the field of view of the front-facing camera into two scenes. In this example with two scenes, the video capture device 110 captures a physical activity scene that includes a portion of the activity scene and is able to capture a tangible interface object 120 in either portion of the physical activity scene. In another example, the adapter 108 can redirect a rear-facing camera (not shown) of the computing device 104 toward a front-side of the computing device 104 to capture the physical activity scene of the activity scene located in front of the computing device 104. In some implementations, the adapter 108 can define one or more sides of the scene being captured (e.g., top, left, right, with bottom open). In some implementations, the adapter 108 can split the field of view of the front facing camera to capture both the physical activity scene and the view of the user interacting with the tangible interface object 120.

[0056] The adapter 108 and the stand 106 for a computing device 104 may include a slot for retaining (e.g., receiving, securing, gripping, etc.) an edge of the computing device 104 to cover at least a portion of the video capture device 110. The adapter 108 may include at least one optical element (e.g., a mirror) to direct the field of view of the video capture device 110 toward the activity scene. The computing device 104 may be placed in and received by a compatibly sized slot formed in a top side of the stand 106. The slot may extend at least partially downward into a main body of the stand 106 at an angle so that when the computing device 104 is secured in the slot, it is angled back for convenient viewing and utilization by its user or users. The stand 106 may include a channel formed perpendicular to and intersecting with the slot. The channel may be configured to receive and secure the adapter 108 when not in use. For example, the adapter 108 may have a tapered shape that is compatible with and configured to be easily placeable in the channel of the stand 106. In some instances, the channel may magnetically secure the adapter 108 in place to prevent the adapter 108 from being easily jarred out of the channel. The stand 106 may be elongated along a horizontal axis to prevent the computing device 104 from tipping over when resting on a substantially horizontal activity scene (e.g., a table). The stand 106 may include channeling for a cable that plugs into the computing device 104. The cable may be configured to provide power to the computing device 104 and/or may serve as a communication link to other computing devices, such as a laptop or other personal computer.

[0057] In some implementations, the adapter 108 may include one or more optical elements, such as mirrors and/or lenses, to adapt the standard field of view of the video capture device 110. For instance, the adapter 108 may include one or more mirrors and lenses to redirect and/or modify the light being reflected from activity scene into the video capture device 110. As an example, the adapter 108 may include a mirror angled to redirect the light reflected from the activity scene in front of the computing device 104 into a front-facing camera of the computing device 104. As a further example, many wireless handheld devices include a front-facing camera with a fixed line of sight with respect to the display of the computing device 104. The adapter 108 can be detachably connected to the device over the video capture device 110 to augment the line of sight of the video capture device 110 so it can capture the activity scene (e.g., surface of a table, etc.). The mirrors and/or lenses in some implementations can be polished or laser quality glass. In other examples, the mirrors and/or lenses may include a first surface that is a reflective element. The first surface can be a coating/thin film capable of redirecting light without having to pass through the glass of a mirror and/or lens. In an alternative example, a first surface of the mirrors and/or lenses may be a coating/thin film and a second surface may be a reflective element. In this example, the lights pass through the coating twice, however since the coating is extremely thin relative to the glass, the distortive effect is reduced in comparison to a conventional mirror. This mirror reduces the distortive effect of a conventional mirror in a cost-effective way.

[0058] In another example, the adapter 108 may include a series of optical elements (e.g., mirrors) that wrap light reflected off of the activity surface located in front of the computing device 104 into a rear-facing camera of the computing device 104 so it can be captured. The adapter 108 could also adapt a portion of the field of view of the video capture device 110 (e.g., the front-facing camera) and leave a remaining portion of the field of view unaltered so that multiple scenes may be captured by the video capture device 110. The adapter 108 could also include optical element(s) that are configured to provide different effects, such as enabling the video capture device 110 to capture a greater portion of the activity scene. For example, the adapter 108 may include a convex mirror that provides a fisheye effect to capture a larger portion of the activity scene than would otherwise be capturable by a standard configuration of the video capture device 110.

[0059] The video capture device 110 could, in some implementations, be an independent unit that is distinct from the computing device 104 and may be positionable to capture the activity scene or may be adapted by the adapter 108 to capture the activity scene as discussed above. In these implementations, the video capture device 110 may be communicatively coupled via a wired or wireless connection to the computing device 104 to provide it with the video stream being captured. [0060] Figure 17 is a block diagram illustrating an example computer system 200 for audible textual virtualizations. The illustrated system 200 includes computing devices 104a... 104n (also referred to individually and collectively as 104) and servers 202a...202n (also referred to individually and collectively as 202), which are communicatively coupled via a network 206 for interaction with one another. For example, the computing devices 104a... 104n may be respectively coupled to the network 206 via signal lines 208a...208n and may be accessed by users. The servers 202a...202n may be coupled to the network 206 via signal lines 204a...204n, respectively. The use of the nomenclature “a” and “n” in the reference numbers indicates that any number of those elements having that nomenclature may be included in the system 200 or other figures.

[0061] The network 206 may include any number of networks and/or network types. For example, the network 206 may include, but is not limited to, one or more local area networks (LANs), wide area networks (WANs) (e.g., the Internet), virtual private networks (VPNs), mobile (cellular) networks, wireless wide area network (WWANs), WiMAX® networks, Bluetooth® communication networks, peer-to-peer networks, other interconnected data paths across which multiple devices may communicate, various combinations thereof, etc.

[0062] The computing devices 104a... 104n (also referred to individually and collectively as 104) are computing devices having data processing and communication capabilities. For instance, a computing device 104 may include a processor (e.g., virtual, physical, etc.), a memory, a power source, a network interface, and/or other software and/or hardware components, such as front and/or rear facing cameras, display, graphics processor, wireless transceivers, keyboard, camera, sensors, firmware, operating systems, drivers, various physical connection interfaces (e.g., USB, HDMI, etc.). The computing devices 104a... 104n may couple to and communicate with one another and the other entities of the system 200 via the network 206 using a wireless and/or wired connection. While two or more computing devices 104 are depicted in Figure 17, the system 200 may include any number of computing devices 104. In addition, the computing devices 104a... 104n may be the same or different types of computing devices.

[0063] As depicted in Figure 17, one or more of the computing devices 104a... 104n may include a camera or video capture device 110, a detection engine 212, and activity application(s) 214. One or more of the computing devices 104 and/or cameras 110 may also be equipped with an adapter 108 as discussed elsewhere herein. The detection engine 212 is capable of detecting and/or recognizing audible sounds and one or more tangible interface object(s) 120. The detection engine 212 can detect the position and orientation of each of the tangible interface object(s) 120, detect how the tangible interface object 120 is being manipulated by the user, and cooperate with the activity application(s) 214 to provide users with a rich virtual experience by detecting the tangible interface object 120 and audible sounds from the user and generating a virtualization in the virtual scene.

[0064] In some implementations, the detection engine 212 processes video captured by a camera 110 to detect visual markers and/or other identifying elements or characteristics to identify the tangible interface object(s) 120. Additional structure and functionality of the computing devices 104 are described in further detail below with reference to at least Figure 18.

[0065] The servers 202 may each include one or more computing devices having data processing, storing, and communication capabilities. For example, the servers 202 may include one or more hardware servers, server arrays, storage devices and/or systems, etc., and/or may be centralized or distributed/cloud-based. In some implementations, the servers 202 may include one or more virtual servers, which operate in a host server environment and access the physical hardware of the host server including, for example, a processor, memory, storage, network interfaces, etc., via an abstraction layer (e.g., a virtual machine manager).

[0066] The servers 202 may include software applications operable by one or more computer processors of the servers 202 to provide various computing functionalities, services, and/or resources, and to send data to and receive data from the computing devices 104. For example, the software applications may provide functionality for internet searching; social networking; web-based email; blogging; micro-blogging; photo management; video, music and multimedia hosting, distribution, and sharing; business services; news and media distribution; user account management; or any combination of the foregoing services. It should be understood that the servers 202 are not limited to providing the above-noted services and may include other network-accessible services.

[0067] It should be understood that the system 200 illustrated in Figure 17 is provided by way of example, and that a variety of different system environments and configurations are contemplated and are within the scope of the present disclosure. For instance, various functionality may be moved from a server to a client, or vice versa and some implementations may include additional or fewer computing devices, services, and/or networks, and may implement various functionality client or server-side. Further, various entities of the system 200 may be integrated into a single computing device or system or additional computing devices or systems, etc.

[0068] Figure 18 is a block diagram of an example computing device 104. As depicted, the computing device 104 may include a processor 312, memory 314, communication unit 316, display 320, camera 110, and an input device 318, which are communicatively coupled by a communications bus 308. However, it should be understood that the computing device 104 is not limited to such and may include other elements, including, for example, those discussed with reference to the computing devices 104 in the Figures.

[0069] The processor 312 may execute software instructions by performing various input/output, logical, and/or mathematical operations. The processor 312 has various computing architectures to process data signals including, for example, a complex instruction set computer (CISC) architecture, a reduced instruction set computer (RISC) architecture, and/or an architecture implementing a combination of instruction sets. The processor 312 may be physical and/or virtual, and may include a single core or plurality of processing units and/or cores.

[0070] The memory 314 is a non-transitory computer-readable medium that is configured to store and provide access to data to the other elements of the computing device 104. In some implementations, the memory 314 may store instructions and/or data that may be executed by the processor 312. For example, the memory 314 may store the detection engine 212, the activity application(s) 214, and the camera driver 306. The memory 314 is also capable of storing other instructions and data, including, for example, an operating system, hardware drivers, other software applications, data, etc. The memory 314 may be coupled to the bus 308 for communication with the processor 312 and the other elements of the computing device 104.

[0071] The communication unit 316 may include one or more interface devices (I/F) for wired and/or wireless connectivity with the network 206 and/or other devices. In some implementations, the communication unit 316 may include transceivers for sending and receiving wireless signals. For instance, the communication unit 316 may include radio transceivers for communication with the network 206 and for communication with nearby devices using closeproximity (e.g., Bluetooth®, NFC, etc.) connectivity. In some implementations, the communication unit 316 may include ports for wired connectivity with other devices. For example, the communication unit 316 may include a CAT-5 interface, Thunderbolt™ interface, FireWire™ interface, USB interface, etc.

[0072] The display 320 may display electronic images and data output by the computing device 104 for presentation to a user. The display 320 may include any conventional display device, monitor or screen, including, for example, an organic light-emitting diode (OLED) display, a liquid crystal display (LCD), etc. In some implementations, the display 320 may be a touch-screen display capable of receiving input from one or more fingers of a user. For example, the display 320 may be a capacitive touch-screen display capable of detecting and interpreting multiple points of contact with the display scene. In some implementations, the computing device 104 may include a graphics adapter (not shown) for rendering and outputting the images and data for presentation on display 320. The graphics adapter (not shown) may be a separate processing device including a separate processor and memory (not shown) or may be integrated with the processor 312 and memory 314. [0073] The input device 318 may include any device for inputting information into the computing device 104. In some implementations, the input device 318 may include one or more peripheral devices. For example, the input device 318 may include a keyboard (e.g., a QWERTY keyboard), a pointing device (e.g., a mouse or touchpad), microphone 1800, a camera, etc. In some implementations, the input device 318 may include a touch-screen display capable of receiving input from the one or more fingers of the user 130. For instance, the functionality of the input device 318 and the display 320 may be integrated, and a user of the computing device 104 may interact with the computing device 104 by contacting a surface of the display 320 using one or more fingers. In this example, the user 130 could interact with an emulated (i. e. , virtual or soft) keyboard displayed on the touch-screen display 320 by using fingers to contact the display 320 in the keyboard regions.

[0074] The detection engine 212 may include a calibrator 302 and a detector 304. The elements 212 and 214 may be communicatively coupled by the bus 308 and/or the processor 312 to one another and/or the other elements 306, 310, 314, 316, 318, 320, and/or 110 of the computing device 104. In some implementations, one or more of the elements 212 and 214 are sets of instructions executable by the processor 312 to provide their functionality. In some implementations, one or more of the elements 212 and 214 are stored in the memory 314 of the computing device 104 and are accessible and executable by the processor 312 to provide their functionality. In any of the foregoing implementations, these components 212 and 214 may be adapted for cooperation and communication with the processor 312 and other elements of the computing device 104.

[0075] The calibrator 302 includes software and/or logic for processing the video stream captured by the camera 110 to detect a continuous presence of one or more tangible interface object(s) 120. For example, the calibrator 302 can detect if the input from the video capture device 110 has changed, such as if the tangible interface object 120 is jostled or shifted and may either update the calibration profile for the change or provide instructions to the user to correct an alignment of the tangible interface object 120. In some implementations, the calibrator 302 may detect that a page of the tangible interface object 120, such as a book has been skipped by the user by mistake and provide instructions or guide the user to turn the page to the correct one in sequence for uninterrupted reading. The calibrator 302 may also detect that the user is pausing (e.g., taking more than 5 seconds) or hesitating to turn to the next page and prompt the user to turn to the next page.

[0076] The detector 304 includes software and/or logic for processing the video stream captured by the camera 110 to detect and/or identify one or more tangible interface object(s) 120 included in the video stream and processing the audio stream captured by the microphone 1800 to detect and/or identify one or more audible sounds. In some implementations, the detector 304 may be coupled to and receive the video stream from the camera 110, the camera driver 306, and/or the memory 314. In some implementations, the detector 304 may process the images of the video stream to determine positional information for the line segments related to the tangible interface object(s) 120) and then analyze characteristics of the line segments included in the video stream to determine the identities and/or additional attributes of the line segments. [0077] In some implementations, the detector 304 may use visual characteristics to recognize custom designed portions of the physical activity scene 116, such as comers or edges, etc. The detector 304 may perform a straight line detection algorithm and a rigid transformation to account for distortion and/or bends on the physical activity scene 116. In some implementations, the detector 304 may match features of detected line segments to a reference object that may include a depiction of the individual components of the reference object in order to determine the line segments and/or the boundary of the expected objects in the physical activity scene 116. In some implementations, the detector 304 may account for gaps and/or holes in the detected line segments and/or contours and may be configured to generate a mask to fill in the gaps and/or holes.

[0078] In some implementations, the detector 304 may recognize the line by identifying its contours. The detector 304 may also identify various attributes of the line, such as colors, contrasting colors, depth, texture, etc. In some implementations, the detector 304 may use the description of the line and the line attributes to identify a tangible interface object 120 by comparing the description and attributes to a database of virtual objects and identifying the closest matches by comparing recognized tangible interface object(s) 120 to reference components of the virtual objects. In some implementations, the detector 304 may incorporate machine learning algorithms to add additional virtual objects to a database of virtual objects as new shapes or sounds are identified.

[0079] The detector 304 may be coupled to the storage 310 via the bus 308 to store, retrieve, and otherwise manipulate data stored therein. For example, the detector 304 may query the storage 310 for data matching any line segments that it has determined are present in the physical activity scene 116. In all of the above descriptions, the detector 304 may send the detected images to the detection engine 212 and the detection engine 212 may perform the above described features.

[0080] The detector 304 may be able to process the video stream to detect a manipulation of the tangible interface object 120. In some implementations, the detector 304 may be configured to understand relational aspects between a tangible interface object 120 and determine an interaction based on the relational aspects. For example, the detector 304 may be configured to identify an interaction related to one or more tangible interface objects present in the physical activity scene 116 and the activity application(s) 214 may determine a routine based on the relational aspects between the one or more tangible interface object(s) 120 and other elements of the physical activity scene 116.

[0081] The activity application(s) 214 include software and/or logic for identifying one or more tangible interface object(s) 120, identifying an audible sound, and match the audible sound to a model of expected sounds based on the tangible interface object(s) 120. The activity application(s) 214 may be coupled to the detector 304 via the processor 312 and/or the bus 308 to receive the information.

[0082] In some implementations, the activity application(s) 214 may determine the virtual object and/or a routine by searching through a database of virtual objects and/or routines that are compatible with the identified combined position of tangible interface object(s) 120 relative to each other. In some implementations, the activity application(s) 214 may access a database of virtual objects or routines stored in the storage 310 of the computing device 104. In further implementations, the activity application(s) 214 may access a server 202 to search for virtual objects and/or routines. In some implementations, a user 130 may predefine a virtual object and/or routine to include in the database.

[0083] In some implementations, the activity application(s) 214 may enhance the virtual scene as part of a routine. For example, the activity application(s) 214 may display visual enhancements as part of executing the routine. The visual enhancements may include adding color, extra virtualizations, background scenery, incorporating the virtual object into a shape and/or character, etc. In further implementations, the visual enhancements may include having the virtual object move or interact with another virtualization (not shown) and/or the virtual character 122 in the virtual scene. In some implementations, the activity application(s) 214 may prompt the user 130 to select one or more enhancement options, such as a change to color, size, shape, etc. and the activity application(s) 214 may incorporate the selected enhancement options into the virtual object and/or the virtual scene. Non-limiting examples of the activity applications 214 may include video games, learning applications, assistive applications, storyboard applications, collaborative applications, productivity applications, etc.

[0084] The camera driver 306 includes software storable in the memory 314 and operable by the processor 312 to control/operate the camera 110. For example, the camera driver 306 is a software driver executable by the processor 312 for signaling the camera 110 to capture and provide a video stream and/or still image, etc. The camera driver 306 is capable of controlling various features of the camera 110 (e.g., flash, aperture, exposure, focal length, etc.). The camera driver 306 may be communicatively coupled to the camera 110 and the other components of the computing device 104 via the bus 308, and these components may interface with the camera driver 306 via the bus 308 to capture video and/or still images using the camera 110.

[0085] As discussed elsewhere herein, the camera 110 is a video capture device configured to capture video of at least the activity scene. The camera 110 may be coupled to the bus 308 for communication and interaction with the other elements of the computing device 104. The camera 110 may include a lens for gathering and focusing light, a photo sensor including pixel regions for capturing the focused light and a processor for generating image data based on signals provided by the pixel regions. The photo sensor may be any type of photo sensor including a charge-coupled device (CCD), a complementary metal-oxide-semiconductor (CMOS) sensor, a hybrid CCD/CMOS device, etc. The camera 110 may also include any conventional features such as a flash, a zoom lens, etc. The camera 110 may include a microphone (not shown) for capturing sound or may be coupled to a microphone included in another component of the computing device 104 and/or coupled directly to the bus 308. In some implementations, the processor of the camera 110 may be coupled via the bus 308 to store video and/or still image data in the memory 314 and/or provide the video and/or still image data to other elements of the computing device 104, such as the detection engine 212 and/or activity application(s) 214.

[0086] The storage 310 is an information source for storing and providing access to stored data, such as a database of virtual objects, virtual prompts, routines, and/or virtual elements, gallery(ies) of virtual objects that may be displayed on the display 320, user profile information, community developed virtual routines, virtual enhancements, etc., object data, calibration data, and/or any other information generated, stored, and/or retrieved by the activity application(s) 214. [0087] In some implementations, the storage 310 may be included in the memory 314 or another storage device coupled to the bus 308. In some implementations, the storage 310 may be or included in a distributed data store, such as a cloud-based computing and/or data storage system. In some implementations, the storage 310 may include a database management system (DBMS). For example, the DBMS could be a structured query language (SQL) DBMS. For instance, storage 310 may store data in an object-based data store or multi-dimensional tables comprised of rows and columns, and may manipulate, i.e., insert, query, update, and/or delete, data entries stored in the verification data store using programmatic operations (e.g., SQL queries and statements or a similar database manipulation library). Additional characteristics, structure, acts, and functionality of the storage 310 is discussed elsewhere herein.

[0088] Figure 19 depicts a flowchart of an example method for virtualization of speech recognition by capturing audible sound and matching it to sound models. At 1902, the video capture device 110 may capture a video stream of a physical activity scene 116 including a tangible interface object 120. At 1904, the computing device 104 may use an input device, such as a microphone 1800 to capture a sound stream. The sound stream may include sounds of a user audibly pronouncing words, such as words that are present on the tangible interface object 120, such as a book or appearing in the virtual scene. At 1906, the activity application 214 may determine an identity of the tangible interface object 120, such as by matching a vision marker or other visible reference point to a database of markers or reference points and identifying the tangible interface object 120. At 1908, the activity application 214 may compare the captured sound stream to an expected sound model based on the identity of the tangible interface object 120. By identifying an expected sound model for comparison, the processing time for comparing the sound stream to the expected sound model can be reduced to occur in substantially real-time as the comparison is only between the sound stream and the expected sound models. For example, as a user read words from a book, the expected sound model would be both the pronunciation of the words as well as common or expected variations for pronunciation and errors. Over time, machine learning algorithms can supplement and expand the expected sound models to capture additional sounds that are occurring. At 1910, the display 112 can display a visual cue based on the comparison. The visual cue may be a correct word pronunciation or a tutorial on how to pronounce a specific sound that was identified as being incorrect from the comparison.

[0089] Figure 20 is an example implementation showing a user reading the words on the tangible interface object 120 and pointing with a pointing object 2002. In some implementations, the activity application 214 may track the location of the tip of the pointing object 2002 as the user reads the words and provide additional feedback in the virtual scene for those words. For example, as the user reads “Here’s a taste” and points to each of the words, the virtual scene can display those words as visual cues 124 and include either a correct or incorrect highlighting, or other indication of correctness, as the user reads. The pointing object 2002 allows the user to point to each word as it’s being read similar to how a reading session is done with a teacher and the activity application 214 can track each word. In further implementations, the pointing object 2002 can be used to call out specific words. For example, the activity application 214 can prompt a question, such as “point to the word ‘taste’” that is either displayed or played as a sound file and the user can then use the pointing object 2002 to point to the correct word. The activity application 214 can then identify the location of the tip of the pointing object and determine if the location is correct for the prompt. The pointing object 2002 can also help the user to call out specific words for additional assistance, such as is if the user is struggles to pronounce the word “taste” then the user can point or tap on the word “taste” on the tangible interface object 120 and the virtual scene can display a prompt or sound out the word “taste” in order to assist the user in learning the word. This interaction mimics how a teacher can interact with a user during a reading session. It should be understood that pointing is not limited to the pointing object 2002 and that the user can point with a finger or other tangible interface object 120 as well to stay immersed in the experience.

[0090] In some implementations, the basic notion of the system 100 is as a reading companion. Speech recognition is a component of this system where as the user speaks, the activity application 214 and/or the detection engine 212 may match the speech that is captured in the sound stream to expected models of speech. In some implementations, the sound models may be based on words and phrases in a set of books. As the user reads a page from a book, the expected models will match to the speech that is being captured and compared.

[0091] In some implementations, a user will be grouped into a difficulty level associated with reading as an initial assessment is performed. In the initial assessment, the activity application 214 may capture the speech as the user interacts with tangible interface objects 120 and/or the virtual scene and may identify based on errors or correct pronunciations, what difficulty level the user should initially be grouped into. For example, the assessment may identify that the user is struggling with the “long a” sound and may have the user speak the “long a” sound out loud and test to see how successful they are with it. If the user is making common mistakes at certain difficulty levels, then instruction and basic intervention is all that is required to correct that error. At higher difficulty levels, if simple mistakes are occurring, the intervention may occur in the moment or after a page has been completed in order to reinforce and correct some of the lower-level skills that may need to be reinforced.

[0092] In some implementations, after the initial assessment, a profile for the skill level of the user may be built and grown. For example, a great group, a good group, and a poor group may be classified from the initial assessment and reading material may be curated based on the group they are added into. This allows for the activity application 214 to provide specifically curated material in a way that creates a good experience for the user, instead of material that may be above or below their level and would create a bad experience during the reading.

[0093] In some implementations, as the user reads the words present on the tangible interface object 120, such as a book and the pronunciation is compared to the expected models, if the pronunciation is correct or incorrect, different colors schemes or other classifiers may be used to signal which word pronunciations are correct or incorrect. In some implementations, the amount of feedback may be determined based on the skill level profile of the user. For example, a poor reader will be offered less correction in the moment and provide the regular feedback at the end. Compared to a reader in a good skill level, if a consistent mistake is being made that is not expected at that skill level then immediate intervention may be done in order to correct that mistake. In further implementations, the skill level the user is assigned may be factored in along with the focus of the lesson in order to determine which errors are corrected and which errors are ignored as the user reads the words of the book.

[0094] In some implementations, a user profile may be created. The user profile may in some instances, gather information initially about the user, such as information provided by a parent or teacher, etc. The user profile may include information such as an age, grade, skill level, supervisor (parent, teacher, etc.) judgement of the user’s skill level, a reading of a recent book completed by the user, etc. In some implementations, the user profile may factor in any speech issues, such as a lisp, stutter, shyness, accent, etc. The user profile may identify these speech issues over time based on the pronunciations provided by the user. The user profile may use various machine learning algorithms in order to train the models and identify the speech issues and change in order to continue to understand the user as various speech issues are modeled and identified. The user profile may track user performance data on a timeline. For example, the user profile may track when the user is having a rough day with reading, what days of the week the user does well with reading, a last time a new word was encountered while reading, etc.

[0095] In some implementations, the phonics values or fluency such as correct speed, cadence, expressions, etc. may be integrated into the user profile and the expected word models. Sight words may be used in order to characterize the profile and/or the expected word models. For example, using sight words, users may be categorized based on their reading level, which is essentially extrapolated from their skills, such as being able to recognize sight words, etc. The activity application 214 may categorize the user automatically based on the captured words spoken by the user during assessment. The activity application 214 may further update the categorization over time as the user progresses and either improves or struggles with various concepts.

[0096] In some implementations, based on the user profile, various different encouragement and feedback may be provided based on the users. Some users may respond better to more immediate feedback, while other users may respond better to focusing on only the issues in the lessons. In some implementations, when specific mispronunciations are detected, those mispronunciations may either be ignored until a future lesson or result in immediate feedback to correct that mispronunciation. In some implementations, the mispronunciation can be identified in real-time and the user profile or other speech recognition algorithms can be used to predict not only what was said, but the popular misspeaks and the correct alternatives.

[0097] In some implementations, the speech recognition models may be trained over time for further accuracy. Behind the scenes, the recognition models may be tracking common errors and trying to tag specific words and common errors or specific mispronunciations. When these speech recognition models have improved their accuracy and have been updated, they can provide further benefits in assisting the user to read by more quickly identifying different mispronunciations and identifying in the background both why the mispronunciation happened and how to provide correction that will result in improvement of the user as they read. In some implementations, these speech recognition models may be localized, such as to allow for different pronunciations in different regions, such as a US localization and a UK localization in order to account for different language uses in those two regions. In some implementations, the speech recognition models may be trained to detect emotion in the words spoken by the user. For example, the speech recognition models may detect one or more of whether a reading cadence of the user is slowing down, whether the user sounds alert and engaged, whether the user’s tone is dominant and in control during reading, etc.

[0098] This technology yields numerous advantages including, but not limited to, providing a low-cost alternative for developing a nearly limitless range of applications that blend both physical and digital mediums by reusing existing hardware (e.g., camera) and leveraging novel lightweight detection and recognition algorithms, having low implementation costs, being compatible with existing computing device hardware, operating in real-time to provide for a rich, real-time virtual experience, processing numerous (e.g., >15, >25, >35, etc.) tangible interface object(s) 120 and/or an interaction simultaneously without overwhelming the computing device, recognizing tangible interface object(s) 120 and/or an interaction (e.g., such as a wand interacting with the physical activity scene) with substantially perfect recall and precision (e.g., 99% and 99.5%, respectively), being capable of adapting to lighting changes and wear and imperfections in tangible interface object(s) 120, providing a collaborative tangible experience between users in disparate locations, being intuitive to setup and use even for young users (e.g., 3+ years old), being natural and intuitive to use, and requiring few or no constraints on the types of tangible interface object(s) 120 that can be processed.

[0099] It should be understood that the above-described example activities are provided by way of illustration and not limitation and that numerous additional use cases are contemplated and encompassed by the present disclosure. In the above description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. However, it should be understood that the technology described herein may be practiced without these specific details. Further, various systems, devices, and structures are shown in block diagram form in order to avoid obscuring the description. For instance, various implementations are described as having particular hardware, software, and user interfaces. However, the present disclosure applies to any type of computing device that can receive data and commands, and to any peripheral devices providing services.

[0100] In some instances, various implementations may be presented herein in terms of algorithms and symbolic representations of operations on data bits within a computer memory. An algorithm is here, and generally, conceived to be a self-consistent set of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.

[0101] It should be bome in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout this disclosure, discussions utilizing terms including “processing,” “computing,” “calculating,” “determining,” “displaying,” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system’s registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.

[0102] Various implementations described herein may relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, including, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, flash memories including USB keys with non-volatile memory or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.

[0103] The technology described herein can take the form of a hardware implementation, a software implementation, or implementations containing both hardware and software elements. For instance, the technology may be implemented in software, which includes but is not limited to firmware, resident software, microcode, etc. Furthermore, the technology can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any non-transitory storage apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. [0104] A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories that provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution. Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers.

[0105] Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems, storage devices, remote printers, etc., through intervening private and/or public networks. Wireless (e.g., Wi-Fi™) transceivers, Ethernet adapters, and modems, are just a few examples of network adapters. The private and public networks may have any number of configurations and/or topologies. Data may be transmitted between these devices via the networks using a variety of different communication protocols including, for example, various Internet layer, transport layer, or application layer protocols. For example, data may be transmitted via the networks using transmission control protocol / Internet protocol (TCP/IP), user datagram protocol (UDP), transmission control protocol (TCP), hypertext transfer protocol (HTTP), secure hypertext transfer protocol (HTTPS), dynamic adaptive streaming over HTTP (DASH), real-time streaming protocol (RTSP), real-time transport protocol (RTP) and the real-time transport control protocol (RTCP), voice over Internet protocol (VOIP), file transfer protocol (FTP), WebSocket (WS), wireless access protocol (WAP), various messaging protocols (SMS, MMS, XMS, IMAP, SMTP, POP, WebDAV, etc.), or other known protocols.

[0106] Finally, the structure, algorithms, and/or interfaces presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method blocks. The required structure for a variety of these systems will appear from the description above. In addition, the specification is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the specification as described herein.

[0107] The foregoing description has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the specification to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the disclosure be limited not by this detailed description, but rather by the claims of this application. As will be understood by those familiar with the art, the specification may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. Likewise, the particular naming and division of the modules, routines, features, attributes, methodologies and other aspects are not mandatory or significant, and the mechanisms that implement the specification or its features may have different names, divisions and/or formats. [0108] Furthermore, the modules, routines, features, attributes, methodologies and other aspects of the disclosure can be implemented as software, hardware, firmware, or any combination of the foregoing. Also, wherever an element, an example of which is a module, of the specification is implemented as software, the element can be implemented as a standalone program, as part of a larger program, as a plurality of separate programs, as a statically or dynamically linked library, as a kernel loadable module, as a device driver, and/or in every and any other way known now or in the future. Additionally, the disclosure is in no way limited to implementation in any specific programming language, or for any specific operating system or environment. Accordingly, the disclosure is intended to be illustrative, but not limiting, of the scope of the subject matter set forth in the following claims.

Claims

What is claimed is:

1. A method comprising: capturing, using a video capture device associated with a computing device, a video stream of a physical activity scene including a tangible interface object; capturing, using an audio capture device associated with the computing device, an audio stream of an environment around the audio capture device, the audio stream including a pronunciation of a word by a user; comparing the captured audio stream including the pronunciation of the word to an expected sound model; and displaying a visual cue on a display screen of the computing device based on the comparison.

2. The method of claim 1, further comprising: determining, using a processor of the computing device, an identity of the tangible interface object; and wherein, the expected sound model is based on the identity of the tangible interface object.

3. The method of claim 2, wherein the tangible interface object is a book and determining the identity of the tangible interface object includes determining a title of the book.

4. The method of claim 3, wherein the audio stream that includes the pronunciation of the word by the is a pronunciation of the word from the book.

5. The method of claim 3, wherein the expected sound model is from a database of sound models associated with a specific page of the book.

6. The method of claim 1, wherein the visual cue is a virtual depiction of the pronunciation of the word.

7. The method of claim 1, wherein the visual cue further comprises a highlighting effect that indicates a correctness of the pronunciation of the word.

8. The method of claim 1, further comprising: determining, using a processor of the computing device, that an expected word was missed by the user during the pronunciation of the word based on the comparison between the

29 captured audio stream including the pronunciation of the word and the expected sound model; and displaying an additional visual cue indicating the expected word that was missed.

9. The method of claim 1, further comprising: determining, using a processor of the computing device, a correctness of the pronunciation of the word based on the comparison; and categorizing, using the processor of the computing device, the user based on the correctness.

10. A reading session system comprising: a stand configured to position a computing device having one or more processors; a video capture device configured to capture a video stream of a physical activity scene, the video stream including a tangible interface object in the physical activity scene; an audio capture device configured to capture an audio stream of an environment around the audio capture device, the audio stream including a pronunciation of a word by a user; an activity application executable by the one or more processors to compare the captured audio stream including the pronunciation of the word to an expected sound model; and a display configured to display a visual cue based on the comparison.

11. The reading session system of claim 10, further comprising: a detector executable by the one or more processors to determine an identity of the tangible interface object; and wherein, the expected sound model is based on the identity of the tangible interface object.

12. The reading session system of claim 11, wherein the tangible interface object is a book and determining the identity of the tangible interface object includes determining a title of the book.

13. The reading session system of claim 12, wherein the audio stream that includes the pronunciation of the word by the is a pronunciation of the word from the book.

14. The reading session system of claim 12, wherein the expected sound model is from a database of sound models associated with a specific page of the book.

30

15. The reading session system of claim 10, wherein the visual cue is a virtual depiction of the pronunciation of the word.

16. The reading session system of claim 10, wherein the visual cue further comprises a highlighting effect that indicates a correctness of the pronunciation of the word.

17. The reading session system of claim 10, wherein the activity application is further configured to determine that an expected word was missed by the user during the pronunciation of the word based on the comparison between the captured audio stream including the pronunciation of the word and the expected sound model and the display is further configured to display an additional visual cue indicating the expected word was missed.

18. The reading session system of claim 10, wherein the activity application is further configured to determine a correctness of the pronunciation of the word based on the comparison and categorize the user based on the correctness.

19. A method comprising: capturing, using a video capture device associated with a computing device, a video stream of a physical activity scene including a book with a group of visible words; capturing, using an audio capture device associated with the computing device, an audio stream including a pronunciation of the group of visible words; determining, using a processor of the computing device, an identity of a visible page of the book from the captured video stream; retrieving, using the processor of the computing device, a group of expected sound models based on the identity of the visible page of the book; comparing, using the processor of the computing device, the captured audio stream including the pronunciation of the group of visible words to the group of expected sound models; determining, using the processor of the computing device, a correctness of the pronunciations of the group of visible words by determining which pronunciations of the group of visible words from the captured audio stream exceed a matching threshold with a sound model from the group of expected sound models based on the comparison; and displaying, on a display of the computing device, the correctness of the pronunciations of the group of visible words as visual cues.

20. The method of claim 19, wherein displaying the correctness of the pronunciations of the group of visible words as visual cues includes displaying a virtual representation of the visible words with highlighting indicating which of the pronunciations of the visible words were correct and which of the pronunciations of the visible words were incorrect.

21. The method of claim 19, wherein determining the identity of the visible page of the book from the captured video stream includes detecting a reference marker on the visible page of the book from the captured video stream and determining an identity of the reference marker.

22. The method of claim 19, wherein determining, using the processor of the computing device, a correctness of the pronunciations of the group of visible words further comprises, displaying in on the display of the computing device, a representation of each of the pronunciations of the group of visible words after being compared to the expected sound models.