WO2017124116A1 - Recherche, complémentation et exploration de multimédia - Google Patents
Recherche, complémentation et exploration de multimédia Download PDFInfo
- Publication number
- WO2017124116A1 WO2017124116A1 PCT/US2017/013829 US2017013829W WO2017124116A1 WO 2017124116 A1 WO2017124116 A1 WO 2017124116A1 US 2017013829 W US2017013829 W US 2017013829W WO 2017124116 A1 WO2017124116 A1 WO 2017124116A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- media
- user
- content
- computer
- implemented method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/44—Browsing; Visualisation therefor
- G06F16/447—Temporal browsing, e.g. timeline
Definitions
- a media file may be identified from a plurality of media files based upon a name of the media file.
- Media output may be navigated with an interface used to access different parts of the corresponding media file.
- Content may be played before or after the media output.
- one or more computing devices and/or methods for searching media are provided.
- a query comprising a first term, for the media may be received.
- a first result and a second result may be identified in time-associated information (e.g., a transcript) of the media based upon a determination that the first result comprises a first match of the first term and the second result comprises a second match of the first term.
- the first result and the second result may be provided (e.g., for presentation) based upon a first temporal property of the first match of the first term in the first result and a second temporal property of the second match of the first term in the second result.
- one or more computing devices and/or methods for supplementing media e.g., a movie
- the media may be segmented into a first portion (e.g., a first scene of the movie) and a second portion (e.g., a second scene of the movie) based upon time- associated text information (e.g., a transcript) of the media.
- the time-associated text information of the media may be analyzed to determine a first context (e.g., entertainment) for the first portion and a second context (e.g., business) for the second portion.
- a first content (e.g., an advertisement for an entertainment-related product) may be selected from a plurality of contents for the first portion based upon the first context and a second content (e.g., an advertisement for a business-related product) may be selected from the plurality of contents for the second portion based upon the second context.
- the first portion of the media may be supplemented (e.g., overlaid, interrupted, etc.) with the first content and the second portion of the media may be supplemented (e.g., overlaid, interrupted, etc.) with the second content.
- a first content may be selected from a plurality of contents for the video.
- a first area e.g., a top area, a bottom area, a side area, etc.
- the video may be supplemented (e.g., overlaid) with the first content at the first area.
- a first content may be selected from a plurality of contents for the video.
- the video may be supplemented with the first content.
- One or more properties e.g., color, transparency, size, duration, etc.
- properties of the first content may be adjusted based upon image analysis of the video.
- one or more computing devices and/or methods for generating a representation of a performance are provided.
- a request to implement a (e.g., karaoke) performance with a first user and a second user may be received.
- a determination may be made the first user is associated with a first type of participation (e.g., singing) in the performance.
- a determination may be made that the second user is associated with a second type of participation (e.g., dancing) in the performance.
- a first content (e.g., a first version of a song) may be selected from a plurality of contents for the first user based upon the first type of participation and a second content (e.g., a second version of the song) may be selected from the plurality of contents for the second user based upon the second type of participation.
- the first content may be provided to the first user and the second content may be provided to the second user.
- a first signal (e.g., comprising audio of the first user singing) may be received from the first user in association with the performance and a second signal (e.g., comprising video of the second user dancing) may be received from the second user in association with the performance.
- a representation of the performance may be generated based upon a combination of the first signal, the second signal, the first content and the second content.
- one or more computing devices and/or methods for navigating through media are provided.
- a request to move (e.g., drag) a control along a first axis from a first portion of the first axis to a second portion of the first axis may be received.
- the media may be navigated through at a first rate of advancement (e.g. a first temporal resolution) based upon a first feature of the first portion.
- the media may be navigated through at a second rate of advancement (e.g. a second temporal resolution) based upon a second feature of the second portion.
- the first rate of advancement may be different than the second rate of advancement.
- FIG. 1 is an illustration of a scenario involving various examples of networks that may connect servers and clients.
- FIG. 2 is an illustration of a scenario involving an example configuration of a server that may utilize and/or implement at least a portion of the techniques presented herein.
- FIG. 3 is an illustration of a scenario involving an example configuration of a client that may utilize and/or implement at least a portion of the techniques presented herein.
- Fig. 4A is a flow chart illustrating an example method for searching media.
- Fig. 4B is a component block diagram illustrating an example system for searching media.
- Fig. 5 A is a flow chart illustrating an example method for supplementing media with content.
- Fig. 5B is a component block diagram illustrating an example system for supplementing media with content.
- Fig. 5C is a flow chart illustrating an example method for supplementing a video with content.
- Fig. 5D is a flow chart illustrating an example method for supplementing a video with content.
- Fig. 6A is a flow chart illustrating an example method for generating a representation of a performance.
- Fig. 6B is a component block diagram illustrating an example system for generating a representation of a performance.
- Fig. 7 A is a flow chart illustrating an example method for navigating through media.
- Fig. 7B is a component block diagram illustrating an example system for navigating through media.
- FIG. 8 is an illustration of a scenario featuring an example non-transitory machine readable medium in accordance with one or more of the provisions set forth herein.
- FIG. 9 is an illustration of a disclosed embodiment.
- Fig. 10 is an illustration of a disclosed embodiment.
- FIG. 11 is an illustration of a disclosed embodiment.
- Fig. 12 is an illustration of a disclosed embodiment.
- Fig. 13 is an illustration of a disclosed embodiment.
- Fig. 14 is an illustration of a disclosed embodiment.
- Fig. 15 is an illustration of a disclosed embodiment.
- Fig. 16 is an L lustration of a disclosed embodiment
- Fig . 17 is an L lustration of a disclosed embodiment
- Fig . 18 is an L lustration of a disclosed embodiment
- Fig . 19 is an L lustration of a disclosed embodiment
- Fig . 20 is an L lustration of a disclosed embodiment
- Fig . 21 is an L lustration of a disclosed embodiment
- Fig . 22 is an L lustration of a disclosed embodiment
- Fig . 23 is an L lustration of a disclosed embodiment
- Fig . 24 is an L lustration of a disclosed embodiment
- Fig . 25 is an L lustration of a disclosed embodiment
- Fig . 26 is an L lustration of a disclosed embodiment
- Fig . 27 is an L lustration of a disclosed embodiment
- Fig . 28 is an L lustration of a disclosed embodiment
- Fig . 29 is an L lustration of a disclosed embodiment
- Fig . 30 is an L lustration of a disclosed embodiment
- Fig . 31 is an L lustration of a disclosed embodiment
- Fig . 32 is an L lustration of a disclosed embodiment
- Fig . 33 is an L lustration of a disclosed embodiment
- Fig . 34 is an L lustration of a disclosed embodiment
- Fig . 35 is an L lustration of a disclosed embodiment
- Fig . 36 is an L lustration of a disclosed embodiment
- Fig . 37 is an L lustration of a disclosed embodiment
- Fig . 38 is an L lustration of a disclosed embodiment
- Fig . 39 is an L lustration of a disclosed embodiment
- Fig . 40 is an L lustration of a disclosed embodiment
- Fig . 41 is an L lustration of a disclosed embodiment.
- Fig . 42 is an L lustration of a disclosed embodiment,
- Fig . 43 is an L lustration of a disclosed embodiment
- Fig . 44 is an L lustration of a disclosed embodiment
- Fig . 45 is an L lustration of a disclosed embodiment
- Fig . 46 is an L lustration of a disclosed embodiment
- Fig . 47 is an L lustration of a disclosed embodiment
- Fig . 48 is an L lustration of a disclosed embodiment
- Fig . 49 is an L lustration of a disclosed embodiment
- Fig . 50 is an L lustration of a disclosed embodiment
- Fig . 51 is an L lustration of a disclosed embodiment
- Fig . 52 is an L lustration of a disclosed embodiment
- Fig . 53 is an L lustration of a disclosed embodiment
- Fig . 54 is an L lustration of a disclosed embodiment
- Fig . 55 is an L lustration of a disclosed embodiment
- Fig . 56 is an L lustration of a disclosed embodiment
- Fig . 57 is an L lustration of a disclosed embodiment
- Fig . 58 is an L lustration of a disclosed embodiment
- Fig . 59 is an L lustration of a disclosed embodiment
- Fig . 62 is an L lustration of a disclosed embodiment
- Fig . 63 is an L lustration of a disclosed embodiment
- Fig . 64 is an L lustration of a disclosed embodiment
- Fig . 65 is an L lustration of a disclosed embodiment
- Fig . 66 is an L lustration of a disclosed embodiment
- Fig . 67 is an L lustration of a disclosed embodiment
- Fig . 68 is an L lustration of a disclosed embodiment
- Fig . 69 is an L lustration of a disclosed embodiment.
- Fig. 70 is an illustration of a disclosed embodiment.
- Fig. 71 is an illustration of a disclosed embodiment.
- Fig. 72 is an illustration of a disclosed embodiment.
- Fig. 73 is an illustration of a disclosed embodiment.
- Fig. 1 is an interaction diagram of a scenario 100 illustrating a service 102 provided by a set of servers 104 to a set of client devices 110 via various types of networks.
- the servers 104 and/or client devices 110 may be capable of transmitting, receiving, processing, and/or storing many types of signals, such as in memory as physical memory states.
- the servers 104 of the service 102 may be internally connected via a local area network 106 (LAN), such as a wired network where network adapters on the respective servers 104 are interconnected via cables, such as coaxial and/or fiber optic cabling, for example, and may be connected in various topologies, such as buses, token rings, meshes, and/or trees, for example.
- the servers 104 may utilize one or more physical networking protocols, such as Ethernet and/or Fiber Channel, and/or logical networking protocols, such as variants of an Internet Protocol (IP), a
- IP Internet Protocol
- TCP Transmission Control Protocol
- UDP User Datagram Protocol
- the servers 104 may be interconnected directly, or through one or more other networking devices, such as routers, switches, and/or repeaters.
- the local area network 106 may be organized according to one or more network architectures, such as server/client, peer-to-peer, and/or mesh architectures, and/or one or more roles, such as administrative servers, authentication servers, security monitor servers, data stores for objects such as files and databases, business logic servers, time synchronization servers, and/or front-end servers providing a user- facing interface for the service 102.
- network architectures such as server/client, peer-to-peer, and/or mesh architectures
- roles such as administrative servers, authentication servers, security monitor servers, data stores for objects such as files and databases, business logic servers, time synchronization servers, and/or front-end servers providing a user- facing interface for the service 102.
- the local area network 106 may include, e.g., analog telephone lines, such as a twisted wire pair, a coaxial cable, full or fractional digital lines including Tl, T2, T3, or T4 type lines, Integrated Services Digital Networks (ISDNs), Digital
- DSLs Subscriber Lines
- wireless links including satellite links, or other
- the local area network 106 may comprise one or more subnetworks, such as may employ differing architectures, may be compliant or compatible with differing protocols and/or may interoperate within the local area network 106. Additionally, one or more local area networks 106 may be
- a router may provide a link between otherwise separate and independent local area networks 106.
- the local area network 106 of the service 102 may be connected to a wide area network 108 (WAN) that allows the service 102 to exchange data with other services 102 and/or client devices 110.
- the wide area network 108 may encompass various combinations of devices with varying levels of distribution and exposure, such as a public wide-area network, such as the Internet, and/or a private network, such as a virtual private network (VPN) of a distributed enterprise.
- VPN virtual private network
- the service 102 may be accessed via the wide area network 108 by a user 112 of one or more client devices 110, such as a portable media player, such as an electronic text reader, an audio device, or a portable gaming, exercise, or navigation device; a portable communication device, such as a camera, a phone, a wearable or a text chatting device; a workstation; and/or a laptop form factor computer.
- client devices 110 such as a portable media player, such as an electronic text reader, an audio device, or a portable gaming, exercise, or navigation device; a portable communication device, such as a camera, a phone, a wearable or a text chatting device; a workstation; and/or a laptop form factor computer.
- One or more client devices 110 may communicate with the service 102 via various connections to the wide area network 108.
- one or more client devices 110 may comprise a cellular communicator and may communicate with the service 102 by connecting to the wide area network 108 via a wireless local area network 106 provided by a cellular provider.
- one or more client devices 110 may communicate with the service 102 by connecting to the wide area network 108 via a wireless local area network 106 provided by a location such as the user' s home or workplace, such as a WiFi (Institute of Electrical and Electronics Engineers (IEEE) Standard 802.11) network or a Bluetooth (IEEE Standard 802.15.1) personal area network.
- the servers 104 and the client devices 110 may communicate over various types of networks.
- Other types of networks that may be accessed by the servers 104 and/or client devices 110 include mass storage, such as network attached storage (NAS), a storage area network (SAN), and/or other forms of computer or machine readable media.
- NAS network attached storage
- SAN storage area network
- FIG. 2 presents a schematic architecture diagram 200 of a server 104 that may utilize at least a portion of the techniques provided herein.
- a server 104 may vary widely in configuration or capabilities, alone or in conjunction with other servers, in order to provide a service such as the service 102.
- the server 104 may comprise one or more processors 210 that process instructions.
- the one or more processors 210 may optionally include a plurality of cores; one or more coprocessors, such as a mathematics coprocessor or an integrated graphical processing unit (GPU); and/or one or more layers of local cache memory.
- the server 104 may comprise memory 202 storing various forms of applications, such as an operating system 204; one or more server applications 206, such as a hypertext transport protocol (HTTP) server, a file transfer protocol (FTP) server, or a simple mail transport protocol (SMTP) server; and/or various forms of data, such as a database 208 or a file system.
- HTTP hypertext transport protocol
- FTP file transfer protocol
- SMTP simple mail transport protocol
- the server 104 may comprise one or more peripheral components, such as a wired and/or wireless network adapter 214 connectible to a local area network and/or wide area network; one or more storage components 216, such as a hard disk drive, a solid-state storage device (SSD), a flash memory device, and/or a magnetic and/or optical disk reader.
- peripheral components such as a wired and/or wireless network adapter 214 connectible to a local area network and/or wide area network
- storage components 216 such as a hard disk drive, a solid-state storage device (SSD), a flash memory device, and/or a magnetic and/or optical disk reader.
- the server 104 may comprise a mainboard featuring one or more communication buses 212 that interconnect the processor 210, the memory 202, and various peripherals, using one or more bus technologies, such as a variant of a serial or parallel AT Attachment (ATA) bus protocol; a Uniform Serial Bus (USB) protocol; and/or Small Computer System Interface (SCI) bus protocol.
- ATA serial or parallel AT Attachment
- USB Uniform Serial Bus
- SCI Small Computer System Interface
- a communication bus 212 may interconnect the server 104 with at least one other server (e.g., in a multibus scenario).
- Other components that may be included with the server 104 include a display; a display adapter, such as a graphical processing unit (GPU); input peripherals, such as a keyboard and/or mouse; and a flash memory device that may store a basic input/output system (BIOS) routine that facilitates booting the server 104 to a state of readiness.
- BIOS basic input/output system
- the server 104 may operate in various physical enclosures, such as a desktop or tower, and/or may be integrated with a display as an "all-in-one" device.
- the server 104 may be mounted horizontally and/or in a cabinet or rack, and/or may simply comprise an interconnected set of components.
- the server 104 may comprise a dedicated and/or shared power supply 218 that supplies and/or regulates power for the other components.
- the server 104 may provide power to and/or receive power from another server and/or other devices.
- the server 104 may comprise a shared and/or dedicated climate control unit 220 that regulates climate properties, such as temperature, humidity, and/or airflow. Servers 104 may be configured and/or adapted to utilize at least a portion of the techniques presented herein.
- Fig. 3 presents a schematic architecture diagram 300 of a client device 110 whereupon at least a portion of the techniques presented herein may be implemented.
- client device 110 may vary widely in configuration or capabilities, in order to provide one or more functionality to a user such as the user 112.
- the client device 110 may serve the user in one or more roles, such as a workstation, kiosk, media player, gaming device, and/or appliance.
- the client device 110 may comprise one or more processors 310 that process instructions.
- the one or more processors 310 may optionally include a plurality of cores; one or more coprocessors, such as a mathematics coprocessor or an integrated graphical processing unit (GPU); and/or one or more layers of local cache memory.
- the client device 110 may comprise memory 301 storing various forms of applications, such as an operating system 303; one or more user applications 302, such as document applications, media applications, file and/or data access applications, communication applications such as web browsers and/or email clients, utilities, and/or games; and/or drivers for various peripherals.
- the client device 110 may comprise one or more peripheral components, such as a wired and/or wireless network adapter 306 connectible to a local area network and/or wide area network; one or more output components, such as a display 308 coupled with a display adapter (optionally including a graphical processing unit (GPU)), a sound adapter coupled with a speaker, and/or a printer; input devices for receiving input from the user, such as a keyboard 311 , a mouse, a microphone, a camera, and/or a touch-sensitive component of the display 308; and/or environmental sensors, such as a global positioning system (GPS) receiver 319 that detects the location, velocity, and/or acceleration of the client device 110, a compass, accelerometer, and/or gyroscope that detects a physical orientation of the client device 110.
- GPS global positioning system
- Other components that may optionally be included with the client device 110 include one or more storage components, such as a hard disk drive, a solid-state storage device (SSD), a flash memory device, and/or a magnetic and/or optical disk reader; and/or a flash memory device that may store a basic input/output system (BIOS) routine that facilitates booting the client device 110 to a state of readiness; and a climate control unit that regulates climate properties, such as temperature, humidity, and airflow.
- storage components such as a hard disk drive, a solid-state storage device (SSD), a flash memory device, and/or a magnetic and/or optical disk reader; and/or a flash memory device that may store a basic input/output system (BIOS) routine that facilitates booting the client device 110 to a state of readiness
- BIOS basic input/output system
- climate control unit that regulates climate properties, such as temperature, humidity, and airflow.
- the client device 110 may be provided in one or more form factors, such as a desktop or tower workstation; an "all-in-one" device integrated with a display 308; a laptop, tablet, convertible tablet, or palmtop device; a wearable device mountable in a headset, eyeglass, earpiece, and/or wristwatch, and/or integrated with an article of clothing; and/or a component of a piece of furniture, such as a tabletop, and/or of another device, such as a vehicle or residence.
- form factors such as a desktop or tower workstation; an "all-in-one" device integrated with a display 308; a laptop, tablet, convertible tablet, or palmtop device; a wearable device mountable in a headset, eyeglass, earpiece, and/or wristwatch, and/or integrated with an article of clothing; and/or a component of a piece of furniture, such as a tabletop, and/or of another device, such as a vehicle or residence.
- the client device 110 may comprise a dedicated and/or shared power supply 318 that supplies and/or regulates power for other components, and/or a battery 304 that stores power for use while the client device 110 is not connected to a power source via the power supply 318.
- the client device 110 may provide power to and/or receive power from other client devices.
- the client device 110 may include one or more servers that may locally serve the client device 110 and/or other client devices of the user 112 and/or other individuals.
- a locally installed webserver may provide web content in response to locally submitted web requests.
- client devices 110 may be configured and/or adapted to utilize at least a portion of the techniques presented herein.
- the client device 110 may comprise a mainboard featuring one or more communication buses 312 that interconnect the processor 310, the memory 301, and various peripherals, using one or more bus technologies, such as a variant of a serial or parallel AT Attachment (ATA) bus protocol; the Uniform Serial Bus (USB) protocol; and/or the Small Computer System Interface (SCI) bus protocol.
- ATA serial or parallel AT Attachment
- USB Uniform Serial Bus
- SCI Small Computer System Interface
- descriptive content in the form of signals or stored physical states within memory such as an email address, instant messenger identifier, phone number, postal address, message content, date, and/or time
- Descriptive content may be stored, typically along with contextual content.
- the source of a phone number such as a communication received from another user via an instant messenger application
- Contextual content may identify circumstances surrounding receipt of a phone number, such as the date or time that the phone number was received), and may be associated with descriptive content.
- Contextual content may, for example, be used to subsequently search for associated descriptive content. For example, a search for phone numbers received from specific individuals, received via an instant messenger application or at a given date or time, may be initiated.
- One or more computing devices and/or techniques for searching media, supplementing media with content, supplementing a video with content, generating a representation of a performance and/or navigating through media are provided.
- a server such as that of an online media content publisher, may serve to host media received from a user of the server such that the hosted media may be accessed by a plurality of users.
- the media may be difficult to find from amongst many (e.g., hundreds, thousands, millions, etc.) of other media. If the media has a length exceeding a threshold (e.g., 1 hour), identifying a portion of the media (e.g., a scene) that is of interest to a viewer may be difficult.
- a threshold e.g. 1 hour
- the media may have to be identified based upon a file name, a title and/or a description of the media. It may be appreciated that a scene in media may not be described in the title, the file name, the description, etc.
- Viewing and/or listening to the media in its entirety may consume a significant amount of time and resources of the server and the user.
- Fast forwarding and/or rewinding through the media using some techniques may also consume a significant amount of time and resources while still possibly resulting in the viewer overlooking the portion of the media of that is of interest.
- Supplementing the media using some techniques may interrupt, distract from and/or otherwise interfere with a desired experience of the viewer.
- media may be searched, supplemented and/or navigated through in a manner that is efficient, convenient, effective and/or timely.
- An embodiment of searching media is illustrated by an example method 400 of Fig. 4A.
- a user such as user Jill, (e.g., and/or a device associated with the user) may access and/or interact with a website, an application, etc. that provides a platform for searching the media using a server (e.g., of the website, the application, etc.).
- the server may host uploaded media, and the website may provide access to view and/or hear the uploaded media to an audience.
- a query comprising at least a first term, for the media may be received (e.g., by the server and/or from the user).
- the media may comprise video, audio, an image and/or a document.
- the media may comprise a video file, a movie, a television show, an audiobook, a podcast, radio, a soundtrack, a song recording, a voice memo, a voicemail, virtual reality content, augmented reality content, videochat streaming, financial data, stock indexes, security prices, financial transaction data, financial statements, a balance sheet, an income statement, a statement of changes in equity, a cash flow statement, physiological data, medical data, medical sensor data, vital sign data, electrophysiological data, medical images, medical lab analysis data, industrial data, security data and/or military data.
- a first result may be identified in time- associated information (e.g., a transcript) of the media based upon a determination that the first result comprises a first match of the first term, and/or a second result may be identified in the time-associated information of the media based upon a determination that the second result comprises a second match of the first term.
- the time-associated information may comprise text information of the media, such as a transcript, a comment, a user annotation or a review associated with the media.
- the time- associated information may alternatively and/or additionally comprise non-text information of the media, such as audio (e.g., a soundtrack), an image (e.g., a frame), etc.
- One or more portions, terms, etc. of the time-associated information may be associated with one or more timestamps.
- the matches may comprise exact matches and/or non-exact matches between a term of the query and the time-associated information of the media.
- the identifying of the first result may be performed using a brute-force search, a satisfiability check, a temporal sliding window, clustering, unsupervised machine learning, supervised machine learning, reinforcement learning, deep learning and/or pre-indexing.
- the media may be transcribed (e.g., before 406, before 404 and/or after 404) to generate a transcript of the media, which may be at least some of the time-associated information.
- the transcript may include text
- the transcript may include text representative of recognized objects (e.g., a computer), persons (e.g., the President) or other images (e.g., of a location, scenery, weather, etc.).
- the transcript may be in a first language (e.g., English), and may be translated to generate a second transcript of the media in a second language (e.g., German), which may be at least some of the time-associated information.
- a first language e.g., English
- a second language e.g., German
- One or more landmarks e.g., entities, trademarks, names, brands, key phrases, indications of emphasis, etc.
- tags, summaries and/or cross-references may be identified and extracted (e.g., before 406, before 404 and/or after 404) from the time-associated information (e.g., and stored in an index).
- the identifying of the first result e.g., and/or one or more other results
- the first result may be provided (e.g., for presentation) based upon a first temporal property of the first match of the first term in the first result and the second result may be provided (e.g., for presentation) based upon a second temporal property of the second match of the first term in the second result.
- the first result and the second result may be provided (e.g., for presentation) responsive to determining that the first temporal property and the second temporal property (e.g., individually, in combination, etc.) exceed a threshold temporal property.
- the first result may be provided (e.g., for presentation) in association with a higher rank than the second result based upon a comparison of the first temporal property and the second temporal property (e.g., the first result may be ranked higher based upon the first temporal property being greater than, less than, before and/or after the second temporal property).
- the query may comprise the first term and a second term.
- the first result may be identified based upon the determination that the first result comprises the first match of the first term and a determination that the first result comprises a first match of the second term
- the second result may be identified based upon the determination that the second result comprises the second match of the first term and a determination that the second result comprises a second match of the second term.
- the first result may be provided (e.g., for presentation) based upon the first temporal property of the first match of the first term in the first result and a third temporal property of the first match of the second term in the first result
- the second result may be provided (e.g., for presentation) based upon the second temporal property of the second match of the first term in the second result and a fourth temporal property of the second match of the second term in the second result.
- a first temporal distance of the first result may be determined based upon the first temporal property and the third temporal property.
- the first temporal distance may correspond to a difference between a first timestamp of the first match of the first term in the first result and a second timestamp of the first match of the second term in the first result.
- a second temporal distance of the second result may be determined based upon the second temporal property and the fourth temporal property.
- the second temporal distance may correspond to a difference between a third timestamp of the second match of the first term in the second result and a fourth timestamp of the second match of the second term in the second result.
- the first result and the second result may be provided (e.g., for presentation) responsive to determining that the first temporal distance (e.g., 5 seconds) and the second temporal distance (e.g., 10 seconds) are less than a threshold temporal distance (e.g., 12 seconds).
- a threshold temporal distance e.g. 12 seconds
- results with an excessively large temporal distance between terms may be assumed as being unlikely to be the result sought by the user, and thus excluded from presentation to the user.
- the threshold temporal distance may be determined based upon the query.
- the query may comprise a value (e.g., 12 seconds) specifying the threshold temporal distance, or the value may be estimated based upon one or more non- numerical aspects of the query.
- the first result may be ranked (e.g., and correspondingly presented) higher than the second result responsive to determining that the first temporal distance of the first result is less than the second temporal distance of the second result. For example, a result with little temporal distance between terms may be determined to be more likely to be the result sought by the user than a result with a large temporal distance between terms.
- the providing of the first result and the second result may comprise providing the results for presentation (e.g., by providing instructions to a device of the user to display the results), or providing the results to another website, application, etc. for further processing.
- the media may comprise a soundtrack (e.g., a song), and the time- associated information of the media may comprise lyrics of the soundtrack.
- the first result and/or the second result (e.g., in response to being selected) may be used to provide a karaoke presentation of the soundtrack (e.g., from a timestamp of the first result and/or the second result).
- a list of index keys corresponding to the query may be generated.
- the list of index keys may comprise index keys corresponding to each result and associated with a corresponding portion (e.g., timestamp, segment, etc.) of the media.
- the list of index keys may comprise a first index key corresponding to the first result and associated with a first portion of the media, and a second index key corresponding to the second result and associated with a second portion of the media. When the first index key is selected, access may be provided to the first portion of the media.
- one or more frames of the media corresponding to a first timestamp of the first portion may be displayed, the media may be played from a first timestamp of the first portion and/or a portion of time- associated information (e.g., a transcript) corresponding to the first timestamp may be displayed.
- access may be provided to the second portion of the media.
- one or more frames of the media corresponding to a second timestamp of the second portion may be displayed, the media may be played from the second timestamp of the second portion and/or a portion of time-associated information (e.g., a transcript) corresponding to the second timestamp may be displayed.
- a user may identify a portion of the media to view, edit, comment upon, etc. It should be appreciated that a user may identify a portion of first media from a plurality of media to view, edit, comment upon, etc.
- method 400 may be implemented in contexts other than a search performed by a user.
- a content allocator may submit the query to identify portions of the media with context matching a context of one or more content, and to supplement the portions of the media with the one or more content.
- Fig. 4B illustrates an example of a system 450 for searching media.
- Query component 410 may receive an indication of a media to search (e.g., the movie "Classic Ancient Warrior Movie"), and a query with one or more terms to find in the media (e.g., the terms "gold gate”).
- Search component 412 may search through time- associated information of the media (e.g., a transcript of the movie) to identify results where the terms of the query are found.
- the search component 412 may find a first result comprising a first portion of the transcript of the movie comprising text with the first term of the query "gold” and the second term of the query "gate,” and a second portion of the transcript of the movie comprising text with the first term of the query "gold” and the second term of the query "gate.”
- Temporal property determination component 414 may determine one or more temporal properties of each of the results. For example, a first temporal distance (e.g., 10 seconds) between a match (e.g., same, similar, relevant, etc.) of the first term in the first result and a match (e.g., same, similar, relevant, etc.) of the second term in the first result may be measured (e.g., based upon a timestamp associated with the match of the first term in the first result, a timestamp associated with the match of the second term in the first result, and/or based upon an estimate based upon a number of words, characters, syllables, etc. between the matches).
- a first temporal distance e.g. 10 seconds
- a second temporal distance (e.g., 5 seconds) between a match (e.g., same, similar, relevant, etc.) of the first term in the second result and a match (e.g., same, similar, relevant, etc.) of the second term in the second result may be measured.
- Ranking and presentation component 416 may rank the first result and the second result based upon temporal properties of each of the results. For example, the second result may be ranked higher than the first result responsive to determining that the second temporal distance (e.g., 5 seconds) of the second result is less than the first temporal distance (e.g., 10 seconds) of the first result. It should be appreciated that the ranking based on temporal distance may be given higher priority than other ranking considerations (e.g., such as how many words are between each of the matches).
- the ranking and presentation component 416 may present the ranked results with an excerpt of at least a portion of the time-associated information (e.g., transcript) corresponding to each result, a link to access the portion of the time-associated information corresponding to each result and/or a link to access (e.g., view, play, hear, etc.) the portion of the media corresponding to each result.
- time-associated information e.g., transcript
- link to access e.g., view, play, hear, etc.
- a website, an application, etc. may provide a platform for supplementing the media with content (e.g., advertisements) using a server (e.g., of the website, the application, etc.).
- the media may comprise video, audio, an image, a document, virtual reality content and/or augmented reality content.
- the server may host uploaded media, and the website may provide access to view and/or hear the uploaded media to an audience.
- the media may be segmented into a first portion and a second portion based upon time-associated text information of the media. For example, the segmenting may be performed based upon identification of scene changes, topic changes, location changes, etc. in the time- associated text information of the media, and/or may be performed based upon timestamps.
- the first portion and the second portion may be of different, similar or equal length. It may be appreciated that the media may be segmented into any number of portions, such as three, four, five, or five hundred, and that each of the portions may be of different, similar or equal length. The number of portions for the media to be segmented into may be determined based upon a (e.g., default or user defined) desired length of each portion, or based upon a number of portions that is determined to be appropriate for the media based upon the time-associated text information.
- the time-associated text information of the media may be analyzed to determine a first context for the first portion and a second context for the second portion.
- a first portion of the time-associated text information that corresponds to the first portion of the media may be analyzed to determine the first context, which may be a first topic, a first theme, a first location, a first object, a first person, etc.
- a second portion of the time-associated text information that corresponds to the second portion of the media may be analyzed to determine the second context, which may be a second topic, a second theme, a second location, a second object, a second person, etc.
- the media may be transcribed (e.g., before 504, before 502 and/or after 502) to generate a transcript of the media, which may be at least some of the time-associated text information.
- a first content may be selected from a plurality of contents for the first portion based upon the first context
- a second content may be selected from the plurality of contents for the second portion based upon the second context.
- a first advertisement for a first product may be determined to be relevant to the first portion of the media, and may thus be selected for the first portion of the media
- a second advertisement for a second product may be determined to be relevant to the second portion of the media, and may thus be selected for the second portion of the media.
- the first content may be selected for the first portion responsive to determining that the first portion is content-compatible (e.g., based upon the first context) and/or the second content may be selected for the second portion responsive to determining that the second portion is content-compatible (e.g., based upon the second context).
- the time-associated text information of the media may be analyzed to determine a third context for a third portion of the media.
- the third portion may (e.g., selectively) not be supplemented with content responsive to determining that the third portion is content-incompatible (e.g., based upon the third context).
- Content-compatibility versus incompatibility may correspond to a likelihood of content being received without (versus with) negative reaction and/or without exceeding a level of distraction.
- the first portion may be determined to be content-compatible due to the first context being indicative of a racing scene
- the second portion may be determined to be content-compatible due to the second context being indicative of a funny scene
- the third portion may be determined to be content-incompatible due to the third context being indicative of a funeral scene.
- the first portion of the media may be supplemented with the first content
- the second portion of the media may be supplemented with the second content.
- the first content may be overlaid upon and/or displayed concurrently with the first portion of the media while the second content may be overlaid upon and/or displayed concurrently with the second portion of the media.
- the first content may be played before, after or in between frames of the first portion, while the second content may be played before, after or in between frames of the second portion.
- a first timestamp may be selected from a plurality of timestamps in the first portion based upon a first match between the first content and a portion of the time-associated text information associated with the first timestamp.
- a second timestamp may be selected from a plurality of timestamps in the second portion based upon a second match between the second content and a portion of the time-associated text information associated with the second timestamp.
- a timestamp corresponding to a part of the media determined to be best situated, least disruptive and/or most relevant for each content may be found in each portion.
- the first portion of the media may be supplemented with the first content at the first timestamp
- the second portion of the media may be supplemented with the second content at the second timestamp.
- method 500 may be implemented in the context of a movie theater, where the media may be a movie.
- the media may be a movie.
- the first portion of the movie may be supplemented with the first content
- the second portion of the movie may be supplemented with the second content
- the portions of the movie may be provided for presentation in the movie theater. It may be appreciated that costs of accessing the movie theater may be reduced or eliminated by supplementing the movie with various, relevant (e.g., and sponsored) content.
- method 500 may be implemented in the context of television programming, where the media may be (e.g., live) television.
- the first portion of the television may be supplemented with the first content
- the second portion of the television may be supplemented with the second content
- the portions of the television may be provided for presentation (e.g., broadcast) to an audience.
- costs of accessing the television may be reduced or eliminated by (e.g., dynamically) supplementing the television with various, relevant (e.g., and sponsored) content, and that the content supplemented may be more relevant and/or less invasive/interruptive (e.g., by being displayed concurrently with or overlaid on television programming) than existing television programming.
- method 500 may be implemented in the context of educational material, where the media may be an educational lecture.
- the first portion of the educational lecture may be supplemented with the first content
- the second portion of the educational lecture may be supplemented with the second content
- the portions of the educational lecture may be provided for presentation to a student.
- costs of getting an education may be reduced or eliminated by (e.g., dynamically) supplementing the educational lecture with various, relevant (e.g., and sponsored) content.
- Presentation of the content may be implemented in an education-friendly manner. For example, the content may presented at one or more times determined to be less likely to distract the student from learning and/or content selected by the student may not be accessed and/or presented until after one or more portions of the educational lecture are completed.
- Fig. 5B illustrates an example of a system 525 for supplementing media with content.
- Segmenter 510 may segment media (e.g., a video) into one or more portions, such as a first portion and a second portion.
- Context analyzer 512 may (e.g., in parallel) analyze time-associated text information of the media, the first portion and/or the second portion to determine a first context 514 associated with the first portion and a second context 516 associated with the second portion.
- the first context 514 may indicate that the first portion of the media pertains to sports
- the second context 516 may indicate that the second portion of the media pertains to cars.
- Content selector 518 may (e.g., in parallel) select a first content from a plurality of contents based upon the first context 514 and/or a second content from a plurality of contents based upon the second context 516. For example, a first sponsored message pertaining to sports may be selected based upon the first context 514, while a second sponsored message pertaining to cars may be selected based upon the second context 516.
- Assembler 520 may assemble the first portion of the media, the first content, the second portion of the media and/or the second content to generate a supplemented media 522 comprising a combination of the first portion supplemented with the first content and the second portion supplemented with the second portion.
- the first portion of the media, the first content, the second portion of the media and the second content may be merged, concatenated and/or otherwise combined with one another (e.g., temporally, and/or spatially, and/or other ways of combination), and the supplemented media 522 comprising the combination of the first portion of the media, the first content, the second portion of the media and/or the second content may be generated.
- a website, an application, etc. may provide a platform for supplementing the video with content (e.g., advertisements) using a server (e.g., of the website, the application, etc.).
- the server may host uploaded videos, and the website may provide access to view and/or hear the uploaded videos to an audience.
- a first content may be selected (e.g., based upon context, etc.) from a plurality of contents for the video.
- a first area (e.g., and one or more additional areas) may be selected from a plurality of areas in the video based upon image analysis of the video. For example, image analysis may be performed upon one or more frames of the video to identify the first area (e.g., as a location suitable for placing supplemental content). It should be appreciated that 524 may happen before, after or concurrently with 526.
- the first area may be selected responsive to determining, based upon the image analysis, that the first area has a focus below a focus threshold. For example, a determination may be made that areas that are more out-of-focus are less likely to be important and/or result in disruption, inconvenience, etc. if they are overlaid by content.
- the first area may be selected responsive to determining, based upon the image analysis, that the first area comprises a first image feature. For example, a determination may be made that an area of a video representative of a side of a truck may be an appropriate location to overlay content (e.g., associated with a truck).
- the first area may be selected responsive to determining, based upon the image analysis, that the first area does not comprise a representation of a face. For example, a determination may be made that areas that display a face (e.g., of a person, animal, character, etc.) are likely to be important and/or result in disruption, inconvenience, etc. if they are overlaid by content.
- a face e.g., of a person, animal, character, etc.
- the first area may be selected responsive to determining, based upon the image analysis, that the first area has a texture below a texture threshold. For example, a determination may be made that areas that have a low level of texture are less likely to be important and/or result in disruption, inconvenience, etc. if they are overlaid by content.
- the first area may be selected responsive to determining, based upon the image analysis, that the first area has a texture above a texture threshold or within a range of texture.
- the first area may be selected responsive to determining, based upon the image analysis, that the first area has motion below a motion threshold. For example, a determination may be made that areas that have a low level of motion are less likely to be important and/or result in disruption, inconvenience, etc. if they are overlaid by content.
- the first area may be selected responsive to determining, based upon the image analysis, that the first area has motion above a motion threshold or within a range of motion.
- the video may be supplemented with the first content at the first area (e.g., and the one or more additional areas).
- the first area may be a frame, a combination of frames, and/or a region (e.g., top, bottom, side, etc.) of one or more frames of the video.
- the video may be supplemented with the first content at the first area through image overlay (e.g. overlaying advertisement).
- method 550 may be implemented in the context of a movie theater, where the media may be a movie.
- method 550 may be implemented in the context of television programming, where the media may be (e.g., live) television.
- method 550 may be implemented in the context of educational material, where the media may be an educational lecture.
- a website, an application, etc. may provide a platform for supplementing the video with content (e.g., advertisements) using a server (e.g., of the website, the application, etc.).
- the server may host uploaded videos, and the website may provide access to view and/or hear the uploaded videos to an audience.
- a first content may be selected (e.g., based upon context, etc.) from a plurality of contents for the video.
- the video may be supplemented with the first content.
- the first content may be overlaid and/or displayed concurrently with the video.
- the first content may be played before, after or in between frames of the video.
- one or more properties of the first content may be adjusted based upon image analysis of the video.
- the first content may be displayed across one or a plurality of frames of the video, and may be (e.g., dynamically) adjusted across the plurality of frames based upon the image analysis of each frame.
- a color of the first content is adjusted (e.g., between a first frame and a second frame of the video) based upon the image analysis.
- the first content may be adjusted from a first color to a second color (e.g., on the first frame, and/or to a third color on the second frame, and/or the first content may be left the first color on a third frame).
- a transparency of the first content is adjusted (e.g., between a first frame and a second frame of the video) based upon the image analysis.
- the first content may be adjusted from a first transparency to a second transparency (e.g., on the first frame, and/or to a transparency color on the second frame, and/or the first content may be left the first transparency on the third frame).
- a size of the first content is adjusted (e.g., between a first frame and a second frame of the video) based upon the image analysis.
- the first content may be adjusted from a first size to a second size (e.g., on the first frame, and/or to a third size on the second frame, and/or the first content may be left the first size on the third frame).
- a duration of the first content is adjusted (e.g., between a first frame and a second frame of the video) based upon the image analysis.
- a website, an application, etc. may provide a platform for generating a representation of a performance using a server (e.g., of the website, the application, etc.).
- the server may host performances, and the website may provide access to view and/or hear the performances (e.g., live, recorded, etc.) to an audience.
- a request may be received to implement a performance with a first user and a second user (e.g., and one or more other users).
- the first user and the second user e.g., and the one or more other users
- the first user and the second user may be at one or more (e.g., different) geographical locations.
- a determination may be made that the first user is associated with a first type of participation in the performance.
- the first user e.g., or another user
- the first user may request that the first user participate in the performance with the first type of participation
- the first user may (e.g., randomly) be assigned the first type of participation (e.g., from amongst a plurality of types of participation)
- the first user may be assigned the first type of participation based upon one or more scores, one or more past performances and/or one or more games (e.g., played against the second user and/or one or more other users before the performance).
- a determination may be made that the second user is associated with a second type of participation in the performance.
- the second user e.g., or another user
- the second user may request that the second user participate in the performance with the second type of participation
- the second user may (e.g., randomly) be assigned the second type of participation (e.g., from amongst a plurality of types of participation)
- the second user may be assigned the second type of participation based upon one or more scores, one or more past performances and/or one or more games (e.g., played against the first user and/or one or more other users before the performance).
- the second type of participation may be different than or the same as the first type of participation. Types of participation may correspond to singing, dancing and/or playing one or more instruments.
- a first content may be selected from a plurality of contents for the first user based upon the first type of participation
- a second content may be selected from the plurality of contents for the second user based upon the second type of participation.
- the first content may comprise a first version of a soundtrack (e.g., associated with the first type of participation- e.g., a version of the soundtrack for singing) and the second content may comprise a second version of the soundtrack (e.g., associated with the second type of participation- e.g., a version of the soundtrack for dancing).
- the first content may be the same as the second content.
- the first content may be provided to the first user, and the second content may be provided to the second user.
- the server may send the first content to a device of the first user via a first (e.g., network) connection, and/or may send the second content to a device of the second user via a second (e.g., network) connection.
- a local computer or karaoke machine can supply both first content and second content in a merged way to both the first user and the second user who are at the same geographical location.
- a first signal may be received from the first user in association with the performance, and a second signal may be received from the second user in association with the performance.
- the first signal may comprise an acoustic signal comprising a representation of the first user singing and/or the second signal may comprise a visual signal comprising a representation of the second user dancing.
- the server may receive the first signal from the device of the first user via the first (e.g., network) connection (e.g., or a third connection different than the first connection) and/or may receive the second signal from the device of the second user via the second (e.g., network) connection (e.g., or a fourth connection different than the second connection).
- the server can be a local computer or karaoke machine, and the first connection and second connection can be local wired/wireless connections.
- a representation of the performance may be generated based upon a combination of the first signal, the second signal, the first content and/or the second content. For example, a video of the performance playing the soundtrack combined with audio of singing by the first user and images of dancing by the second user may be generated and/or provided for display to the first user, the second user, an audience, one or more judges, etc.
- Fig. 6B illustrates an example of a system 650 for generating a representation of a (e.g., karaoke) performance.
- Performance creation component 618 may create a performance responsive to a request from one or more users, such as a first user, Jack, and a second user, Jill, and/or randomly (e.g., for one or more users accessing a common application, service, etc.).
- the request may include a name for the performers of the performance, such as a band name, for example.
- Participation manager 620 may determine that the first user, Jack, is associated with a first type of participation 622, such as singing, and that the second user, Jill, is associated with a second type of participation 624, such as dancing. Participation manager 620 may also determine whether Jack and Jill are at the same geographical location or different geographical locations.
- Content selector 626 may (e.g., in parallel) select a first content from a plurality of contents based upon the first type of participation 622 and a second content from a plurality of contents based upon the second type of participation 624. For example, a first version of a soundtrack customized for singing may be selected based upon the first type of participation 622, while a second version of the (same) soundtrack customized for dancing may be selected based upon the second type of participation 624. It may be appreciated that in some examples, the first content and the second content may be a same content (e.g. when Jack and Jill are at the same geographical location), while in other examples, the first content and the second content may be different soundtracks, images, videos, and/or different types of media.
- the first content may be provided to the first user 628, Jack, while the second content may be provided to the second user 630, Jill.
- the first version of the soundtrack may be played to Jack, while the second version of the soundtrack may be played to Jill.
- a first signal 632 such as a (e.g., audio) signal of Jack singing along with the first version of the soundtrack, may be received from the first user 628.
- a second signal 634 such as a (e.g., video) signal of Jill dancing to the second version of the soundtrack, may be received from the second user 630.
- Assembler 636 may assemble the first signal 632, the second signal 634, the first content and/or the second content (e.g., and/or third content comprising a third version of the soundtrack) to generate a representation of the performance 638.
- the representation of the performance 638 may comprise, for example, a video displaying Jack singing the soundtrack, Jill dancing to the soundtrack (e.g., and another user, Jane, playing an instrument in the soundtrack) and the soundtrack being played in the background.
- first signal 632, the second signal 634, the first content and/or the second content may be merged, concatenated and/or otherwise combined with one another, and the representation of the performance 638 comprising the combination of the first signal 632, the second signal 634, the first content and/or the second content may be generated.
- An embodiment of navigating through media is illustrated by an example method 700 of Fig. 7A.
- a user such as user Jill, (e.g., and/or a device associated with the user) may access and/or interact with a website, an application, etc. that provides a platform for searching the media using a server or a local computer (e.g., of the website, the application, etc.).
- the server may host uploaded media, and the website may provide access to view and/or hear the uploaded media to an audience. It should be appreciated that the media may be hosted locally or on the networked server. Accordingly, at 704, a request to move (e.g., drag) a control along a first axis from a first portion of the first axis to a second portion of the first axis may be received (e.g., by the server and/or from the user).
- the media may comprise video, audio, an image, a document and/or an application interface.
- the media may be navigated through at a first rate of advancement based upon a first feature of the first portion of the media.
- the media e.g., video
- a first rate e.g., frames per second
- the media may be navigated through at a second rate of advancement based upon a second feature of the second portion of the media.
- the media e.g., video
- the first rate of advancement may be different than (e.g., or the same as) the second rate of advancement.
- portions of the media determined to be more (e.g., above a threshold) important, popular, exciting, etc. may be navigated through at a slower rate of advancement than portions of the media determined to be less (e.g., below the threshold) important, popular, exciting, etc.
- the axis itself can be represented nonlinearly in the graphic user interface, with the first pixel distance between the first location to the second location representing a first rate of advancement, and the second pixel distance between the second location to the third location representing a second rate of advancement, wherein the first rate of advancement may be different than (e.g., or the same as) the second rate of advancement (e.g. the user jumps from location A to another location B with more precision between location A and C wherein location B is between location A and location C, but jumps from location B to another location D with (e.g., relatively) less precision between locations B and E (e.g., in comparison to between A and C) wherein location D is between location B and location E).
- the first axis may correspond to (e.g., a spectrum of) time.
- a first point on the axis may correspond to a first timestamp of the media
- a second point on the axis may correspond to a second timestamp (e.g., after the first timestamp) of the media
- a rate of advancement may comprise temporal resolution.
- the media that is navigated through may comprise a list of contacts (e.g., of the user).
- the first feature may comprise a frequency of contact between the user and a first contact in the list of contacts
- the second feature may comprise a frequency of contact between the user and a second contact in the list of contacts, etc.
- the media that is navigated through may comprise a list of messages (e.g., of the user).
- the list of messages may comprise text messages, instant messages, email messages, etc.
- the first feature may comprise a feature of a first message in the list of messages
- the second feature may comprise a feature of a second message in the list of messages, etc.
- a feature of a message may comprise a contact frequency between a receiver and sender of the message, whether the receiver is a TO recipient or a CC recipient of the message, an importance of the message, a timestamp of the message, a length of the message, a domain name of the sender, a subject of the message, a signature of the sender, whether the message is an initial message, whether the message is a reply message, whether the message is a forwarded message, a number of recipients of the message, user data, user history, how frequent the user replied to previous messages from the sender, how soon the user replied to the previous messages of the sender after the user saw the messages, the length of the previous messages between the receiver and the sender.
- the first feature and/or the second feature may be determined based upon information of the media.
- the first feature and/or the second feature may be determined based upon text, an image, audio, comments, tags, titles, a transcript, cross-references, data analytics from a plurality of users, recommendations, reviews and/or user history associated with the media.
- the first feature may correspond to a first distance of (e.g., a first instance of) a focus point from the first portion of the first axis and/or the second feature may correspond to a second distance of (e.g., a second instance of) the focus point from the second portion of the first axis.
- the first instance of the focus point is a first distance (e.g., 2 inches on the computer display) away from the first portion of the first axis (e.g., along a second axis different than (e.g., perpendicular to) the first axis) and the second instance of the focus point is a second distance (e.g., less than the first distance) (e.g., 1 inch on the computer display) from the second portion of the first axis (e.g., along the second axis)
- the first rate of advancement may be greater than the second rate of advancement.
- a representation of the moving of the control along a representation of the first axis may be provided for presentation, for example, as part of a playback interface.
- Fig. 7B illustrates an example of a system 750 for navigating through media.
- An interface 755 may be displayed on a device of a user.
- the interface 755 may, in some examples, display an application, such as a media (e.g., video) player, on the interface, which may include a media display portion 702 within which a media may be played and/or a media control bar 704 as part of a playback interface.
- the interface 755 may further display information about a source of the media, a control that when selected enables sharing the media, one or more other recommended media and/or a media upload button, which may be selected by the user to upload one or more videos to a server associated with the application.
- the media in the display portion 702 and/or in a preview box may be updated to reflect the movement of the control and/or to present (e.g., display) which portion and/or frame of the media would be played if the control was released at that instant.
- the rate of movement (e.g., the updating of the media displayed) may be different (e.g., faster) than a normal rate of playing the media (e.g., to enable the user to identify a desired part of the media without having to view and/or listen to the media from the beginning).
- a control at a first location 706 in the media control bar 704 may be moved (e.g., dragged) to a second location 708. Responsive to determining that the second location 708 is within a first portion 708 of the media, the updating of the media displayed may be at a first rate of advancement (e.g., 3x).
- the control at the first location 706 in the media control bar 704 may be moved (e.g., dragged) to a third location 712. Responsive to determining that the third location 712 is within a second portion 714 of the media, the updating of the media displayed may be at a second rate of advancement (e.g., lOx).
- a second rate of advancement e.g., lOx
- the user can click on the locations on the first axis of the media control to jump between locations on the first axis.
- the first axis may be represented nonlinearly, wherein the first rate of advancement between the first location and second location may be different than (e.g., or the same as) the second rate of advancement between the second location and third location.
- the user can click to jump to a fourth location that is between the first location and second location at the first rate of advancement, and/or click to jump to a fifth location that is between the second location and third location at the second rate of advancement.
- At least some of the disclosed subject matter may be implemented on a client (e.g., a device of a user), and in some examples, at least some of the disclosed subject matter may be implemented on a server (e.g., hosting a service accessible via a network, such as the Internet).
- a client e.g., a device of a user
- a server e.g., hosting a service accessible via a network, such as the Internet
- Fig. 8 is an illustration of a scenario 800 involving an example non- transitory machine readable medium 802.
- the non-transitory machine readable medium 802 may comprise processor-executable instructions 812 that when executed by a processor 816 cause performance (e.g., by the processor 816) of at least some of the provisions herein.
- the non-transitory machine readable medium 802 may comprise a memory semiconductor (e.g., a semiconductor utilizing static random access memory (SRAM), dynamic random access memory (DRAM), and/or synchronous dynamic random access memory (SDRAM) technologies), a platter of a hard disk drive, a flash memory device, or a magnetic or optical disc (such as a compact disc (CD), digital versatile disc (DVD), or floppy disk).
- SRAM static random access memory
- DRAM dynamic random access memory
- SDRAM synchronous dynamic random access memory
- the example non- transitory machine readable medium 802 stores computer-readable data 804 that, when subjected to reading 806 by a reader 810 of a device 808 (e.g., a read head of a hard disk drive, or a read operation invoked on a solid-state storage device), express the processor-executable instructions 812.
- a reader 810 of a device 808 e.g., a read head of a hard disk drive, or a read operation invoked on a solid-state storage device
- the processor- executable instructions 812 when executed, cause performance and/or
- an embodiment 814 such as at least some of the example method 400 of Fig. 4A, the example method 500 of Fig. 5A, the example method 550 of Fig. 5C, the example method 575 of Fig. 5D, the example method 600 of Fig. 6A and/or the example method 700 of Fig. 7A, for example, and/or at least some of the example system 450 of Fig. 4B, the example system 525 of Fig. 5B, the example system 650 of Fig. 6B and/or the example system 750 of Fig. 7B, for example.
- Multimodal karaoke (how to integrate singing, dancing, instrument playing together, also has a transcript search capability based on lyrics)
- Non- linear video navigation (how to navigate through video manually in a precise and easy way, especially on mobile devices.)
- Videomark technology conveniently manage and navigate through contents [00208] 4. Time-associated text search and media navigation control (aka. Search and play)
- Multimodal karaoke
- Media search may be done with queries in fields such as title.
- the Time-associated text information such as transcripts, legends, captions, lyrics, subtitles may not be used in existing search technologies such as search engines.
- the current media players either online or offline, either hardware- implemented or software-implemented, do not provide a convenient way to locate specific words or phrases in the Time-associated text information such as transcripts, for accurate localization and media navigation. For instance, if the word "computer" is mentioned in the speaker's speech in the video but not in the title and playlists names, the current search will not work.
- Current web search may be limited to searching video and audio search with title or descriptions, not the content of the media works themselves.
- Time-associated text information such as transcripts, legends, captions, subtitles and annotations can provide important information for content searching in media.
- the search can be done purely based on Time-associated text information.
- the search engine can search different videos based on the presence of the query in the transcripts of the videos. The search engine will subsequently return the videos where the query words have presences in the video transcripts.
- the search engine can ask user to input a combination of queries in different fields, where at least one of the said queries is for the field search of transcript.
- a query "happy” in the field of "title” and a query “lab” in the field of “transcript” may be inputted by the user for a combined search for desirable videos.
- the engine will subsequently return the results of relevant videos (or segments of videos, playlists, channels, etc.) based on the search, where the 2 queries are found in the title and transcript, respectively.
- the results may be ranked based the relevance or other attributes such as time for uploading.
- other text fields which might not be necessarily directly corresponds to the transcript of the audio such as descriptions, comments, and tags, may also be included in the search.
- a query "happy” in the field of "tag” and a query “lab” in the field of "transcript” can be inputted by the user.
- the search engine will subsequently return results of relevant videos (or segments of videos, playlist, channels, etc.) based on the search, where either the first query is found in the tag or the second query is found in the transcript, respectively.
- the search engine will have an algorithm to rank the relevance of results, which will be described in the later part of the disclosed subject matter.
- the user stores his/her family videos on local harddisk drive and a computer program scans over all those videos for the user to quick find out his/her family videos dated years back by searching.
- Media referred to in the disclosed subject matter uses a combination of different content forms.
- Media may include a combination of text, audio, still image, animation, video, or interactive content forms.
- the disclosed subject matter is applicable to any media, such as files online or local, as well as YouTube, MP3, Quicktime, AAC, radio, DVDs, Blu-Ray, TV, virtual reality, Netflix, Amazon Video, HBO Now, or any content-distribution service, in physical media, in the cloud or in cyberspace.
- the media are intended for use on a computer (including smartphones, tablets, etc.) or on any other hardware such as DVD players, Karaoke players, video streaming sticks, or TV set-top boxes.
- many hardware implementation can contain a computing unit.
- One of such hardware-implemented examples is a Karaoke player that has an embedded system/computer usually using a microcontroller.
- Time-associated text information is defined as any text information that is associated with the sound, speech or any other information-carrying signals.
- Time-associated text information such as transcripts (including subtitles, closed captions, lyrics, annotations, or equivalents of aforementioned) of media entities available
- the search can go into Time-associated text and provide matching results in the Time-associated text which carries additional information related to the audio/video, unlike current search where only text outside of the media entities, such as title, descriptions, or tags are used.
- the aforementioned search method in audio-related text can also be expanded to all kinds of media content search, including but not limited to search functions built into or outside of content distribution websites, such as YouTube.com, Netflix.com, Amazon.com Video, HBO on Demand, Apple Store, Google Play, etc.
- software refers to programs or codes that can run a computer, or other electronic devices.
- the software can be implemented as a pro-gram that can run on different operating systems or without operating system.
- an operating sys-tem itself is a program, and it may support to the disclosed subject matter (e.g., via a built-in function or user interface) to certain extent.
- the software can be implemented at the hardware level, such as using a field-programmable gate array (FPGA) or specific electronic circuits.
- FPGA field-programmable gate array
- a timestamp refers to a specific point on the time axis of the media. For instance, a timestamp when the word "hello” is said can be expressed in absolute time, such as 5 minutes 55 seconds (counted from the beginning of the video). Likewise, other ways to encode timestamp is also possible.
- the basic processing pipeline for the Time- associated text search is illustrated below.
- the soft-ware will take a plurality of user inputted queries, match queries with Time-associated text information in media on text domain (e.g., transcript), audio-domain (e.g., the sound waveforms), and visual-domain (e.g., video frames), and return the results to user through a list or GUI-powered (graphic user interface powered) visualization.
- text domain e.g., transcript
- audio-domain e.g., the sound waveforms
- visual-domain e.g., video frames
- the user can selected the result of interest (such as a video or a video segment) for viewing.
- Figure 9 Flowchart of search with Time-associated text
- Each query consists of parts known as terms.
- Each term can form a query, e.g., two words can be a query while each word is a term.
- a term can also be a relationship on terms. For example, “not “happy”' can be a term meaning the word "happy” shall not appear in search results, where not is treated as a keyword for a logical relationship and the word "happy" itself is also a term.
- time-related terms in the search query and they can be composed (e.g., binary search using AND and OR) with any existing search terms in a search query (e.g., words in a query cannot be more than 5 seconds apart).
- a term doesn't have to be in the form of text or formatted string. It can be media too, e.g., an audio clip (as in Apple Siri or Microsoft Cortana) or an image (as in Wolfram Mathematica syntax where images can be operands of functions).
- a query is searched against a reference. Similar to terms, a reference does not have to be limited to text, but a media work of multiple media, e.g., a video that comes with transcripts in two languages.
- a result of searching for a query against a reference is called a match.
- a match is collection of terms and their timestamps such that the query is fully or partially satisfied.
- a query can be searched against one reference or multiple references, returning one or more matches.
- the user can specify searching different terms at different media field. For example, the user can specify to search one word in the title and a two- word phrase in the closed caption of a media work.
- a field can be time-advancing with playing of the media, e.g., English transcript, Chinese transcript, left-channel audio, right-channel audio, video frames, etc.; or a field can be time-independent, e.g., the title, the description, etc.
- a field could also mean a constraint to apply onto terms, for example, we can have a field called "must have exact words" or "not including the following words".
- the field can include human annotation, from the content creator, the viewers and distributors, such as Amazon X-ray, providing trivia and backgrounds about the storyline and actors.
- Treating language as a field allows the user to search for the Time- associated text in specific languages. For instance, if user input the query "hola” in the field “transcript” and query “Spanish” in the field “Ho-la”, the software will return the results from clips with corresponding Spanish words of "Hola” being said. Similarly, dialect such as Cantonese or Hakka can also be specified in the search field. This function re-quires the identification of the language in media entities. This problem is also known as language identification.
- the language detection can be done in audio domain.
- the language detection can be done in text domain, e.g., using n-gram classification models.
- the language detection can be done based on tags from users or the transcripts.
- the input method may be based on text input, voice input, body gesture, brain-computer interface, or any other input methods. Keyboards (physical or software), mouse, voice, touchpads, microphones, RGB-D camera (e.g., Microsoft Kinect), audio-based controller (e.g., Amazon Alexa or Dot), or other input devices may be used.
- the input can be done on one computer (e.g., a mobile smartphone or an tablet computer) while the result is shown on another computer (e.g., a Google Chromecast).
- the input device might also be far away from the output device (e.g., a phone at Mountain View for input and a tablet computer at Redmond for output).
- the software will allow user to simply input a plurality of words as the query, and the software to determine whether the words inputted are individual terms or a plurality of the words are phrases where a compound word should be considered as a term. Detecting a compound word has been well studied in NLP, in problems such as collocation. One approach is to use item names in Wikipedia, e.g., Illinois Wikifier.
- Terms-and-connectors search Users are empowered to search in multiple terms and connect them using different connectors in the query.
- the syntax of terms-and-connectors including but not limited to, the following forms. Phrases in double quotes enclose a term in double quotes, "like this". Double quotes can define a single search term that contains spaces, e.g., "holly dolly” where the space is quoted as a character, differs much from holly dolly where the space is interpreted as a logical AND.
- Boolean connectors can be used to connect terms, such as(blue OR red) AND green, which differs from: blue OR (red AND green).
- Terms can be excluded by prefixing a hyphen or dash (-), which is "logical not". For example, payment card -"credit card” finds all matches with “payment” and “card”, but not “credit card” yet would include “credit” separately.
- a wildcard character * can match any extra character-string, can prefix or suffix a word or string. For example, “*like” will match “childlike” or “dream-like”. Spelling relaxation occurs by suffixing a tilde ( ⁇ ) like this-, with results like "thus” and “thins”. This is also called search ⁇ ish. It should be appreciated that the aforementioned terms-and-connectors search method can be defined with other reserved words.
- the terms-and- connectors search can be used in combination with different fields of search such as the field "transcript”, “title”, “tag” where each field specify where the corresponding terms will be searched for. For example, a user could inquiry to find all occurrences where the word happy" appears in closed caption while the word "lab” appears in title.
- temporal constraints are relationships for timestamps.
- search results can be presented to the users with temporal consideration, e.g., temporal distances (expressed in time) among query terms in each match can be involved in ranking. This provide additional advantage compared to the capability of search techniques in prior arts.
- the temporal distance in the disclosed subject matter refer to the interval between event A and event B, where the said event can be a word, a phrase, a sentence in the timed transcript, a scene in the movie (e.g., fighting), or plot related events (e.g., Tom Cruise enters the scene).
- the word “happy” is said at a timestamp of 0 hour, 0 minutes 20 seconds
- the word "lab” is said at a timestamp of 0 hour, 2 minutes 30 seconds
- the temporal distance between "happy” and "lab” is 0 hour, 2 minutes 10 seconds.
- temporal constraints can be built based on temporal distance. For instance, we can specify a temporal constraint of "less than 30 seconds" between the words "happy" and "lab”.
- the media is a video clip accompanied by a transcript where each word is associated with a timestamp while the said query is a phrase.
- the user can find the timestamp of matches in which 80% of the words in the query are at most 5 seconds apart, e.g., find all timestamps where at least 4 out of the 5 words in "happy lab good best award employee" are mentioned no more than 5 seconds apart.
- the matches are ranked based on the proximity of temporal distances between all query terms in each match. In this case, if several successive words are too far away temporally from each other, they are not considered a match.
- the distance between words can be a combination (e.g., weighted sum, etc.) of temporal, lexical and semantic distances.
- Lexical distance can be defined as how many words in between, how many syllables in between, or even how many sentences in between, or their distances on the parsing tree, etc. for instance, in the sentence "this is a powerful patent", the lexical distance between "this” and “patent” is 3 words.
- the semantic distance can be defined in many ways.
- the semantic distance can be defined as the average pair-wise vector distance between any two words, except stop/background words, 5 words up stream and 5 words down stream near the match. For example, if the query is to find all of the 2 words “happy” and “lab”, and one match is in the phrase "Google Venture gives Happy Lab a $20 Million valuation", the sematic distance will be the average vector distance of any pairs of the words in "google", “venture”, "million", “valuation”. The vector distance is defined as the cosine of the angle between the two vectors representing the two words. Methods to generate vector representation of words include but are not limited to, Google word2vec, Stanford NLP Lab's Glove, etc.
- TymeTravel syntax refers to the method and syntax for ending time-related query.
- the user can specify the approximate time range of each term, e.g., 'happy' around 3:00 AND 'lab' at 5:30 after 'happy', or "happy' no more than 30 seconds before lab'.
- the user can even specify the timestamp of which term to be returned, or they could generally specify using words, including but not limited to, "center", “leftmost", and "latest”.
- the timestamp of 0.5 second will be returned if the user specifies to return the earliest timestamp among the two.
- the time constraints can be represented in a syntax other than natural languages.
- “the word “happy' no earlier than 30 seconds before the word lab'” can be expressed as time(happy) - time(lab) ⁇ 30s, or "the word “happy' around 3 minutes 00 seconds and the word lab' after 5 minutes 30 seconds” can be expressed as happy ⁇ 3:00 AND lab@5:30+.
- Some reserved word can be used to allow the user to specify constraints on the terms, such as "all-apart 5s" meaning that all words in the query must be 5 seconds apart temporally.
- temporal query can be sent via HTTP, HTTP/s, or other Internet protocol, with or without encryption, through POST, GET or any other HTTP/s methods to transmit the temporal query. It should be also appreciated that the temporal query can be sent with the text query.
- TymeTravel beyond end user The use of TymeTravel syntax does not have to be limited to end users. Any information can be converted into/from TymeTravel syntax and be used by any part of the software system.
- a client program uses TymeTravel syntax to send the queries to a server.
- the user fills in different terms into a webform and then the program converts the inputs of the webform into TymeTravel syntax and send to a server.
- the program can automatically (without an explicit time-related search request from the user) synthesize a query in TymeTravel and send to a server.
- the TymeTravel syntax is used to exchange data/query/info between two different programs running on the same computer.
- the TymeTravel syntax is used to exchange information between a program and the APIs/libraries that it calls.
- TymeTravel Syntax is not a fixed grammar. It should be treated as a general grammar that allows users to specify time- related constraint in query terms.
- the query does not have to be initiated by end users. Any information can be converted into/from query and be used by any part of the software system. For example, the user fills in different terms into a web-form and then the program converts the inputs of the webform into a query and send to a server. In another example, based on data, the program can automatically (without an explicit time-related search request from the user) synthesize a query and send to a server. In yet another example, a query can be exchanged between two different programs running on the same computer. In yet another example, a query can be exchanged between a program and the APIs/libraries that it calls, where the APIs/libraries can be local or remote.
- the search does not have to be limited to Time- associated text information but all kinds of information, embedded in the media or outside of media, at all kinds of media forms, related with time or not.
- an object recognized in a movie at a timestamp can be converted into a text describing it (eg. a picture of a chair can be recognized and converted to text information "a chair" ) ⁇
- Text-based search/matching In the embodiments of the disclosed subject matter where audio information are transcribed into text in this disclosed subject matter, many string matching algorithms can be employed to find the matches for a query, including but not limited to, Naive string searching algorithm, Rabin-Karp algorithm, Finite-state automaton search, Knuth-Morris-Pratt algorithm, Boyer-Moore algorithm, dynamic programming-based string alignment, Bitap algorithm, Aho- Corasick algorithm, Commentz- Walter algorithm, etc.
- Naive string searching algorithm Rabin-Karp algorithm
- Finite-state automaton search Knuth-Morris-Pratt algorithm
- Boyer-Moore algorithm Boyer-Moore algorithm
- dynamic programming-based string alignment Bitap algorithm
- Aho- Corasick algorithm Commentz- Walter algorithm, etc.
- Text-based matching results can be ranked based on the distances between the query and the matches, using distance metrics, including but not limited to, Hamming distance, editing distance, Levenshtein distance, Jaro-Winkler distance, most frequent k words, S0rensen-Dice coefficient, Jaccard similarity, Tversky index, Jensen- Shannon divergence, etc. Typos will be tolerated with spell correction suggestions.
- Text-based match can be expanded to include word related to the query terms. For example, when searching for "inn”, other words "hotel” and “resort” will also be included.
- a common case in handling natural languages is that one word, phrase, or clause/sentence can match multiple words, phrases, or clauses/sentences. This is due to the ambiguity of natural languages.
- text match can be searched using text alignment algorithms studied in NLP. Query terms, individually, as a partial group or as a whole, can be aligned against the time-associated text. Methods and tools for text alignment include but are not limited to, Smith-waterman algorithm, Jacana-align, Semi-Markov Phrase-based Alignment, IBM models, MANLI aligner, etc.
- Audio-based search/matching In another embodiment, the search matching will be done in the audio domain rather than the text domain.
- the matching algorithm will operate based on audio signals directly.
- the query from the user can be an audio clip, either from the user's audio input or an artificial audio clip converted/synthesized from text input.
- the audio query can be directly aligned/searched against the audio of the media stream and the playing will begin from the timestamp of the beginning of the matching. Aligning audios can be done by Hidden Markov Models (HMM), Dynamic Time Warping (DTW), Neural Networks, Deep Learning Models, etc.
- the audio input or text input may be translated by the software from one language to another one for the search.
- “Hola” in Spanish may be translated to "Hello” in English for the search.
- the match for audio can be done in time domain, frequency domain, time-frequency joint domain, or the combination of thereof, via any necessary transforms, e.g., Fourier Transform, Short-term Fourier Transform, Wavelet transform.
- Image -based search/matching In yet another embodiment, the search matching will be per-formed in the image domain.
- the matching algorithm will operate not based on text but based on images and video frames.
- the query of the user can be a plurality of video clips or a plurality of images, either uploaded from existing data or taken at the time of search.
- the image query can be directly aligned/searched against the video frames of the media stream and the playing will begin from the timestamp of the beginning of the matching.
- Searching for images can be done by Object recognition algorithms, including but not limited to, Histogram of Oriented Gradients (HOG), Aggregated Channel Features (ACF), Viola-Jones algorithms, SURF, etc.
- Ranking can be done based on the matching scores.
- Hybrid search Instead of making use of the text , audio or video independently for searching and matching, in another embodiment we propose a media joint search and data mining scheme.
- a user specifies the query, they can define some terms in text and some other terms in video. For example, "Steve Jobs" in the text query field and the "iMac G3" in the video/image query field can be specified by a user.
- the text searching algorithm will find matches for "Steve Jobs” in text while the image recognition algorithm will detect objects of iMac G3 in the video stream. Users can set different priority levels for different types of search terms. User feedbacks or tagging can also be used to teach the computers.
- Step 1 the software finds matches for each query term in each one field.
- field we refers to an organization of data that follows time axis, e.g., the transcript in one language or a stock price of one trade symbol, such as transcript in Chinese or stock price of Apple inc..
- Step 2 we check whether the query is satisfied at each timestamp where at least one term is matched.
- Term herein refers keywords, audio waveforms, images, videos or other items composing the query for the search. If matched, and the user specifies how the return time should be extracted, e.g., "the center timestamp of all terms", the specific timestamps will be returned. If matched but the user does not give a preference on the return timestamp, the timestamp of the earliest term in the query may be returned as the default option.
- Step 2 There are multiple ways to check the satisfiability of the query in Step 2.
- the combinations of all timestamps, each of which matches at least one term, and terms associated with those timestamps are enumerated and checked against the query. This approach is illustrated using the example below where the query is finding two words "happy" and "lab” in the transcript and the temporal distance between the matches of the two words must be under 5 seconds.
- Figure 10 An example of matching multiple terms involving time
- a sliding temporal window is used and the query is checked within each incremental temporal position of the sliding window along the time axis. For example, if the user wants to search for co-occurrences of the two words "happy" and "lab” that are no more than 5 seconds apart, we establish a +/- 5-second sliding temporal window for each occurrences of "happy" and "lab” and check whether both words appear in each position of the slide window.
- the sliding temporal window is established based on the time constraints inputted by the user or by default.
- the temporal constraint is, using the "Tymetravel” syntax introduced above, "time(happy)-time(lab) ⁇ 30".
- the sliding temporal window will be at least 30 seconds in temporal width to allow checking whether the word "lab” appears within 30 seconds after the timestamp of the word "happy”.
- a default value for the sliding window will be used, such as 10 seconds in temporal width, because in most cases of media enjoyment people care about a short event.
- a plurality of matches will be found, and each match corresponds to a temporal position of the sliding window that satisfies the query.
- Clustering is a well-studied problem in machine learning.
- Approaches for clustering include but are not limited to connectivity-based clustering (e.g., single-linkage clustering, complete linkage clustering, average linkage clustering, etc.), centroid-based clustering (e.g., k-means clustering, etc.), distribution based clustering (e.g., expectation-maximization algorithm, etc.) and density-based clustering (e.g., DBSCAN, OPTICS, etc.).
- connectivity-based clustering e.g., single-linkage clustering, complete linkage clustering, average linkage clustering, etc.
- centroid-based clustering e.g., k-means clustering, etc.
- distribution based clustering e.g., expectation-maximization algorithm, etc.
- density-based clustering e.g., DBSCAN, OPTICS, etc.
- the clustering algorithm can consider temporal constraints and temporal distances in the query to avoid splitting two timestamps that can satisfy the query into more than one clusters. For example, if one temporal constraint is that time(happy)-time(lab) ⁇ 30, then we must merge two consecutive clusters containing "happy" and "lab". An example illustrating this embodiment is given below.
- Figure 12 Cluster-based match finding.
- Figure 13 Left: Algorithm steps for using sliding window to check matches along time.
- a matrix is created for every timestamp of a matched query term, where each row corresponds to a query term and each column corresponds to a timestamp.
- the timestamps are sorted into ascending or descending order.
- Column by column we first check the satisfaction of non-temporal terms at each timestamps, labeling those satisfied as 0's and those not as l's to form a binary vector. Then with this binary vector, we check the temporal constraints at timestamps that are labeled as l's. This embodiment is further explained in Section 2.2.5.
- the query satisfaction check can be done via encoding each query into a constraint satisfiability problem.
- Each search term will be represented as a variable.
- each term for the condition "any of these words” will be translated into powerset of all terms that must be included at least in the query. For example, the term “any of these words: happy lab” will be translated into 3 expressions: “happy”, “lab” and “happy, lab”.
- Each condition "all these words” or “exact word or phrase” term will be translated into expressions each of which is one term there.
- Figure 14 The 3 embodiments of Algorithm 2. In practice, the combination of all search algorithms can be used.
- the preprocessing includes transcribing, translation, parsing, etc.
- index is a mapping from queries to their corresponding results.
- the queries for building index can be generated by computers automatically or from real queries that users input. By using index, the results can be acquired without doing the search again.
- An index is usually stored in a data structure called hash table. Using index to speed up data fetching has been widely used in searches, such as web search, data base search, etc.
- the disclosed subject matter enable the software to build and update indexes for queries for Time-associated text information, time- associated information, and other information.
- the queries can be of different lengths using various operators.
- the indexes for temporal queries can be build and updated. Then for each non-temporal query, we enumerate the combinational of basic temporal constraints, e.g., temporal distances between terms, time elapses of terms with respect to the beginning, end and middle of the media. Combinations of different non-temporal queries and temporal constraints on them form the temporal queries that users may enter and hence are indexed.
- the media database used to build the index can be acquired in many ways. For content hosting websites, such as YouTube, Vimeo, Netflix, Hulu, they can just build index for all media contents.
- the indexer can find a YouTube video as long as it can have the weblink to it.
- the indexer program can generate queries to media content providers to obtain the weblink to all entities belong to or are distributed through that provider.
- the indexer program can crawl by following the menus on the content provider's website, e.g., following menus of movies provided by Amazon Prime Video.
- the indexer program can leverage search engines by submitting queries to them (such as Google Video search) and following links in the search results.
- a weblink does not necessarily mean a URL specified in HTTP protocol or URI specified by Android operating system. It can be any form of identification to point to a media entity.
- the index can be pre-built and updated on a plurality of levels.
- the index can be built for a particular user (eg. Tom's files on all his computers, tablets and smartphones).
- the index can be built for a particular group of users (eg. all the family members including the father, mother and kids; all students and teachers involved in a class).
- the index can be built and updated for the entire search engine, or a video content provider.
- the index can be built and updated for a local device (eg. the karaoke machine; a computer; a smartphone).
- the index can be built and updated in a cloud-based architecture. It should be appreciated that different versions of index can be stored, and recovery of earlier version index is possible, if needed.
- the search software begins with only one query term and gradually add other query terms.
- the second term will be added into search and the query satisfiability will be checked (e.g., within the sliding temporal window).
- New term will be added gradually (e.g., one at each time, or two at each time, or a plurality at each time) until the query is satisfied or until the query is dissatisfied. If the query is dissatisfied, the algorithm will begin from the first term at the next match occurrence.
- the search result even for those dissatisfied/failed search, could be logged for ranking.
- terms can be ranked based on computational time-complexity to search them (e.g., text searching is easier than recognizing objects in video frames), likelihood that a match for this term can be found (e.g., there could be more matches for the word "government” than "telescope” in a video of presidential debate), and other factors.
- the pseudocode below shows how to find exactly one match. Note that the pseudocode below does not use a data structure to log what combinations of matches have been searched. Hence it is slow, especially when we want to reuse it to search for multiple matches.
- break // go to add the next term; else, check next match of the term
- a data structure such as a tree
- Such search process can be optimized using heuristics methods, including but not limited to, A* algorithm, unit propagation, DPLL, etc.
- the algorithms can also be used to find all matches, i.e., traverse all nodes on the search tree.
- Figure 15 An illustration of query matching using a search tree. The matches for query terms are visualized on time axis while the search tree is given below the time axis. On the search tree, unsatisfied combinations of timestamps are in void arrows while satisfied in solid arrows. Solid lines represent nodes visited in the search process while dash lines for future visits. The illustration shows the search tree after checking "happy" @t2 and "lab"@t3.
- the search tree can be pruned.
- the checking for "happy"@tl and "lab”@t5 are not necessary after we learned that "happy"@tl and "lab”@t4 dissatisfies the query - because t5 is greater then t4 and thus if t4 cannot satisfy the temporal constraint, t5 cannot satisfy the temporal constraint, either.
- the trace back is only one level up on the tree.
- More intelligent methods can be used to prune the search tree or prioritize the branches on the search tree, including but not limited to, conflict-driven clause learning (originated from solving Boolean Satisfiability problems), cut-set based, smart backtrack, A* search, etc.
- temporal constraints can be user-defined or by default. For example, if user input "happy lab" without specifying the temporal constraints, the computer will use default setting (e.g. temporal constraint: time interval between the first keyword and second keyword is less than 10 seconds)
- Transcript of the media "Let's take a look at how a volcano erupts. In Italy, over 300 volcano eruptions happen each year.”
- Query 2 words out of the words “Italy”, “volcano” and “erupt'V'eruption”, and the temporal distance between the 2 words are at most 2 seconds apart.
- Step 1 we run the search of each term (a match of each term is in bold font) and extract the timestamp associated with each word (in parenthesis):
- Italy Let's take a look at how a volcano erupts. In Italy (8.5s), over 300 volcano eruptions happen each year.
- Step 2 we find matches that satisfy terms with or without rules connecting them.
- the clustering condition is that no two timestamps in one cluster is more than 2 seconds apart.
- the last two clusters have large enough overlaps in both time and terms and therefore are merged as one cluster.
- each cluster satisfies the query. Because the user does not specify what time in each cluster to return, the earliest time of each cluster is returned, i.e., 5.9s and 8.5s. As an alternative, without breaking sentences, the time of the first word belonging to each of the 2 cluster is returned, i.e., 0.0s for "Let's" and 8.3 for "In”.
- the two clusters can be ranked using some embodiments to be discussed as follows. For example, the 2nd cluster contains all 3 query terms while the first only has 1. If the only ranking criterion is coverage of query terms, then the 2nd cluster will be ranked as the 1st.
- smallest discretized timestamp interval is 0.1 second. It should be appreciated that the other values can also be applied as the smallest discretized timestamp interval, such as 1 second.
- results of the search will be ranked.
- the flowchart including ranking is shown below:
- Figure 17 Flowchart of search with Time-associated text information with ranking method
- a query could return many results and they need to be ranked to present to users.
- Existing text search rank text search results and this topic has been well-studied in areas such as information retrieval in traditional search such as website search. All those reported algorithms can be applied for our applications in ranking search results.
- the ranking algorithms that can be used with our methods include, but are not limited to, Inverse Term Frequency (IDF), Term Frequency - Inverse Term Frequency (TF - IDF), Okapi BM25, cosine similarity between TF-IDF of two results, PageRank, HIST algorithm, etc.
- IDF Inverse Term Frequency
- TF - IDF Term Frequency - Inverse Term Frequency
- Okapi BM25 cosine similarity between TF-IDF of two results
- PageRank PageRank
- HIST algorithm etc.
- the ranking is based on score, denoted as the ranking score.
- the final ranking score is a function combining the values of factors in various ways, or the combination of those ways, including but not limited to summation, subtraction, multiplication, division, exponent, logarithm, sigmoid, sine, cosine, softmax, etc. FMethods used in existing ranking algorithms may also be used, solely or as part of (including joint use) the ranking function.
- the ranking methods reported in prior arts for search eg. pagerank, inverse Term Frequency, TF - IDF, etc
- a primitive ranking score which can be used as part of input for calculating the final ranking score.
- the ranking function does not necessarily have to be expressed analytically, and it could be a numerical transformation obtained or stored in many ways, including but not limited to, weighted sum of those factors, artificial neural networks (including neural networks for deep learning), support vector machines with or without kernel functions, or ensembled (e.g., via boosting or bagging, specialized methods, such as random forest for decision trees) versions of them or combinations of them.
- the function can be one transform, or a plurality of combination thereof.
- the computing of the final ranking score comprises of existing ranking algorithms (e.g., existing algorithms can be used to generate the primitive ranking score) and our new methods (partly based on the power of searching on time-associated media).
- the temporal distances between terms in a query can be included in calculating the ranking score.
- the temporal distance can be defined as a function of many factors carrying temporal information.
- the temporal distance can be defined as the time difference between two timestamps, or the natural exponent of the time difference between two timestamps. Different distance measures can be used when calculating the time difference.
- the ranking score is reciprocally propositional to the variance of all timestamps of elements on temporal distance which is defined as the variance of timestamp differences.
- the query is "happy lab search" and that they have to be no more than 5 seconds apart is the time constraint.
- the primitive ranking score calculated using various ranking methods can be divided by 1.04 to take the temporal distances into consideration.
- the temporal distance can serve as an input for calculating the primitive ranking score; in another aspect, the temporal distance can serve as an input for calculating the final ranking score directly.
- the ranking score is weighted by the sigmoid of temporal distance, which is defined by the geometric average of all timestamps to their mean.
- the temporal distance can reflect other factors, too.
- the temporal distance between two matching words will be doubled if they appear in two different lines or sentences of the transcript (e.g., we may define a penalty term for having words in different lines or sentences).
- the line is defined as the temporal distance greater than an interval (e.g., > 2s). It takes into account the fact that most people usually pay attention to words that are temporally close to each other when watching videos.
- the temporal distance between two query words are amplified by their distance on the parsing tree, where the parsing tree distance is defined as the number of branches (a branch is also known as an edge in the graph theory) along the shortest path between the two words on the parsing tree.
- the temporal distance is signed with positive and negative.
- the temporal distance from “happy” to “lab” could have different signs than the temporal distance from "lab” to "happy” (denoted as negative).
- signed temporal distances can be calculated between each term pair.
- the (terml , terni2) pair will have the opposite sign from the (term2, teiml) pair, if the first the term of the pair is defined as the leading term.
- the absolute value may also be used in addition to the signed values for calculate the ranking scores.
- the confidence level is a general term to describe the likelihood that one query element (or a combination of elements; e.g., a keyword) matches one element (or a combination of elements) in the search target, e.g., how close a word in the user- inputted query matches a word in the transcript or how confident the system is about recognizing an actor in a scene.
- the confidence level can depend on many factors, such as vectorized word distances for word-to-word matches, object matching score for figure-to- video-frame matches.
- the confidence level is the vector distance, including but not limited to, cosine of the angle and the Euclidean distance, between the two vectors representing the two words.
- the vector for a word can be obtained using methods such as Google Word2Vec or Stanford Glove.
- methods such as HOG or ACF can be used to compute a score about how good or reliable the match is.
- the confidence level itself be a factor in calculating the final ranking score, but also it can be used to affect other factors previously described in the disclosed subject matter, including the temporal distance.
- the confidence levels of all matches in one result are passed into a function, and the output of the function, denoted as the overall confidence score of this result, will be used as one input for the function to calculate the final ranking score. This process can be represented by the flowchart shown below.
- a match contains 3 words, "happy lab search” in one sliding window for the query "happy experimental find”, where the confidence levels are 100% for "happy"->"happy", 80% for "lab”-> “experimental”, and 90% for "search”-> “find”.
- the overall confidence score may be defined as the average of them, thus, 90% for this 3-word match.
- the confidence levels can be phrase- based or N-gram based (eg. with "happy lab” as one term), rather than unigram-based.
- the overall confidence score can serve as an input for calculating the primitive ranking score; in another aspect, the overall confidence score can serve as a input for calculating the temporal distance; in yet another aspect, the overall confidence score can serve as one of the inputs for calculating the final ranking score.
- the confidence levels of all matches will be passed into the function to calculate final ranking score directly.
- its confidence level is a function of factors of the matches and terms in the matches. Those factors include but are not limited to media type, length of media, body of the match (e.g., on producer-provided subtitle, speech-to-text converted caption, audio or video), orders of terms in a query in the match (e.g., finding "happy lab” where term "happy” is before term "lab”), source of the media (e.g., if an element involves searching a word in the user annotation, the confidence level for such a match is linked to the credibility and history of the annotator).
- the confidence levels of matches can be weighted to calculate either final ranking score or overall confidence level, based on the types of media and the length of media.
- the algorithm can have more confidence for a text match found in the lyrics of a short music TV (MTV, Karaoke, etc.) than an object recognition match found in the a user uploaded lengthy and blurry surveillance video.
- the user could set the search result to rely on video frame matches more than on text matches, i.e., video matches are given higher confidence weight than text matches.
- the user wants to find words matching a speech-to-text (STT) converted caption.
- STT speech-to-text
- the confidence level for matching a word in query (inputted as text) and a word obtained via transcribing the media is the multiplication/product of the vector distance between the vectors representing the two words while the confidence level of the speech-to-text conversion for the second word.
- An extreme case for confidence level is when the search is on professional annotation of the media, such as the X-Ray (TM) labels made by Amazon.com when streaming timestamp Videos. For professional annotation, the confidence level should be significantly high, e.g., set as 100%.
- Order or time of occurrence is also a factor that should be considered when ranking results and for calculating the final ranking score.
- users are allowed to specify temporal constraints in the query (e.g., happy@25%+, meaning that the word "happy" appearing in the first quarter of the media), while the search results may match the temporal constraints to different extent which can be ranked.
- the present disclosed subject matter also uses the discrepancies between results and queries as a factor for calculating final ranking score, which can be modeled as a function.
- Different temporal discrepancies can be used to calculate the final ranking score, and different types of temporal discrepancies should affect the final ranking score to different extent. Users can specify (e.g., by ordering) the tolerance to different temporal discrepancies.
- the final ranking score will be obtained by multiplying the primitive ranking score with the sigmoid of number of temporal discrepancies in the result.
- different types of discrepancies have different weights in calculating the final ranking score.
- the weight is 2 while the weight for discrepancies on local temporal constraint, e.g., time(happy) - time(lab) ⁇ 40, is 1.
- the weight is 2 while the weight for discrepancies on local temporal constraint, e.g., time(happy) - time(lab) ⁇ 40, is 1.
- the order or time of occurrences of keywords can be used to calculate the final ranking score.
- the user could configure the search engine described in the disclosed subject matter to calculate the final ranking score with the temporal distance being considered. For instance, if the user prefer the earlier occurrence of the query keyword, he may check "the earlier the better" checkbox being supplied by the software GUI, and in this case the occurrence at 1/6 will be ranked higher. Exemplified equations are shown below:
- the ranking results, parameters (including weights) for ranking factors, and equations for calculating final ranking score can be refined based on user feedback, machine learning, crowd sourcing and/or user-interaction with the system. To customize and reorder the ranking, the updating process can be scoped for different scenarios.
- Possible scenarios comprise a genre of media work (e.g., in speeches, the text is more important than video) , a piece of media work (e.g., one chapter of an audio book or one scene of a movie), different fields of a media work (e.g., two queries using the same terms to search in the same movie but one on text and the other on audio can have different rankings), a user, a group of users (e.g., friend circle on social media), etc., or a combination thereof (e.g., customized ranking for a group of users on all dramas produced during 1990s).
- the feedback can be from one user, a group of user, a plurality of databases, and recent trends, etc.
- User feedback can be extracted in multiple ways and various types of information can be collected.
- the most common way is to log which result the user clicks.
- the user clicks a search result he/she votes for that result.
- the most voted result should pop up. Indeed, after a long run, the ranking will be tuned more and more toward user votes. For example, if sufficient amount of users click the second result instead of the top result, the second result swill be popped to top.
- More explicit way of soliciting user feedback is asking questions (e.g., "should this result be moved up or down?") or asking the user to manually reorder the search result.
- the order that a user clicks the results and the temporal interval between clicks are logged.
- the software will apply methods comprising pointwise approaches (e.g., RankBoost, McRank), listwise approaches (e.g., AdaRank, ListMLE), pairwise approaches (e.g., GBlend, LambdaMART), or combination thereof (e.g., IntervalRank using pairwise and listwise approaches).
- pointwise approaches e.g., RankBoost, McRank
- listwise approaches e.g., AdaRank, ListMLE
- pairwise approaches e.g., GBlend, LambdaMART
- IntervalRank using pairwise and listwise approaches e.g., IntervalRank using pairwise and listwise approaches.
- the ranking update problem can be modeled as a regression problem in the context of machine learning, e.g., finding a mapping from factors mentioned above to user votes.
- the ranking update problem can be solved through a classification problem in the context of machine learning, e.g., classifying whether the rank of each result is over-ranked or under-ranked.
- classifier can be binary or multi-class.
- machine learning algorithm can be supervised learning or unsupervised learning.
- the results of very low ranking scores may not be returned/displayed to users.
- Time-associated text information such as transcripts, legends, captions, lyrics, and annotations may or may not be already available for search.
- speech-to-text conversion can be employed to extract the transcript.
- the speech-to-text conversion or automatic speech recognition (ASR) method comprises an algorithm/model, including but not limited to, Hidden Markov Models (HMM), Dynamic Time Warping (DTW), Neural Networks, Deep Learning Models.
- HMM Hidden Markov Models
- DTW Dynamic Time Warping
- Neural Networks Neural Networks
- Deep Learning Models The basic pipeline including text generation from audio is shown below.
- Figure 20 Flowchart of search with Time-associated text including transcribing [00375] 2.5 Translation
- Enabling translation of the either the queries and/or the reference has several benefits, e.g., the user does not have to search in the native language of the media entity, or they can watch the media entity with captions in a different language.
- Both the query and the native transcripts can be translated. However, translating the native transcripts might be more accurate as it can take advantage of the context.
- Machine translation approaches including rule -based machine translation (RBMT) or statistical machine translation (SMT) can be used.
- RBMT systems include direct systems that uses dictionaries to map input to output, transfer RBMT systems that employ morphological and syntactical analysis and interlingual RBMT systems that employ abstract meanings of words. SMT systems can work at word, phrase or sentence level.
- the translation can be done hierarchically according to syntax.
- Hybrid systems may use both RBMT and SMT systems.
- the translation can be done at different levels, including word, phrase, clause, sentence and paragraph levels, respectively. Because the order of words of the same sentence may differ in different languages, translating at sentence level may have an added advantage.
- for each sentence in the new caption it will be displayed from the time at the beginning of the sentence in native language and to the time at the end of the sentence in native language. It is also possible to chop the translated sentence in many parts and displayed sequentially during that time interval.
- To determine when to display which word we can employ automated audio and transcript alignment algorithms, such as finite state transducer (FST) or time-encoded transfer. Therefore, the translated transcripts can be obtained for searching.
- FST finite state transducer
- Figure 21 Flowchart of search with Time-associated text information with build-in translation
- Multi-channel sound Many media entities contain more than one audio channels, for purposes such as stereo effects.
- the software will search the query in the transcript(s) of one or more channel(s), and show the result individually or collectively.
- An alternative approach is to first form a consensus from any, some, or all channels, and then run the search discussed above on the consensus of them.
- the consensus can be formed in many ways, in audio, text, or jointly using the information from more than one form.
- the consensus can be form in many ways, including but not limited to, averaging the signals from different channels, detecting the phase shift between channels and then align the signals, etc.
- the problem of text alignment is well studied, methods including the IBM method, Hidden Markov Model (HMM), Competitive Linking Algorithm, Dynamic Time Warpping (DTW), Giza++, etc.
- Safeguard In one embodiment, the software can be used to screen for inappropriate words or sex-related words in the video. Inappropriate text or media contents will be detected using the matching algorithms mentioned above. Then a "clean version" will be generated with those inappropriate content removed. For example, place a high-pitch noise at timestamps where bad words are found. As another approach, make a new copy with the time durations containing bad words removed.
- Time- associated text information e.g., a transcript
- pre-existing e.g., closed caption provided by the producer
- Text and media data can be aligned using many methods. For example, text and speech can be aligned using, including but not limited to, comparing in audio domain by converting text to speech (such as SyncTS system), universal phone models, finite state machine methods, time- encoded method, dynamic programming, etc. If consensus between data of some domains (e.g., text and audio) cannot be reached, alignment with more domains (images, other data sources) can be the tier breaker.
- Music/audio management In software/services such as iTunes, Google Music, Amazon Music and karaoke software, the music may also be arranged for searching based on the matching query with lyrics. As such, songs with the same words, phrases or sentences can be grouped together. For instance, when a query "hotel” is inputted into the system, all songs with “hotel” in the lyrics will all show up under the search and can be thus categorized together. Furthermore, search will enable synonyms matching or fuzzy search if desirable. For instance, “inn” and “hotel” may be matched together under synonyms. This way, the "theme”, "context” of the music or videos can be grouped together.
- Search results can be presented to the users in multiple ways, including but not limited to, the original media entities, the most interesting (ranked by landmarks, which will also be discussed in later sections of this disclosure) segments (linear or nonlinear) of media, a montage of certain segments of media, text results of any form and lengths, or the combination of them. Ranks and scores of multiple matches will also be presented to the user, numerically or graphically, or the combination. The confidence level and ranking scores may also be presented to the user as the search results.
- the media management and navigation system only offers limited functionalities: for instance, the currently available management systems are only based on files, rather than time-associated information. For instance, using conventional system a playlist can be created which contains a sequence of videos to be played. In this case, the playlist is based on different video files, ignoring the time-associated information.
- the user can navigate and manage the contents within the same video based on time-associated information and/or audio-associated information.
- the software can create a new type of data that stores shortcuts
- mapping relationship which maps from an index key to a timestamp in the media.
- the index key can be a key for text, images, audio, videos, other media or a combination thereof.
- the index key is similar to elements in a "table of contents" that is used for text contents, such as a PhD thesis.
- the index key will map a text key to a timestamp in a video.
- a plurality of index fields can form a table of content for a video. For instance, a table of content for the TV series "Friends" episode 1 can be formed, as follows: [00391 ] Figure 22. Index keys presented to user based on the Videomark technology
- each text key is mapping to one unique timestamp or a plurality of timestamps.
- the software will play from the corresponding timestamp associated with the event that shows that Rachel decided to stay with Monica.
- index keys can be generated either manually by the user or automatically by the software. It should be further appreciated that the index keys can be generated by input a search query such as "Ross and Rachel" into the software and the software will automatically generate the index key.
- the video do not necessarily have to be segmented into video segments. Instead, the index keys can just be the beginning and end timestamps for each of the video duration desirable, without the video segmentation process.
- the table of content has fields that maps to timestamps in a plurality of videos. For instance, in the table of content shown below, the key "8. Rachel is going out with Paulo" maps to a timestamp in another episode (stored as a separate video file), when Rachel is going out with Paulo. Consequently, user can conveniently manage and navigate through the content based on information embedded in the media, and jump to the corresponding timestamp within the right video file for media playing easily.
- Figure 23 Index keys presented to user based on the Videomark technology. Multiple files are indexed.
- the table of contents presented to the user contain text keys with a hierarchical structure. For instance, an exemplified table of content is shown below, where 2 level of text keys are presented. The key “4.1 Rachel checked out Monica's apartment” and “4.2 Monica mentioned that she got the apartment from her grandma” are keys hierarchically under the key "4. George decided to stay with Monica”. By clicking on " 4.1 Rachel checked out Monica's apartment", the software will jump to the corresponding timestamp to play from the segment that Rachel checked out Monica's apartment. [00398] Figure 24. Index keys presented to user based on the Videomark technology. Multiple files are indexed. Hierarchical structure is included.
- the index key can contain images as an representation for displaying to user.
- the first key field contain an image, which may reminds the user of the plot.
- the image key field map to a timestamp that is associated with the image.
- the timestamp is a timestamp which is temporally proximal to the timestamp when the image is shown in the video.
- the image is user specified and the timestamp is also user-specified.
- Figure 25 Index keys presented to user based on the Videomark technology. Multiple files are indexed. Hierarchical structure is included.
- the index key can include images, animations or video clips.
- the index key is a combination of image and text, as shown below:
- Figure 26 Index keys presented to user based on the Videomark technology. Multiple files are indexed. Hierarchical structure is included. The index key can include images, animations or video clips.
- the index keys can also be in formats other than text or images, when they are presented to the user.
- the index keys can be in the forms of video, animation such as GIF format, or audio clips.
- the index key can be a combination of video, animation, audio clips, text and images.
- the index key can be a GIF animation, which help the user to understand the context.
- the index key can be a video clip. When the user place the mouse cursor on the video clip, the video clip will play (muted or with sound), so the user can preview the index key to determine whether it is what he/she want.
- an index key can map to a plurality of timestamps.
- the user click on one such index key the user will be presented with a selection from a plurality of timestamps in a form of menu. The user can therefore select from the selection menu selections for the desirable timestamp.
- an higher level index key such as the "Friends meet each other in the Central Perk coffee shop "
- the user click on an higher level index key such as the "Friends meet each other in the Central Perk coffee shop "
- the user may be redirected to the lower level table of content enumerating multiple timestamps associated with the index key, as shown below.
- the user can therefore make selection for the desirable timestamp.
- the different level of index key list can be expanded or collapsed, as needed. An example given below.
- Figure 27 Index keys presented to user based on the Videomark technology, when the user click on one such index key, the user will be presented with a selection from a plurality of timestamps in a form of menu.
- an index key maps to a timestamp in a plurality of audio files, such as voice memos, audio recordings, podcasts, music files or karaoke soundtracks.
- the disclosed subject matter allow the user to manage and navigate within the voice memos, podcasts and audio recordings.
- the disclosed subject matter allow user to manage and navigate within the music files. As such, the user can jump to the desirable timestamps within songs.
- the disclosed subject matter embodies a karaoke player software that allows the user to navigate in the karaoke soundtracks to the timestamps where certain lines of lyrics are said, thereby skipping the irrelevant parts of the soundtracks.
- index keys can map to the timestamps in video lectures or audio lectures.
- the students can also be video or audio recordings as their homework to a problem; the teacher can navigate through the answers in the form of video/audio easily using the method in the disclosed subject matter.
- some of the index keys can map to a text block in a text file such as pdf file.
- the index key can also map to a website.
- the index key can also map to a hashtag or tweet.
- the index key can also map to an image.
- An example is shown below. In this case, when user click on "1. Law of reflection", the user will be redirected to the paragraph in an optics textbook (eg. as a pdf file) which discusses the law of reflection. When the user click on "2. Demonstration of Law of reflection” or "4. Demonstration of Law of refraction”, the user will be redirected to the corresponding timestamps in the video, as previously described.
- Figure 28 Index keys presented to user based on the Videomark technology. Some of the index keys can map to a text block in a text file
- the software may provide a graphic user interface that present the user the relevant information, so that the user does not have to jump between more than 1 software for using text, video and audio contents.
- a representative user interface is illustrated below.
- the panel for index keys will present the index keys such as table of contents, as previously discussed.
- the Panel for Presentation of Results will present the videos (play from the timestamp of choice) or present the text content (eg. particular paragraphs in the textbook) to user.
- the software has an embedded media player for playing audio/video, and a text editor/reader for present the text content. It should be appreciated that a website can also be presented in the panel for presentation of results.
- Figure 29 GUI presented to user based on the Videomark technology.
- the graphic user interface of the software is shown below.
- the software can show user an recommendation "since you like ABCD, you might want to try EFGH". Advertisement may also be shown here, in text, photo, animation, video or combination thereof. Shopping choices can also be presented here, partly based on user's choice of index key and/or user history.
- Social media can also be integrated or interfaced with the software: for instance, the user can share part of the content to his/her friends via social media (eg. "i just watched XYZ”)
- Figure 30 GUI presented to user based on the Videomark technology. Additional panel for showing advertisements, recommendation or social media components is included.
- the software will generate the list of index keys based on user input or search query.
- the index keys may be the exact search results generated by the software discussed in section 2 of the disclosed subject matter (with corresponding timestamps); in another aspect of the embodiment, the index keys may be entities based on search results with user input.
- the software will integrated the aforementioned search functionality with the media management functionality. Thus, the software can help the user to create a personalized table of content based on the search query. It should be appreciated that the list of index keys can be either generated in a just-in-time fashion, or to be previously calculated, indexed and stored to speed up the query speed.
- a representative flowchart is shown below:
- Figure 32 flowchart for processes including user query, index key generations and presentation of relevant results
- the user may search for the term “"Rachel” AND “Ross”” as the query, the software will match the search query with the videos of "Friends", based on transcript.
- the results will be ranked and presented to the user as a list of index keys, in an orderly fashion (eg. starting from season 1 episode 1 to Season 10 last episode, sequentially).
- a result as shown below may be presented to the user, as follows.
- the software will play the video from the corresponding timestamp, so that the user can easily navigate through the show and watch the show in a convenient way. It should be appreciated that movies, TV shows, video lectures, audio clips, voice memos, games, radio recordings, podcasts can all benefit the disclosed subject matter, in a similar way.
- the software will create a spoiler-free version of list of index keys.
- the spoiler- free version will omit some detailed plots so that the user will not have spoiler of the media to be watched. For instance, with the show "Friends" the user may search for the term ""Rachel” AND “Ross”" as the query, the software will generate the following spoiler-free version of list of index keys.
- the user can click on an icon called "switch to full version list", and the user will be redirected to the version of list of index keys containing spoilers, as shown previously.
- warning to user such as "Are you sure you want to switch to the full version of list containing spoilers?" can be displayed to user to solicit user confirmation.
- the spoiler-free version can be the default choice, to avoid revealing and spoilers.
- the software will show the user the full version by default if the watch history is tracked and suggested that the user has watched the show,
- the software will generate a short video based on the user query and list of index keys.
- the short video is similar to a summary video or trailer. For instance, with the show "Friends" the user may search for the term ""Rachel” AND “Ross”" as the query, the software will generate a short video containing short video segments associated with the index keys.
- each video segment associated with the index key may be a video starting from the timestamp of the index key and lasting for a duration of time of choice (e.g. 5-12 minutes after the timestamp).
- each video segment associated with the index key may have a variable duration, and the duration is determined by other segmentation techniques based on artificial intelligence. As such, the user can view the shorter video compilation instead of the whole TV shows. Upon different user query, different video compilation can be generated. It should be appreciated for each index key, a plurality of video segments (greater than 1) may be generated, based on user-preference or default settings. A representative flowchart of this process in shown below:
- the list of index keys is presented to the user in the form of a glossary.
- each element of the glossary maps to a plurality of timestamps in media such as videos or audio clips.
- some elements of the glossary map to text blocks in textbooks, and other element of the glossary maps to a plurality of timestamps in media such as videos or audio clips.
- some elements of the glossary map to websites or hashtags.
- the glossary can have a plurality of hierarchical levels of index keys.
- the lists of index keys can be expanded or collapsed as needed, for user visualization.
- user can enter search query as a new element to be added to the pre- configured glossary. It should be appreciated that the glossary feature described here will benefit both educational application and non-educational application.
- the aforementioned methods can be used in medical fields, such as medical data search, recording, management and navigation.
- medical records and data the most common forms of data are timed readings (such as blood pressure at given date), timed text (e.g. physician's summary for a visit at certain date), images (e.g. CT, MRI, ultrasound data at a given date), videos, etc.
- timed readings such as blood pressure at given date
- timed text e.g. physician's summary for a visit at certain date
- images e.g. CT, MRI, ultrasound data at a given date
- videos etc.
- a medical training software can be embodied.
- the medical training software can navigate the user between video lectures, video for diagnostics, surgical demo videos, textbooks, websites, images, using the methods previously described in the disclosed subject matter.
- the aforementioned methods can be applied to management of user-generated videos, such as videos taken on the phone or videos created in the social media application (e.g. Snapchat etc.). Consequently, when the user query the software, videomark technology will arrange and present the results to the user. For instance, if the query "Mike" is inputted, the software will generate a list of index keys based on search results for "Mike", similar to the examples previously discussed. Thus the user can manage and edit the video contents conveniently.
- the list of index keys generated by one user can be shared to other users via social media applications such as Snapchat, Twitter, WeChat or Facebook.
- the videos being searched with the queries are media massages containing video (e.g.
- Snapchat or WeChat and the list of index keys based on user query is build in the social media application. Consequently, the user can easily manage video massages in the social media application such as Snapchat.
- the software will search in the user-defined library, such as videos in local storage, video-containing messages, MMS, videos in cloud storage, or a combination thereof. It should be appreciated that dating social media application can also benefit from the disclosed subject matter.
- the aforementioned methods can be used to managing adult videos.
- the adult videos can be user generated or be created by adult websites. For instance, when the query "take off is inputted, the list of index keys will be generated based on the said query.
- the previously discussed "generating short video” feature can also be applied to adult video contents.
- the software can automatically generate a short summary video integrating video segments, based on user input/query.
- the media entity can be a game video or recording (e.g., in the game's own format).
- the videomarks hence can be events related with and specific to the game.
- the videomarks may include all times that a hero dies.
- the game video or replay will jump to place where a hero dies.
- the game does not have to a video game.
- videomarks can be all touchdown moments in one season of NFL. When the user traverses all videomarks, he/she will enjoy all touchdown moments in this season.
- Video/Audio Editing [00443]
- the video/audio editing software is not very convenient to use. For instance, for one to segment a short clip from a longer video, one has to manually search on the time axis to define the starting and ending points of the segment. Also, it is sometimes very hard to quickly locate the right recorded event (e.g. if we look for the events when "Edison" is said in footage, it is not very convenient to search for it manually).
- video editing software can be developed with build in search capability.
- the software can search for the event based on Time- associated text information. For instance, when the word "Edison" is inputted into the software as the query, the software will search, rank and return the results matching "Edison" to user, when the word "Edison” is said in the video.
- the search and rank process is similar to the methods previously described in the disclosed subject matter. By clicking on the result, the software will take the user to the timestamp associated with the query (eg. when the word "Edison” is said in the transcript).
- the user can easily define the starting point or end point of a video segment, without manually searching in the videos and dragging back and forth on the time axis.
- video editing functionalities of computer science can be enabled in the media editing software, such as divide, combine, segment, special effect, slow motion, fast motion, picture in picture, montage, trimming, splicing, cutting, arranging clips across the timeline, color manipulation, titling, visual effects, mixing audio synchronized with the video image sequence.
- conventional method of manually editing the media across the time axis is also supported by the media editing software described in the disclosed subject matter. A representative flowchart of this embodiment is shown below.
- Figure 35 flowchart including video editing functionalities.
- search results are be presented to the user using the "Videomark technology" described in the disclosed subject matter.
- the user will be able to drag and drop the index keys, or realign the index keys along the timeline/time-axis for media editing, with GUI.
- a representative flowchart is shown below
- reserved words can be defined for automatic or semi-automatic segmentation of media.
- words such as "3, 2, 1, action” and “cut” can be defined as starting point and end point, respectively.
- the occurrences when "3, 2, 1, action” and the associated timestamps will be automatically defined as the starting points of the segment, and he occurrences when "cut” and the associated timestamps will be automatically defined as the end points of the segment. Consequently, various comment/production terminologies can be incorporated into the software as reserved words.
- the aforementioned methods can be applied to editing music and podcasts, including music soundtrack and music video.
- the media editing software embodied by the disclosed subject matter can run on any computing devices, such as desktops, laptops, smartphones, tablet computers, smart watches, smart wearable devices. It should be further appreciated that camcorder, cameras, webcams, voice recorder with computing power can also run the media editing software. It should be appreciated that the media editing software can run locally, on the cloud or on a local/cloud hybrid environment. It should be appreciated that the media editing software can be integrated with video streaming websites/providers and social media software applications.
- the software can enable video selfie (eg. take a video of oneself) capturing, processing, management and navigation.
- a video selfie software can be developed with build-in search capability.
- the software can search for the event based on Time- associated text information. For instance, when the word "Edison" is inputted into the software as the query, the software will search, rank and return the results matching "Edison" to user, when the word "Edison” is said in the video.
- the search and rank process is similar to the methods previously described in the disclosed subject matter. By clicking on the result, the software will take the user to the timestamp associated with the query (eg. when the word "Edison” is said in the transcript).
- the user can easily define the starting point or end point of a video segment, without manually searching in the videos and dragging back and forth on the time axis.
- video editing functionalities of computer science can be enabled in the media editing software, such as divide, combine, segment, special effect, slow motion, fast motion, picture in picture, montage, trimming, splicing, cutting, arranging clips across the time axis, color manipulation, titling, visual effects, mixing audio synchronized with the video image sequence.
- conventional way of manually editing the media across the timeline is also supported by the media editing software described in the disclosed subject matter.
- the search results are be presented to the user using the "Videomark technology" previously described in the disclosed subject matter. The user will be able to drag and drop the index keys, or realign the index keys along the timeline/time- axis for media editing of selfie videos.
- the video selfie software can facilitate batch processing capability to handle videos. Because a video contains a collection of image frames, the processing of video is not as easy as the processing of individual selfie images. In one aspect of the embodiment, the processing can be done on one representative frame, and the settings will be applied to other image frames automatically.
- the software comprises object recognition, object tracking and face recognition algorithms.
- region of interests can also be user defined (for instance, the user may draw a square via the GUI to define the ROI manually).
- the features such as human or dog can also be identified using segmentation techniques. The software will detect the faces and track them across the frames. Consequently, similar settings can be applied to processing of each face across the frames.
- Faces can be identified by face recognition algorithms. Algorithms for facial recognition include but are not limited to, eigenfaces, fisherfaces, local binary patterns histogram. Many open source software libraries has functions for facial recognition off-the-shelf, including OpenCV.
- the software may show 3 selected frames to the user (eg. a frame selected from frame 1-100, a frame selected from frame 101-130, a frame selected from frame 131-200, respectively).
- a plurality of frames will be synthesized based on the characteristics of the all frames. For example, a weighted algorithm can be implemented to generate a "representative" image for frame 1-100, frame 101-130, and frame 131-200, respectively.
- One possible method is to perform a special region based ensemble averaging (averaging for each face with affine transformation).
- Another method is to digitally synthesize an image with the faces of Jim, Mike, Lucy and Lily with their faces having typical intensity values.
- the user only has to edit one image frame instead of a plurality of frames.
- the software will track the features and regions of interests across different frames.
- faces will be tracked.
- the software can use Object recognition algorithms comprises: Edge detection, Primal sketch, Recognition by parts, Appearance-based methods, Edge matching, Divide-and- Conquer search, Greyscale matching, Gradient matching, Histograms of receptive field responses, Large modelbases, Feature -based methods, Interpretation trees, Hypothesize and test, Pose consistency,Pose clustering, Invariance,Geometric hashing, Scale- invariant feature transform (SIFT), Speeded Up Robust Features (SURF), Bag of words representations, 3D cues, Artificial neural networks and Deep Learning, Context, Explicit and implicit 3D object models, Fast indexing, Global scene representations, Gradient histograms, Intraclass transfer learning, Leveraging internet data, Reflectance
- Object recognition algorithms comprises: Edge detection, Primal sketch, Recognition by parts, Appearance-based methods, Edge matching, Divide-and- Conquer search, Greyscale
- the software to map the regions of interest such as the faces in other image frames to the representative frame.
- the software Given that the user has already specified global and/or local operation/processing of the image, where global operation applies to the whole image frame while local operation applies only to a plurality of pixel neighborhoods, the software will map the ROIs in other frames to the ROI in the representative frame. For instance, a one-to-one correspondence will be established between different regions within the face of Lily in the representative image frame and the different regions within the face of Lily in other frames. Consequently, the global operations and local operation that user specified on the representative frame can be applied to other frames. It should be appreciated that the operation on other frames may not be identical that on the representative frames, to accommodate the differences between individual frames.
- the flowchart is shown below
- the ROI such as the faces are divided into a plurality of regions.
- the software calculates the histograms with each regions in the ROI of the processed representative image frame (after the global and/or local operations applied). Subsequently, the software will also calculate the he histograms within each regions in the ROI of the other image frames. The software will process the other image frames so that the histograms of regions in the ROI will be more similar to that of the processed representative image frame. It should be appreciated that the values and/or distribution of the histogram will be used for the calculation of operations for other image frames.
- the user may process one image with global and local operation.
- the user is process a selfie photo so that she looks better on that image.
- the software may apply the aforementioned methods to batch process other photos of the user, so that she looks better on other photos without requiring her to go through editing each photo individually.
- the software will record user history of his/her photo/video editing activities and automatically generate a template for editing photo with individualized settings.
- Machine learning algorithms may be applied to calculate the setting based on user history.
- the software will rely on crowd sourcing and record the editing activities of a plurality of users.
- the software will use machine learning algorithm such as deep learning to generate suggested template settings to the user based on machine learning, for photo/video editing. It should be appreciated that the software may use both local user history and crowd sourcing/ deep learning results to generate the optimized settings.
- the query may be a plurality of particular words, phrases or sentences, etc, in any language.
- the user may type, say or use other methods (such as those mentioned in Section 2.1 earlier) to input the said queries to the computer. For example, when the user says "Pythagorean theorem" as a query in a video of geometry class recording video, the user will jump to the particular timestamp when the teacher mentions the phrase "Pythagorean theorem".
- the video segment may be presented to the user.
- the video may start to play at a timestamp when the said query word was uttered.
- a timestamp when the said query word was uttered.
- the software will jump to the desirable timestamp based on the user preference. For instance, the user may opt to jump to the first timestamp where the query match is found, or alternatively the user may manually select the timestamp to jump to based on the results returned by the software.
- the search does not have to be limited to Time-associated text but all kinds of information, at all kinds of media forms, related with time or not.
- the search in the disclosed subject matter is a query search based on transcripts associated with the media of interest.
- the transcript-based query search can be performed using 3 steps illustrated below.
- the time-associated transcript may be provided by the content provider, e.g., close captioning or subtitles from movies.
- the software Upon user input of a plurality of queries and query searching based on transcripts, the software will jump to a particular timestamp where the said queries are matched with the transcripts. It should be appreciated that if a plurality of matching results are found, the software can jump to a particular timestamp selected from the matching results, based on the user preference.
- a default setting can be implemented such that default mode of operation of the software is to jump to the earliest timestamp; in another aspect the default setting the software can return a plurality of results to the user and the user can manually select the timestamp to jump to.
- Figure 36 Flowchart of media navigation/playing control
- a query of "happy lab” may be inputted into the software, and software will determine the a plurality of locations/timestamps where the said query occurs in the text information. Consequently, the user can locate the timestamps within the media, where the said timestamps are matched with the query.
- the association algorithm may comprise a method selected from the group consisting of temporally mapping to, temporally being prior to, or temporally being after.
- the time constraint between the said timestamp and said query may be user-defined or automatic.
- the user can input the "happy lab” in the search field of "transcript”, and input the "5 seconds after” in the field “association method”; the software will return all timestamps where the said timestamps occurs 5 seconds after the corresponding query in the transcript. This way, the user can determine the correct timestamp.
- the user can input the "happy lab” in the search field of "transcript”, and input the "in the first half of video” in the field “association method”; the software will return all timestamps where the said timestamps occurs 5 seconds after another timestamp when the said query occurs (e.g., were said by the presenter) in the transcript.
- the timestamp associated with the word/phrases/sentences in the search query (association algorithm is previously described) is extracted and the media stream then starts playing from the said timestamp.
- a confirmation may be promoted to the user before jump to the new time of playing.
- the user can use various methods to submit the query.
- the basic pipeline including transcribing audio into text is shown below in Figure 30.
- the software will search in the video transcript for the query, find the matching results and jump to the timestamp when the said query occurs.
- the software will search in the video transcript for the query, find the matching results and jump to the timestamp 5 seconds prior to when the said query occurs.
- Figure 37 Flowchart of media navigation/playing control using searching in audio-generated text
- Figure 38 Flowchart of media navigation/playing control with Time- associated text including translation
- All methods e.g., inputting and submitting query, matching, ranking, and safeguarding, mentioned in Section 2, can be used here.
- Multiple results can be returned to the user with timestamps, previews (in all media, text, video snapshot, etc.), ranking scores.
- the search can be on mixed fields. The user has the freedom to select from one of the results.
- a plurality of the search results can be represented by a GIF animation to the user, which help the user to understand the context.
- a plurality of the search results can be represented by a video clip to the user. When the user place the mouse cursor on the video clip, the video clip will play (muted or with sound), so the user can preview the index key to determine whether it is what he/she want.
- a representative graphic user interface containing the search results is shown in Fig. 39.
- Figure 39 An exemplified GUI for the software.
- the user put in the query, and view 4 search results.
- the cursor is moved on to the video clip window (eg. Timestamp 1)
- the clip will automatically play in the clip window for preview.
- the video will play full screen from that particular timestamp.
- End point search in yet another embodiment, text-based search will determine and ending point of the clip to be played (part of the video).
- the search and matching methods are essentially similar to that in aforementioned embodiments. Different from the starting point, the end point is being specified.
- the pipeline is illustrated below.
- Figure 40 Flowchart of media navigation/playing control with end point search
- At least 2 queries will be inputted into the software, with the first query representing the starting point for the clip, and the second query representing the end point of the clip to be searched for.
- the user may specify a query "happy” in the field "starting word”, and specify a query "lab” in the field "ending word”.
- the software will return a plurality of video clips with the corresponding starting word and ending word.
- the user may specify additional information such as the length of clips as another field for the search to narrow down possible clips.
- the user may also manually select the clips of their desire based on the search results.
- the user can also use a composition of the "starting word” and "end word” in one query, such as "starting with happy and end with lab.”, which can be inputted through voice command know in the field of computer industry or text command.
- Graphic user interface A exemplified GUI of the software is shown below.
- “the bRaln” is submitted by the user as the query, and a list of matching results are returned. Both timestamps and the transcripts containing the matches are shown to the user. The user can select any of the match results and software will play the video from the timestamp selected.
- FIG. 41 Graphic user interface for the software: an example
- the voice box described in the disclosed subject matter can be made available for a variety of communication methods, such as telephone, mobile phone, online audio/video chat (e.g. Skype video messages, Google Hangouts, WhatsApp, WeChat voice message), Voice over IP (voice being transmitted over the Internet or any kind of packet-switching networks such as Bluetooth scatternet), voice mail messages, answering machines, etc. It should be further appreciated that this disclosed subject matter should not be limited to voice or audio calls, but also messaged-based communication, such as text messages.
- communication methods such as telephone, mobile phone, online audio/video chat (e.g. Skype video messages, Google Hangouts, WhatsApp, WeChat voice message), Voice over IP (voice being transmitted over the Internet or any kind of packet-switching networks such as Bluetooth scatternet), voice mail messages, answering machines, etc.
- this disclosed subject matter should not be limited to voice or audio calls, but also messaged-based communication, such as text messages.
- our disclosed subject matter allows users to rapidly locate the voicemails containing the query of interest inputted by user.
- the basic processing pipeline for the Time-associated text search in voicemails is illustrated below.
- the software will take a plurality of user inputted queries, match queries with Time-associated text information in media based on the text domain to locate the relevant voicemails (e,g., based on transcripts), and return the results of relevant voicemails to user.
- the user may choose to play the voicemails based on search results. Either the whole voicemail or a segment of recording within the said voicemail containing the said query word can be presented to the user.
- the query can be a plurality of words, phrases or sentences.
- the search is not limited only to the transcript of the voicemail, but also all types of text information, such as the caller name or phone number.
- a new number not recognized by the current phonebook will trigger a search on the internet for such as number, and relevant caller information will be fetched for searching and ranking purpose.
- Ranking can be done with different user preference. In one aspect, the ranking can be done based on the date and time for calling; in another aspect, the ranking can be done based on caller id (recognized number or new number; the group where the number belongs to: eg. family, friends, etc).
- Figure 42 Flowchart for searching text query in text information in voicemails
- the software upon completion of the aforementioned search process, the software will play the relevant voicemails beginning from the timestamps associated with the queries. For instance, if the query "happy lab” is inputted, the software will match the query with the voicemails based on the Time-associated text such as the transcript, and play the media from the points matching the query.
- a query of "happy lab” may be inputted into the software, and software will determine the a plurality of locations where the said query occurs in the transcripts of the voicemails. Consequently, the user can locate the timestamps within the voicemails, where the said timestamps are associated with the said query.
- the association algorithm may comprise a method selected from the group consisting of mapping to, being prior to, and being after.
- the time intervals between the said time-stamp and said query may be user-defined or automatic. For instance, the user can input the "happy lab” in the search field of "transcript”, and input the "5 seconds after” in the field "association method”; the software will return all timestamps where the said timestamps occurs 5 seconds after the said query in the voicemail transcript. This way, the user can determine the correct time stamp in the voicemail. It should be appreciated segments of voicemail containing the query may also be generated and presented to the user.
- One aspect of the embodiment is illustrated in the flowchart below:
- Figure 43 Flowchart for playing voicemails based on matching locations
- Videomark technology described in section 3 can also be used here for voicemail search and management.
- the user may input the query using voice recognition methods known in the field of computer industry, including but not limited to, Hidden Markov Models (HMM), Dynamic Time Warping (DTW), Neural Networks, Deep Learning Models.
- HMM Hidden Markov Models
- DTW Dynamic Time Warping
- Neural Networks Neural Networks
- Deep Learning Models The voice input may be converted to text domain for further search.
- a flowchart for the software is illustrated below:
- voice input query search and voicemail playing where the searching and matching of query is done on the audio domain (e.g., sound waveforms, or transform of the sound waveforms such as spectrogram) instead of text domain (e.g., transcript).
- audio domain e.g., sound waveforms, or transform of the sound waveforms such as spectrogram
- text domain e.g., transcript
- joint audio-text domain searching and matching can be performed.
- a multi-tier searching process can be performed, where the searching and matching can be performed on the text domain first, and then on the audio-domain secondly.
- the disclosed subject matter will automatically categorize or classify voicemails and incoming calls into different categories, including identifying spam voicemails.
- the classification can be done in text domain (based on Time-associated text information such as the transcript), audio domain (e.g., sound waveforms, or any transformation of the waveforms), or combination of thereof.
- a rule- based approach or a statistical approach, or combination of thereof may be used.
- category names as predicate names.
- important(1234567891) means that cany calls from the number 1234567891 is important or spam(8001234567) means that the number 8001234567 is a spam number.
- spam(8001234567) means that the number 8001234567 is a spam number.
- the rule "any 800 number is a spam unless it is not tagged as spam by the user” and a user specified rule ("8003310500 is not a spam" - AT&T's customer service number) can be express as:
- [00536] means "the number X is unknown to not be a spam”.
- a user manually tag a number X as not spam, we create a literal -spam(X) in the knowledge base.
- some training samples will be used to initialize the classifier, which can be provided by the service provider via manual annotation.
- the carrier will create a feature vector, which contains information about the call, such as length of call, time of call, calling frequency, how many people received calls from this, natural language features from the transcript of the call if it is a voicemail.
- the carrier will label whether the caller is a spam caller or not. For spam callers, they can get such info from many ways, such as users complaint. Then, a classifier can be trained to recognize spam callers. In one embodiment, a caller not labeled as spam is not necessarily not a spam caller - could be false negative.
- the classifier can treat callers not labeled as spam as unlabeled data and train itself accordingly.
- the classifier can be updated when more data, especially those submitted by users become available. Users also tag voicemails or calls into different categories and such tagging can be used to update the classifier using machine learning approaches known in the field of computer science.
- additional information can be provided to the users to help them decide and tag whether a caller is spam, such as how many people this caller have called over the past 24 hours, and the geographical variance of the destinations of the caller.
- a user has the option to not to share his/her tags with the carrier nor allow the carrier to use data of his/her call records to update the system-wide classifier for reasons like privacy concern. Then the phone can use user tags to update the classifier locally, only for the user.
- the statistical approach can have different modelings.
- the statistical approach can be modeled as a regression problem where a score of spam likelihood will be outputted, instead of a binary decision, spam or not.
- Algorithms to train a statistical classifier or regressor include but are not limited to Support Vector Machine, Decision Tree, Artificial Neural Network, etc.
- ensemble learning approaches such as boosting or bagging can be used to boost the performance of classifiers/regressors.
- this problem can be solved using Deep Learning methods, such as Deep Neural Network, Convolutional Neural Network, Regression Neural Network, etc. Through a Deep Learning approach, the step of feature vector construction can be greatly reduced. For example, the natural language features from voicemails may not need to be extracted but can be feed into the deep learning modele as raw data.
- the language features to train the classifier includes n-grams, structure features, semantic dimensions, etc. Examples : n-grams, structure features, semantic dimensions flowcharts. Other data features can be used to train the classifier too, such as calling time.
- the value of X determines the likelihood that this caller is also a spam to the user.
- the distance from the user to a friend on social network can be defined in various ways, such as how often they communicate, how frequently are they in the same picture, etc.
- the voicemail/call categorization system can use text information such as the transcripts or caller ID, and non-text information, such as the number, time and duration of the call, in separate or collectively fashion.
- a received voicemail/call will be classified in both voice and text domains.
- a speech-to-text conversion converts audio signal to text.
- Classifiers in voice and text domain will both make decision on the voicemail/call.
- the results from both classifiers will be fused via majority vote and presented to the user.
- the user can manually apply tags and those tags can be used to update classifiers.
- system-wide classifier can also be trained and applied. For example, a frequent spam caller will be labeled as spam for all users.
- Information fusion algorithms will be used to make judgment when different classifiers disagree, using methods including but not limited to, maximum entropy method, JDL/DFIG model, Kalman filter, Dempster-Shafer algorithm, central limit theorem, Bayesian networks, etc.
- the classifiers modeled as either multiple uni-class classifiers or multi- class classifier, can be trained using architectures, including but not limited to, Naive Bayes classifier, Hidden Markov Model (HMM), Artificial Neural Network (ANN), Support Vector Machine (SVM), and Deep Learning.
- HMM Hidden Markov Model
- ANN Artificial Neural Network
- SVM Support Vector Machine
- Figure 47 Flowchart of voicemail/call classification. Note that the classification can be done in voice or text domain only. Note that this approach also works for modeling the problem as regression, which does not binarily tell whether a call/caller is a spam but the likelihood.
- the software will combine local information with information from other databases (e.g., database online; known spam caller/telemarketer) to form the database for spam filtering.
- database online e.g., database online; known spam caller/telemarketer
- Audiobooks will benefit greatly from the disclosed subject matters.
- the user will be able to advance to the chapters and timestamps when certain queries occur. For instance, when the query "Steve Jobs" is inputted, the software can accurately locate the chapters where the said query occurs (eg. "Steve Jobs" is mentioned in the audio).
- the user will be able to play the audio segments when the queries occurs. Similarly, both starting and ending points of the audio clips may be specified.
- the search in the transcript will also enable user to manage the audiobook chapters or audio segments in a convenient way by topics, similarities and contents, within 1 audiobook or across a plurality of audiobooks.
- podcasts can use the aforementioned features and technology described in the disclosed subject matter.
- Podcast in the form of video or audio-only can all benefit from the disclosed subject matter.
- recording of radio programs can also benefit from the disclosed subject matter, in a similar fashion.
- a flowchart representing one aspect of the search technology applied in audiobook, podcast and radio recordings based on Time-associated text information is illustrated below.
- the software will allow user to play the audiobook, or podcast, or radio recording form the timestamp associated with the query words.
- Figure 48 Flowchart of search in audiobook (including any audio- focused media entity, such as podcast and radio.)
- the user may select the segment he/she prefers so the audiobook/podcast/radio will play from the timestamp of choice.
- results will be ranked before it is returned to the user:
- Figure 49 Flowchart of search in audiobook (including any audio- focused media entity, such as podcast and radio.) with result ranking.
- the software will first transcribe the audio information to generate the transcripts.
- the software will play the segment from the timestamp associated with the query matching locations, as illustrated in the flowchart below:
- Figure 50 Flowchart of search in audiobook (including any audio- focused media entity, such as podcast and radio.) with transcribing and result ranking.
- FIG. 1 Another flowchart representing another aspect of audiobook/podcast/radio-recording search based on transcript is shown below.
- voice input with speech recognition may be used to guide the navigation through the chapters in the audiobook.
- user may say "Jump to when Steve Jobs was fired", and the software will do so to play from the timestamp when Steve Jobs was fired in the audiobook; in another example, user may say "play the audiobook until Zuckerberg bought the Oculus”.
- the artificial intelligence may also be integrated into this software. For instance, user can say "play the audiobook until my coffee ordered was delivered by Amazon", the software will play the audiobook until the coffee ordered on Amazon is delivered as indicated by the system notification.
- Such integrated intelligence can be realized on desktop, laptop, tablet computers, smartphones, smart systems (e.g., Amazon Alexa) or other embedded systems.
- Figure 51 Flowchart of text-based audiobook (including any audio- focused media entity, such as podcast and radio.) search comprising of voice input and audiobook play.
- the queries that user inputs can be used for advertisement purposes, independent to the media entity or jointly with the content of the media entity. For instance, if the audiobook is supplied for free, at the end of each chapter the user may be required to listen to an audio advertisement.
- the podcast advertisement may be based on the user queries for targeted advertising.
- the user queries and users feedback to advertisements can be used for crowd sourcing and machine learning for targeted advertising.
- All methods e.g., inputting and submitting query, matching, ranking, and safeguarding, mentioned in Section 2, can be used here.
- Multiple results can be returned to the user with timestamps, previews (in all media, text, video snapshot, etc.), ranking scores.
- the search can be on mixed fields.
- the user has the freedom to select from one of the results.
- the audiobook, audio recordings and podcasts can also benefit from the Videomark technology described in section 3.
- the audiobook, audio recordings and podcasts can be managed and presented to the user using the Videomark technology.
- the user can query the software using the "TymeTravel" syntax/methods described in the disclosed subject matter.
- the textbook information can be linked to the relevant timestamps in the videos based on media- associated information such as the transcript of the video. For instance, when the student is reading the textbook where "Compton scatting" is mentioned, a hyperlink can be placed in the textbook linking to the corresponding timestamp where "Compton scatting” is discussed.
- the media can a combination package of entities consist of videos and books. As such, the query inputted by the user will be searched in all the media entities in the package. For instance, when the query "Compton scatting" is inputted by the user, the corresponding matching results in both textbooks (location where the query keywords occurs) and in videos (the timestamps associated with the query) will be ranked and return to the user.
- Each occurrence may mean a word, a phrase, a clause or a sentence.
- the mapping from textbook block to transcript occurrence may be stored in a database or other data structures for user to look up and query, so that the query speed is accelerated.
- One block in the textbook may match multiple occurrences in the transcript of the media, and one occurrence in the transcript and other Time-associated text information may also be matched to multiple blocks in the text, with different ranks and matching scores.
- the ranking algorithm is similar to the methods previously described in section 2.
- the matching system between the textbooks and the videos can be essentially considered as a multiple-to-multiple mapping system, mathematically.
- all textbook blocks matching the part of the transcript and other Time-associated text information under playing will be presented to the viewer, along with their locations in the textbook, ranks and matching scores.
- the user may preview these textbook information in a thumbnail window within the video player, in a picture-in-picture format.
- the users can visit any or all matches found, automatically, manually or semi-automatically.
- the matching and ranking can be done using approximate/fuzzy matching to find the matches of phrases of least distance (string metric for measuring the difference between two sequences), which can be defined in many ways, such as editing distance, Levinstein distance, etc.
- matches can be found using topic modeling where the distance between a transcript line and a textbook sentence can be computed based on their topic vectors. Also, much additional information can be used to help here. For example, locations cited in the index of the textbook will have higher rank. All text matching algorithms (including but not limited to, Naive string searching algorithm, Rabin- Karp algorithm, Finite- state automaton search, Knuth- Morris-Pratt algorithm, Boyer-Moore algorithm, dynamic programming-based string alignment, Bitap algorihtm, Aho-Corasick algorithm, Commentz-Walter algorithm) can be used here individually or collectively of any combination. The matching and ranking methods describe in previous sections of this disclosed subject matter can be applied for this purposes.
- the search can include multiple fields.
- the student can search a keyword in the textbook and search a particular instrument in the video.
- the textbook itself can be considered to contain multiple fields, such as the main text body, section head, sidenotes, footnotes, examples, homework, and even the note that the student takes by him-/herself.
- the mapping between text blocks in the textbook and transcript blocks in the videos can also be done in a reversed way by first extracting blocks from the transcripts and then constructing the mapping from transcript blocks to their matching occurrences in the textbook.
- the presentation can also be done in a reversed way that when the user selects part of the textbook, the corresponding parts in the media, found through searching for matches in the transcript, are shown with ranks and scores. And the user can play any or all matches in the media.
- the textbook structure can provide navigation in video watching, or vice versa. For example, once the mapping between blocks in the text and those in transcript and other Time-associated text information are established, the hierarchy on the textbook can be transferred to different segments of the video. Video segmentation can be done using text-based segmentation methods mentioned above, pure audio/video segmentation methods, or simply by transferring delimits from the textbook hierarchy to the video. An illustration of textbook to transcript mapping is shown below:
- Figure 52 Illustrating how a segment of speech is converted into text transcript and then corresponding part in the textbook is identified and highlighted.
- the matching and linking between the textbook and the audios/videos can also be enabled by the "Videomark" technology described in section 3 of the disclosed subject matter.
- the Videomark system can list all the organization of the information in the textbook and in the video.
- a glossary can be generated and create links to the corresponding timestamps in the audios/videos and the corresponding text blocks in the textbook files.
- matching/mapping between the first media and a second media can also be done using methods substantially similar to the aforementioned aspects and embodiments. All methods, e.g., inputting and submitting query, matching, ranking, and safeguarding, mentioned in Section 2, can be used here. Multiple results can be returned to the user with timestamps, previews (in all media, text, video snapshot, etc.), ranking scores.
- the search can be on mixed fields. The user has the freedom to select from one of the results.
- the user can query the software using the "TymeTravel" syntax/methods described in the disclosed subject matter.
- the search methods in the disclosed subject matter can go beyond time- associated media and to be applied to any time-associated data.
- the media could be a multi-channel physiological time series that has timed annotations, while the query is a medical condition that a plurality of symptoms occur with a time- relevant relationship.
- GTCS generalized tonic-clonic seizure
- clonic two phases, tonic and clonic, that are about 10 to 20 seconds apart.
- a computer algorithm or a human being can recognize and annotate tonic and clonic phases from physiological time series, including but not limited to, electroencephalogram (EEG), gyroscope, and accelerometer, then the search algorithm will search both phases on multiple channels and set the window of these two phrase to at 20 seconds, in order to automatically alarm that a GTCS happened to the subject.
- ECG Electrocardiogram
- EMG Electromyography
- a medical record software can be made using the "Videomark” methods described in the disclosed subject matter.
- the different segments of the sensor data such as EEG can be organized.
- the user can query the software using the "TymeTravel” syntax/methods described in the disclosed subject matter.
- the media can be stock indexes and the time- related annotation is the events that happen as time advances.
- the search query can be that "when S&P500 drops at most 100 points while NASDAQ drops at least 50 points". The two events that "S&P500 drops ⁇ 100 points” and "NASDAQ drops 50+ points" are timed with stock index log.
- data types that can benefit from the disclosed subject matter are: stock market, foreign exchange rate, commodity prices, bond prices, interests rates, cash flow, market cap, etc.
- a financial analysis software can be made using the "Videomark” methods described in the disclosed subject matter.
- the different segments of the financial data such as stock prices can be organized using the "Videomark” technology.
- the user can query the software using the "TymeTravel” syntax/methods described in the disclosed subject matter.
- Landmark extraction and annotation-driven media watching Various types of landmarks, such as entities (e.g., trademarks, names, brands), keyphrases (e.g., "a tragedy accident"), emphases (e.g., "I would like to explain again") can be extracted from the transcript to provide guidance for users. Landmarks can be user- defined/annotated or learned by the machine.
- entities e.g., trademarks, names, brands
- keyphrases e.g., "a tragedy accident”
- emphases e.g., "I would like to explain again
- Landmarks can be user- defined/annotated or learned by the machine.
- NLP Natural Language Processing
- named entity recognition can extract all important names from the transcript.
- rule-based keyword extraction can identify the timestamps where new concepts are introduced.
- An effective rule is called "a kind of" rule, e.g., "A computer is a kind of electronic devices.” Any sentence that fits the "a kind of" rule is likely about introducing a new concept and hence the sentence can be considered as a landmark.
- landmarks users can play the media entity without following time lineage. For example, they can jump to the timestamps of different landmarks and watch only a few seconds to catch the most interesting or important parts.
- the "Videomark” technology described in the disclosed subject matter can be used to management the landmarks extracted and assist users to navigate through the video.
- a preview/trailer of the media can be automatically generated from landmarks.
- the segment of video before the landmark may be condensed into a fast forwarded trailer or preview. For instance, if the time constraint of "2 minutes before and 5 minutes after" is by default (or the time constraint can be specified by the user) and the landmark keyword of "Macintosh" is mentioned.
- the software will automatically generate a shortened video comprising of video segments associated with the landmark keyword ("Macintosh") and satisfying the constraint " minutes before and 5 minutes after" the said landmark keyword. It should be appreciated that the time constraint can be specified using the "TymeTravel” syntax specified in the disclosed subject matter.
- the segment of video associated with the landmark timestamps may be condensed into a fast forwarded montage trailer when the screen is spatially spitted into a plurality of smaller videos.
- the user can select the relevant segment for viewing by clicking on the smaller video in the screen.
- the segments may be arranged per temporal order or per the ranking algorithms previously described in section 2.
- the segment of video between 2 landmarks are slow motioned for the user to view.
- the landmark generated by the disclosed subject matter can be used in combination of existing user annotation landmarks or other existing database (such as X-Ray in Amazon video), to provide useful information to user and help the user to navigate through the view.
- Media temporal segmentation from transcript or other audio-related text information With the help of transcripts or other audio-related text information, the media segmentation can be performed in various way. In one embodiment, the segmentation is performed based time-associated text information such as transcripts. In another embodiment, the segmentation is performed based on the analysis of images of the media. In another embodiment, the segmentation is based on the metadata associated with the video. In yet another embodiment, the segmentation is based on a combination of images, metadata and time-associated text information.
- the media segmentation is performed temporally using transcripts or other time-associated text information: With the help of transcripts or other time-associated text information, a media stream can be segmented temporally.
- NLP-based segmentation has multiple approaches, including but not limited to, hidden Markov chain (HMM), lexical chains, word clustering, topic modeling, etc..
- HMM hidden Markov chain
- lexical chains lexical chains
- word clustering topic modeling
- topic modeling etc.
- Topic modeling algorithms include Latent Dirichlet Allocation (LDA), multi-grain LDA (MG-LDA), etc, can be used for segmentation based on topics.
- Clustering algorithms include connectivity-based clustering, centroid-based clustering, distribution-based clustering, density-based clustering, etc, can be used for segmentation of media.
- the text body to train the topic model can be at multiple scales, e.g., transcripts for all media on a website, transcript for a particular media entity, etc.
- the unit of topic modeling can be of various sizes at different levels, such as at sentence level, at 100-sentence level, or at all-words-within-10- minute-interval (temporally defined).
- the timestamps associated with the segmentation based on transcript will be used as the timestamps for the media entity.
- the segmentation is first done on transcript (the text domain), resulting in beginning and end timestamps associated with each text segments, and then these timestamps are transferred to the segments of the videos.
- the pairs of the beginning and end timestamps become the delimiters for segmenting the videos.
- the software may detect the discourse features of sentences, and use these features for media segmentation. For example, a sentence beginning with the word "next" is likely to be the beginning of a new segment.
- the media segmentation can be done at different levels of the transcript text body, such as topic level, sentence level and even word/phrase level.
- Different temporal constraints can be applied in combination with the text segmentation method to better segment the video. The algorithm with temporal constraint is represented below:
- Step 3 If the beginning and end timestamps (delimiters) meets temporal constraints, use the delimiters to slice the video and proceed to Step 4. If the beginning and end timestamps (delimiters) do not meet temporal constraints, return to Step 1 but use a different set of parameters to segment transcript.
- step 3 of previous flowchart only the delimiters that do not meet time constraints are returned to step 1.
- Various machine learning technique discussed previously, such as supervised machine learning, unsupervised machine learning, deep learning and reinforcement learning can be applied generate algorithms for automatic segmentation.
- a summary can be generated for each segment from corresponding transcript and used as the caption for each scene.
- the composition of the summaries will form a synopsis of the video.
- Text summarization methods include but not limited to, TextRank, LexRank, and maximum entropy-based summarization. More detailed discussion of summarization will be discussed later.
- the aforementioned segmentation method is novel as it may convert the videos to text-gated video clips for storage.
- a new video can be synthesized. For instance, this method can be used to remove certain word such as "hell” from the video clips.
- clips with sentences starts with “I am” may be combined together sequentially to form a new video.
- the "Videomark” technology in the disclosed subject matter can be further combined with the segmentation method for media management and navigation.
- Tagging Automated tagging from transcripts and other Time-associated text information Tagging is an effective way to represent the various aspects of a media entity. Lots of applications can be done on tags, e.g., finding similar media entities, advanced search, etc.
- tags come from content creators (e.g., people who upload videos to service provider such as YouTube) or automated extraction from any text associated with it. If the content creator does not provide any tags nor accurate text for tag extraction, then the media piece will be tagless or mistagged.
- the frequencies of all words may be counted and then the most frequent words (by absolute number, percentage, or any other metrics) becomes tags. For instance, if the word "Steve Jobs" have highest counts (e.g. 101 times) by absolute number, then “Steve Jobs” may be used as one of the tag;
- a known frequency-based method, tf-idf may be used. Such a method computes two parameters. TF, is the frequency of each word in each document, denoted as tf(w, d), where w is a word and d is a document.
- LDA latent Dirichlet Allocation
- MG-LDA multi-grain LDA
- a combination method comprises of tf-idf and topical modeling may also be used for tagging. Other methods for tag extraction may also be used.
- we may also leverage the information hidden in audio to extract tags or landmarks.
- accent or emphasis in speech usually mean important words.
- the vocal signature of the speaker may also extracted.
- the audio-domain tag and video-domain tag may be used separately or together, sequentially or simultaneously.
- a transcript With the power of NLP, a transcript can be turned into a summary, synopsis, or plot of the video. This feature will allow users to get a general idea of the media clip if the producer does not provide a synopsis.
- Text summarization algorithms include TextRank, PageRank, LexRank, maximum entropy-based summarization, etc. Some heuristics can be used, such as the beginning-of-paragraph heuristics (this method can be very effective for lecture videos).
- the summarization can be done at single-document level, or multi-document level, where a document can have various definitions, such as the transcript of an entire media entity, or the transcripts of all episode s of one season of a TV show.
- this method can also be used to teach the software to do automatic summarization. For example, most TV shows use the beginning of every episode to quickly review what happened in previous episode. The short review of what happened in previous episode is the summary of previous episode. The transcript of the short review and the transcript of the entire previous episode can form a pair of data to train the computer to generate summaries. In another aspect, we can use this to generate trailers. Given the transcript of a lengthy media entity, we first summarize its transcripts, and then video/audio segments correspond to the summary sentences become the trailer. Note that here the "transcript" means all text associated with the media.
- Cross-referencing generation A very common case in speech is that the speaker will refer to a topic mentioned earlier or hint a topic to be discussed later.
- a link between the refered location e.g., "now let's discuss Newton's Second Law" and the referring location (e..g, "we will discuss his second law later") can be established to help users jump back and forth between the corresponding timestamps.
- Crowdsourcing approach can solicit the linkage from users.
- the linkages labeled by users can be directly used to establish cross-references, or can be used to train the software to do so automatically or semi-automatically via machine learning.
- features for machine learning may include word-to- word distance in the transcript, such as those established via vectorized representation of words or distance measure based on distance in a common-sense knowledge base, or the temporal distances described in the disclosed subject matter.
- the cross-reference generation can also be applied between different types of media, for instance, between video and textbook, or between audio and textbook, etc.
- Concept discovery 3. Concept hierarchy derivation , 4. Learning of non- taxonomic relations , 5. Rule discovery , 6. Ontology population , 7. Concept hierarchy extension , 8. Frame and event detection.
- information retrieval also studies knowledge base construction from text. Representative systems include OpenlE, NELL, etc.
- the knowledge extracting can even go beyond finding logical relationships between objects mentioned in the media.
- named entity recognition NER
- NER can be employed to find all tools needed to fix a car if knowing the video is about fixing a car problem and being able to recognize what words in the transcripts are tools.
- the knowledge learned from multiple videos can be merged into more comprehensive ones.
- wrench is a tool that we use to tight things (e.g., a sentence saying "let's tight it using a size 5 wrench") and in another video it learns screws are to be tightened (e.g, a sentence saying "the screw must be tightened firmly”), then it can learn that wrenches are tools to be applied onto screws.
- “See-also" recommendation Currently, content providers (e.g., YouTube, Amazon Prime Video, etc.) do not use transcripts or other text information embedded in the media to recommend new media for users to consume after enjoying the current one. With the availability of transcripts and other Time- associated text information, the relevance between medias can be calculated in addition to existing sources. Using machine learning, this can be done in either supervised way or unsupervised way, or a combination thereof. First, text features of each document is extracted, such as the topic models or n-gram models. In one embodiment using supervised way, we can use the users' watching behavior to train the software. For example, 2 video clips that are frequently watches consecutively are likely to be very related.
- similarities between two transcripts can be calculated using their topic vectors or the frequency vector of n- grams.
- the media entities that are mostly close to the entity that the user just finished watching will be presented to the user.
- All approaches for estimating the similarity between any two media entities can be used individually or collectively in any form of combination.
- All natural language processing features mentioned above can be used here for machine learning.
- the features can be at different levels, e.g., word level, phrase level, sentence-level or even document level (e.g., vector representation of documents).
- the recommendation based on text information can be jointly used with any existing recommendation approaches. For example, different recommendation approaches can each give a media entity a score and the final score is a function of those scores. The recommendation will be presented to the user based on final scores.
- the query histories of user in the Time-associated text information described in the disclosed subject matter can be used for recommendation system.
- the search term means what the user wants and the frequencies of search means how much the user want.
- the time constraints can also be used for recommendation.
- the time constraints carries important information. For example, if user have been looking for "duration less than 2 minutes" around the timestamps associated with the query, this imply that the user wants more user information. .
- the crowd sourcing feature will be enabled and statistics will be collected for analysis. For instance, if many users watched the "see-also" video recommended and quickly put in another query (eg. less than 40 seconds), that means the recommended video may not be well received and the recommendation needs to be updated.
- media entities have been used to deliver advertisements to users, especially when the user enjoys media content without paying, e.g., Spotify free version, YouTube free version, or Vudu "Movies on us”.
- content providers do not use information included in audio or transcripts to match the audiences/viewers and advertisements.
- the metadata for contextual advertisements are conventionally based on title, topic, genre, actor, director, band, user annotation of the media entity, as well as the user history and/or cookies (eg. search, browsing, purchase histories) to select and deliver contextual advertisement. This is especially problematic when the media entity is user uploaded without property text information (e.g., title, descriptions, etc.), resulting in improper ad targeting.
- Advertisement matching approaches include, keyword matching (e.g., maximizing the overlap between keywords of an advertisement and the top word of a transcript), contextual advertisement, and collaborative filtering (eg. making automatic predictions about the interests of a user by collecting preferences or taste information from many users). All approaches for advertisement based on transcript of the current media entity that a user is watching, can be used separately or collectively in any free combination.
- the targeted advertisement can be based on the user query history for searching in the audio-associated information.
- the advertisement can be selected based on user query history to enable targeted advertisement.
- the user query used in the Time-associated text information can be used for contextual advertisement purposes.
- the Time-associated text information such as transcripts can be used to provide contexts for the media, in a more precisely, well- defined manner.
- transcripts provide more enriched contexts, with precise temporal definition with timestamps. For instance, a movie can cover a wide array of topics, in different sets. As such, grouping a movie into a genre such as "Romance”, “Action”, “Sci-Fi” are very insufficient. Even with more specialized categorization such as "a Star War movie", it is still not yet sufficient to differentiate one segment of the movie from other segments.
- some video segments are about sports (characters are running), some segments can be about cars and flights (characters are driving to airports), and some segments are about romance (characters are staying in a hotel near by the beach).
- Time-associated text information such as transcripts
- the contexts of each video segments can be analyzed and extracted, using NLP, topic modeling and automatic summarization approaches discussed previously.
- Other artificial intelligence approaches discussed previously can also be applied to analyze contexts. Consequently, the video can be automatically segmented into different segments and the corresponding contexts can be linked to each segments, using the transcripts and other Time-associated text information.
- the contextual advertisements can be dynamically delivered based on the video segments, rather than the whole videos.
- a sport store ad is shown besides the video by the program; when the driving to airport segment is shown, ads about car dealerships and airlines can be delivered and shown besides the video by the program; when the segments about romance/hotel is shown ads about vocation resorts can be shown besides the video by the program.
- Contextual ads as an application of our search algorithms.
- the search technologies we discussed previously can be used for contextual advertisement.
- the keywords of advertisements in the advertisement network/database can be used as or used to synthesize the search queries to generate results of suitable timestamps in suitable media for contextual ad. For instance, if an advertisement is about running shoes, the keywords are running and shoes, those keywords can be used to search for suitable timestamps containing those words. Also the matches will be compared to the part of transcript close to the matches for the keyword, to make sure times near the matches are suitable for the advertisement as well. Methods we invented above on detecting the suitability of advertising can be used here.
- keywords can be from multiple sources. Besides manually provided keywords by the advertiser, keywords can be generated from the content of the advertisement itself (such as objects recognized in the advertisement image/video or phrases extracted from the advertisement) or from the user activities (such as cookie or his/her search history).
- the contextual advertisement can implemented with media/video segmentation based on transcripts.
- a flowchart about how this process work is shown below:
- the media segmentation can be performed in various way.
- the segmentation is performed based time-associated text information such as transcripts.
- the segmentation is performed based on the analysis of images of the media.
- the segmentation is based on the metadata associated with the video.
- the segmentation is based on a combination of images, metadata and time-associated text information.
- the media segmentation is performed temporally using transcripts or other time-associated text information: With the help of transcripts or other time-associated text information, a media stream can be segmented temporally.
- NLP-based segmentation has multiple approaches, including but not limited to, hidden Markov chain (HMM), lexical chains, word clustering, topic modeling, etc..
- HMM hidden Markov chain
- lexical chains lexical chains
- word clustering topic modeling
- topic modeling etc.
- Topic modeling algorithms include Latent Dirichlet Allocation (LDA), multi-grain LDA (MG-LDA), etc, can be used for segmentation based on topics.
- Clustering algorithms include connectivity-based clustering, centroid-based clustering, distribution-based clustering, density-based clustering, etc, can be used for segmentation of media.
- the text body to train the topic model can be at multiple scales, e.g., transcripts for all media on a website, transcript for a particular media entity, etc.
- the unit of topic modeling can be of various sizes at different levels, such as at sentence level, at 100-sentence level, or at all-words-within-10- minute-interval (temporally defined).
- the timestamps associated with the segmentation based on transcript will be used as the timestamps for the media entity.
- the segmentation is first done on transcript (the text domain), resulting in beginning and end timestamps associated with each text segments, and then these timestamps are transferred to the segments of the videos.
- the pairs of the beginning and end timestamps become the delimiters for segmenting the videos.
- the software may detect the discourse features of sentences, and use these features for media segmentation. For example, a sentence beginning with the word "next" is likely to be the beginning of a new segment.
- the media segmentation can be done at different levels of the transcript text body, such as topic level, sentence level and even word/phrase level.
- Different temporal constraints can be applied in combination with the text segmentation method to better segment the video. The algorithm with temporal constraint is represented below:
- Step 3 If the beginning and end timestamps (delimiters) meets temporal constraints, use the delimiters to slice the video and proceed to Step 4. If the beginning and end timestamps (delimiters) do not meet temporal constraints, return to Step 1 but use a different set of parameters to segment transcript.
- step 3 of previous flowchart only the delimiters that do not meet time constraints are returned to step 1.
- Various machine learning technique discussed previously, such as supervised machine learning, unsupervised machine learning, deep learning and reinforcement learning can be applied generate algorithms for automatic segmentation.
- advertisement network/database are the commercial network/database such as the Google Adsense. For instance, based on the contexts of each segments, the software can query Adsense to retrieve a relevant advertisement for display.
- the media (such as video, audiobook, podcast, etc) are first transcribed.
- the process can be represented by the following flowchart:
- the segmentation of the video are based on multimodal joint analysis described previously in the patent. For instance, transcript- audio joint analysis can be performed for segmentation; in another aspect, image- transcript joint analysis can be performed for segmentation.
- the context of the video segments are determined by multimodal joint analysis described previously in the patent. For instance, transcript-audio joint analysis can be performed; in another aspect, image- transcript joint analysis can be performed to determine the context.
- a plurality of advertisements maybe delivered for the same segments. For instance, for a segment with the context of hotel, more than 1 ads can be delivered, either sequentially or in parallel. In one aspect, if more than 1 ads are delivered within one segment, the ads may also be displayed repeatedly (ad 1, ad 2, ad3, ad 1, ad2, ad3, adl, ad2, ad3 etc).
- ad itself can be a media entity and the context of the ad may be used in our disclosed subject matter.
- Ideal timestamps for playing ads the software identify the ideal timestamps to start and end displaying contextual advertisements. For instance, based on the context of the segment, the timestamps related to phrases of interests in the transcripts can be extracted. For instance, in an action movie there is a video segment about a criminal breaking into the house. The timestamp when the owner said the word "we need a security system" (defined as Tl) can be identified by the software as a good landmark for starting delivering ads about home security. The timestamp the criminal has left the house (defined as T2) may be a good landmark for stopping delivering ads about home security.
- the software may start displaying contextual advertisements starting 5 seconds after the Tl, and end displaying the advertisement 10 seconds after T2.
- the timestamps for ideal ad-delivery can be determined by NLP methods previously described.
- the ideal timestamps can be determined by video-text joint analysis or audio-text joint analysis. For instance, if the software detect there is a region with low texture and high homogeneity and the transcript at this time is highly relevant to the context, this can be an ideal timestamp to display and overlaying ad on top of the said region.
- the ideal timestamps for displaying ads are the timestamps corresponding to part in the transcript that best match with the keywords in ads. For example, if the keyword of ads is "shoes", then times around the occurrences of words like "wear” or "walk” are good times to display ads about shoes.
- the match can be done between any part (including all) of the ads and any part (including all) of the transcript.
- the match between words in ads and words in transcript can be computed based on word similarity (e.g., via word2vec) or topic similarity.
- a sliding window may be imposed on the transcript (or the transcripts of both the ad and the media entity is the ad itself is also media).
- the slide window can be defined temporally (e.g., 10 seconds) or lexically (e.g., 10 consecutive words). Then the match between text in that window and the entire or part of the ads will be computed.
- a window can also be defined as a segment of the ads or the transcript.
- Ad-compatible segments In another embodiment, the video is categorized by the software into "ad-compatible segments" and "ad-incompatible segments".
- the ad-compatible segments are the ones which are more appropriate to anchor ads.
- addition of ads will cause less repulsion from viewers; conversely, the ad-incompatible segments are the ones which is less appropriate to anchor ads.
- addition of ads will cause more repulsion from viewers. For instance, in a world war 2 movie such as "The pearl harbor", the segment in which the male character is dancing with a nurse can be an ad- compatible segment, which has a context of "dancing" to anchor advertisements such as dancing classes or dancing studios.
- the segments when the Japanese Air Force attacks the Pearl Harbor is an ad-incompatible segments, and displaying ads during this time may upset the user and compromise their viewing experience significantly.
- the software will automatically determine whether a segment is ad- compatible or ad-incompatible by analyzing the Time-associated text information such as the transcripts. For instance, previously we discussed how to segment the video based on transcripts. Adding an advertisement in the middle of a segment will maximize the chance that the audience finishes watching this advertisement because he/she wants to continue watching the video. Further, the segments that the audience are most unlikely to skip can be detected and ads can be added there.
- the unlikelihood of skipping can be estimated in many ways, e.g., the topic similarity between a segment and all previous segments that the audience didn't skip, the topic similarity between a segment and all previous segments that the friends of the audience didn't skip.
- other information such as user annotation, comments, sentiment analysis (e.g., do not add ads at a segment which is sad like funeral scenes) can be applied to determine which video segments are ad-compatible.
- the software will analyze the transcript in conjunction with audio, and images. For instance, in an action movie, the software can treat the video segments with abundance of gun shots as ad-incompatible, as the gun shot scene are frequently the climax of the movies.
- the other instance is the detection of actors wearing less clothes (bikinis, nude scenes) based on computer vision algorithms, and render those video segments ad-compatible.
- the process can be represented by the following flowchart
- the software dynamically display the overlaying advertisements (banner ads overlaying on a small part of the video) based on the dynamic contextual advertisement methods described above. As such, the software will display the advertisements as overlaying advertisements based on the contextual information of video segments. In one aspect, the software perform an alpha composition between overlaying ads and the video. As such, the overlaying ads are partially transparent.
- Companion ads In another embodiment, the dynamic contextual advertisements are delivered by software via companion ads, where the ads are display alongside the video outside the boundary of the video, such as the black strips on both ends of a movie. As such, there is no spatial overlay between the video and the advertisements.
- the dynamic contextual advertisements are delivered by software via traditional ads (in between video segments), where video is temporarily paused to play the advertisements. It should be appreciated that the aforementioned methods can determine the optimal time for delivering the ads. Also, segmentation based on transcript enable the ads to cause less distraction to the train of the thoughts of the user.
- the overlaying ads and companion ads are images.
- the overlaying ads and companion ads are movies or GIFs.
- the user can insert a javascript code or do XXX to enable the dynamic contextual advertisement of the media.
- transcript information of the videos watched and the query history can be used in combination to determine the relevance of an advertisement.
- advertisement selection methods known in the field of information technology may be used in conjunction of the methods described above.
- the software may not want to interrupt the user when he/she click on the dynamic ads.
- the software when the user click on the ads, the software will automatically add the ad service/products to his/her wishlist/shopping carts, so that the viewer can look into these products later.
- the software will compile a list of ads that user clicked on and send this list to the user via email. As such, the user can look at these ad items and make purchase decisions later on, without interrupting the viewing experiences.
- the software will integrate the advertisements user clicked into a personalized webpage for the user to view and shop, after viewing the video.
- Coupon In another embodiment, the software enable a feature for "watch and save”. Discount such as coupons and special offers are activated when the viewer watch the video or media. The software displays coupons or special offers as advertisements.
- the flowchart is shown below: [00724] 1. Segment the media file into a plurality of segments based on transcripts
- the coupon database is services such as Groupon.
- the dynamic contextual advertisement can be used at home, on the go, or in movie theaters. It can be used on computers, smartphones, tablets, TV receivers, etc.
- Smartphone/tablet application The dynamic advertisement technology can be used in mobile streaming applications such as Netflix.
- the software may offer users the choice of playing less subscription fees when users activate the dynamic advertisement option.
- Movie theater In one embodiment, the dynamic advertisement technology can be used in the movie theaters.
- the dynamic targeted ad based on the transcript or other audio-associated information can be shown concurrent with the movie; in one aspect, the dynamic ad is shown on top of or below the movie, or alongside the movie. In another embodiment, the dynamic ad can be shown in between different segments of the movie. In another aspect, overlaying ads can be shown.
- the movie ticket price maybe subsidized by the advertisement at a reduced rate; in some cases, the movie ticket can be free.
- the dynamic advertisement technology can be used in delivering traditional TV content.
- the TV channels may broadcast their program using the dynamic advertisement technology to deliver contextual ads. For instance, instead of interrupting the viewers periodically, more overlaying ads or companion ads can be used based on contexts, for a better user experience.
- Streaming stick/TV box/TV receivers In one embodiment, the dynamic advertisement technology can be running on streaming stick, TV boxes, TV receivers, or DVR machines.
- the dynamic advertisement technology can be used in delivering online education or Massive open online course (MOOC).
- MOOC Massive open online course
- the course lectures are very easy to transcribe.
- user preferences/history based on students' past exams/quizzes/activities will also facilitate the selection of the appropriate advertisements or coupons by the software.
- Overlaying ads and companion ads may be preferred in delivering educational contents as they cause less distractions to the user.
- the advertisements that students clicked on may be presented to the viewer after finishing the lecture, using the methods previously described, instead of redirecting students to the merchant website immediately.
- the advertisement may lower the overall tuition for the students.
- the student may still have to come to school for formal tests.
- the software may also prioritize the ads to be displayed. For instance, study-related ads, book-related ads or career-related ads may be given higher priority in being displayed.
- Videochat/Audiochat advertisement The dynamic contextual advertisement technology can also be used for videochat, audiochat and phone calls. Based on the context of the information, relevant ads can be delivered to users by the software.
- the software will analyze the video frames of the segments using image analysis, and determine pixel locations and/or size of the advertisements to be delivered.
- computer vision algorithms can be used to understand the images.
- image features can be extracted from the video by the software to determine the regions to place the ad. For instance, out-of-focus regions of the images, regions without human faces, regions with less optical flow or other image features can be used to determine the regions for overlaying the ad.
- image processing techniques may be used for analysis of video for ad placement: pixel-based operations, point-based operations, padaptortie thresholding, contrast stretching, histogram equalization, histogram matching, histogram operations, image enhancement, image filtering, noise removal, edge detection, edge enhancement, fourier transform and analysis, frequency-domain processing, image restoration, Restoration by the inverse Fourier filter, The Wiener- Helstrom Filter, Constrained deconvolution, Blind deconvolution, Iterative deconvolution and the Lucy-Richardson algorithm, Constrained least-squares restoration, Stochastic input distributions and Bayesian estimators, The generalized Gauss-Markov estimator, shape descriptors, Shape-preserving transformations, Shape transformation, affine transformation, The Procrustes transformation, projective transform, Nonlinear transformations, Warping, ,piecewise warp, piecewise affine warp, morphological processing, Dilation and erosion, Morphological opening and closing, Boundary extraction, Extracting connected components, Region filling,
- the following computer vision techniques may be used for analysis of video for ad placement: Point operators, Linear filtering, Pyramids and wavelets, Geometric transformations, Global optimization, Feature detection and matching (Points and patches, Edges, Lines, etc), Segmentation (Active contours, Split and merge, Mean shift and mode finding, Normalized cuts, Graph cuts and energy-based methods, etc), Feature-based alignment (2D and 3D feature-based alignment, Pose estimation, Geometric intrinsic calibration, etc), Structure from motion (Triangulation, Two-frame structure from motion, Factorization, Bundle adjustment, Constrained structure and motion, etc), Dense motion estimation (Translational alignment, Parametric motion, Spline-based motion, Optical flow, Layered motion, etc), Image stitching (Motion models, Global alignment, Compositing, etc), Computational photography techniques (Photometric calibration, High dynamic range imaging, Super-resolution and blur removal, Image matting and compositing, Texture analysis, synthesis and transfer, etc), Stereo correspondence (Ep
- the regions selected by the software can be a static regions temporally (the region do not move across different nearby frames), or dynamic regions (eg. The regions is moving across nearby frames).
- animation effects of "moving ad” or "fly in”, “fly out” can be created.
- Object recognition and object detection for placing overlaying advertisement is performed to determine where to place the overlaying ad in the video.
- the software finds and identifies objects in the video to facilitate placing of overlaying ad at a relevant and appropriate location, with relevant starting and end timestamps. For instance, cars in the video can be recognized, and the overlaying advertisement of car dealership can be placed on or nearby the said cars identified.
- dogs are identified and pet food advertisement can be overlaid in the region nearby the said dogs in the said video.
- Elvis Presley is identified in the videos and overlaying ad can be placed in the regions nearby Elvis. The regions and durations of overlaying ad placement in the video can be therefore determined, for improved viewing experience.
- Possible object recognition algorithms are: Appearance-based methods (Edge matching, Divide-and-Conquer search, Greyscale matching, Gradient matching, Histograms of receptive field responses, Large model bases, etc), Feature-based methods (Interpretation trees, Hypothesize and test, Pose consistency, Pose clustering, Invariance, Geometric hashing, Scale-invariant feature transform (SIFT), Speeded Up Robust Features (SURF), BRIEF (Binary Robust Independent Elementary Features) ), Bag-of-words model in computer vision, Recognition by parts, Viola-Jones object detection, SVM classification with histograms of oriented gradients (HOG) features, Image segmentation and blob analysis, or a combination thereof.
- Appearance-based methods Edge matching, Divide-and-Conquer search, Greyscale matching, Gradient matching, Histograms of receptive field responses, Large model bases, etc
- Feature-based methods Interpretation trees, Hypothesize and test, Pos
- the region selected based on object recognition and detection may be either temporally static (geometrical center of the region selected do not move across different frames) or dynamic (geometrical center of the region selected move across different frames).
- the displacement of the geometrical center of the region selected across different frames can be in accordance with the displacement/movement of the said object recognized, providing a pleasant ad viewing experience.
- plane detection is performed to identify the plane and orientation of the plane in the video.
- the overlaying ad can be therefore placed in the appropriate perspective and projection consistent with the plane orientation in the image. For instance, carpet flooring can be identified in the video and the overlaying ad can be placed on the region of the carpet with correct orientation and perspective, consistent with the carpet plane orientation and perspective in the video.
- Texture analysis for placing overlaying advertisement is performed on the video to find the region and timestamps for placing the overlaying advertisement in the video.
- An image texture is a set of metrics calculated in image processing to quantify the perceived texture of an image, which gives information about the spatial arrangement of color or intensities in an image or selected region of an image or video. For instance, the regions with less complex texture and/or more homogenous spatial color distribution can be identified and overlaying advertisement can be delivered in these regions in video.
- the regions with less complex textures and/or repetitive textures typically implied that the is less image complexity in the region (eg. no human faces, less interesting details, etc), which can be used by the software for placing overlaying ad.
- the region with more homogenous spatial color distribution often is a region that is less interesting to the viewer (eg. flooring, wall, sky, etc), which can be used by the software for placing overlaying ad.
- the desirable texture can be set by the user manually, the software automatically, or by a combination thereof.
- Face detection for placing overlaying advertisement In another embodiment, regions containing human faces can be avoided for ad overlaying in video. It is generally undesirable to overlay ads on human faces.
- Some possible face detection algorithms that can be applied for placing overlaying advertisement are: Weak classifier cascades, Viola & Jones algorithm, PCA, ICA, LDA, EP, EBGM, Kernel Methods, Trace Transform, AAM, 3-D Morphable Model, 3-D Face Recognition, Bayesian Framework, SVM, HMM, Boosting & Ensemble, or a combination thereof.
- motion analysis can be applied to select regions for placing overlaying ad on video.
- the regions with less motions are selected for overlaying ads.
- the regions with more motions are selected for overlaying ads.
- optical flow analysis such as Lucas-Kanade method, is performed and regions with less optical flows, i.e., less motion, will be used for ad delivery.
- Other motion detection algorithms such as block-matching, template- matching, subtracting a sequence of frames, background and foreground segmentation, can also be applied to identify regions with less motion.
- the selected region can be static.
- the region selected can be dynamic, and the displacement of the region temporally across the frames are consistent with the average displacement/motion of pixels in the regions selected (eg. if a region containing a slow moving car is selected the geometrical center of the region selected displace temporally across different frames in accordance with the displacement/movement of the car).
- out-of-focus regions in the video are identified.
- the out-of-focus areas are typically background and less important.
- the software can identify the out-of-focus regions based on image analysis and overlay the advertisements on those regions.
- Some possible algorithms are: contrast detection, Fourier analysis (finding regions with more distribution of higher frequency components), analyze variance of pixel neighborhoods with sliding window, histogram analysis of pixel neighborhoods (eg. 4x4 pixel neighborhood), analyze gradients, or a combination thereof. It should be appreciated that any algorithms used in the field of autofocus control can be used for identifying the out-of-focus regions.
- the software automatically detects the video frames with similar perspectives. For instance, in a video captured by multiple cameras, there are image frames captured in different perspectives and angles. The image frames from the same camera angles may not be shown continuously, and the video producers often make switch between footages captured by different cameras back and forth throughout a video.
- the software can detect the different group of image frames captured from similar camera perspective, and place the same overlaying ad in more than one group of image frames, but skip displaying the said ad in the group of image frames that is temporally in between the said group of image frames captured from similar camera perspective.
- the size of the overlaying advertisement is smaller than the region selected by the software.
- the software will only use a subset of the region selected and automatically move the said advertisement within the region.
- an animation effect can be created, such as sliding, flying in, flying out, etc.
- the regions can be ranked based on the spatial constraints and/or temporal constraints, as well as image analysis results (eg. texture properties, motion, object recognition results, contain human face or not, etc).
- Penalty function for spatial locations In one embodiment, the software imposes a penalty function on the regions selected by the software for ranking purpose. For instance, the software may impose a higher penalty function for regions towards the spatial central regions of the video, and less penalty function towards the spatial peripheral regions of the video. As such, the regions on the peripheral will be given priority for overlaying advertisement placement.
- the ranking is based on score, denoted as the ranking score.
- the final ranking score can be a function combining the values of factors in various ways, or the combination of those ways, including but not limited to summation, subtraction, multiplication, division, exponent, logarithm, sigmoid, sine, cosine, softmax, etc. FMethods used in existing ranking algorithms may also be used, solely or as part of (including joint use) the ranking function.
- the ranking function does not necessarily have to be expressed analytically, and it could be a numerical transformation obtained or stored in many ways, including but not limited to, weighted sum of those factors, artificial neural networks (including neural networks for deep learning), support vector machines with or without kernel functions, or ensembled (e.g., via boosting or bagging, specialized methods, such as random forest for decision trees) versions of them or combinations of them.
- the function can be one transform, or a plurality of combination thereof.
- Machine learning for placing overlaying ad Various machine learning technique discussed previously, such as supervised machine learning, unsupervised machine learning, deep learning and reinforcement learning can be applied to identify the regions for overlaying advertisements. For instance, based on data of human selection of regions, the software can learn how to select regions using machine learning. The human selection data can be crowd-sourced.
- the software will analyze the color information of the region in the video frames for displaying overlaying ad and adjust the display settings of the overlaying advertisement dynamically.
- the display settings that may be adjusted are: color, transparency, contrast, saturation, size, animation pattern, duration, etc.
- Adjust color of the overlaying ad based on image analysis For instance, if the region of overlaying contains primarily blue color, the software may adjust the advertisement to red color so that the ad will be more obvious.
- average color values of pixels in the regions or color histograms of pixels in the region can be calculated to represent the color properties of the said regions.
- machine learning such as unsupervised learning or supervised learning can be performed to analyze the color information. For instance, k-means can be applied on the color histogram to identify the primary color composition of the region. PCA, ICA, SVM and other machine learning algorithms may be applied to analyze color information.
- the color information of the video frames in the region of overlay can be analyzed using HSV transform, where RGB information is converted to HSV image space.
- the HSV space data of the region can be further analyzed using histogram, average values, PCA, k-means or various machine learning algorithms.
- the transparency level of the overlaying ads are determined by the computer vision and image analysis of the relevant video frames.
- the software dynamically adjusts the alpha composition (the transparency level of the ad) based on the pixel values of the ads and pixel values of the video frames in the region where they overlay. For instance, if at the region of overlay, the relevant video frames shows primarily a homogenous color with simple texture, the software will decrease the transparency level of the overlaying ads, as objects with homogenous color with simple texture are typically not important to viewers.
- the relevant video frames shows complex texture, or shows special objects such as human faces
- the transparency level of the ads will be increased, so that viewer will see these objects better after ad overlaying. As such, the user will have a much better viewing experience as the overlaying ads are less distracting.
- Adjust size, duration and animation pattern of the overlaying ads similar to color and transparency settings, other settings such as size, duration and animation pattern can also be automatically adjusted by the software.
- the software can perform a comprehensive analysis of the media file to generate a timed multimodal context file.
- a plurality of timed information can be included, such as timed contexts, ideal ad- playing timestamps, timed ad-compatibility score, timed ad-layout, timed ad- adjustment, etc.
- timed contexts ideal ad- playing timestamps
- timed ad-compatibility score timed ad-layout
- timed ad- adjustment etc.
- the functionality of these properties are illustrated as following in the table in Figure 53.
- the file is organized as a data structure such as the a matrix, where the rows are cor-responding to timestamps (eg. every row is a one second increment: row 1: OhOmOs; row 2: OhOmls; row 3: 0h0m2s....), and columns are properties such as timed contexts, ideal ad-playing timestamps, timed ad- compatibility score, timed ad-layout, timed ad-adjustment, etc.
- a data structure such as the a matrix, where the rows are cor-responding to timestamps (eg. every row is a one second increment: row 1: OhOmOs; row 2: OhOmls; row 3: 0h0m2s....), and columns are properties such as timed contexts, ideal ad-playing timestamps, timed ad- compatibility score, timed ad-layout, timed ad-adjustment, etc.
- the multimodal context file With the multimodal context file, the information about the video in how to display contextual ads are documented in detail. As such, the file can be used as a companion file with the video conveniently. It should be appreciated that in some cases these properties can be stored and transmitted to servers in real time, without writing into a file.
- the file does not have to be a file in a storage of a computer. It can be a data structure maintained in an online database.
- a bottleneck for distant streaming is the network bandwidth because of the large data nature of media.
- the video does not need to be updated at constant and/or high frequency.
- the media encoding rate can be varied based on the speed of text changing in the transcript. For example, when the professor is not talking, the encoding rate can be lower. If the professor is talking fast, the encoding rate needs to go high.
- high encoding rate e.g., 30 FPS for video
- Each line/block in the transcript has a timestamp for its beginning and end.
- a simple way is to set the coding rate proportional to the number of syllables during each line. Speech recognition can also be added here to improve.
- the real time interval that each word or even each syllable is spoken can be determined by speech-transcript alignment, such as universal phone models, finite state machine methods, time-encoded method, dynamic programming, etc. Once the interval of each word or syllable is detected, we can set the encoding rate following a function of the speed of syllable changing, for example, let the encoding rate be proportional to the speed that syllables come up.
- the speed of syllables is either directly detected or estimated by dividing the duration by the number of syllables in each word. This is an easy task for some languages, such as Chinese or Korean which are monosyllabic.
- the syllables are directed detected by the audio analysis software and no transcription is needed.
- Figure 55 The illustration above shows how one segment of speech is converted into text, then speech speed is estimated at word level in unit of syllable, and final video encoding rate is set to be related to the speech speed. All steps here can be done in multiple ways or measurements can be in any unit. In the ex-ample above, the total frame number reduces to 48 while without it the number is 450, assuming 1 ⁇ 4 second per syllable in human speech. [00947] 13. Virtual reality (VR) and augmented reality (AR) applications
- Our software technology can be coupled with virtual reality hardware/software such as head mounted display, heads-up display and other display technologies.
- virtual reality hardware/software such as head mounted display, heads-up display and other display technologies.
- the other embodiments previously described in the disclosed subject matter may be applied to settings of VR or AR.
- the software may help the user navigate through video instructions.
- the user may wear a head mounted display, and the instructions are being navigated through using search methods previously described in this disclosed subject matter. For instance, the user may be fixing a car while watching a video tutorial.
- the user can use audio search to jump back and forth in the video to rewatch any part for as many times as he/she wants, e.g., "jump to where I need to install a filter using a size- 10 wrench".
- the user can also instruct the media player to pause as well as give a preview of the entire procedure using our landmark-based trailer composition.
- the user can use the voice recognition to input a query such as "engine” to search for the timestamps when the word "engine” is mentioned.
- the "Videomark” technology can be used for guiding the navigation of the video in VR and AR.
- the query in VR and AR can be specified by in the "Tymetravel” technology; the user can also use the "Tymetravel” to share the query with other people who are using computers/smartphone/VR/AR at remote end.
- the virtual reality may has an display components, such as an LCD (liquid crystal display) display, an OLED (organic light emitting diode) display, a Liquid crystal on silicon (LCoS) display, a projection display; a head-mounted display (HMD), a head-mounted projection display (HMPD), an op-tical-see through display, a switchable optical see-through display, a selective occlusion see-through head- mounted display, and a video see-through display.
- the display may comprise an augmented reality window, augmented monitors, a projection on the patient/projective head-mounted display, selec-tive occlusion see-through head- mounted display, and retinal scanning display.
- the display may be ste-reoscopic, binocular or non-stereoscopic.
- the virtual reality display may have microphone and camera(s) built-in, and have wired or wireless connectivity.
- the VR display comprises a smartphone with a lens- integrated case (eg. google cardboard) to work with the software.
- the VR set has at least one micro-display with lens.
- the VR set is pupil-forming or non-pupil-forming configuration.
- the software run on the smartphone, local tablets and local computers.
- the software is run on the VR set (on its embedded computing device); in yet another embodiment, the software is run on the cloud, either fully or partially.
- the display method to create VR/AR experience include but are not limited to, parallax display, polarized display, active shutter, anaglyph, etc.
- the voice memo software only provides recording functionality and very basic memo management functionality. For instance, the audio clips or voice memos are only organized based on time of re-cording; therefore, the management and search within each voice memo based on the information in the recording (such as the transcript) is not available.
- the disclosed subject matter present a software that can record the voice memos, manage the memos, search within the memos and segment the memos into clips based on the Time- associated text information.
- the search methods, ranking methods described previously in the disclosed subject matter can be readily applied in the field of audio recording and voice memos.
- a representative flowchart implemented in the voice memo software is shown in the figure below:
- Figure 56 A representative flowchart of voice memo search
- the software can play the audio clips from the timestamp associated with the query that is specified by the user, after ranking or returning the results. It should be appreciated that the methods for specifying the time constraints previously discussed can be used in the voice memo software implementation.
- the voice memo can implement the "Videomark” technology described in the disclosed subject matter. Consequently, the videomark as well as glossary for the audio clips or audio segments can be generated.
- the user can manage the audio clips or voice memo based on the information embedded in the clips (such as transcript) and will be able to jump to desirable timestamp within the correct audio clip easily.
- the "TymeTravel" methods described in the disclosed subject matter can be implemented in the voice memo software. As such, the user can specify, share and manage the audio clips in a convenient way.
- the Karaoke software, Karaoke machine, singing machine is not very convenient to use.
- the search for a particular song in the Karaoke machine is typically enabled by looking up by singers, looking up song titles, looking up by languages, or spell the title.
- the existing system is not convenient for the user if the user does not know the song title and singer name for the song.
- Karaoke software and Karaoke machine can be developed with build in search capability.
- the software can search for the event based on Time- associated text information such as lyrics. For instance, when the phrase "reminds me of childhood memories" is inputted into the software/machine as the query, the Karaoke software/machine will search, rank and return the results matching "reminds me of childhood memories” to user, when the phrase "reminds me of childhood memories” is within the lyrics of music video or sound track.
- the search and rank process is similar to the methods previously described in the disclosed subject matter.
- the software will take the user to the timestamp associated with the query (eg. when the phrase "reminds me of childhood memories” is sung in the lyrics/transcript/close caption). As such, the user can sing beginning at the desirable timestamp, without singing the whole song.
- karaoke singing functionalists known in the field of music entertainment and computer engineering can be enabled in the Karaoke software and Karaoke machine, such as sing with instrument soundtrack only (without original singer's sound), sing with original singer's sound, tuning up & down the keys, look up songs by traditional methods (by singer names, genre, song titles, languages, rankings, etc), cut the unfinished songs, reorder the songs selected, update the music database, etc.
- a representative flowchart of this embodiment is shown in Figure 57.
- the Karaoke software embodied by the disclosed subject matter can run on any computing devices, such as desktops, laptops, smartphones, tablet computers, smart watches, smart wearable devices, video game consoles, TV streaming devices, TV DVR, smart soundbar, smart speakers, smart audio power amplifier. It should be appreciated that the Karaoke software can run locally, on the cloud or on a local/cloud hybrid environment. It should be appreciated that the Karaoke software can be integrated with video streaming websites/providers and social media applications. The karaoke software use the following methods of input: touchscreen, keyboard, remote control, mouse, voice recognition, gesture recognition, etc.
- the Karaoke software or Karaoke machine support remote collaboration. Different from current software/machine that is designed for singing locally, the Karaoke software and machine in the disclosed subject matter supports singers from different geographical locations.
- the software has built-in communication capability to allow for a plurality of users to sing together, even if they are situated in different cities.
- the sound track/music video can be streamed from remote service provider which is cloud-based; in another aspect of the embodiment, the sound track/music video can be stored locally or downloaded to local computer before playing; in yet another aspect of the embodiment, the sound track/music video can be obtained using a hybrid approach of real time online streaming and local storage.
- the remote collaboration can be enabled by communication technologies, wired or wireless. Possible communication technologies that can be applied comprise WiFi, Bluetooth, LAN, near-field communication (NFC), infrared communication, radio-frequency communication, TV cable, satellite TV communication, telephone communication, cellular network, 3G network, 4G network, etc.
- the Karaoke software or karaoke machine allow user to use instrument and form a band for performance and entertainment. Different from traditional software which only allow people to sing, the disclosed subject matter allow user to participate in a plurality of roles. For instance, the user can participate to play an instrument, such as a guitar, drums or piano.
- a representative flowchart is showing below
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
L'invention se rapporte à un ou plusieurs dispositifs informatiques, systèmes et/ou procédés de recherche, de complémentation et/ou d'exploration de multimédia. Par exemple, une requête de multimédia peut être utilisée afin d'identifier des résultats et de fournir les résultats en fonction de propriétés temporelles des résultats. Dans un autre exemple, le multimédia peut être segmenté en parties en fonction d'informations de texte associées au temps du multimédia, et chaque partie du multimédia peut être complétée d'un contenu sélectionné en fonction d'un contexte de la partie. Dans un autre exemple, une zone d'une vidéo peut être sélectionnée en fonction de l'analyse d'image de la vidéo, et la vidéo peut être complétée d'un contenu au niveau de la zone. Dans un autre exemple, une vidéo peut être complétée d'un contenu, et des propriétés du contenu peuvent être réglées en fonction d'une analyse d'image de la vidéo. Dans un autre exemple, le multimédia peut être exploré à différentes vitesses d'avancement.
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201662279616P | 2016-01-15 | 2016-01-15 | |
| US62/279,616 | 2016-01-15 | ||
| US201762446650P | 2017-01-16 | 2017-01-16 | |
| US62/446,650 | 2017-01-16 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2017124116A1 true WO2017124116A1 (fr) | 2017-07-20 |
Family
ID=57910187
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2017/013829 Ceased WO2017124116A1 (fr) | 2016-01-15 | 2017-01-17 | Recherche, complémentation et exploration de multimédia |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2017124116A1 (fr) |
Cited By (63)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107844327A (zh) * | 2017-11-03 | 2018-03-27 | 南京大学 | 一种实现上下文一致性的检测系统及检测方法 |
| RU2679967C1 (ru) * | 2018-03-14 | 2019-02-14 | Общество с ограниченной ответственностью "Смарт Энджинс Рус" | Устройство отыскания информации по ключевым словам |
| CN110019934A (zh) * | 2017-09-20 | 2019-07-16 | 微软技术许可有限责任公司 | 识别视频的相关性 |
| CN110119784A (zh) * | 2019-05-16 | 2019-08-13 | 重庆天蓬网络有限公司 | 一种订单推荐方法及装置 |
| WO2019241776A1 (fr) * | 2018-06-15 | 2019-12-19 | Geomni, Inc. | Systèmes et procédés de vision artificielle permettant la modélisation de toits de structures à l'aide de données bidimensionnelles et de données partielles tridimensionnelles |
| CN110880161A (zh) * | 2019-11-21 | 2020-03-13 | 大庆思特传媒科技有限公司 | 一种多主机多深度摄像头的深度图像拼接融合方法及系统 |
| CN110971976A (zh) * | 2019-11-22 | 2020-04-07 | 中国联合网络通信集团有限公司 | 一种音视频文件分析方法及装置 |
| CN111291085A (zh) * | 2020-01-15 | 2020-06-16 | 中国人民解放军国防科技大学 | 层次化兴趣匹配方法、装置、计算机设备和存储介质 |
| WO2020132142A1 (fr) * | 2018-12-18 | 2020-06-25 | Northwestern University | Système et procédé de calcul dans le domaine temporel en pipeline à l'aide de bascules à domaine temporel, et leur application en analyse de série chronologique |
| WO2020243645A1 (fr) * | 2019-05-31 | 2020-12-03 | Apple Inc. | Interfaces utilisateur pour une application de navigation et de lecture de podcast |
| US20210201143A1 (en) * | 2019-12-27 | 2021-07-01 | Samsung Electronics Co., Ltd. | Computing device and method of classifying category of data |
| US11057682B2 (en) | 2019-03-24 | 2021-07-06 | Apple Inc. | User interfaces including selectable representations of content items |
| US11070889B2 (en) | 2012-12-10 | 2021-07-20 | Apple Inc. | Channel bar user interface |
| EP3910645A1 (fr) * | 2020-05-13 | 2021-11-17 | Siemens Healthcare GmbH | Récupération d'image |
| CN113688212A (zh) * | 2021-10-27 | 2021-11-23 | 华南师范大学 | 句子情感分析方法、装置以及设备 |
| US11194546B2 (en) | 2012-12-31 | 2021-12-07 | Apple Inc. | Multi-user TV user interface |
| US11245967B2 (en) | 2012-12-13 | 2022-02-08 | Apple Inc. | TV side bar user interface |
| US11290762B2 (en) | 2012-11-27 | 2022-03-29 | Apple Inc. | Agnostic media delivery system |
| US11297392B2 (en) | 2012-12-18 | 2022-04-05 | Apple Inc. | Devices and method for providing remote control hints on a display |
| SE2051550A1 (en) * | 2020-12-22 | 2022-06-23 | Algoriffix Ab | Method and system for recognising patterns in sound |
| WO2022150401A1 (fr) * | 2021-01-05 | 2022-07-14 | Pictory, Corp | Procédé, système et appareil de résumé de vidéo par intelligence artificielle |
| CN114780512A (zh) * | 2022-03-22 | 2022-07-22 | 荣耀终端有限公司 | 一种灰度发布方法、系统及服务器 |
| CN114840129A (zh) * | 2021-02-01 | 2022-08-02 | 苹果公司 | 显示具有分层结构的卡片的表示 |
| EP4042292A1 (fr) * | 2020-12-17 | 2022-08-17 | Google LLC | Amélioration automatique de diffusion multimédia en continu à l'aide d'une transformation de contenu |
| US11461397B2 (en) | 2014-06-24 | 2022-10-04 | Apple Inc. | Column interface for navigating in a user interface |
| US11467726B2 (en) | 2019-03-24 | 2022-10-11 | Apple Inc. | User interfaces for viewing and accessing content on an electronic device |
| CN115203380A (zh) * | 2022-09-19 | 2022-10-18 | 山东鼹鼠人才知果数据科技有限公司 | 基于多模态数据融合的文本处理系统及其方法 |
| CN115362438A (zh) * | 2020-03-31 | 2022-11-18 | 斯纳普公司 | 在多媒体消息传送应用中对可修改视频进行搜索和排序 |
| US11514113B2 (en) | 2020-09-22 | 2022-11-29 | International Business Machines Corporation | Structural geographic based cultural group tagging hierarchy and sequencing for hashtags |
| WO2022251323A1 (fr) * | 2021-05-25 | 2022-12-01 | Emaginos Inc. | Plate-forme analytique éducative et procédé associé |
| US11520467B2 (en) | 2014-06-24 | 2022-12-06 | Apple Inc. | Input device and user interface interactions |
| US11520858B2 (en) | 2016-06-12 | 2022-12-06 | Apple Inc. | Device-level authorization for viewing content |
| US11532333B1 (en) | 2021-06-23 | 2022-12-20 | Microsoft Technology Licensing, Llc | Smart summarization, indexing, and post-processing for recorded document presentation |
| CN115527112A (zh) * | 2022-09-15 | 2022-12-27 | 哈尔滨工程大学 | 一种基于深度神经网络的声呐图像纹理特征去除方法 |
| EP4109299A1 (fr) * | 2021-06-24 | 2022-12-28 | Nokia Solutions and Networks Oy | Génération interactive de requêtes pour des éléments médiatiques annotés dans le temps |
| US11543938B2 (en) | 2016-06-12 | 2023-01-03 | Apple Inc. | Identifying applications on which content is available |
| US11582517B2 (en) | 2018-06-03 | 2023-02-14 | Apple Inc. | Setup procedures for an electronic device |
| US11609678B2 (en) | 2016-10-26 | 2023-03-21 | Apple Inc. | User interfaces for browsing content from multiple content applications on an electronic device |
| US20230094828A1 (en) * | 2021-09-27 | 2023-03-30 | Sap Se | Audio file annotation |
| US20230162020A1 (en) * | 2021-11-23 | 2023-05-25 | Microsoft Technology Licensing, Llc | Multi-Task Sequence Tagging with Injection of Supplemental Information |
| US11683565B2 (en) | 2019-03-24 | 2023-06-20 | Apple Inc. | User interfaces for interacting with channels that provide content that plays in a media browsing application |
| US11720229B2 (en) | 2020-12-07 | 2023-08-08 | Apple Inc. | User interfaces for browsing and presenting content |
| CN116595808A (zh) * | 2023-07-17 | 2023-08-15 | 中国人民解放军国防科技大学 | 事件金字塔模型构建与多粒度时空可视化方法和装置 |
| US11736619B2 (en) * | 2020-12-03 | 2023-08-22 | International Business Machines Corporation | Automated indication of urgency using Internet of Things (IoT) data |
| KR20230146233A (ko) * | 2022-04-12 | 2023-10-19 | 한국전자통신연구원 | 자연어를 이용한 동영상 구간 검색 방법 및 장치 |
| US11843838B2 (en) | 2020-03-24 | 2023-12-12 | Apple Inc. | User interfaces for accessing episodes of a content series |
| US11863837B2 (en) | 2019-05-31 | 2024-01-02 | Apple Inc. | Notification of augmented reality content on an electronic device |
| US11895371B1 (en) * | 2021-09-21 | 2024-02-06 | Amazon Technologies, Inc. | Media content segment generation and presentation |
| US11899895B2 (en) | 2020-06-21 | 2024-02-13 | Apple Inc. | User interfaces for setting up an electronic device |
| US11934640B2 (en) | 2021-01-29 | 2024-03-19 | Apple Inc. | User interfaces for record labels |
| US11962836B2 (en) | 2019-03-24 | 2024-04-16 | Apple Inc. | User interfaces for a media browsing application |
| US20240134912A1 (en) * | 2020-05-19 | 2024-04-25 | Miso Technologies Inc. | System and method for question-based content answering |
| US12062367B1 (en) * | 2021-06-28 | 2024-08-13 | Amazon Technologies, Inc. | Machine learning techniques for processing video streams using metadata graph traversal |
| US12074935B2 (en) | 2021-12-30 | 2024-08-27 | Google Llc | Systems, method, and media for removing objectionable and/or inappropriate content from media |
| CN118691375A (zh) * | 2024-06-28 | 2024-09-24 | 广州七亩地科技发展有限公司 | 一种基于用户画像的商品推荐方法及系统 |
| US12147964B2 (en) | 2017-05-16 | 2024-11-19 | Apple Inc. | User interfaces for peer-to-peer transfers |
| US12149779B2 (en) | 2013-03-15 | 2024-11-19 | Apple Inc. | Advertisement user interface |
| US12197501B2 (en) | 2023-05-31 | 2025-01-14 | Microsoft Technology Licensing, Llc | Historical data-based video categorizer |
| US12216987B2 (en) * | 2020-11-25 | 2025-02-04 | Nec Corporation | Generating heading based on extracted feature words |
| US12307082B2 (en) | 2018-02-21 | 2025-05-20 | Apple Inc. | Scrollable set of content items with locking feature |
| CN120107898A (zh) * | 2025-05-09 | 2025-06-06 | 中国电建集团西北勘测设计研究院有限公司 | 适用于水电站大坝泄水场景下的人员入侵视觉识别方法 |
| US12335569B2 (en) | 2018-06-03 | 2025-06-17 | Apple Inc. | Setup procedures for an electronic device |
| US20250298835A1 (en) * | 2024-03-19 | 2025-09-25 | Tubertdata LLC | Methods And Systems For Personalized Transcript Searching And Indexing Of Online Multimedia |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| AU2009212772A1 (en) * | 2009-08-24 | 2011-03-10 | Canon Kabushiki Kaisha | Method for media navigation |
| US20130307792A1 (en) | 2012-05-16 | 2013-11-21 | Google Inc. | Gesture touch inputs for controlling video on a touchscreen |
| WO2015157711A1 (fr) * | 2014-04-10 | 2015-10-15 | Google Inc. | Procédés, systèmes et supports pour rechercher un contenu vidéo |
| US20150293996A1 (en) * | 2014-04-10 | 2015-10-15 | Google Inc. | Methods, systems, and media for searching for video content |
-
2017
- 2017-01-17 WO PCT/US2017/013829 patent/WO2017124116A1/fr not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| AU2009212772A1 (en) * | 2009-08-24 | 2011-03-10 | Canon Kabushiki Kaisha | Method for media navigation |
| US20130307792A1 (en) | 2012-05-16 | 2013-11-21 | Google Inc. | Gesture touch inputs for controlling video on a touchscreen |
| WO2015157711A1 (fr) * | 2014-04-10 | 2015-10-15 | Google Inc. | Procédés, systèmes et supports pour rechercher un contenu vidéo |
| US20150293996A1 (en) * | 2014-04-10 | 2015-10-15 | Google Inc. | Methods, systems, and media for searching for video content |
Cited By (107)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11290762B2 (en) | 2012-11-27 | 2022-03-29 | Apple Inc. | Agnostic media delivery system |
| US12225253B2 (en) | 2012-11-27 | 2025-02-11 | Apple Inc. | Agnostic media delivery system |
| US11070889B2 (en) | 2012-12-10 | 2021-07-20 | Apple Inc. | Channel bar user interface |
| US12342050B2 (en) | 2012-12-10 | 2025-06-24 | Apple Inc. | Channel bar user interface |
| US11317161B2 (en) | 2012-12-13 | 2022-04-26 | Apple Inc. | TV side bar user interface |
| US12177527B2 (en) | 2012-12-13 | 2024-12-24 | Apple Inc. | TV side bar user interface |
| US11245967B2 (en) | 2012-12-13 | 2022-02-08 | Apple Inc. | TV side bar user interface |
| US11297392B2 (en) | 2012-12-18 | 2022-04-05 | Apple Inc. | Devices and method for providing remote control hints on a display |
| US12301948B2 (en) | 2012-12-18 | 2025-05-13 | Apple Inc. | Devices and method for providing remote control hints on a display |
| US11822858B2 (en) | 2012-12-31 | 2023-11-21 | Apple Inc. | Multi-user TV user interface |
| US12229475B2 (en) | 2012-12-31 | 2025-02-18 | Apple Inc. | Multi-user TV user interface |
| US11194546B2 (en) | 2012-12-31 | 2021-12-07 | Apple Inc. | Multi-user TV user interface |
| US12149779B2 (en) | 2013-03-15 | 2024-11-19 | Apple Inc. | Advertisement user interface |
| US12086186B2 (en) | 2014-06-24 | 2024-09-10 | Apple Inc. | Interactive interface for navigating in a user interface associated with a series of content |
| US11520467B2 (en) | 2014-06-24 | 2022-12-06 | Apple Inc. | Input device and user interface interactions |
| US11461397B2 (en) | 2014-06-24 | 2022-10-04 | Apple Inc. | Column interface for navigating in a user interface |
| US12105942B2 (en) | 2014-06-24 | 2024-10-01 | Apple Inc. | Input device and user interface interactions |
| US12468436B2 (en) | 2014-06-24 | 2025-11-11 | Apple Inc. | Input device and user interface interactions |
| US11520858B2 (en) | 2016-06-12 | 2022-12-06 | Apple Inc. | Device-level authorization for viewing content |
| US12287953B2 (en) | 2016-06-12 | 2025-04-29 | Apple Inc. | Identifying applications on which content is available |
| US11543938B2 (en) | 2016-06-12 | 2023-01-03 | Apple Inc. | Identifying applications on which content is available |
| US11966560B2 (en) | 2016-10-26 | 2024-04-23 | Apple Inc. | User interfaces for browsing content from multiple content applications on an electronic device |
| US11609678B2 (en) | 2016-10-26 | 2023-03-21 | Apple Inc. | User interfaces for browsing content from multiple content applications on an electronic device |
| US12147964B2 (en) | 2017-05-16 | 2024-11-19 | Apple Inc. | User interfaces for peer-to-peer transfers |
| CN110019934A (zh) * | 2017-09-20 | 2019-07-16 | 微软技术许可有限责任公司 | 识别视频的相关性 |
| CN110019934B (zh) * | 2017-09-20 | 2023-07-14 | 微软技术许可有限责任公司 | 识别视频的相关性 |
| CN107844327B (zh) * | 2017-11-03 | 2020-10-27 | 南京大学 | 一种实现上下文一致性的检测系统及检测方法 |
| CN107844327A (zh) * | 2017-11-03 | 2018-03-27 | 南京大学 | 一种实现上下文一致性的检测系统及检测方法 |
| US12307082B2 (en) | 2018-02-21 | 2025-05-20 | Apple Inc. | Scrollable set of content items with locking feature |
| RU2679967C1 (ru) * | 2018-03-14 | 2019-02-14 | Общество с ограниченной ответственностью "Смарт Энджинс Рус" | Устройство отыскания информации по ключевым словам |
| US11582517B2 (en) | 2018-06-03 | 2023-02-14 | Apple Inc. | Setup procedures for an electronic device |
| US12335569B2 (en) | 2018-06-03 | 2025-06-17 | Apple Inc. | Setup procedures for an electronic device |
| US12333218B2 (en) | 2018-06-15 | 2025-06-17 | Insurance Services Office, Inc. | Computer vision systems and methods for modeling roofs of structures using two-dimensional and partial three-dimensional data |
| US10909757B2 (en) | 2018-06-15 | 2021-02-02 | Geomni, Inc. | Computer vision systems and methods for modeling roofs of structures using two-dimensional and partial three-dimensional data |
| US11922098B2 (en) | 2018-06-15 | 2024-03-05 | Insurance Services Office, Inc. | Computer vision systems and methods for modeling roofs of structures using two-dimensional and partial three-dimensional data |
| WO2019241776A1 (fr) * | 2018-06-15 | 2019-12-19 | Geomni, Inc. | Systèmes et procédés de vision artificielle permettant la modélisation de toits de structures à l'aide de données bidimensionnelles et de données partielles tridimensionnelles |
| US11467831B2 (en) | 2018-12-18 | 2022-10-11 | Northwestern University | System and method for pipelined time-domain computing using time-domain flip-flops and its application in time-series analysis |
| WO2020132142A1 (fr) * | 2018-12-18 | 2020-06-25 | Northwestern University | Système et procédé de calcul dans le domaine temporel en pipeline à l'aide de bascules à domaine temporel, et leur application en analyse de série chronologique |
| US11057682B2 (en) | 2019-03-24 | 2021-07-06 | Apple Inc. | User interfaces including selectable representations of content items |
| US12432412B2 (en) | 2019-03-24 | 2025-09-30 | Apple Inc. | User interfaces for a media browsing application |
| US12008232B2 (en) | 2019-03-24 | 2024-06-11 | Apple Inc. | User interfaces for viewing and accessing content on an electronic device |
| US11962836B2 (en) | 2019-03-24 | 2024-04-16 | Apple Inc. | User interfaces for a media browsing application |
| US11445263B2 (en) | 2019-03-24 | 2022-09-13 | Apple Inc. | User interfaces including selectable representations of content items |
| US11467726B2 (en) | 2019-03-24 | 2022-10-11 | Apple Inc. | User interfaces for viewing and accessing content on an electronic device |
| US12299273B2 (en) | 2019-03-24 | 2025-05-13 | Apple Inc. | User interfaces for viewing and accessing content on an electronic device |
| US11750888B2 (en) | 2019-03-24 | 2023-09-05 | Apple Inc. | User interfaces including selectable representations of content items |
| US11683565B2 (en) | 2019-03-24 | 2023-06-20 | Apple Inc. | User interfaces for interacting with channels that provide content that plays in a media browsing application |
| CN110119784A (zh) * | 2019-05-16 | 2019-08-13 | 重庆天蓬网络有限公司 | 一种订单推荐方法及装置 |
| CN110119784B (zh) * | 2019-05-16 | 2020-08-04 | 重庆天蓬网络有限公司 | 一种订单推荐方法及装置 |
| US11863837B2 (en) | 2019-05-31 | 2024-01-02 | Apple Inc. | Notification of augmented reality content on an electronic device |
| US12204584B2 (en) | 2019-05-31 | 2025-01-21 | Apple Inc. | User interfaces for a podcast browsing and playback application |
| WO2020243645A1 (fr) * | 2019-05-31 | 2020-12-03 | Apple Inc. | Interfaces utilisateur pour une application de navigation et de lecture de podcast |
| US11797606B2 (en) | 2019-05-31 | 2023-10-24 | Apple Inc. | User interfaces for a podcast browsing and playback application |
| US12250433B2 (en) | 2019-05-31 | 2025-03-11 | Apple Inc. | Notification of augmented reality content on an electronic device |
| CN110880161A (zh) * | 2019-11-21 | 2020-03-13 | 大庆思特传媒科技有限公司 | 一种多主机多深度摄像头的深度图像拼接融合方法及系统 |
| CN110971976A (zh) * | 2019-11-22 | 2020-04-07 | 中国联合网络通信集团有限公司 | 一种音视频文件分析方法及装置 |
| CN110971976B (zh) * | 2019-11-22 | 2021-08-27 | 中国联合网络通信集团有限公司 | 一种音视频文件分析方法及装置 |
| US12205024B2 (en) * | 2019-12-27 | 2025-01-21 | Samsung Electronics Co., Ltd. | Computing device and method of classifying category of data |
| US20210201143A1 (en) * | 2019-12-27 | 2021-07-01 | Samsung Electronics Co., Ltd. | Computing device and method of classifying category of data |
| CN111291085A (zh) * | 2020-01-15 | 2020-06-16 | 中国人民解放军国防科技大学 | 层次化兴趣匹配方法、装置、计算机设备和存储介质 |
| CN111291085B (zh) * | 2020-01-15 | 2023-10-17 | 中国人民解放军国防科技大学 | 层次化兴趣匹配方法、装置、计算机设备和存储介质 |
| US12301950B2 (en) | 2020-03-24 | 2025-05-13 | Apple Inc. | User interfaces for accessing episodes of a content series |
| US11843838B2 (en) | 2020-03-24 | 2023-12-12 | Apple Inc. | User interfaces for accessing episodes of a content series |
| EP4127974A1 (fr) * | 2020-03-31 | 2023-02-08 | Snap Inc. | Recherche et classement de vidéos modifiables dans une application de messagerie multimédia |
| CN115362438A (zh) * | 2020-03-31 | 2022-11-18 | 斯纳普公司 | 在多媒体消息传送应用中对可修改视频进行搜索和排序 |
| EP3910645A1 (fr) * | 2020-05-13 | 2021-11-17 | Siemens Healthcare GmbH | Récupération d'image |
| US12288609B2 (en) | 2020-05-13 | 2025-04-29 | Siemens Healthineers Ag | Image retrieval |
| US20240134912A1 (en) * | 2020-05-19 | 2024-04-25 | Miso Technologies Inc. | System and method for question-based content answering |
| US11899895B2 (en) | 2020-06-21 | 2024-02-13 | Apple Inc. | User interfaces for setting up an electronic device |
| US12271568B2 (en) | 2020-06-21 | 2025-04-08 | Apple Inc. | User interfaces for setting up an electronic device |
| US11514113B2 (en) | 2020-09-22 | 2022-11-29 | International Business Machines Corporation | Structural geographic based cultural group tagging hierarchy and sequencing for hashtags |
| US12216987B2 (en) * | 2020-11-25 | 2025-02-04 | Nec Corporation | Generating heading based on extracted feature words |
| US11736619B2 (en) * | 2020-12-03 | 2023-08-22 | International Business Machines Corporation | Automated indication of urgency using Internet of Things (IoT) data |
| US11720229B2 (en) | 2020-12-07 | 2023-08-08 | Apple Inc. | User interfaces for browsing and presenting content |
| EP4042292A1 (fr) * | 2020-12-17 | 2022-08-17 | Google LLC | Amélioration automatique de diffusion multimédia en continu à l'aide d'une transformation de contenu |
| SE2051550A1 (en) * | 2020-12-22 | 2022-06-23 | Algoriffix Ab | Method and system for recognising patterns in sound |
| SE544738C2 (en) * | 2020-12-22 | 2022-11-01 | Algoriffix Ab | Method and system for recognising patterns in sound |
| WO2022150401A1 (fr) * | 2021-01-05 | 2022-07-14 | Pictory, Corp | Procédé, système et appareil de résumé de vidéo par intelligence artificielle |
| US12008038B2 (en) | 2021-01-05 | 2024-06-11 | Pictory, Corp. | Summarization of video artificial intelligence method, system, and apparatus |
| US11934640B2 (en) | 2021-01-29 | 2024-03-19 | Apple Inc. | User interfaces for record labels |
| CN114840129A (zh) * | 2021-02-01 | 2022-08-02 | 苹果公司 | 显示具有分层结构的卡片的表示 |
| WO2022251323A1 (fr) * | 2021-05-25 | 2022-12-01 | Emaginos Inc. | Plate-forme analytique éducative et procédé associé |
| GB2622335A (en) * | 2021-05-25 | 2024-03-13 | Emaginos Inc | Educational analytics platform and method thereof |
| US11790953B2 (en) | 2021-06-23 | 2023-10-17 | Microsoft Technology Licensing, Llc | Smart summarization, indexing, and post-processing for recorded document presentation |
| US11532333B1 (en) | 2021-06-23 | 2022-12-20 | Microsoft Technology Licensing, Llc | Smart summarization, indexing, and post-processing for recorded document presentation |
| WO2022271319A1 (fr) * | 2021-06-23 | 2022-12-29 | Microsoft Technology Licensing, Llc | Génération de résumé, indexation et post-traitement intelligents pour présentation de document enregistré |
| EP4109299A1 (fr) * | 2021-06-24 | 2022-12-28 | Nokia Solutions and Networks Oy | Génération interactive de requêtes pour des éléments médiatiques annotés dans le temps |
| US12062367B1 (en) * | 2021-06-28 | 2024-08-13 | Amazon Technologies, Inc. | Machine learning techniques for processing video streams using metadata graph traversal |
| US11895371B1 (en) * | 2021-09-21 | 2024-02-06 | Amazon Technologies, Inc. | Media content segment generation and presentation |
| US20230094828A1 (en) * | 2021-09-27 | 2023-03-30 | Sap Se | Audio file annotation |
| US11893990B2 (en) * | 2021-09-27 | 2024-02-06 | Sap Se | Audio file annotation |
| CN113688212A (zh) * | 2021-10-27 | 2021-11-23 | 华南师范大学 | 句子情感分析方法、装置以及设备 |
| US20230162020A1 (en) * | 2021-11-23 | 2023-05-25 | Microsoft Technology Licensing, Llc | Multi-Task Sequence Tagging with Injection of Supplemental Information |
| US12353998B2 (en) * | 2021-11-23 | 2025-07-08 | Microsoft Technology Licensing, Llc | Multi-task sequence tagging with injection of supplemental information |
| US12074935B2 (en) | 2021-12-30 | 2024-08-27 | Google Llc | Systems, method, and media for removing objectionable and/or inappropriate content from media |
| CN114780512A (zh) * | 2022-03-22 | 2022-07-22 | 荣耀终端有限公司 | 一种灰度发布方法、系统及服务器 |
| KR102859509B1 (ko) * | 2022-04-12 | 2025-09-16 | 한국전자통신연구원 | 자연어를 이용한 동영상 구간 검색 방법 및 장치 |
| KR20230146233A (ko) * | 2022-04-12 | 2023-10-19 | 한국전자통신연구원 | 자연어를 이용한 동영상 구간 검색 방법 및 장치 |
| CN115527112A (zh) * | 2022-09-15 | 2022-12-27 | 哈尔滨工程大学 | 一种基于深度神经网络的声呐图像纹理特征去除方法 |
| CN115203380B (zh) * | 2022-09-19 | 2022-12-20 | 山东鼹鼠人才知果数据科技有限公司 | 基于多模态数据融合的文本处理系统及其方法 |
| CN115203380A (zh) * | 2022-09-19 | 2022-10-18 | 山东鼹鼠人才知果数据科技有限公司 | 基于多模态数据融合的文本处理系统及其方法 |
| US12197501B2 (en) | 2023-05-31 | 2025-01-14 | Microsoft Technology Licensing, Llc | Historical data-based video categorizer |
| CN116595808A (zh) * | 2023-07-17 | 2023-08-15 | 中国人民解放军国防科技大学 | 事件金字塔模型构建与多粒度时空可视化方法和装置 |
| CN116595808B (zh) * | 2023-07-17 | 2023-09-08 | 中国人民解放军国防科技大学 | 事件金字塔模型构建与多粒度时空可视化方法和装置 |
| US20250298835A1 (en) * | 2024-03-19 | 2025-09-25 | Tubertdata LLC | Methods And Systems For Personalized Transcript Searching And Indexing Of Online Multimedia |
| CN118691375A (zh) * | 2024-06-28 | 2024-09-24 | 广州七亩地科技发展有限公司 | 一种基于用户画像的商品推荐方法及系统 |
| CN120107898A (zh) * | 2025-05-09 | 2025-06-06 | 中国电建集团西北勘测设计研究院有限公司 | 适用于水电站大坝泄水场景下的人员入侵视觉识别方法 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2017124116A1 (fr) | Recherche, complémentation et exploration de multimédia | |
| Amato et al. | AI in the media and creative industries | |
| US12374372B2 (en) | Automatic trailer detection in multimedia content | |
| Manzoor et al. | Multimodality representation learning: A survey on evolution, pretraining and its applications | |
| Shah et al. | Multimodal analysis of user-generated multimedia content | |
| US10769438B2 (en) | Augmented reality | |
| US10970334B2 (en) | Navigating video scenes using cognitive insights | |
| US20220208155A1 (en) | Systems and methods for transforming digital audio content | |
| US9892109B2 (en) | Automatically coding fact check results in a web page | |
| US10679063B2 (en) | Recognizing salient video events through learning-based multimodal analysis of visual features and audio-based analytics | |
| JP6361351B2 (ja) | 発話ワードをランク付けする方法、プログラム及び計算処理システム | |
| US20140255003A1 (en) | Surfacing information about items mentioned or presented in a film in association with viewing the film | |
| CN107924414A (zh) | 促进在计算装置处进行多媒体整合和故事生成的个人辅助 | |
| CN112989076A (zh) | 多媒体内容搜索方法、装置、设备及介质 | |
| US20140164371A1 (en) | Extraction of media portions in association with correlated input | |
| US12124524B1 (en) | Generating prompts for user link notes | |
| WO2019032994A1 (fr) | Dispositifs de communication orale, faciale et gestuelle et architecture informatique d'interaction avec un contenu multimédia numérique | |
| US20250119625A1 (en) | Generating video insights based on machine-generated text representations of videos | |
| US20160034585A1 (en) | Automatically generated comparison polls | |
| US20140161423A1 (en) | Message composition of media portions in association with image content | |
| US20230282017A1 (en) | Contextual sentiment analysis of digital memes and trends systems and methods | |
| US20140163956A1 (en) | Message composition of media portions in association with correlated text | |
| US20190095392A1 (en) | Methods and systems for facilitating storytelling using visual media | |
| Shao et al. | Evaluation on algorithms and models for multi-modal information fusion and evaluation in new media art and film and television cultural creation | |
| EP3306555A1 (fr) | Diversification des résultats de recherche multimédia sur des réseaux sociaux en ligne |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17701980 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 17701980 Country of ref document: EP Kind code of ref document: A1 |