WO2016108113A1 - Apparatus, method, and server for generating audiobooks - Google Patents
Apparatus, method, and server for generating audiobooks Download PDFInfo
- Publication number
- WO2016108113A1 WO2016108113A1 PCT/IB2015/059598 IB2015059598W WO2016108113A1 WO 2016108113 A1 WO2016108113 A1 WO 2016108113A1 IB 2015059598 W IB2015059598 W IB 2015059598W WO 2016108113 A1 WO2016108113 A1 WO 2016108113A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- dynamic range
- audio
- recording
- processor
- storage unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/167—Audio in a user interface, e.g. using voice commands for navigating, audio feedback
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09B—EDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
- G09B5/00—Electrically-operated educational appliances
- G09B5/04—Electrically-operated educational appliances with audible presentation of the material to be studied
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L65/00—Network arrangements, protocols or services for supporting real-time applications in data packet communication
- H04L65/60—Network streaming of media packets
- H04L65/75—Media network packet handling
- H04L65/762—Media network packet handling at the source
Definitions
- the present specification relates generally to generating audiobooks.
- audiobooks are generated using recording equipment in a studio, where a user or reader reads a book from a physical or electronic copy. The reader's voice is is recorded. After the reader has completed recording the book, a post production team will process the recordings to generate an audiobook for publishing.
- the audio recording is generally processed manually to synchronize the sound recording with the text of a book and for quality control. If portions of the recording are not recorded well, the reader would have to go re-record the section after locating the text in the book. In addition, the post production team would be required to insert the new recording in the appropriate location of the audiobook.
- an apparatus for generating an audiobook includes a network interface configured to receive ebook data and to transmit the audiobook to a server.
- the apparatus also includes a memory storage unit in communication with the network interface.
- the memory storage unit is configured to store the ebook data and a recording database.
- the recording database is configured to store a received audio recording.
- the apparatus includes a display screen in communication with the memory storage unit.
- the display screen is configured to display a text segment from the ebook data.
- the apparatus includes a microphone in communication with the memory storage unit.
- the microphone is configured to receive audio input for generating the received audio recording based on the text segment.
- the apparatus also include an input device configured to receive a stop input upon completion of the received audio recording.
- the apparatus includes a processor in communication with the memory storage unit.
- the processor is configured to post-process the received audio recording.
- the processor is also configured to generate the audiobook using the received audio recording.
- the plurality of audio recordings may be indexed to the ebook data.
- the apparatus may further include a parsing engine.
- the parsing engine may be configured to parse the ebook data into a plurality of text segments.
- Each text segment of the plurality of text segments may have a minimum number of words.
- the minimum number of words may be sixty.
- the parsing engine may parse the ebook data at a natural break.
- the natural break is after a sentence.
- the processor may be further configured to assess the received audio recording for quality.
- the processor may be further configured to analyze the plurality of audio recordings to determine an upper dynamic range limit and a lower dynamic range limit.
- the display may generate a warning indicator when dynamic range of the audio input is outside of the upper dynamic range limit and the lower dynamic range limit.
- a method of generating an audiobook involves receiving ebook data via a network interface for storage in a memory storage unit.
- the memory storage unit is in communication with the network interface.
- the method further involves displaying a text segment from the ebook data on a display screen.
- the method also involves receiving audio input via a microphone.
- the audio input is for generating a received audio recording based on the text segment.
- the method involves receiving a stop input upon completion of the received audio recording for storage in a recording database the memory storage unit.
- the method involves post-processing the received audio recording to generate the audiobook.
- a server for generating an audiobook includes a network interface configured to send and receive data via a network.
- the server includes a memory storage unit in communication with the network interface.
- the memory storage unit is configured to store ebook data and a recording database, wherein the recording database is configured to store a plurality of audio recordings received via the network.
- the server includes a processor in communication with the memory storage unit.
- the processor is configured to send a plurality of text segments to a device for displaying on a display screen, receive the plurality of audio recordings based on the plurality of text segment for storing in the recording database, and post-process the plurality of audio recordings to generate the audiobook.
- non-transitory computer readable medium configured with codes for generating an audiobook.
- the codes are for directing a processor to receive ebook data via a network interface for storage in a memory storage unit.
- the memory storage unit is in communication with the network interface.
- the codes further direct the processor to display a text segment from the ebook data on a display screen.
- the codes are for further directing a processor to receive audio input via a microphone.
- the audio input is for generating a received audio recording based on the text segment.
- the codes also direct the processor to receive a stop input upon completion of the received audio recording for storage in a recording database the memory storage unit.
- the codes are for directing a processor to post-process the received audio recording to generate the audiobook.
- Figure 3 is a flow chart of a method of generating an audiobook in accordance with an embodiment
- Figure 5 is a flow chart of a method of generating an audiobook in accordance with another embodiment.
- Figure 6 is a schematic representation of an apparatus for generating an audiobook in accordance with another embodiment.
- FIG. 1 an apparatus for generating an audiobook is generally shown at 50.
- Figure 2 shows a schematic block diagram of the electronic components of the apparatus 50 in greater detail.
- the apparatus 50 includes a processor 52, a network interface 54, a display screen 56, a microphone 58, and an input device 60, and memory storage device 62. It should be emphasized that the structure shown of the apparatus 50 is purely exemplary. In the present embodiment, the apparatus 50 is connected to a network 70, such as the Internet, for receiving and transmitting content as will be discussed in greater detail below.
- a network 70 such as the Internet
- the display screen 56 is not particularly limited and can include a variety of different types of screens such as an array of light emitting diodes (LED), liquid crystals, plasma cells, organic light emitting diodes (OLED), or an electrophoretic ink (E-ink) screen. Furthermore, the display screen 56 can be a touchscreen configured to receive input for controlling the apparatus 50. In the present embodiment, the display screen 56 is a computer monitor in communication with the memory storage unit 62. The display screen 56 is generally configured to display content, such as a text segment selected from the ebook data stored on the memory storage unit 62. In particular, the display screen 56 is configured to display at least one text segment selected from the ebook data.
- the microphone 58 is in communication with the memory storage unit 62.
- the microphone 58 is an external microphone connected to a computer.
- the microphone 58 is generally configured to receive audio input that can be used to generate an audio recording for storage in the memory storage unit 62.
- the audio recording can be associated with text segment selected from the ebook data and displayed on the display screen 56. It is to be appreciated by a person of skill in the art with the benefit of this description that the microphone 58 is not particularly limited and can be of any type of microphone or device configured to receive audio input for storage on the memory storage unit 62.
- the memory storage unit 62 is in communication other components of the apparatus 50 and can be of any type such as non-volatile memory (e.g. Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory, hard disk, floppy disk, optical disk, solid state drive, or tape drive) or volatile memory (e.g. random access memory (RAM)).
- non-volatile memory e.g. Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory, hard disk, floppy disk, optical disk, solid state drive, or tape drive
- volatile memory e.g. random access memory (RAM)
- RAM random access memory
- the memory storage unit 62 is generally a type of non-volatile memory because of the robust nature of non-volatile memory, some embodiments can use volatile memory in situations where high access speed is desired.
- the memory storage unit 62 is a non-volatile memory unit storing a plurality of data records in a database.
- the memory storage unit 62 is generally configured receive ebook data via the network
- the memory storage unit 62 is configured to store a recording database having a plurality of audio recordings received from the microphone 58.
- the recording database is not particularly limited.
- the recording database can include an index for indexing the plurality of audio recordings and associating each audio recording with a text segment selected from the ebook data.
- the apparatus 50 can include a feature that providing a complete playback of all audio recordings in the records database in sequence to allow for effective reviewing of the audiobook to ensure that the quality of the audio recordings is consistent. It is to be appreciated that this feature can be useful in applications where audio recordings are not generated in sequence or when the user generating the audio recordings drifts with respect to tone or accents.
- the text segments can include some overlap to provide for blending of audio recordings for each text segment.
- the plurality of text segments can then be stored on the memory storage unit 62 in a separate database.
- each of the text segments can be indexed and identified, such as by page number, chapter number, line number or any other suitable identifier associated with the ebook.
- the manner by which the parsing engine 64 parses the ebook data is not particularly limited. In the present embodiment, the parsing engine 64 parses the ebook data into text segments with a minimum number of words. For example, the minimum number of words can be approximately sixty. Alternatively, the minimum number of words can be greater or less than sixty. For examples, in some embodiments, the minimum number of words can be about thirty.
- the minimum number of words can be about forty. In yet another embodiment, the minimum number of words can be about fifty. In yet another embodiment, the minimum number of words can be about seventy. In yet another embodiment, the minimum number of words can be about eighty. In yet another embodiment, the minimum number of words can be about ninety.
- the parsing engine 64 can be configured to parse the ebook data into text segments with equal number of words or have the additional limitation of a maximum number of words. By limiting the text segments to approximately the same size, the plurality of audio recordings would be more uniform in size.
- the parsing engine 64 can be omitted.
- the ebook data in the memory storage unit 62 can be received from the network 70 in a pre-parsed format.
- the ebook data can be manually parsed by a user (e.g. a reader) prior to generating the audiobook.
- Block 530 comprises receiving audio input via the microphone 58.
- the manner by which the audio input is received is not particularly limited.
- the audio input generally includes sound signals corresponding to the user reading the text segment displayed in block 520.
- the received audio input is then used to generate and store a received audio recording associated with the text segment displayed in block 520.
- the manner by which the audio recording is generated and stored is not particularly limited.
- the audio recording can be stored as a sound file and in a recording database on the memory storage unit 62.
- meta-tags can be optionally generated during the execution of block 530.
- the meta-tags can be used to identify various text segments such as by chapter, page, or line numbers corresponding to a print publication of the ebook.
- the meta-tags can be used to reference a text segment using a specific identifier number.
- the manner by which the meta-tags are generated is not particularly limited.
- the meta-tags can be generated manually by a user providing input, for example, via the input device 60.
- the meta-tag can be automatically generated by the processor 52.
- the input from the input device 60 can also represent an approval of the generated audio recording for storage in the memory storage unit 62. It is to be appreciated with the benefit of this description that the audio recording can be re-recorded if not approved, such as if the quality or audio input is not satisfactory (e.g. stutter during a recording, misread of the text segment, etc.). Therefore, the input from the input device 60 directs the apparatus 50 to stop recording the audio input in the presently received audio recording and to approve the received audio recording for storage into the database in the memory storage unit 62. In other embodiments, the two steps can be separated into two separate inputs for improved control or an optional toggle feature can allow the user to select between having a single input from the input device 60 carry out one or both of the functions described above.
- one quality that can be assessed in block 550 is the volume or dynamic range of the audio recording generated by the executions of blocks 530 and 540.
- the processor 52 can analyze a plurality of audio recordings to determine an upper dynamic range limit and a lower dynamic range limit.
- the manner by which the upper dynamic range limit and the lower dynamic range limit are determined is not particularly limited.
- the processor 52 can analyze the plurality of audio recordings or a subset thereof in the recordings database stored on the memory storage unit 62 associated with the ebook data to determine the maximum and minimum dynamic ranges and set the upper dynamic range limit and the lower dynamic range limit to the maximum and minimum, respectively.
- the upper dynamic range limit and the lower dynamic range limit can be set to a percentage of the maximum and minimum, respectively.
- the upper dynamic range limit and the lower dynamic range limit can be set to 90%, of the maximum and minimum values. In other embodiments, the upper dynamic range limit and the lower dynamic range limit can be set to 80% of the maximum and minimum values. In another embodiment, the upper dynamic range limit and the lower dynamic range limit can be set to 75% of the maximum and minimum values.
- the upper dynamic range limit and the lower dynamic range limit can be set using an average across the plurality of audio recordings or a subset thereof, such that a predetermined percentage would fall within the upper dynamic range limit and the lower dynamic range limit.
- the predetermined percentage is not particularly limited.
- the percentage can represent a standard deviation (about 68%).
- the percentage can represent the 95 th percentile.
- the processor 52 can determine whether audio input in the recording is within the upper dynamic range limit and the lower dynamic range limit.
- the processor 52 can adjust portions of the received audio recording where the audio input is outside of the range by capping the audio input at the upper dynamic range limit or the lower dynamic range limit.
- the entire received audio recording can be analyzed and scaled as a whole such that no portion exceeds the upper dynamic range limit and the lower dynamic range limit.
- block 550 is specifically performed by executing a series Attorney Ref: P4887PC00 of automated Batch, Sound Exchange (SoX), and AutoHotKey (AHK) scripts to improve the audio quality, expand the dynamic range, and indexing the text segments.
- SoX Sound Exchange
- AHK AutoHotKey
- the apparatus is a device 50a, such as a portable electronic device (e.g. a smartphone or a tablet).
- a portable electronic device e.g. a smartphone or a tablet
- the device 50a includes a processor 52a, a network interface 54a, a display screen 56a, a microphone 58a, and a speaker 68a.
- the network interface, the processor 52a, the network interface 54a, the display screen 56a, the microphone 58a, and the speaker 68a are each disposed within a housing (not shown) of the device 50a and in electrical communication with the processor 52a.
- the network interface 54a is not particularly limited and can include various network interface devices such as a network interface controller (NIC).
- the network interface 54a is generally configured to connect the device 50a to a network, such as the network 70 described above, for receiving and sending content, such as ebook data.
- the network interface 54a can connect using a data link layer standard such as Ethernet, Wi-Fi, mobile network (such as, but not limited to, fourth generation (4G), third generation (3G), code division multiple access (CDMA), Groupe Special Mobile (GSM) or Long Term Evolution (LTE) standards), or Token Ring.
- 4G fourth generation
- 3G third generation
- CDMA code division multiple access
- GSM Groupe Special Mobile
- LTE Long Term Evolution
- the display screen 56a is not particularly limited and can include a variety of different types of screens such as an array of light emitting diodes (LED), liquid crystals, plasma cells, organic light emitting diodes (OLED), or an electrophoretic ink (E-ink) screen. Furthermore, the display screen 56a can be a touchscreen configured to receive input for controlling the device 50a. In the present embodiment, the display screen 56a is generally configured to display content, such as an ebook.
- LED light emitting diodes
- LCD organic light emitting diodes
- E-ink electrophoretic ink
- the speaker 68a is also not particularly limited and can be of any type of speaker.
- the device 50a can include a built in speaker commonly found in portable electronic devices.
- the device 50a can include an external speaker capable of generating louder audio output when desired.
- the device 50a can be configured to provide monophonic sound, stereophonic sound such as surround sound, or virtual surround sound.
- the speaker 68a can also be linked to other speakers (not shown) through a wireless connection, such as Bluetooth or WI-FI, for applications where wireless connections are desirable such as in a car audio system.
- the speaker 68a is configured to provide sound output such as for an audiobook or for playing back an audio recording. Accordingly, it is to be appreciated that in some embodiments where sound generation is not required, the speaker 68a can be omitted. In other embodiments, it is to be appreciated by a person of skill in the art with the benefit of this description that the device 50a can also serve as an audiobook player when the device 50a is not generating an audiobook. Therefore, the speaker 68a can also be used to output an audiobook to the user.
- the processor 52a is generally configured to execute programming instructions 100a for to generating audiobooks.
- the instructions 100a can be configured to direct the processor 52a to carry out a method of generating an audiobook as described further below.
- the programming instructions 100a are stored in a computer readable storage medium (not shown in figure 4) accessible by the processor 52a.
- a method for generating an audiobook using the device 50a is represented in the form of a flow-chart and indicated generally at 600.
- the method 600 is performed using the device 50a. It is to be appreciated that the following discussion of the method 600 will lead to further understanding of the device 50a and its various components. It is also to be understood that the device 50a and/or the method 600 can be varied, and need not work exactly as discussed herein in conjunction with each other, and that such variations are within the scope of the present invention. For example, the method 600 can also be carried out using the apparatus 50 described above. It is to be emphasized, that method 600 need not be performed in the exact sequence as shown and that various blocks can be performed in parallel rather than in sequence.
- the output can be optionally divided into portions, such as text segments, for display on the display screen 56a.
- the manner by which the output is divided is not particularly limited.
- the output can be divided into portions to facilitate reading of the output by a user.
- facilitating reading of the ebook output can involve dividing the ebook output into portions without splitting sentences or dividing a dialog into separate portions.
- Other manners by which reading can be facilitated involve dividing the output into portions that can be read in one breath of a reader.
- some embodiments of the device 50a use a predetermined routine to divide the ebook output.
- the division of the ebook output can be customized for each reader's preference.
- the manner by which the ebook output is divided can involve a machine learning engine that can automatically customize the manner by which the ebook output is divided for each reader after observing the capabilities and tendencies of the reader.
- Block 630 comprises receiving audio input via the microphone 58a.
- the manner by which the audio input is received is not particularly limited.
- the audio input generally includes sound signals corresponding to the user reading the ebook output generated in block 620.
- the received audio input is converted into a sound file and stored on the device 50a and can be included as part of the audiobook being generated.
- the input corresponding to an approval is not limited to manual intervention from a user.
- the processor 52a can be optionally configured to detect errors in the audio input automatically and subsequently generate an input corresponding to an approval if no errors are detected or if the number of errors fall within a predetermined tolerance.
- Some examples of potential errors that the processor 52a can detect include noise fluctuations and or other extraneous sounds, such as lip smacking, or sounds made when a reader is hesitating such as "urn" or "ah”.
- the manner by which the errors are detected is not particularly limited.
- the frequency patterns of common errors can be compared to frequency patterns stored on a memory storage unit (not shown) of the device 50a.
- the audio input need not be a perfect studio-quality recording.
- the device 50 is generally not configured for a studio, but instead configured to be commonly used by a reader in most locations.
- a reader can generate a part of an audiobook while at a coffee shop or while waiting for or riding on a bus. Accordingly, the approval of the audio input can forward the audio input for post processing.
- the processor can provide an opportunity to re-record the portion of the ebook output or edit the recorded audio input using various sound editing tools.
- the manner by which a determination that no input has been received is not particularly limited. For example, in some embodiments, a pre-determined time-out period can be used such that after the period has expired, the processor 52a can proceed as though no input corresponding to an approval is received. In other embodiments, the processor 52a can be directed to wait indefinitely for a response.
- the method 600 described above is a non- limiting representation.
- variants of the embodiments discussed above can be used.
- any combination of the optional features mentioned above can be included in the device.
- the device can include a manual approval from a reader at block 640 in addition to the automatic error detection routines discussed above.
- the parsing engine 64 can be operated by the server 80b. In such an embodiment, all or a portion of the ebook data is transmitted to the server 80b, which operates the parsing engine 64 to generate text segments that are returned to the apparatus 50b.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
Abstract
An apparatus, a method, a server, and a computer readable medium for generating an audiobook is provided. The apparatus includes a network interface for receiving ebook data, a memory storage unit for storing data, a display screen, a microphone for receiving audio input, an input device, and a processor for post-processing to generate the audiobook. The method involves receiving ebook data, displaying a text segment from the ebook data, receiving audio input, receiving a stop input, and post processing an audio recording to generate the audiobook. The server includes a network interface, a memory storage unit and a processor for carrying out a method of generating an audiobook from received information. The computer readable medium is configured with codes for generating an audiobook, wherein the codes direct a processor to carry out a method of generating an audiobook from received information.
Description
APPARATUS, METHOD, AND SERVER FOR GENERATING AUDIOBOOKS
CROSS REFERENCE TO RELATED APPLICATIONS
This application claims priority from US Provisional application 62/097,997, filed December30, 2014, the contents all of which are incorporated herein by reference.
FIELD
[0001] The present specification relates generally to generating audiobooks.
BACKGROUND
[0002] Traditionally, audiobooks are generated using recording equipment in a studio, where a user or reader reads a book from a physical or electronic copy. The reader's voice is is recorded. After the reader has completed recording the book, a post production team will process the recordings to generate an audiobook for publishing. In this regard, the audio recording is generally processed manually to synchronize the sound recording with the text of a book and for quality control. If portions of the recording are not recorded well, the reader would have to go re-record the section after locating the text in the book. In addition, the post production team would be required to insert the new recording in the appropriate location of the audiobook.
SUMMARY
[0003] In accordance with an aspect of the invention, there is provided an apparatus for generating an audiobook. The apparatus includes a network interface configured to receive ebook data and to transmit the audiobook to a server. The apparatus also includes a memory storage unit in communication with the network interface. The memory storage unit is configured to store the ebook data and a recording database. The recording database is configured to store a received audio recording. In addition, the apparatus includes a display screen in communication with the memory storage unit. The display screen is configured to display a text segment from the ebook data. Furthermore, the apparatus includes a microphone in communication with the memory storage unit. The microphone is configured to receive audio
input for generating the received audio recording based on the text segment. The apparatus also include an input device configured to receive a stop input upon completion of the received audio recording. Also, the apparatus includes a processor in communication with the memory storage unit. The processor is configured to post-process the received audio recording. The processor is also configured to generate the audiobook using the received audio recording.
[0004] The recording database may be configured to store a plurality of audio recordings including the received audio recording.
[0005] The plurality of audio recordings may be indexed to the ebook data.
[0006] The apparatus may further include a parsing engine. The parsing engine may be configured to parse the ebook data into a plurality of text segments.
[0007] Each text segment of the plurality of text segments may have a minimum number of words.
[0008] The minimum number of words may be sixty.
[0009] The parsing engine may parse the ebook data at a natural break.
[0010] The natural break is after a sentence.
[0011] The natural break is after a paragraph.
[00 2] The processor may be further configured to assess the received audio recording for quality.
[0013] The processor may be further configured to analyze the plurality of audio recordings to determine an upper dynamic range limit and a lower dynamic range limit.
[0014] The processor may be further configured to analyze the audio input in the received audio recording to determine whether a dynamic range of the audio input is between the upper dynamic range limit and the lower dynamic range limit.
[0015] The display may generate a warning indicator when dynamic range of the audio input is outside of the upper dynamic range limit and the lower dynamic range limit.
[0016] In accordance with another aspect of the invention, there is provided a method of generating an audiobook. The method involves receiving ebook data via a network interface for storage in a memory storage unit. The memory storage unit is in communication with the network interface. The method further involves displaying a text segment from the ebook data on a display screen. The method also involves receiving audio input via a microphone. The
audio input is for generating a received audio recording based on the text segment. The method involves receiving a stop input upon completion of the received audio recording for storage in a recording database the memory storage unit. Furthermore, the method involves post-processing the received audio recording to generate the audiobook.
[0017] In accordance with another aspect of the invention, there is provided a server for generating an audiobook. The server includes a network interface configured to send and receive data via a network. In addition, the server includes a memory storage unit in communication with the network interface. The memory storage unit is configured to store ebook data and a recording database, wherein the recording database is configured to store a plurality of audio recordings received via the network. Furthermore, the server includes a processor in communication with the memory storage unit. The processor is configured to send a plurality of text segments to a device for displaying on a display screen, receive the plurality of audio recordings based on the plurality of text segment for storing in the recording database, and post-process the plurality of audio recordings to generate the audiobook.
[0018] In accordance with another aspect of the invention, there is provided non-transitory computer readable medium configured with codes for generating an audiobook. The codes are for directing a processor to receive ebook data via a network interface for storage in a memory storage unit. The memory storage unit is in communication with the network interface. The codes further direct the processor to display a text segment from the ebook data on a display screen. The codes are for further directing a processor to receive audio input via a microphone. The audio input is for generating a received audio recording based on the text segment. The codes also direct the processor to receive a stop input upon completion of the received audio recording for storage in a recording database the memory storage unit. In addition, the codes are for directing a processor to post-process the received audio recording to generate the audiobook.
BRIEF DESCRIPTION OF THE DRAWINGS
[0019] Reference will now be made, by way of example only, to the accompanying drawings in which:
[0020] Figure 1 is a schematic representation of an apparatus for generating an audiobook in accordance with an embodiment;
[0021] Figure 2 is a schematic representation of components of the apparatus shown in
figure 1 ;
[0022] Figure 3 is a flow chart of a method of generating an audiobook in accordance with an embodiment;
[0023] Figure 4 is a schematic representation of components of the apparatus for generating audiobooks in accordance with another embodiment;
[0024] Figure 5 is a flow chart of a method of generating an audiobook in accordance with another embodiment; and
[0025] Figure 6 is a schematic representation of an apparatus for generating an audiobook in accordance with another embodiment.
DETAILED DESCRIPTION OF THE EMBODIMENTS
- [0026] Referring to figure 1 , an apparatus for generating an audiobook is generally shown at 50. Figure 2 shows a schematic block diagram of the electronic components of the apparatus 50 in greater detail. The apparatus 50 includes a processor 52, a network interface 54, a display screen 56, a microphone 58, and an input device 60, and memory storage device 62. It should be emphasized that the structure shown of the apparatus 50 is purely exemplary. In the present embodiment, the apparatus 50 is connected to a network 70, such as the Internet, for receiving and transmitting content as will be discussed in greater detail below.
[0027] In the present embodiment, the processor 52 is in communication with the other components of the apparatus 50. The processor 52 is generally configured to execute programming instructions 100 for generating audiobooks. In general, the programming instructions 100 are stored in the memory storage unit 62. In the present embodiment, the instructions 100 can be configured to direct the processor 52 to control the hardware components and to analyze, data received from the memory storage unit 62 or the microphone 58. For example, the processor 52 can be directed to carry out post-processing of audio recordings stored in the memory storage unit 62. As another example, the processor 52 can also be directed to carry out live assessments of audio input received at the microphone 58.
[0028] In the present embodiment, the network interface 54 is generally configured to connect the apparatus 50 to the network 70 for receiving and sending content, such as receiving ebook data from a content provider via the network 70, or sending a generated audiobook to an external server (not shown). It is to be appreciate that in some embodiments, the network
interface 54 can also be used to transmit audio recordings to an external server (not shown) for further processing prior to the generation of an audiobook, as discussed below in further detail.
[0029] The manner by which the network interface 54 connects the apparatus 50 to the network 70 is not particularly limited. For example, the network interface 54 can be a network interface controller (NIC) and connect to the network 70 using a data link layer standard such as Ethernet, Wi-Fi, mobile network (such as, but not limited to, fourth generation (4G), third generation (3G), code division multiple access (CDMA), Groupe Special Mobile (GSM) or Long Term Evolution (LTE) standards), or Token Ring. Furthermore, it is to be appreciated that in some embodiments, the network interface 54 can be omitted and that the apparatus 50 can use another input/output method to receive and send content, such as reading from and writing to a computer readable medium.
[0030] The display screen 56 is not particularly limited and can include a variety of different types of screens such as an array of light emitting diodes (LED), liquid crystals, plasma cells, organic light emitting diodes (OLED), or an electrophoretic ink (E-ink) screen. Furthermore, the display screen 56 can be a touchscreen configured to receive input for controlling the apparatus 50. In the present embodiment, the display screen 56 is a computer monitor in communication with the memory storage unit 62. The display screen 56 is generally configured to display content, such as a text segment selected from the ebook data stored on the memory storage unit 62. In particular, the display screen 56 is configured to display at least one text segment selected from the ebook data.
[0031] The microphone 58 is in communication with the memory storage unit 62. In the present embodiment, the microphone 58 is an external microphone connected to a computer. The microphone 58 is generally configured to receive audio input that can be used to generate an audio recording for storage in the memory storage unit 62. In particular, the audio recording generate can be associated with text segment selected from the ebook data and displayed on the display screen 56. It is to be appreciated by a person of skill in the art with the benefit of this description that the microphone 58 is not particularly limited and can be of any type of microphone or device configured to receive audio input for storage on the memory storage unit 62.
[0032] The input device 60 is in communication with the processor 52 for controlling the apparatus 50. In particular, the input device 60 is generally configured to receive input corresponding to completion of an audio recording for stopping the storage of audio input received from the microphone 58. Continuing with the above example, the input device 60 can
receive a stop input from a user input after audio input corresponding to the text segment selected from the ebook data is stored in the memory storage unit 62.
[0033] In the present embodiment, the input device 60 is a key or plurality of keys on a standard computer keyboard as shown in figure 1. It is to be appreciated by a person of skill in the art that variations are contemplated. In some embodiments, the input device 60 can be a single button separate from the keyboard. In other embodiments, the input device 60 can be combined with the display screen 56 in the form of a touchscreen device configured to receive a touch input. Furthermore, it is to be appreciated by a person of skill in the art that the input device 60 can be omitted from the apparatus 50 in some embodiments. For example, in some embodiments, the completion of an audio recording corresponding to the end of a text segment can be automatically detected using speech recognition and natural language processing techniques.
[0034] The memory storage unit 62 is in communication other components of the apparatus 50 and can be of any type such as non-volatile memory (e.g. Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory, hard disk, floppy disk, optical disk, solid state drive, or tape drive) or volatile memory (e.g. random access memory (RAM)). Although the memory storage unit 62 is generally a type of non-volatile memory because of the robust nature of non-volatile memory, some embodiments can use volatile memory in situations where high access speed is desired. In the present embodiment, the memory storage unit 62 is a non-volatile memory unit storing a plurality of data records in a database. The memory storage unit 62 is generally configured receive ebook data via the network interface 54. In addition, the memory storage unit 62 is configured to store a recording database having a plurality of audio recordings received from the microphone 58. It is to be appreciated by a person of skill in the art with the benefit of this description that the recording database is not particularly limited. For example, the recording database can include an index for indexing the plurality of audio recordings and associating each audio recording with a text segment selected from the ebook data. In addition, the apparatus 50 can include a feature that providing a complete playback of all audio recordings in the records database in sequence to allow for effective reviewing of the audiobook to ensure that the quality of the audio recordings is consistent. It is to be appreciated that this feature can be useful in applications where audio recordings are not generated in sequence or when the user generating the audio recordings drifts with respect to tone or accents.
[0035] In the present embodiment, the apparatus 50 further includes a parsing engine 64 in
communication with the memory storage unit 62. In the present embodiment, the parsing engine 64 operates using a separated dedicated processor. However, it is to be appreciated that variations are contemplated and the parsing engine 64 can be operated by a generic processor and/or the processor 52 in some embodiments. The parsing engine 64 is generally configured to parse the ebook data into a plurality of text segments. In the present embodiments, the text segments are sequential and can be connected in order to reconstruct the ebook.
[0036] In other embodiments, the text segments can include some overlap to provide for blending of audio recordings for each text segment. The plurality of text segments can then be stored on the memory storage unit 62 in a separate database. In addition, each of the text segments can be indexed and identified, such as by page number, chapter number, line number or any other suitable identifier associated with the ebook. The manner by which the parsing engine 64 parses the ebook data is not particularly limited. In the present embodiment, the parsing engine 64 parses the ebook data into text segments with a minimum number of words. For example, the minimum number of words can be approximately sixty. Alternatively, the minimum number of words can be greater or less than sixty. For examples, in some embodiments, the minimum number of words can be about thirty. In another embodiment, the minimum number of words can be about forty. In yet another embodiment, the minimum number of words can be about fifty. In yet another embodiment, the minimum number of words can be about seventy. In yet another embodiment, the minimum number of words can be about eighty. In yet another embodiment, the minimum number of words can be about ninety.
[0037] Furthermore, it is to be appreciated, by a person of skill in the art with the benefit of this description that the text segment can vary in length such that the ebook data is parsed at a natural break in the language to facilitate the generation of an audio recording and improve the quality of the generated audiobook. For example, the natural break can be after the end of a sentence, paragraph or other natural pause in a language such that the transition from one audio recording to another audio recording is less easily detectable when listening to the audiobook as a whole.
[0038] It is to be appreciated that variations to the parsing engine 64 are contemplated. In other embodiments, the parsing engine 64 can be configured to parse the ebook data into text segments with equal number of words or have the additional limitation of a maximum number of words. By limiting the text segments to approximately the same size, the plurality of audio recordings would be more uniform in size. In some embodiments, it is to be appreciated that
the parsing engine 64 can be omitted. For example, the ebook data in the memory storage unit 62 can be received from the network 70 in a pre-parsed format. As another example, the ebook data can be manually parsed by a user (e.g. a reader) prior to generating the audiobook.
[0039] Referring now to figure 3, a method for generating an audiobook using the apparatus 50 is represented in the form of a flow-chart and indicated generally at 500. In order to assist in the explanation of the method 500, it will be assumed that the method 500 is performed using the device 50. It is to be appreciated that the following discussion of the method 500 will lead to further understanding of the device 50 and its various components. It is also to be understood that the device 50 and/or the method 500 can be varied, and need not work exactly as discussed herein in conjunction with each other, and that such variations are within the scope of the present invention. It is to be emphasized, that method 500 need not be performed in the exact sequence as shown and that various blocks can be performed in parallel rather than in sequence; hence the elements of the method 500 are referred to herein as "blocks" rather than "steps".
[0040] The manner in which the method 500 is started is not particularly limited. For example, the method 500 can start upon receiving input at the apparatus 50. Alternatively, the method 500 can start when the apparatus 50 is powered on or receiving a command via the network interface 54. In another embodiment, the method 500 can also start when an application, such as a web browser or downloaded app, is executed.
[0041] Block 510 comprises receiving ebook data for storage in the memory storage unit 62. The manner by which ebook data is received is not particularly limited. In the present embodiment, the ebook data is generally received via the network interface 54 from the network 70. In addition, the format of the ebook data is not particularly limited and the apparatus 50 can be configured to be compatible with a wide variety of ebook data formats. In addition, the processor 52 can further be configured to automatically detect the format of received ebook data. In the present embodiment, the ebook data can be received in various formats such as iBook, Kindle, EPUB, eReader, PDF, OpenDocument Format, MOBI, Amazon Word, Microsoft Reader, plain text, or Microsoft Word. In other embodiments, additional formats can also be supported such as proprietary formats supported by devices such as Kindle, Kobo, Nook and iPad, etc.
[0042] Block 520 comprises displaying a text segment from the ebook data stored on the memory storage unit 62 to the display screen 56. The manner by which the text segment is generated is not particularly limited and can include various manners. In the present
embodiment, a parsing engine 64 parses the ebook data into a plurality of text segments and stores the text segments in a database on the memory storage unit 62. The parsed text segments can then each be read for output to the display screen 56.
[0043] The manner by which the ebook data is divided into text segments by the parsing engine 64 is not particularly limited. For example, the ebook data can be divided into portions to facilitate reading the text segment by a user. It is to be appreciated by a person of skill in the art with the benefit of this description that facilitating reading of the text segment can involve dividing the ebook data into text segments without splitting sentences or dividing a dialog into separate text segments. Other manners by which reading can be facilitated involve dividing the ebook data into portions that can be read in one breath of the user (i.e. typically about 60 words for an average person). Accordingly, it is contemplated that some embodiments of the apparatus 50 use a predetermined routine to parse the ebook data. In other embodiments, the parsing of the ebook data can be customized for each user's preferences. For example, some users can prefer to have larger text segments and others prefer shorter text segments. It is to be appreciated that larger text segments can provide for more seamless generation of an audiobook requiring less input (as discussed later in step 540). In contrast, it is to be appreciated that shorter text segments can facilitate editing and allow for easier re-recording of a specific audio recording if desired. In further embodiments still, the manner by which the ebook data is parsed can involve a machine learning engine that can automatically customize the manner by which the ebook data is parsed for each user after observing the capabilities and tendencies of the user.
[0044] In other embodiments, the ebook data can include separated text segments when received from a third party source. Accordingly, in these embodiments, the parsing engine 64 can be optional and omitted.
[0045] Block 530 comprises receiving audio input via the microphone 58. The manner by which the audio input is received is not particularly limited. In the present embodiment, the audio input generally includes sound signals corresponding to the user reading the text segment displayed in block 520. The received audio input is then used to generate and store a received audio recording associated with the text segment displayed in block 520. The manner by which the audio recording is generated and stored is not particularly limited. For example, the audio recording can be stored as a sound file and in a recording database on the memory storage unit 62.
[0046] It is to be appreciated by a person of skill in the art that meta-tags can be optionally
generated during the execution of block 530. The meta-tags can be used to identify various text segments such as by chapter, page, or line numbers corresponding to a print publication of the ebook. Alternatively, the meta-tags can be used to reference a text segment using a specific identifier number. The manner by which the meta-tags are generated is not particularly limited. For example, the meta-tags can be generated manually by a user providing input, for example, via the input device 60. In another embodiment, the meta-tag can be automatically generated by the processor 52.
[0047] Block 540 comprises receiving an input from the input device 60 to stop recording of the received audio recording at the end of the text segment displayed in block 520. It is to be appreciated by a person of skill in the art with the benefit of this description, that the received input is generally configured to coincide with the completion of the audio recording as indicated by a user (e.g. the user completed reading the text segment). In other embodiments, block 540 can be modified or omitted. For example, in other embodiments, the apparatus 50 can include a voice recognition system and a language processor to automatically detect the completion of the audio recording based on a comparison with the text segment.
[0048] The input from the input device 60 can also represent an approval of the generated audio recording for storage in the memory storage unit 62. It is to be appreciated with the benefit of this description that the audio recording can be re-recorded if not approved, such as if the quality or audio input is not satisfactory (e.g. stutter during a recording, misread of the text segment, etc.). Therefore, the input from the input device 60 directs the apparatus 50 to stop recording the audio input in the presently received audio recording and to approve the received audio recording for storage into the database in the memory storage unit 62. In other embodiments, the two steps can be separated into two separate inputs for improved control or an optional toggle feature can allow the user to select between having a single input from the input device 60 carry out one or both of the functions described above.
[0049] Block 550 comprises post processing the audio recording generated by the executions of blocks 530 and 540. The manner by which the audio recording is processed during the execution of block 550 is not particularly limited and can involve processing the audio recording using various components. In the present embodiment, the audio recording is assessed for quality. In particular, the quality of the audio recording generated by the executions of blocks 530 and 540 can be analyzed and compared with the quality of other audio recordings in the recordings database stored on the memory storage unit 62. The manner by which the comparisons are made is not particularly limited. For example, the audio recording
generated by the executions of blocks 530 and 540 can be compared with audio recordings associated with an adjacent text segment previously recorded, multiple text segments within a predetermined proximity, or a random sample of text segments in the recordings database.
[0050] In the present embodiment, one quality that can be assessed in block 550 is the volume or dynamic range of the audio recording generated by the executions of blocks 530 and 540. The processor 52 can analyze a plurality of audio recordings to determine an upper dynamic range limit and a lower dynamic range limit. The manner by which the upper dynamic range limit and the lower dynamic range limit are determined is not particularly limited. For example, the processor 52 can analyze the plurality of audio recordings or a subset thereof in the recordings database stored on the memory storage unit 62 associated with the ebook data to determine the maximum and minimum dynamic ranges and set the upper dynamic range limit and the lower dynamic range limit to the maximum and minimum, respectively. As another example, the upper dynamic range limit and the lower dynamic range limit can be set to a percentage of the maximum and minimum, respectively. For example, the upper dynamic range limit and the lower dynamic range limit can be set to 90%, of the maximum and minimum values. In other embodiments, the upper dynamic range limit and the lower dynamic range limit can be set to 80% of the maximum and minimum values. In another embodiment, the upper dynamic range limit and the lower dynamic range limit can be set to 75% of the maximum and minimum values.
[0051] In other embodiments, the upper dynamic range limit and the lower dynamic range limit can be set using an average across the plurality of audio recordings or a subset thereof, such that a predetermined percentage would fall within the upper dynamic range limit and the lower dynamic range limit. The predetermined percentage is not particularly limited. For example, the percentage can represent a standard deviation (about 68%). In another embodiment, the percentage can represent the 95th percentile.
[0052] Once the upper dynamic range limit and the lower dynamic range limit has been determined, the processor 52 can determine whether audio input in the recording is within the upper dynamic range limit and the lower dynamic range limit. The processor 52 can adjust portions of the received audio recording where the audio input is outside of the range by capping the audio input at the upper dynamic range limit or the lower dynamic range limit. In other embodiments, the entire received audio recording can be analyzed and scaled as a whole such that no portion exceeds the upper dynamic range limit and the lower dynamic range limit.
[0053] In the present embodiment, block 550 is specifically performed by executing a series
Attorney Ref: P4887PC00 of automated Batch, Sound Exchange (SoX), and AutoHotKey (AHK) scripts to improve the audio quality, expand the dynamic range, and indexing the text segments. In particular, the following SoX settings are applied to each audio recording:
[0054] It is to be appreciated by a person of skill in the art with the benefit of this description that the upper dynamic range limit and the lower dynamic range limit can also be stored in the memory storage unit 62 for optional use during the execution of other steps in the method 500. The upper dynamic range limit and the lower dynamic range limit can be adjusted and dynamic in nature, such as being updated after each approval of an audio recording. As an example of such use, block 530 can include a live assessment of the audio input received at the microphone 58 which determines whether the received audio input is within the upper dynamic range limit and the lower dynamic range limit. If the audio input is not within the upper dynamic range limit and the lower dynamic range limit, the display 56 can generate a warning indicator to the user.
[0055] Again, it is to be re-emphasized that the method 500 described above is a non- limiting representation. For example, variants of the embodiments discussed above can be used.
[0056] Referring to figure 4, a schematic block diagram of the electronic components of another embodiment of an apparatus for generating audiobooks is generally shown. In the
present embodiment, the apparatus is a device 50a, such as a portable electronic device (e.g. a smartphone or a tablet). Like components of the device 50a bear like reference to their counterparts in the apparatus 50, except followed by the suffix "a". The device 50a includes a processor 52a, a network interface 54a, a display screen 56a, a microphone 58a, and a speaker 68a. The network interface, the processor 52a, the network interface 54a, the display screen 56a, the microphone 58a, and the speaker 68a are each disposed within a housing (not shown) of the device 50a and in electrical communication with the processor 52a.
[0057] In the present embodiment, the network interface 54a is not particularly limited and can include various network interface devices such as a network interface controller (NIC). The network interface 54a is generally configured to connect the device 50a to a network, such as the network 70 described above, for receiving and sending content, such as ebook data. For example, the network interface 54a can connect using a data link layer standard such as Ethernet, Wi-Fi, mobile network (such as, but not limited to, fourth generation (4G), third generation (3G), code division multiple access (CDMA), Groupe Special Mobile (GSM) or Long Term Evolution (LTE) standards), or Token Ring. Furthermore, it is to be appreciated that in some embodiments, the network interface 54a can be omitted and that the device 50a can use another input/output device to receive and send content.
[0058] The display screen 56a is not particularly limited and can include a variety of different types of screens such as an array of light emitting diodes (LED), liquid crystals, plasma cells, organic light emitting diodes (OLED), or an electrophoretic ink (E-ink) screen. Furthermore, the display screen 56a can be a touchscreen configured to receive input for controlling the device 50a. In the present embodiment, the display screen 56a is generally configured to display content, such as an ebook.
[0059] The microphone 58a is also not particularly limited and can be of any type of microphone. In the present embodiment, the device 50a can include a built in microphone commonly found in portable electronic devices, such as smartphones, tablets and laptops. In other examples, the device 50a can include an external microphone capable such as hands-free Bluetooth device or a car audio system connected via Bluetooth. In the present embodiment, the microphone 58a is configured to receive audio input such as the voice of a user.
[0060] The speaker 68a is also not particularly limited and can be of any type of speaker. For example, the device 50a can include a built in speaker commonly found in portable electronic devices. In other examples, the device 50a can include an external speaker capable of generating louder audio output when desired. Furthermore, the device 50a can be configured
to provide monophonic sound, stereophonic sound such as surround sound, or virtual surround sound. In addition, the speaker 68a can also be linked to other speakers (not shown) through a wireless connection, such as Bluetooth or WI-FI, for applications where wireless connections are desirable such as in a car audio system.
[0061] In the present embodiment, the speaker 68a is configured to provide sound output such as for an audiobook or for playing back an audio recording. Accordingly, it is to be appreciated that in some embodiments where sound generation is not required, the speaker 68a can be omitted. In other embodiments, it is to be appreciated by a person of skill in the art with the benefit of this description that the device 50a can also serve as an audiobook player when the device 50a is not generating an audiobook. Therefore, the speaker 68a can also be used to output an audiobook to the user.
[0062] The processor 52a is generally configured to execute programming instructions 100a for to generating audiobooks. For example, the instructions 100a can be configured to direct the processor 52a to carry out a method of generating an audiobook as described further below. In general, the programming instructions 100a are stored in a computer readable storage medium (not shown in figure 4) accessible by the processor 52a.
[0063] Referring now to figure 5, a method for generating an audiobook using the device 50a is represented in the form of a flow-chart and indicated generally at 600. In order to assist in the explanation of the method 600, it will be assumed that the method 600 is performed using the device 50a. It is to be appreciated that the following discussion of the method 600 will lead to further understanding of the device 50a and its various components. It is also to be understood that the device 50a and/or the method 600 can be varied, and need not work exactly as discussed herein in conjunction with each other, and that such variations are within the scope of the present invention. For example, the method 600 can also be carried out using the apparatus 50 described above. It is to be emphasized, that method 600 need not be performed in the exact sequence as shown and that various blocks can be performed in parallel rather than in sequence.
[0064] Block 610 is the start of the method 600. The manner in which the method 600 is started is not particularly limited. For example, the method 600 can start upon receiving input at the device 50a. Alternatively, the method 600 can start when the device 50a is powered on or receiving a command via the network interface 54a. In another embodiment, the method 600 can also start when an application, such as a web browser or downloaded app, is executed.
[0065] Block 620 comprises generating ebook output to the display screen 56a. The manner by which the output is generated is not particularly limited and can include various manners. For example, the ebook can be downloaded from a third party source in one of many formats. The processor 52a subsequently can translate the downloaded ebook into text for output to the display screen 56a. It is to be appreciated that the format of the ebook received is not limited and can include a wide variety of formats typically used for ebooks to be translated in to text for output to the display screen 56a. Furthermore, the processor 52a can further be configured to automatically detect the format of an ebook file and execute the appropriate translation routine.
[0066] In the present embodiment, the output can be optionally divided into portions, such as text segments, for display on the display screen 56a. The manner by which the output is divided is not particularly limited. For example, the output can be divided into portions to facilitate reading of the output by a user. It is to be appreciated by a person of skill in the art that facilitating reading of the ebook output can involve dividing the ebook output into portions without splitting sentences or dividing a dialog into separate portions. Other manners by which reading can be facilitated involve dividing the output into portions that can be read in one breath of a reader. Accordingly, it is contemplated that some embodiments of the device 50a use a predetermined routine to divide the ebook output. In other embodiments, the division of the ebook output can be customized for each reader's preference. In further embodiments still, the manner by which the ebook output is divided can involve a machine learning engine that can automatically customize the manner by which the ebook output is divided for each reader after observing the capabilities and tendencies of the reader.
[0067] Block 630 comprises receiving audio input via the microphone 58a. The manner by which the audio input is received is not particularly limited. In the present embodiment, the audio input generally includes sound signals corresponding to the user reading the ebook output generated in block 620. The received audio input is converted into a sound file and stored on the device 50a and can be included as part of the audiobook being generated.
[0068] It is to be appreciated by a person of skill in the art that meta-tags can be optionally generated during the execution of block 630. The meta-tags can be used to identify various sections of the ebook output such as chapters or page numbers corresponding to page numbers in a printed version. The manner by which the meta-tags are generated is not particularly limited. For example, the meta-tags can be generated manually by a reader providing input, for example, via the display screen 56a having a touchscreen. In other
embodiments, the meta-tags can be generated using a keyboard or button (not shown for this embodiment) on the device 50a. In yet another embodiment, the meta-tags can be automatically generated by the processor 52 by referencing the portion of the ebook output displayed at block 620. In yet another embodiment, the meta-tags can be generated using voice recognition systems that corresponds the received audio input with the ebook output.
[0069] Block 640 comprises receiving an input corresponding to an approval of the audio input. The manner by which the approval is received is not particularly limited. In the present embodiment, a button is presented to a user on the display screen 56a which the user can touch using the touchscreen display screen 56a to approve the audio input received at block 630. It is to be appreciated with the benefit of this description, that the audio input can be optionally reviewed in some embodiments. Accordingly, a second button can be presented on the touchscreen display screen 56a for reviewing the audio input.
[0070] It is to be appreciated that the input corresponding to an approval is not limited to manual intervention from a user. For example, in the present embodiment, the processor 52a can be optionally configured to detect errors in the audio input automatically and subsequently generate an input corresponding to an approval if no errors are detected or if the number of errors fall within a predetermined tolerance. Some examples of potential errors that the processor 52a can detect include noise fluctuations and or other extraneous sounds, such as lip smacking, or sounds made when a reader is hesitating such as "urn" or "ah". The manner by which the errors are detected is not particularly limited. For example, the frequency patterns of common errors can be compared to frequency patterns stored on a memory storage unit (not shown) of the device 50a.
[0071] In some further embodiments, the received audio input can be provided to a speech recognition engine (not shown) for conversion to text. The text received from the speech recognition engine can subsequently be compared by the processor 52a with the ebook data to ensure accuracy of the audiobook being generated. For example, if the reader misread a word or sentence, the processor 52a can generate a warning so that the reader can further review the audio input. Alternatively, if no error is detected, the processor 52a can generate an input corresponding to an approval.
[0072] It is to be appreciated by a person of skill in the art having the benefit of this description that the audio input need not be a perfect studio-quality recording. In particular, the device 50 is generally not configured for a studio, but instead configured to be commonly used by a reader in most locations. For example, a reader can generate a part of an audiobook while
at a coffee shop or while waiting for or riding on a bus. Accordingly, the approval of the audio input can forward the audio input for post processing.
[0073] In the event that no input corresponding to an approval is received or generated using one of the automated error detection routines, the processor can provide an opportunity to re-record the portion of the ebook output or edit the recorded audio input using various sound editing tools. The manner by which a determination that no input has been received is not particularly limited. For example, in some embodiments, a pre-determined time-out period can be used such that after the period has expired, the processor 52a can proceed as though no input corresponding to an approval is received. In other embodiments, the processor 52a can be directed to wait indefinitely for a response.
[0074] Block 650 comprises post processing the audio input received at block 630. The manner by which the audio input is processed during the execution of block 650 is not particularly limited and can involve various routines. For example, the volume of the audio input can be leveled to account for varying distances from the microphone 56a by the reader during use of the device 50a. Another example of post processing can include running a noise reduction routine to reduce the amount of ambient noise or removing noise spikes, such as from a passing truck in the background.
[0075] Again, it is to be re-emphasized that the method 600 described above is a non- limiting representation. For example, variants of the embodiments discussed above can be used. As an example of a variant, any combination of the optional features mentioned above can be included in the device. For example, the device can include a manual approval from a reader at block 640 in addition to the automatic error detection routines discussed above.
[0076] As another example of a variation, multiple versions of audio recordings can be stored in a memory storage unit. For example, if an error is detected or a user does not approve a recording at block 640, the same portion of the ebook output can be re-recorded multiple times to generate multiple versions of the audio recording. The user can then be presented with the multiple versions of the audio recording from which the user can select the best one or edit one or more of the multiple versions to produce a final version for the associated portion of the ebook output. It is to be appreciated that editing can include applying a filter to the audio input or combining two or more of multiple versions. In other embodiments, a verification engine can be used to automatically select or produce a final version from the multiple recorded versions corresponding to an ebook output.
[0077] Referring to figure 6, an apparatus for generating an audiobook is generally shown at 50b. Like components of the device 50b bear like reference to their counterparts in the apparatus 50, except followed by the suffix "b". The apparatus 50b includes a processor 52b, a network interface 54b, a display screen 56b, a microphone 58b, and an input device 60b, and memory storage device 62b. In the present embodiment, the apparatus 50b is connected to a network 70, such as the Internet, for receiving and transmitting content as will be discussed in greater detail below. In addition, a server 80b is connected to the network 70 and in communication with the apparatus 50b.
[0078] It is to be appreciated by a person of skill in the art with the benefit of this description that the apparatus 50b can be used to carry out the methods 500 and 600. Furthermore, the apparatus 50b can be used to carry out variations of the methods 500 and 600 by performing one or more of the blocks on the server 80b instead of the apparatus 50b. Accordingly, this allows additional resources in the apparatus 50b to perform other tasks and/or allowing for the hardware components of apparatus 50b to be more economical. For example, blocks 550 and 650 generally require more computational resources. In some embodiments, the audio recordings can be uploaded to the server 80b for post processing. The manner by which the server 80b can carry out post-processing is not particularly limited and can include additional methods using more complex routines made possible with the additional computer resources generally available in the server 80b over the apparatus 50b. It is also to be appreciated that by moving post-processing step away from the apparatus to another location, other possibilities for post-processing are available such as allowing for manual post-processing or combining manual and automatic post-processing at the server 80b.
[0079] In other embodiments, the ebook data can also be stored on the server 80b and text segments streamed to the apparatus 50b as required for reducing the size requirement of the memory storage unit 62b.
[0080] In another embodiment, it is to be appreciated that the parsing engine 64 can be operated by the server 80b. In such an embodiment, all or a portion of the ebook data is transmitted to the server 80b, which operates the parsing engine 64 to generate text segments that are returned to the apparatus 50b.
[0081] Various advantages will now be apparent. Of note is the ability to generate an audiobook on a personal apparatus or device without the need to use a conventional recording studio. By generating the audiobook independently, a customizable audiobook with a specific user's voice can be generated more efficiently and economically.
[0082] As another example of a variation, it is to be appreciated by a person of skill in the art that the system shown in figure 6 can be modified to combine recordings from multiple users, where each user can record one or more text segments. In particular, the apparatus 50b can be duplicated such that more than one apparatus can provide the server 80b with audio recordings that can subsequently be combined to generate an audiobook. For example, this feature can be particularly useful for generating audiobooks with multiple voices.
[0083] While specific embodiments have been described and illustrated, such embodiments should be considered illustrative and should not serve to limit the accompanying claims.
Claims
1. An apparatus for generating an audiobook, the apparatus comprising: a network interface configured to receive ebook data and to transmit the audiobook to a server; a memory storage unit in communication with the network interface, the memory storage unit configured to store the ebook data and a recording database, wherein the recording database is configured to store a received audio recording; a display screen in communication with the memory storage unit, the display screen configured to display a text segment from the ebook data; a microphone in communication with the memory storage unit, the microphone
configured to receive audio input for generating the received audio recording based on the text segment; an input device configured to receive a stop input upon completion of the received audio recording; and a processor in communication with the memory storage unit, the processor configured to post-process the received audio recording, the processor further configured to generate the audiobook using the received audio recording.
The apparatus of claim 1 , wherein the recording database is configured to store a plurality of audio recordings including the received audio recording.
The apparatus of claim 2, wherein the plurality of audio recordings is indexed to the ebook data.
4. The apparatus of any one of claims 1 to 3, further comprising a parsing engine, the
parsing engine configured to parse the ebook data into a plurality of text segments, the plurality of text segments including the text segment.
5. The apparatus of claim 4, wherein each text segment of the plurality of text segments has a minimum number of words.
6. The apparatus of claim 5, wherein the minimum number of words is sixty.
7. The apparatus of any one of claims 4 to 6, wherein the parsing engine parses the ebook data at a natural break.
8. The apparatus of claim 7, wherein the natural break is after a sentence.
9. The apparatus of claim 7, wherein the natural break is after a paragraph.
10. The apparatus of claim 2, wherein the processor is further configured to assess the
received audio recording for quality.
11. The apparatus of claim 10, wherein the processor is further configured to analyze the plurality of audio recordings to determine an upper dynamic range limit and a lower dynamic range limit.
12. The apparatus of claim 11 , wherein the processor is further configured to analyze the audio input in the received audio recording to determine whether a dynamic range of the audio input is between the upper dynamic range limit and the lower dynamic range limit.
13. The apparatus of claim 12, wherein the display generates a warning indicator when
dynamic range of the audio input is outside of the upper dynamic range limit and the lower dynamic range limit.
14. A method of generating an audiobook, the method comprising: receiving ebook data via a network interface for storage in a memory storage unit, the memory storage unit in communication with the network interface; displaying a text segment from the ebook data on a display screen;
receiving audio input via a microphone, the audio input for generating a received audio recording based on the text segment; receiving a stop input upon completion of the received audio recording for storage in a recording database the memory storage unit; and post-processing the received audio recording to generate the audiobook.
15. The method of claim 14, further comprising storing a plurality of audio recordings in the recording database, the plurality of audio recordings including the received audio recording.
16. The method of claim 15, further comprising indexing the plurality of audio recordings to the ebook data.
17. The method of any one of claims 14 to 16, further comprising parsing the ebook data into a plurality of text segments using a parsing engine.
18. The method of claim 17, wherein parsing comprises parsing the ebook data into a plurality of text segments, each text segment of the plurality of text segments having a minimum number of words.
19. The method of claim 18, wherein the minimum number of words is sixty.
20. The method of any one of claims 17 to 19, wherein parsing comprises parsing the ebook data at a natural break.
21. The method of claim 20, wherein the natural break is after a sentence.
22. The method of claim 20, wherein the natural break is after a paragraph.
23. The method of any one of claim 15, further comprising assessing the received audio
recording for quality.
24. The method of claim 23, further comprising analyzing the plurality of audio recordings to determine an upper dynamic range limit and a lower dynamic range limit.
25. The method of claim 24, further comprising analyzing the audio input to determine whether a dynamic range of the audio input in the received audio recording is between the upper dynamic range limit and the lower dynamic range limit.
26. The method of claim 25, further comprising generates a warning indicator when dynamic range of the audio input is outside of the upper dynamic range limit and the lower dynamic range limit.
27. A server for generating an audiobook, the server comprising: a network interface configured to send and receive data via a network; a memory storage unit in communication with the network interface, the memory storage unit configured to store ebook data and a recording database, wherein the recording database is configured to store a plurality of audio recordings received via the network; and a processor in communication with the memory storage unit, the processor configured to send a plurality of text segments to a device for displaying on a display screen, receive the plurality of audio recordings based on the plurality of text segment for storing in the recording database, and post-process the plurality of audio recordings to generate the audiobook.
28. The server of claim 27, wherein the plurality of audio recordings is indexed to the ebook data.
29. The server of claim 27 or 28, further comprising a parsing engine, the parsing engine
configured to parse the ebook data into the plurality of text segments.
30. The server of claim 29, wherein each text segment of the plurality of text segments has a minimum number of words.
31. The server of claim 30, wherein the minimum number of words is sixty.
32. The server of any one of claims 29 to 31 , wherein the parsing engine parses the ebook data at a natural break.
33. The server of claim 32, wherein the natural break is after a sentence.
34. The server of claim 32, wherein the natural break is after a paragraph.
35. The server of any one of claims 27 to 34, wherein the processor is further configured to assess the plurality of audio recordings for quality.
36. The server of claim 35, wherein the processor is further configured to analyze the plurality of audio recordings to determine an upper dynamic range limit and a lower dynamic range limit.
37. The server of claim 36, wherein the processor is further configured to analyze audio input in each audio recording of the plurality of audio recordings to determine whether a dynamic range of the audio input is between the upper dynamic range limit and the lower dynamic range limit.
38. The server of claim 37, wherein the processor generates a warning message when
dynamic range of the audio input is outside of the upper dynamic range limit and the lower dynamic range limit.
39. A non-transitory computer readable medium configured with codes for generating an
audiobook, the codes for directing a processor to: receive ebook data via a network interface for storage in a memory storage unit, the memory storage unit in communication with the network interface;
display a text segment from the ebook data on a display screen; receive audio input via a microphone, the audio input for generating a received audio recording based on the text segment; receive a stop input upon completion of the received audio recording for storage in a recording database the memory storage unit; and post-process the received audio recording to generate the audiobook.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201462097997P | 2014-12-30 | 2014-12-30 | |
| US62/097,997 | 2014-12-30 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2016108113A1 true WO2016108113A1 (en) | 2016-07-07 |
Family
ID=56284361
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/IB2015/059598 Ceased WO2016108113A1 (en) | 2014-12-30 | 2015-12-14 | Apparatus, method, and server for generating audiobooks |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2016108113A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113065318A (en) * | 2021-03-23 | 2021-07-02 | 上海匠欣信息科技有限公司 | Electronic point-reading material manufacturing method and device, electronic equipment and storage medium |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6279017B1 (en) * | 1996-08-07 | 2001-08-21 | Randall C. Walker | Method and apparatus for displaying text based upon attributes found within the text |
| US20020184189A1 (en) * | 2001-05-30 | 2002-12-05 | George M. Hay | System and method for the delivery of electronic books |
| US20070088712A1 (en) * | 2005-10-14 | 2007-04-19 | Watson David A | Apparatus and method for the manufacture of audio books |
| US20100290638A1 (en) * | 2009-04-14 | 2010-11-18 | Heineman Fred W | Digital audio communication and control in a live performance venue |
| US8548618B1 (en) * | 2010-09-13 | 2013-10-01 | Audible, Inc. | Systems and methods for creating narration audio |
-
2015
- 2015-12-14 WO PCT/IB2015/059598 patent/WO2016108113A1/en not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6279017B1 (en) * | 1996-08-07 | 2001-08-21 | Randall C. Walker | Method and apparatus for displaying text based upon attributes found within the text |
| US20020184189A1 (en) * | 2001-05-30 | 2002-12-05 | George M. Hay | System and method for the delivery of electronic books |
| US20070088712A1 (en) * | 2005-10-14 | 2007-04-19 | Watson David A | Apparatus and method for the manufacture of audio books |
| US20100290638A1 (en) * | 2009-04-14 | 2010-11-18 | Heineman Fred W | Digital audio communication and control in a live performance venue |
| US8548618B1 (en) * | 2010-09-13 | 2013-10-01 | Audible, Inc. | Systems and methods for creating narration audio |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113065318A (en) * | 2021-03-23 | 2021-07-02 | 上海匠欣信息科技有限公司 | Electronic point-reading material manufacturing method and device, electronic equipment and storage medium |
| CN113065318B (en) * | 2021-03-23 | 2024-03-22 | 上海匠欣信息科技有限公司 | Electronic point reading material manufacturing method and device, electronic equipment and storage medium |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Zhang et al. | Comparing acoustic analyses of speech data collected remotely | |
| US12217756B2 (en) | Systems and methods for improved digital transcript creation using automated speech recognition | |
| US8548618B1 (en) | Systems and methods for creating narration audio | |
| CN107464555B (en) | Method, computing device and medium for enhancing audio data including speech | |
| CN106716466B (en) | Conference information storage device and method | |
| US10282162B2 (en) | Audio book smart pause | |
| US9946511B2 (en) | Method for user training of information dialogue system | |
| US9330720B2 (en) | Methods and apparatus for altering audio output signals | |
| US20200294487A1 (en) | Hands-free annotations of audio text | |
| JP2006323806A (en) | System and method for converting text into speech | |
| CN114023301A (en) | Audio editing method, electronic device and storage medium | |
| JP6430137B2 (en) | Voice transcription support system, server, apparatus, method and program | |
| WO2018120821A1 (en) | Method and device for producing presentation | |
| JP2006119625A (en) | Verb error recovery in speech recognition | |
| US9472186B1 (en) | Automated training of a user audio profile using transcribed medical record recordings | |
| CN110289015B (en) | Audio processing method, device, server, storage medium and system | |
| CN113947059A (en) | Demonstration support system | |
| KR20240101711A (en) | Automated text-to-speech pronunciation editing for long-form text documents | |
| CN104699745B (en) | Instantaneous speech power and speech output method | |
| WO2016108113A1 (en) | Apparatus, method, and server for generating audiobooks | |
| US20140156256A1 (en) | Interface device for processing voice of user and method thereof | |
| CN114822492B (en) | Speech synthesis method and device, electronic equipment and computer readable storage medium | |
| JP2001325250A (en) | Minutes preparing device, minutes preparing method and recording medium | |
| US8990087B1 (en) | Providing text to speech from digital content on an electronic device | |
| JP6624607B1 (en) | Output sound error detection support program, output sound error detection support method, and output sound error detection support device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15875321 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 15875321 Country of ref document: EP Kind code of ref document: A1 |