WO2024211927A2

WO2024211927A2 - Systems and methods for content and appearance preserving instant translation

Info

Publication number: WO2024211927A2
Application number: PCT/US2024/039660
Authority: WO
Inventors: Hong Heather Yu; Xiyun Song; Pingfan Wu; Masood Mortazavi; Zongfang LIN; Yubin Zhou; Zhiqiang Lao; Liang Peng
Original assignee: FutureWei Technologies Inc
Current assignee: FutureWei Technologies Inc
Priority date: 2024-07-25
Filing date: 2024-07-25
Publication date: 2024-10-10
Anticipated expiration: 2027-01-25
Also published as: WO2024211927A3

Abstract

A system and method of instantly translating text at a processing device. Input content is received, and input text objects are identified in the input content. The identified input text objects are translated from an input language to an output language. An appearance characteristic of the identified input text objects is determined, and output text objects are generated that include the translated identified input text objects and that have an appearance characteristic that substantially matches the determined appearance characteristic of the identified input text objects. Output content is generated by replacing the identified input text objects with the generated output text objects.

Description

SYSTEMS AND METHODS FOR CONTENT AND APPEARANCE PRESERVING INSTANT TRANSLATION

Inventors

Hong Heather Yu Xiyun Song Pingfan Wu

Masood Mortazavi Zongfang Lin Yubin Zhou Zhiqiang Lao Liang Peng

FIELD

[0001] This disclosure generally relates to improving the quality of computer- generated instant text translation.

BACKGROUND

[0002] In many instances, a group of people want to present or exchange information with one another, such as during live or online meetings, or through document sharing such as via email, computer download, or courier service. In some instances, all group members are fluent in a common language, and exchange text information in the common language so that all group members may easily understand the communicated text. However, other groups include people who are not all fluent in a common language. For example, many multi-national companies employ people in a variety of countries throughout the world. Thus, some groups may need to text exchange information with one another even though all group members may not share a common language or may not all be equally comfortable communicating in a common language.

[0003] As a result of recent advances in artificial intelligence and cloud computing capabilities, some “instant” translation tools on mobile devices and computers are quite effective at very quickly translating text between any of a large number of languages. As used herein, “instant translation” means text translation that is performed very rapidly in real time. For example, text in a first language (e.g., Chinese) can be copied and pasted in a mobile phone translation app, and the app almost instantly displays text translated to a second language (e.g., English). Likewise, some mobile phone apps can very quickly translate text captured by the mobile phone’s camera.

[0004] Although these existing solutions offer very fast and highly accurate text translation, the tools often fail to preserve faithfully the style of the original text or the style of non-text features of the original content. For example, many instant translation tools display translated text in a default font, color and size, with no regard to the style of the original text. Also, many instant translation tools overlay the translated text on the original text, obscuring any underlying and surrounding graphics or images and thus changing the overall design and appearance characteristics of the original content.

SUMMARY

[0005] One aspect includes a computer implemented method of translating text at a processing device. The computer implemented method includes receiving, at the processing device, input content; identifying, at the processing device, input text objects in the input content; translating the identified input text objects from an input language to an output language. The computer implemented method also includes determining, at the processing device, an appearance characteristic of the identified input text objects; generating, at the processing device, output text objects that comprise the translated identified input text objects and that have an appearance characteristic that substantially matches the determined appearance characteristic of the identified input text objects; and generating, at the processing device, output content by replacing the identified input text objects with the generated output text objects. [0006] Implementations may include the foregoing implementation, wherein the identified input text objects comprise one or more of symbols, characters, numbers. Implementations may include any of the foregoing implementations, wherein the identified input text objects comprise text embedded within image objects. Implementations may include any of the foregoing implementations, wherein an appearance characteristic of the identified input text objects includes one or more of a font style, color, size, shape, typeface, underlining, dimension, relative location, and orientation. Implementations may include any of the foregoing implementations, wherein replacing comprises removing the identified input text objects and inserting the generated output text objects. Implementations may include any of the foregoing implementations, wherein the computer implemented method further includes identifying, at the processing device, input image objects in the input content; determining, at the processing device, an appearance characteristic of the identified input image objects; generating output image objects, at the processing device, that have an appearance characteristic that substantially matches the determined appearance characteristic of the identified input image objects; and generating output content further comprises replacing the identified input image objects with the generated output image objects. Implementations may include any of the foregoing implementations, wherein an appearance characteristic of the identified input image objects include non-text image details of the identified input image objects. Implementations may include any of the foregoing implementations, wherein the computer implemented method further includes identifying, at the processing device, an input background in the input content; determining, at the processing device, an appearance characteristic of the identified input background; generating, at the processing device, an output background that has an appearance characteristic that substantially matches the determined appearance characteristic of the identified input background; and generating output content further comprises replacing the identified input background with the generated output background. Implementations may include any of the foregoing implementations, wherein an appearance characteristic of the identified input background includes one or more of a color, a shading, and a simulated texture. Implementations may include any of the foregoing implementations, wherein the processing device comprises any of a server and a client device. Implementations may include any of the foregoing implementations, wherein a first subset of the method steps is performed by a server and a second subset of the method steps is performed by a client device. Implementations may include any of the foregoing implementations, wherein the method steps are performed by a sender client device, which is configured to provide the output content to one or more receiver client devices. Implementations may include any of the foregoing implementations, wherein the computer implemented method further includes estimating a computing capability of a user processing device, and selectively performing the determining and generating steps based on the estimated computing capability. Implementations may include any of the foregoing implementations, wherein the computer implemented method further includes determining from the estimated computing capability that the user processing device comprises a first power type device, and generating, at the processing device, output text objects that comprise the translated identified input text objects, but without preserving appearance characteristics of the identified input text objects; and determining from the estimated computing capability that the user processing device comprises a second power type device comprising greater computing capability than a first power type device, and generating, at the processing device, output text objects that comprise the translated identified input text objects, and preserving appearance characteristics of the identified input text objects. Implementations may include any of the foregoing implementations, wherein the computer implemented method further includes adjusting, at the processing device, a color and a contrast of the output text objects based on a user’s ambient environment,

[0007] Another aspect includes a non-transitory computer-readable medium storing computer instructions for translating text at a processing device. The instructions, when executed by one or more processors, cause the one or more processors to perform the steps of: receiving, at the processing device, input content; identifying, at the processing device, input text objects in the input content; translating the identified input text objects from an input language to an output language; determining, at the processing device, an appearance characteristic of the identified input text objects; generating, at the processing device, output text objects that comprise the translated identified input text objects and that have an appearance characteristic that substantially matches the determined appearance characteristic of the identified input text objects; and generating, at the processing device, output content by replacing the identified input text objects with the generated output text objects.

[0008] Implementations may include the foregoing implementation, wherein the identified input text objects comprise one or more of symbols, characters, numbers. Implementations may include any of the foregoing implementations, wherein the identified input text objects comprise text embedded within image objects. Implementations may include any of the foregoing implementations, wherein an appearance characteristic of the identified input text objects includes one or more of a font style, color, size, shape, typeface, underlining, dimension, relative location, and orientation. Implementations may include any of the foregoing implementations, wherein replacing comprises removing the identified input text objects and inserting the generated output text objects. Implementations may include any of the foregoing implementations, wherein the computer implemented method further includes identifying, at the processing device, input image objects in the input content; determining, at the processing device, an appearance characteristic of the identified input image objects; generating output image objects, at the processing device, that have an appearance characteristic that substantially matches the determined appearance characteristic of the identified input image objects; and generating output content further comprises replacing the identified input image objects with the generated output image objects. Implementations may include any of the foregoing implementations, wherein an appearance characteristic of the identified input image objects include non-text image details of the identified input image objects. Implementations may include any of the foregoing implementations, wherein the computer implemented method further includes identifying, at the processing device, an input background in the input content; determining, at the processing device, an appearance characteristic of the identified input background; generating, at the processing device, an output background that has an appearance characteristic that substantially matches the determined appearance characteristic of the identified input background; and generating output content further comprises replacing the identified input background with the generated output background. Implementations may include any of the foregoing implementations, wherein an appearance characteristic of the identified input background includes one or more of a color, a shading, and a simulated texture. Implementations may include any of the foregoing implementations, wherein the processing device comprises any of a server and a client device. Implementations may include any of the foregoing implementations, wherein a first subset of the method steps is performed by a server and a second subset of the method steps is performed by a client device.

[0009] Another aspect includes a user equipment device that includes a non- transitory memory storage comprising instructions, and one or more processors in communication with the memory storage. The one or more processors execute the instructions to cause the device to receive, at the user equipment device, input content; identify, at the user equipment device, input text objects in the input content; translate selective ones of the identified input text objects from an input language to an output language; determine, at the user equipment device, an appearance characteristic of the selective ones of identified input text objects; generate, at the user equipment device, output text objects that comprise the translated selective ones of the identified input text objects and that have an appearance characteristic that substantially matches the determined appearance characteristic of the selective ones of the identified input text objects; and generate, at the user equipment device, output content by replacing the identified selective ones of the input text objects with the generated output text objects.

[0010] Implementations may include the foregoing implementation, wherein the one or more processors execute the instructions to further cause the device to provide a user interface for specifying the selective ones of the identified input text objects. Implementations may include any of the foregoing implementations, wherein the one or more processors execute the instructions to further cause the device to provide the user interface for specifying the output language. Implementations may include any of the foregoing implementations, wherein the one or more processors execute the instructions to further cause the device to determine from one or more of a user profile and system settings the selective ones of the identified input text objects. Implementations may include any of the foregoing implementations, wherein the one or more processors execute the instructions to further cause the device to receive from the user interface the output language.

[0011] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the Background.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] Aspects of the present disclosure are illustrated by way of example and are not limited by the accompanying figures, for which like references indicate the same or similar elements.

[0013] FIG. 1 A is a diagram of example input content.

[0014] FIG. 1 B is a diagram of example output content generated from the input content of FIG. 1A.

[0015] FIG. 1 C is a diagram of another example output content generated from the input content of FIG. 1A.

[0016] FIG. 1 D is a diagram of still another example output content generated from the input content of FIG. 1A.

[0017] FIG. 1 E is a diagram of yet another example output content generated from the input content of FIG. 1A.

[0018] FIG. 2A is a simplified block diagram of an example instant translation system.

[0019] FIG. 2B is a simplified block diagram of another example instant translation system. [0020] FIG. 2C is a simplified block diagram of still another example instant translation system.

[0021] FIG. 2D is a more detailed block diagram of an embodiment of a translation block of the example instant translation system of FIGS. 2A-2C.

[0022] FIG. 3A is a simplified block diagram of an example architecture of an instant translation system.

[0023] FIG. 3B is a simplified block diagram of another example architecture of an instant translation system.

[0024] FIG. 3C is a simplified block diagram of still another example architecture of an instant translation system.

[0025] FIG. 3D is a simplified block diagram of yet another example architecture of an instant translation system.

[0026] FIG. 3E is a simplified block diagram of still another example architecture of an instant translation system.

[0027] FIG. 4 is a flowchart of an example method of instantly translating text.

WRITTEN DESCRIPTION

[0028] Certain embodiments of the present disclosure can be used to instantly translate input content that includes input text in an input language (e.g., English) to output content that includes output text in an output language (e.g., French) and that has appearance characteristics that substantially match appearance characteristics of the input text. Certain embodiments of the present disclosure also can be used to provide output content that has non-text features that have appearance characteristics that substantially match appearance characteristics of non-text features in the input content. As used herein, “instantly translate” means to translate text very rapidly in real time, even if not literally instantaneous. [0029] FIG. 1A depicts example input content 100a that may be provided by a presenter and received by a recipient. Example input content 100a may be a document (e.g., a Word document), a presentation slide (e.g., a Power Point slide), a web page, a video, or other similar content or combination of one or more of such content. The presenter may be a speaker doing a presentation at a meeting while displaying input content 100a on a screen, and the recipient may be an audience member (live or online) attending the presentation and seeing input content 100a displayed on the screen.

[0030] Alternatively, a presenter may send input content 100a (e.g., a PDF document) via email to the recipient, who may then display input content 100a on a computer screen, print input content 100a on a printer, or save input content 100a to computer memory. In yet another alternative, the presenter may be a computer server providing input content 100a that includes HTML code for rendering by a recipient’s web browser. Persons of ordinary skill in the art will understand that these are nonlimiting examples.

[0031] In embodiments described below, content may include text objects, image objects, and background. As used herein, “text objects” include text and embedded text. As used herein, “image objects” includes raster images (e.g., photographs, scans), vector images (e.g., charts and graphs), Al-generated images (“Al images”) (which may be non-raster and non-vector images), video and other similar images. As used herein, “background” includes background features (e.g., color, shading) that are not text objects or image objects.

[0032] As used herein, “text” includes text symbols such as characters, numbers, and other similar symbols, such as in a word processing document, a spreadsheet, a slide presentation program, a text box, or similar. As used herein, “embedded text” includes text symbols embedded within image objects.

[0033] In the embodiment of FIG. 1A, input content 100a includes input text objects 102a, input image objects 104a, and input background 106a. In an embodiment, input text objects 102a include input text 108a1 and 108a2, and input embedded text 110a1 , 110a2 and 110a3. In other embodiments, input content 100a may include more, fewer, and/or different input text objects 102a.

[0034] In an embodiment, input image objects 104a include input Al image 112a, input photograph 114a, and input chart 116a. In other embodiments, input content 100a may include more, fewer, and/or different input image objects 104a. In the embodiment of FIG. 1A, input Al image 112a includes input embedded text 110a1 , input photograph 114a includes input embedded text 110a2, and input chart 116a includes input embedded text 110a3.

[0035] In the embodiment of FIG. 1A, input background 106a includes a background color and shading. In embodiments, the background color may be white, or some other color, and the shading making includes a crosshatch pattern or some other pattern, or not pattern.

[0036] In an embodiment, input Al image 112a includes various image components, some of which represent input embedded text 110a1. Likewise, input photograph 114a includes pixels, some of which represent input embedded text 110a2. Similarly, input chart 116a includes vector paths, some of which represent input embedded text 110a3. Although not depicted in FIG. 1 A, an input video also may include pixels, some of which represent input embedded text included in the video.

[0037] In an embodiment, input text objects 102a are in an input language. In the example of FIG. 1A, the input language is English. In an embodiment, all input text objects 102a in input content 100a are in the same input language, such as depicted in FIG. 1A. In other embodiments, input text objects 102a may be in a variety of different input languages.

[0038] For example, input text 108a1 may be in a first input language (e.g., English), input text 108a2 may be in a second input language (e.g., Korean), input embedded text 110a1 may be in a third input language (e.g., Russian), input embedded text 110a2 may be in a fourth input language (e.g., Spanish), and input embedded text 110a3 may be in a fifth language (e.g., Hungarian). For simplicity, the remaining description will assume that input text objects 102a are all in a single input language.

[0039] In accordance with embodiments of the present disclosure, input content that includes input text objects in an input language is converted to output content that includes output text objects in an output language different from the input language. For example, FIG. 1 B depicts example output content 100b generated from input content 100a of FIG. 1A.

[0040] In the embodiment of FIG. 1 B, output content 100b includes output text objects 102b, output image objects 104b, and output background 106b. In an embodiment, output text objects 102b include output text 108b1 and 108b2, and output embedded text 110b1 , 110b2 and 110b3. In embodiments, output content 100a includes a same number of output text objects 102b as input text objects 102a in input content 100a.

[0041] In an embodiment, output image objects 104b include output Al image 112b, output photograph 114b, and output chart 116b. In embodiments, output content 100b includes a same number of output image objects 104 as input image objects 104a in input content 100a. In the embodiment of FIG. 1 B, output Al image 112b includes output embedded text 110b1 , output photograph 114b includes output embedded text 110b2, and output chart 116b includes output embedded text 110b3.

[0042] In the embodiment of FIG. 1 B, output background 106b includes a same background color and shading as input background 106a of FIG. 1A.

[0043] In an embodiment, output Al image 112b includes various image components, some of which represent output embedded text 110b1 . Likewise, output photograph 114b includes pixels, some of which represent output embedded text 110b2. Similarly, output chart 116b includes vector paths, some of which represent output embedded text 110b3.

[0044] In an embodiment, output text objects 102b are in an output language. In the example of FIG. 1 B, the output language is French. In an embodiment, all output text objects 102b in output content 100b are in the same output language, such as depicted in FIG. 1 B. In other embodiments, output text objects 102b may be in a variety of different input languages.

[0045] For example, output text 108b1 may be in a first output language (e.g., Polish), output text 108b2 may be in a second output language (e.g., Italian), output embedded text 110b1 may be in a third output language (e.g., Croatian), output embedded text 110b2 may be in a fourth input language (e.g., Chinese), and output embedded text 110b3 may be in a fifth language (e.g., Persian). For simplicity, the remaining description will assume that output text objects 102b are all in a single input language.

[0046] In accordance with embodiments of the present disclosure, input content 100a that includes input text objects 102a (e.g., input text 108a1 and 108a2 and input embedded text 110a1 , 110a2 and 110a3) in an input language (e.g., English) is converted to output content 100b that includes output text objects 102b (e.g., output text 108b1 and 108b2 and output embedded text 110b1 , 110b2 and 110b3) in an output language (e.g., French) and having appearance characteristics that substantially match appearance characteristics of input text objects 102a. In addition, output content 100b includes output image objects 104b and output background 106b that have appearance characteristics that substantially match appearance characteristics of input image objects 104a and input background 106a, respectively.

[0047] As used herein, appearance characteristics of input text objects 102a include font style, color, size, shape, typeface (e.g., roman, bold, italics), underlining, dimension (e.g., 2D or 3D), relative location, orientation, and other similar appearance characteristics of text.

[0048] As used herein, appearance characteristics of input image objects includes non-text image details of input image objects, such as non-text details in input Al image 112a, input photograph 114a, and input chart 116a. Non-text details in input Al image 112a include, for example, the various flower, greenery, vase, table surface, wood grain, and other such elements and placement of such elements in input Al image 112a. Non-text details in input photograph 114a include, for example, the flowers, vase, water table surface, lighting effects, wall shading, and other such elements and placement of such elements in input photograph 114a. Non-text details in input chart 116a include, for example, the background color, vertical exes, horizontal gridlines, bar chart elements, and other such elements and placement of such elements in input chart 116a.

[0049] As used herein, appearance characteristics of input background includes color, shading, simulated texture, and other similar appearance characteristics of input background 106a.

[0050] Thus, as depicted in FIG. 1 B, output text objects 102b (output text 108b1 and 108b2 and output embedded text 110b1 , 110b2 and 110b3) in output content 100b are in output language French and have appearance characteristics that substantially match appearance characteristics of input text objects 102a (input text 108a1 and 108a2 and input embedded text 110a1 , 110a2 and 110a3, respectively) in input content 100a of FIG. 1A.

[0051] For example, output text 108b2 in FIG. 1 B has a color, a font, a font size, a 3D effect and a relative location that substantially matches a color, a font, a font size, a 3D effect and a relative location of input text 108a2 in input content 100a. Similarly, output embedded text 110b1 in output content 100b has a script font, a font size, a color and a relative location that substantially matches a script font, a font size, a color and a relative location of input embedded text 110a1 in input content 100a.

[0052] In addition, as depicted in FIG. 1 B, output image objects 104b and output background 106b substantially preserve a style of input image objects 104a and input background 106a, respectively. In particular, appearance characteristics of non-text image details in output Al image 112b, output photograph 114b, and output chart 116b substantially match appearance characteristics of non-text portions of input image objects, such as non-text image details in input Al image 112a, input photograph 114a, and input chart 116a.

[0053] For example, output Al image 112b in output content 100b preserves image details of input Al image 112a (e.g., the wood grain on the table surface) of input content 100a of FIG. 1A. Likewise, photograph 110b in output content 100b preserves image details of photograph 110a (e.g., the tulips laying on the table surface) of input content 100a of FIG. 1A. Similarly, output background 106b preserves image details of input background 106a (e.g., color and shading) in input content 100a of FIG. 1A.

[0054] In other embodiments, input content 100a that includes input text objects 102a (e.g., input text 108a1 and 108a2, and input embedded text 110a1 , 110a2 and 110a3) in an input language (e.g., English) may be selectively converted to output content 100b that includes output text objects (e.g., output text 108b1 and 108b2, and output embedded text 110b1 , 110b2 and 110b3) in an output language (e.g., French) and having a style that substantially matches a style of input text objects 102a. In addition, output content 100b includes output image objects 104b and output background 106b that substantially preserve a style of input image objects 104a and input background 106a, respectively.

[0055] In an embodiment, such selectivity may convert some but not all input text objects 102a (e.g., input text 108a1 and 108a2, and input embedded text 110a1 , 110a2 and 110a3). For example, FIG. 10 depicts example output content 100c generated from input content 100a of FIG. 1A. In this embodiment, input text 108a1 and 108a2 is converted from English to French, but input embedded text 110a1 , 110a2 and 110a3 is not converted.

[0056] Thus, in this example, output text 108b1 and 108b2 in output content 100c is displayed in output language French and has a font, a font size, a color, a shape, a typeface, a dimension, and a relative location and orientation that substantially matches a font, a font size, a color, a shape, a typeface, a dimension, and a relative location and orientation of input text 108a1 and 108a2 in input content 100a of FIG. 1A. Output content 100c also includes output embedded text 110c1 , 110c2 and 110c3 that is displayed in English, the same as that of input embedded text 110a1 , 110a2 and 110a3 in input content 100a of FIG. 1A.

[0057] In another example, FIG. 1 D depicts example output content 100d generated from input content 100a of FIG. 1A. In this embodiment, embedded text is converted from English to French, but text is not converted. Thus, in this example, output embedded text 110d1 , 110d2 and 110d3 in output content 10Od is displayed in output language French and has a font, a font size, a color, a shape, a typeface, a dimension, and a relative location and orientation that substantially matches a font, a font size, a color, a shape, a typeface, a dimension, and a relative location and orientation of input embedded text 110a1 , 110a2 and 110a3 of FIG. 1A. Output content 100d also includes output text 108d1 and 108d2 that is displayed in English, the same as that of input text 108a1 and 108a2 of FIG. 1A.

[0058] Persons of ordinary skill in the art will understand that in still other embodiments, other types of selective conversion may be used. For example, only input embedded text in photographs may be converted from the input language to the output language, and all other input text objects may remain in the input language. Or only input embedded text in graphs and charts may be converted from the input language to the output language, and all other input text objects may remain in the input language.

[0059] In the example embodiments described above and depicted in FIGS. 1A- 1 D, output text objects 102b (e.g., output text 108b1 and 108b2, and output embedded text 110b1 , 110b2 and 110b3) in output content 100b depicted in FIG. 1 B replaces input text objects 102a (e.g., input text 108a1 and 108a2, and input embedded text 110a1 , 110a2 and 110a3, respectively) in input content 100a of FIG. 1A.

[0060] Likewise, output text 108b11 and 108b22 in output content 100c depicted in FIG. 1 C replaces input text 108a1 and 108a2, respectively, in input content 100a of FIG. 1A. Similarly, output embedded text 110d1 , 110d2 and 110d3 in output content 100d depicted in FIG. 1 D replaces input embedded text 110a1 , 110a2 and 110a3, respectively) in input content 100a of FIG. 1A.

[0061] What is meant by output text objects 102b replacing input text objects 102a is that the translated output text objects 102b are not superimposed or “taped over” the corresponding input text objects 102a. Instead, input text objects 102a (e.g., input text 108a1 and 108a2, and input embedded text 110a1 , 110a2 and 110a3) are effectively “erased” by inpainting input image objects 104a and input background 106a in input content 100a. After input text objects 102a have been erased in this manner, output text objects 102b (e.g., output text 108b1 and 108b2, and output embedded text 110b1 , 110b2 and 110b3) are overlayed over the in-painted input image objects 104a and input background 106a to form output content 100b.

[0062] FIG. 2A is a simplified block diagram of an example translation system 200a for instantly translating input content that includes input text in an input language (e.g., English) to output content that includes output text in an output language (e.g., French) and that has appearance characteristics that substantially match appearance characteristics of the input text, and has non-text features that have appearance characteristics that substantially match appearance characteristics of non-text features in the input content.

[0063] In an embodiment, input content 100a is provided as input to a language and text detection block 202 that is configured to identify input text objects in the input content (referred to herein as “identified input text objects”), and also identify the input language (referred to herein as “identified input language”) of the identified input text objects. Thus, using input content 100a of FIG. 1 A as an example, language and text detection block 202 identifies input text objects 102a (input text 108a1 and 108a2, and input embedded text 110a1 , 110a2 and 110a3), and identifies the input language (English).

[0064] In embodiments, language and text detection block 202 may be implemented in hardware, software, or a combination of hardware and software. In embodiments, language and text detection block 202 may be implemented using artificial intelligence machine learning systems that are trained to identify input text objects 102a in input content 100a, and also identify the input language of input text objects 102a.

[0065] In an embodiment, the identified input text objects 102a and identified input language is provided to a format detection block 204 that is configured to identify formats of the identified input text objects 102a. Example formats include, two- dimensional text, three-dimensional text, embedded text in images and video, and other similar formats. In embodiments, format detection block 204 may be implemented in hardware, software, or a combination of hardware and software. In embodiments, format detection block 204 may be implemented using artificial intelligence machine learning systems that are trained to detect input file formats.

[0066] In an embodiment, the identified input text objects 102a, identified input language, and identified formats are provided to a translation determination decision block 206 to determine if the identified input text objects 102a in the identified input language should be translated to output text objects 102b in a desired output language. In embodiments, translation determination decision block 206 may be implemented in hardware, software, or a combination of hardware and software.

[0067] In an embodiment, translation determination decision block 206 prompts a user to specify whether the identified input text objects 102a should be translated, and also prompts the user to specify a desired output language. For example, if the identified input language is English and a user specifies a desired output language is French, translation determination decision block 206 may prompt a user whether the identified input text objects 102a should be translated to French.

[0068] In another embodiment, translation determination decision block 206 does not prompt a user to identify input text objects 102a to be translated or a desired output language. Instead, translation determination decision block 206 may automatically make this determination whenever the identified input text objects 102a are in an input language different from a predetermined desired output language.

[0069] In another embodiment, example instant translation system 200a also may receive user profile and preferences information 208 that may specify various user preferences. For example, user profile and preferences information 208 may specify that a user’s preferred languages are French and Chinese. In an embodiment, translation determination decision block 206 may receive this preferred language information and may either prompt the user to select which of the preferred languages should be used to translate the identified input text objects 102a, or may automatically select one of the preferred languages specified in user profile and preferences information 208. [0070] In an embodiment, if translation determination decision block 206 determines that the identified input text objects 102a should be translated, instant translation system 200a proceeds along the “N” (no) output path and loops back to language and text detection block 202 to identify any additional input text objects 102a in input content 100a. Alternatively, if translation determination decision block 206 determines that the identified input text objects 102a should be translated from the identified input language to a specified or preferred output language, instant translation system 200a proceeds along the “Y” (yes) output path to translation block 210.

[0071] In an embodiment, translation block 210 includes a modality detection block 212 that is configured to identify various content modalities in input content 100a. Example modalities include input text, input embedded text, input image objects, input background, and other similar modalities. Thus, continuing to use input content 100a of FIG. 1A as an example, modality detection block 212 determines that input content 100a includes input text 108a1 and 108a2, input embedded text 110a1 , 110a2 and 110a3, input image objects 104a and input background 106a. In humancomputer interaction, a modality is the classification of a single independent channel of input/output between a computer and a human. Such channels may differ based on sensory nature (e.g., visual vs. auditory), or other significant differences in processing (e.g., text vs. image. Each input modality relies on specific types of sensors, devices, and recognition algorithms to capture and interpret user inputs. Similarly, each output modality is a distinct channel through which a computer system conveys information or feedback to a user and represents an independent channel of presenting information, utilizing a particular sensory or cognitive process.

[0072] Referring again to FIG. 2A, in an embodiment modality detection block 212 also is configured to classify any identified input image objects 104a as including specific types of input image objects. Thus, continuing to use input content 100a of FIG. 1A as an example, modality detection block 212 determines that input image objects 104a includes input Al image 112a, input photograph 114a, and input chart 116a. [0073] Referring again to FIG. 2A, in an embodiment modality detection block 212 also is configured to identify and classify any identified input background. Thus, continuing to use input content 100a of FIG. 1 A as an example, modality detection block 212 determines that input background 106a includes both a background color and shading.

[0074] Referring again to FIG. 2A, in an embodiment the output of modality detection block 212 is input to an optical character recognition (“OCR”) block 214, a translation block 216, a localization block 218, a style detection block 220, a background and environment analysis block 222, and an inpainting block 224. Each of these will be discussed in turn.

[0075] In an embodiment, OCR block 214 is configured to convert identified input embedded text into a machine-readable text format. Thus, continuing to use input content 100a of FIG. 1A as an example, in an embodiment OCR block 214 converts input embedded text 110a1 , 110a2 and 110a3 in input Al image 112a, input photograph 114a, and input chart 116a, respectively into a machine-readable text format.

[0076] Referring again to FIG. 2A, in an embodiment translation block 216 is configured to translate identified input text objects 102a from the input language to the output language specified by the user. In embodiments, translation block 216 may be implemented using artificial intelligence-based translation systems, algorithms or engines, such as Meta M2M100, OpenNMT, Google Translate, Quillbot, DeepL, Smartling or other similar artificial intelligence-based translation systems.

[0077] In an embodiment, localization block 218 is configured to use grammar, syntax rules and cultural aspects of language to facilitate translation block 216 correctly translating the identified input text objects 102a. In some embodiments, the functions of translation block 216 and localization block 218 can be combined into a single block.

[0078] In an embodiment, style detection block 220 is configured to determine appearance characteristics of identified input text objects 102a. In embodiments, appearance characteristics of text include font, color, size, shape, typeface (e.g., roman, bold, italics), underlining, dimension (e.g., 2D or 3D), relative location, orientation, and other similar appearance characteristics of input text objects 102a.

[0079] As described above, input Al image 112a includes various image components, some of which represent input embedded text 110a1 (referred to herein as “text image components”), input photograph 114a includes pixels, some of which represent input embedded text 110a2 (referred to herein as “text image pixels”), and input chart 116a includes vector paths, some of which represent input embedded text 110a3. (referred to herein as “text vector paths”). In embodiments, image components of an Al image that are not text image components are referred to herein as “background image components,” pixels of a photograph or other raster image that are not text pixels are referred to herein as “background pixels,” and vector paths of a vector image that are not text vector paths are referred to herein as “background vector paths.”

[0080] In an embodiment, background and environment analysis block 222 is configured to distinguish background image components, background pixels and background vector paths from text image components, text pixels and text vector paths, respectively (e.g., to aid text identification in scenarios in which text is very faint and difficult to see). In an embodiment, background and environment analysis block 222 is configured to enhance contrast between background image components and text image components in Al images. In an embodiment, background and environment analysis block 222 also is configured to enhance contrast between background pixels and text pixels in raster images. In an embodiment, background and environment analysis block 222 also is configured to enhance contrast between background vector paths and text vector paths in vector graphics.

[0081] In an embodiment, background and environment analysis block 222 also is configured to analyze a user’s ambient environment (e.g., a user may be viewing input content on a computer screen in very dark room, or on a mobile phone in very bright sun). In an embodiment, background and environment analysis block 222 also is configured to adjust the color and contrast of the translated input text objects 102a based on the user’s environment to make it easier for the user to read the translated text.

[0082] In an embodiment, inpainting block 224 is configured to generate output image objects 104b and output background 106b by removing input embedded text 110a1 , 110a2 and 110a3 from input image objects 104a, removing input text 108a1 and 108a2 from input background 106a, filling-in missing data in portions of input image objects 104a and input background 106a in which the text was removed, and providing the resulting output image objects 104b and output background 106b to text insertion block 228. Without wanting to be bound by any particular theory, it is believed that inpainting block 224 thus preserves appearance characteristics of input image objects 104a and input background 106a of input content 100a.

[0083] Thus, using input content 100a of FIG. 1 A as an example, in an embodiment inpainting block 224 generates output Al image 112b by removing input embedded text 110a1 (“Beautiful Roses”) from Al image 112a, and filling in the portions of input Al image 112a (e.g., leaves and flower elements at the top of the floral arrangement) that would be missing when input embedded text 110a1 is removed from input Al image 112a.

[0084] Likewise, in an embodiment inpainting block 224 generates output photograph 114b by removing input embedded text 110a2 (“Spring Tulips”) from input photograph 114a, and filling in the portions of the photograph (e.g., portions of the table, the wood grain in the table, portions of tulip petals) that would be missing when input embedded text 110a2 is removed from input photograph 114a.

[0085] Similarly, in an embodiment inpainting block 224 generates output chart 116b by removing input embedded text 110a3 (“Red, Yellow, . . ., Green”) from input chart 116a, and filling in the portions of the chart (e.g., the white background) that would be missing when input embedded text 110a3 is removed from input chart 116a.

[0086] In addition, inpainting block 224 generates output background 106b by removing input text 108a1 and 108a2 from input background 106a, and filling in the portions of the input background (e.g., color and shading) that would be missing when input text 108a1 and 108a2 are removed input background 106a.

[0087] In embodiments, each of modality detection block 212, OCR block 214, translation block 216, localization block 218, style detection block 220, background and environment analysis block 222 and inpainting block 224 may be implemented in hardware, software, or a combination of hardware and software. In embodiments, one or more of modality detection block 212, OCR block 214, translation block 216, localization block 218, style detection block 220, background and environment analysis block 222 and inpainting block 224 may be implemented using neural networks (e.g., deep learning based, machine learning based) and/or using conventional hardware and software technologies.

[0088] In an embodiment, the outputs of OCR block 214, translation block 216, localization block 218, and style detection block 220 are input to a text generation block 226 that is configured to generate output text objects 102b from the translated text provided by translation block 216 (and facilitated by localization block 218) having substantially the same appearance characteristics of the input text objects determined by style detection block 220.

[0089] In an embodiment, text insertion block 228 is configured to insert output text objects 102b generated by text generation block 226 into the in-painted images generated by inpainting block 224, resulting in an output image (e.g., output Al image 112b of FIG. 1 B).

[0090] Without wanting to be bound by any particular theory, it is believed that instant translation system 200a converts input content 108a that includes input text objects 102a in an input language (e.g., English) to output content 108b that includes output text objects 102b in an output language (e.g., French).

[0091] In addition, without wanting to be bound by any particular theory, it is believed that instant translation system 200a generates output text objects 102b having appearance characteristics that substantially match appearance characteristics of input text objects 102a. [0092] In addition, without wanting to be bound by any particular theory, it is believed that instant translation system 200a generates output content 100b that includes output image objects 104b and output background 106b that substantially preserve a style of input image objects 104a and input background 106a, respectively.

[0093] FIG. 2B is a simplified block diagram of another example instant translation system 200b for converting input content 100a that includes input text objects 102a in an input language (e.g., English) to output content 100b that includes output text objects 102b in an output language (e.g., French), with output text objects 102b having appearance characteristics that substantially match appearance characteristics of input text objects 102a, and with output image objects 104b and output background 106b that substantially preserve a style of input image objects 104a and input background 106a, respectively. Example instant translation system 200b is similar to example instant translation system 200a of FIG. 2A, but also includes a translation method determination block 230.

[0094] In an embodiment, translation method determination block 230 is configured to adapt the text translation method based on the computing capabilities of a user’s device (e.g., desktop computer, laptop computer, mobile phone). In an embodiment, translation method determination block 230 includes a device computing power estimation block 232, a low computation mode decision block 234, a translation block 236, a localization block 238 and a text generation and overlay insertion block 240.

[0095] In an embodiment, device computing power estimation block 232 is configured to estimate the computing capabilities of a user’s device. For example, device computing power estimation block 232 may estimate the type of user device (e.g., mobile phone, laptop computer, etc.), processor type, processor power, processor speed, available memory, communication protocol, or other similar factors related to the computing capabilities of a user’s device.

[0096] In an embodiment, low computation mode decision block 234 is configured to determine from the estimated computing capabilities from device computing power estimation block 232 if the user’s devices is a first power type device (e.g., a “low power” device) or is a second power type device (e.g., not a low power device). For example, a mobile phone with a limited processor and limited memory may be classified as a low power device, whereas a laptop computer with an advanced processor that operates at a high clock rate may be classified as not a low power device.

[0097] In an embodiment, if low computation mode decision block 234 determines that the user device is not a low power device, instant translation system 200b proceeds to translation block 210, such as described above and depicted in FIG. 2A. In other words, if a user device is not a low power device, instant translation system 200b converts input content 100a that includes input text objects 102a in an input language (e.g., English) to output content 100b that includes output text objects 102b in an output language (e.g., French), and that preserves appearance characteristics of the input text objects 102a, and preserves appearance characteristics of input image objects 104a, and input background 106a.

[0098] If, however, low computation mode decision block 234 determines that the user device is a low power device, instant translation system 200b converts input content 100a that includes input text objects 102a in an input language (e.g., English) to output content 100b that includes output text objects in an output language (e.g., French), but without preserving appearance characteristics of the input text objects 102a, or the style of input image objects 104a, and input background 106a.

[0099] In particular, translation block 236 and localization block 238 are configured to perform the same translation and location functions as translation block 216 and localization block 218, respectively, described above. Thus, in an embodiment, localization block 238 provides output text objects in the output language to text generation and overlay insertion block 240, which is configured to generate the actual output text objects and then overlay the generated output text objects over the input content (e.g., input content 100a of FIG. 1A).

[00100] For example, FIG. 1 E depicts example output content 100e generated by text generation and overlay insertion block 240 from input content 100a of FIG. 1A. In particular, output content 100e includes output text objects (output text 108e1 and 108e2, and output embedded text 110e1 , 110e2 and 110e3) in output language French overlayed on input text (input text 108a1 and 108a2, and input embedded text 110a1 , 110a2 and 110a3, respectively) in input content 100a of FIG. 1A.

[00101] Thus, if low computation mode decision block 234 determines that the user device is a low power device, instant translation system 200b converts input content that includes input text in an input language (e.g., English) to output content that includes output text in an output language (e.g., French), but without matching appearance characteristics of the input text, or the style of input image objects 104a and input background 106a. This enables instant translation system 200b to provide accurate text translation even if a user device lacks sufficient power to preserve the style of the input text, or the style of the non-text features in the input content.

[00102] In embodiments, each of device computing power estimation block 232, low computation mode decision block 234, translation block 236, localization block 238, and text generation and overlay insertion block 240 may be implemented in hardware, software, or a combination of hardware and software. In embodiments, one or more of device computing power estimation block 232, low computation mode decision block 234, translation block 236, localization block 238, and text generation and overlay insertion block 240 may be implemented using neural networks (e.g., deep learning based, machine learning based) and/or using conventional hardware and software technologies.

[00103] FIG. 2C is a simplified block diagram of another example instant translation system 200c for converting input content 100a that includes input text objects 102a in an input language (e.g., English) to output content 100b that includes output text objects 102b in an output language (e.g., French), that substantially matches appearance characteristics of the input text objects 102a and substantially matches appearance characteristics of input image objects 104a, and input background 106a in the input content 100a. Example instant translation system 200c is similar to example instant translation system 200a of FIG. 2A, but also includes a preferred language detection block 242. [00104] In an embodiment, preferred language detection block 242 is configured to automatically detect a user’s preferred/default language and provide the detected language to translation determination decision block 206. In an embodiment, preferred language detection block 242 detects a user’s preferred language through a system language setting or a browser language setting, or other similar technique.

[00105] In embodiments, preferred language detection block 242 may be implemented in hardware, software, or a combination of hardware and software, or may be implemented using neural networks (e.g., deep learning based, machine learning based) and/or using conventional hardware and software technologies.

[00106] As described above, user profile and preferences information 208a may be used to specify various user preferences. In various embodiments, example instant translation systems 200a-200c may use information specified in user profile and preferences information 208a tailor the type and appearance of translated output content.

[00107] For example, user profile and preferences information 208a may specify that a user wants translated content to be very high quality, or that lower quality translation is sufficient in exchange for power or resource conservation. Thus, translation block 216 may use this preference information to select a translation quality.

[00108] In other examples, user profile and preferences information 208a may specify that high contrast between the background and foreground translated text is very important, or alternatively may specify that such high contrast is not important. In such a scenario, background and environment analysis block 222 may use this preference information to determine whether and how much to adjust translated text based on the user’s environment. Persons of ordinary skill in the art will understand that other similar user preferences may be used to select various features of the translation operation.

[00109] FIG. 2D is a more detailed block diagram of an embodiment of translation block 210 of FIGS. 2A-2C. In an embodiment, an output of modality detection block 212 is input to a decision block 244 which is configured to determine whether a modality identified by modality detection block 212 is input text (e.g., input text 108a1 and 108a2) (Y) or some other identified modality (e.g., input embedded text, input image objects or input background) (N).

[00110] In an embodiment, if decision block 244 determines that an identified modality is input text (e.g., input text 108a1 and 108a2), the input text is provided to a metadata processing block 246a. In an embodiment, metadata processing block 246a is configured to process metadata included in the input text. For example, Microsoft Office and PDF documents include various metadata fields that can be used to specify various preferences and parameters that may be used by the disclosed text translation technology.

[00111] For example, a permanence indicator may have a first value (e.g., “T”) to specify that the translated output content may be temporary (e.g., viewed, but not printed, saved or forwarded), or a second value (e.g., “P”) to specify that the translated output content may be permanent (e.g., viewed, printed, saved, forwarded).

[00112] In another example, an originality indicator may have a first value (e.g., “O”) to specify that specific text is original, or a second value (e.g., “T”) to specify that specific text is translated.

[00113] In still another example, a copyright indicator may have a first value (e.g., “W”) to specify that a watermark must be embedded in the translated output content, or a second value (e.g., “N”) to specify that a watermark is not needed in the translated output content. Persons of ordinary skill in the art will understand that other similar metadata indicators may be used.

[00114] In addition, metadata processing block 246a may be configured to use a “custom field” for input image objects to specify any of the various metadata indicators discussed above. Also, metadata processing block 246a may be configured to use other fields for input image objects, such as a copyright notice, a language identifier, user preferences, special instructions, or other similar custom field data. Persons of ordinary skill in the art will understand that the disclosed text translation technology may use other similar metadata indicators and fields.

[00115] In an embodiment, an output of metadata processing block 246a is input to a matching font generation block 248a, a translation block 216a and a localization block 218a. In an embodiment, matching font generation block 248a is configured to identify fonts that match or substantially match fonts in the input text (e.g., input text 108a1 and 108a2). In an embodiment, if matching font generation block 248a cannot identify a matching (or substantially matching) font, matching font generation block 248a may be configured to select a font that is a closest match to the input text.

[00116] In embodiments, translation block 216a and localization block 218a are instances of translation block 216 and localization block 218, respectively, described above and depicted in FIG. 2A. In an embodiment, outputs of matching font generation block 248a, translation block 216a and localization block 218a are input to text generation block 226, described above and depicted in FIG. 2A. In an embodiment, an output of text generation block 226 is input to text insertion block 228, as described above and depicted in FIG. 2A.

[00117] Referring again to decision block 244, if a determination is made that a modality identified by modality detection block 212 is not input text (e.g., input embedded text, input image objects or input background), the content is provided to an embedded text determination block 250 that is configured to determine if the content includes input embedded text (e.g., input embedded text 110a1 , 110a2 and 110a3). If embedded text determination block 250 determines that the content does not include input embedded text, a next page/document block 252 is configured to retrieve or request a next page or document to be translated.

[00118] If, however, embedded text determination block 250 determines that the content includes input embedded text (e.g., input embedded text 110a1 , 110a2 and 110a3), OCR block 214 converts determined input embedded text into a machine- readable text format and style detection block 220 determines appearance characteristics of the input embedded text, as described above and depicted in FIG. 2A. [00119] In an embodiment, outputs of OCR block 214 and style detection block 220 are input to a metadata processing block 246b, which is configured to process metadata included in the input embedded text, such as described above regarding metadata processing block 246a. In an embodiment, an output of metadata processing block 246b is input to a matching font generation block 248b, a translation block 216b and a localization block 218b, which are configured to operate in the same manner as matching font generation block 248a, translation block 216a and localization block 218a, described above.

[00120] In an embodiment, background and environment analysis block 222 is configured to distinguish background image components, background pixels and background vector paths from text image components, text pixels and text vector paths, respectively in the received input content, such as described above and depicted in FIG. 2B. In an embodiment, an output of background and environment analysis block 222 is provided to inpainting block 224, which is configured to remove text from images, and fill-in missing data in portions of the input image in which the text was removed and provide the resulting in-painted images to text insertion block 228, as described above and depicted in FIG. 2A.

[00121] In embodiments, each of decision block 244, metadata processing block 246a, matching font generation block 248a, metadata processing block 246b, embedded text determination block 250, and next page/document block 252 may be implemented in hardware, software, or a combination of hardware and software. In embodiments, one or more of decision block 244, metadata processing block 246a, matching font generation block 248a, metadata processing block 246b, embedded text determination block 250, and next page/document block 252 may be implemented using neural networks (e.g., deep learning based, machine learning based) and/or using conventional hardware and software technologies.

[00122] In embodiments, a variety of different system architectures may be used to implement instant translations systems of this technology, such as the example instant translation systems 200a-200c described above. For example, FIG. 3A is a simplified block diagram of an example architecture 300a in which a translation server 302 performs all of the instant translation processing.

[00123] In an embodiment, translation server 302 is configured to receive input content 304 that includes input text in an input language (e.g., English), translate input content 304 to output content 306 that includes output text in an output language (e.g., French), preserving the style of the input text and the non-text features in the input content, and provide the output content 306 to one or more client devices 308 (e.g., a laptop computer 308a, a wearable device 308b, a mobile device 308c, or other similar client device). Persons of ordinary skill in the art will understand that there may be more, fewer, or different client devices 308 than the example client devices depicted in FIG. 3A.

[00124] Input content 304 may be a document (e.g., a Word document), a presentation slide (e.g., a PowerPoint slide), a web page, or other similar content. Input content 304 may be provided by a presenter, such as a speaker doing a presentation at a meeting while input content 304 is livestreamed over a private or public network (e.g., the Internet) to translation server 302.

[00125] Alternatively, a presenter may upload or otherwise transmit input content 304 (e.g., a PDF document) to translation server 302. In yet another alternative, the presenter may be a content server (e.g., an online bookstore, a news organization, or other similar content server) providing input content 304 to translation server 302. Persons of ordinary skill in the art will understand that input content 304 may be some other type of content, and other types of presenters may provide input content 304 to translation server 302.

[00126] Although a single translation server 302 is shown connected to three client devices 308, in other embodiments multiple translation servers 302 may be used, and each translation server 302 may be coupled to a corresponding one of client devices 308a, 308b and 308c. In embodiments, the one or more translation servers 302 may be part of a cloud service 310, which in various embodiments may provide cloud computing services dedicated to text translation. [00127] In still other embodiments, translation servers 302 are not part of a cloud service but may be one or more translation servers that are operated by a single enterprise, such that the network environment is owned and contained by a single entity (such as a corporation) and in which client devices 308a, 308b and 308c are all connected via the private network of the entity.

[00128] Lines between client devices 308a, 308b and 308c and translation server 302 represent network connections which may be wired or wireless and which may include one or more public and/or private networks.

[00129] Although not depicted in FIG. 3A, one or more network nodes may be disposed between the one or more translation servers 302 and client devices 308a, 308b and 308c. In embodiments, the network nodes may include “edge” nodes that are generally one network hop from client devices 308a, 308b and 308c. In embodiments, each network node may be a switch, router, processing device, or other network-coupled processing device which may or may not include data storage capability, allowing output content 306 to be stored in the node for distribution to client devices 308a, 308b and 308c. In other embodiments, network nodes may be basic network switches having no available caching memory.

[00130] In an embodiment, translation server 302 includes an interaction engine 312, a localization engine 314, a translation engine 316, a text and style detection, processing and analysis engine 318, an image/video/graphics processing and analysis engine 320, a rendering engine 322, and a watermark engine 324. In an embodiment, translation server 302 also may include user profile and preferences information 208.

[00131] In other embodiments, translation server 302 may include, more, fewer, or different components than the example components depicted in FIG. 3A. In addition, the functions of two or more of the example components depicted in FIG. 3A may be combined into a single component.

[00132] In embodiments, translation server 302 communicates with users via interaction engine 312, for example, via a user interface for exchanging information with users of client devices 308a, 308b and 308c. In embodiments, interaction engine 312 may provide an app-based or operating system-based user interface.

[00133] For example, interaction engine 312 may provide a user interface for users to specify the input content (or portions of input content) to be translated, the input language, and the output language. In embodiments, interaction engine 312 may provide a user interface that allows users to download, save and/print the translated output content. In embodiments, interaction engine 312 may extract various translation parameters, such as desired output language, translation quality, and other translation parameters from user profile and preferences information 208.

[00134] In embodiments, translation server 302 uses localization engine 314 and translation engine 316 to translate the specified input content from the input language to the desired output language. In an embodiment, localization engine 314 uses grammar, syntax rules and cultural aspects of language to assist translation engine 316 to correctly translate the meaning of input text. In embodiments, translation engine 316 may perform the translation using artificial intelligence-based translation systems, such as Google Translate, Quillbot, DeepL, Smartling or other similar artificial intelligence-based translation systems.

[00135] In embodiments, translation server 302 uses text and style detection, processing and analysis engine 318 to identify input text objects in the input content, and also identify the input language (e.g., English) of the input text objects. In embodiments, translation server 302 uses text and style detection, processing and analysis engine 318 to determine appearance characteristics of identified input text objects, such as font, color, size, shape, typeface (e.g., roman, bold, italics), underlining, dimension (e.g., 2D or 3D), relative location, orientation, and other similar appearance characteristics of identified input text objects. In embodiments, translation server 302 uses text and style detection, processing and analysis engine 318 to generate output text translated by translation engine 316.

[00136] In embodiments, translation server 302 uses image/video/graphics processing and analysis engine 320 to identify input embedded text, input image objects and input background in the input content. In embodiments, translation server 302 also uses image/video/graphics processing and analysis engine 320 to classify any identified input image objects as including specific types of input image objects (e.g., Al images, graphics, photograph, chart, video), and any identified input background (e.g., color and shading).

[00137] In embodiments, translation server 302 also uses image/video/graphics processing and analysis engine 320 to distinguish background image components, background pixels and background vector paths from text components, text pixels and text vector paths, respectively in the input content, and to enhance contrast between background image components, background pixels and background vector paths from text image components, text pixels and text vector paths, respectively in translated text (e.g., to aid text identification in scenarios in which text is very faint and difficult to see).

[00138] In embodiments, translation server 302 also uses image/video/graphics processing and analysis engine 320 to adjust the color and contrast of the translated text based on the user’s environment to make it easier for the user to read the translated text.

[00139] In embodiments, translation server 302 uses rendering engine 322 to perform inpainting to remove input embedded text from images, and fill-in missing data in portions of the input image in which the input embedded text was removed. In embodiments, translation server 302 also uses rendering engine 322 to insert the translated output text generated by text and style detection, processing and analysis engine 318 into the in-painted images.

[00140] In embodiments, translation server 302 uses watermark engine 324 to insert any necessary watermark in the translated output content.

[00141] Persons of ordinary skill in the art will understand that the various elements depicted in translation server 302 of FIG. 3A are examples, and the functions of one or more of interaction engine 312, localization engine 314, translation engine 316, text and style detection, processing and analysis engine 318, image/video/graphics processing and analysis engine 320, rendering engine 322, and watermark engine 324 may be combined, or performed by different ones of the example engines described above. In addition, persons of ordinary skill in the art will understand that translation server 302 may include more, fewer or different engines than the examples depicted in FIG. 3A.

[00142] FIG. 3B is a block diagram of a network processing device 326 that can be used to implement various embodiments of translation server 302 of FIG 3A. Specific network processing devices may use all of the components shown, or only a subset of the components, and levels of integration may vary from device to device. Furthermore, network processing device 326 may contain multiple instances of a component, such as multiple processing units, processors, memories, transmitters, receivers, etc.

[00143] In an embodiment, network processing device 326 includes a processing unit 328 equipped with one or more input/output devices, such as network interfaces, storage interfaces, and the like. In an embodiment, processing unit 328 includes a central processing unit (CPU) 330, a memory 332, a mass storage device 334, a network interface 336 and an I/O interface 338, all connected to a bus 340.

[00144] In embodiment, bus 340 may be one or more of any type of several bus architectures including a memory bus or memory controller, a peripheral bus, or the like. In an embodiment, network interface 336 enables network processing device 326 to communicate over a network 342 (e.g., the Internet) with other processing devices such as those described herein.

[00145] In embodiments, CPU 330 may include any type of electronic data processor. In embodiments, memory 332 may include any type of system memory such as static random-access memory (SRAM), dynamic random-access memory (DRAM), synchronous DRAM (SDRAM), read-only memory (ROM), a combination thereof, or the like. In embodiments, memory 332 may include ROM for use at bootup, and DRAM for program and data storage for use while executing programs.

[00146] In an embodiment, memory 332 includes computer readable instructions that are executed by CPU 330 to implement embodiments of the disclosed technology, including interaction engine 312, localization engine 314, translation engine 316, text and style detection, processing and analysis engine 318, image/video/graphics processing and analysis engine 320, rendering engine 322, watermark engine, and user profile and preferences information 208. In embodiments, the functions of interaction engine 312, localization engine 314, translation engine 316, text and style detection, processing and analysis engine 318, image/video/graphics processing and analysis engine 320, rendering engine 322, and watermark engine are described herein in various flowcharts and figures.

[00147] In embodiments, mass storage device 334 may include any type of storage device configured to store data, programs, and other information and to make the data, programs, and other information accessible via bus 340. In embodiments, mass storage device 334 may include, for example, one or more of a solid-state drive, hard disk drive, a magnetic disk drive, an optical disk drive, or the like.

[00148] As described above, FIG. 3A depicts an example architecture 300a in which translation server 302 performs all of the translation processing. Such an architecture is useful for scenarios in which client devices 308a, 308b, 308c have limited processing power and capabilities. In other scenarios, client devices 308a, 308b, 308c may have high processing power and capabilities. FIG. 3C depicts an alternative example architecture 300b for such scenarios, in which client devices 308a, 308b, 308c perform all of the translation processing.

[00149] In embodiments, each of client devices 308a, 308b, 308c may include a processor 344 that performs all of the translation processing. In embodiments, processor 344 is configured to receive input content 304 that includes input text in an input language (e.g., English), translate input content 304 to output content 306 that includes output text in an output language (e.g., French), preserving the style of the input text and the non-text features in the input content, and display or save the output content 306 on the respective client device 308 (e.g., laptop computer 308a, wearable device 308b, mobile device 308c, or other similar client device). Persons of ordinary skill in the art will understand that there may be more, fewer, or different client devices 308 than the example client devices depicted in FIG. 3B. [00150] In embodiments, processor 344 includes interaction engine 312, localization engine 314, translation engine 316, text and style detection, processing and analysis engine 318, image/video/graphics processing and analysis engine 320, rendering engine 322, watermark engine 324 and user profile and preferences information 208, such as described above and depicted in FIG. 3A.

[00151] In embodiments, client devices 308a, 308b, 308c are coupled to a network 346 (e.g., a public network, a private network, a local area network, the Internet or other similar network). In embodiments, processor 344 uses interaction engine 312 to receive input content 304 via network 346.

[00152] In embodiments, processor 344 performs translation processing using interaction engine 312, localization engine 314, translation engine 316, text and style detection, processing and analysis engine 318, image/video/graphics processing and analysis engine 320, rendering engine 322, watermark engine 324 and user profile and preferences information 208, such as described above and depicted in FIG. 3A. In an embodiment, the example network processing device 326 of FIG. 3B, described above, can be used to implement various embodiments of processor 344 of FIG 3C.

[00153] As described above, FIG. 3A depicts an example architecture 300a in which translation server 302 performs all of the translation processing, and FIG. 3C depicts an alternative example architecture 300b in which client devices 308a, 308b, 308c perform all of the translation processing. FIG. 3D depicts an example architecture 300c in which translation processing is divided between translation server 302 and client devices 308a, 308b, 308c.

[00154] In the illustrated example, translation server 302 uses localization engine 314, translation engine 316, text and style detection, processing and analysis engine 318, image/video/graphics processing and analysis engine 320, watermark engine 324 and user profile and preferences information 208 such as described above, and processor 344 on client devices 308a, 308b, 308c uses interaction engine 312, rendering engine 322, watermark engine 324 and user profile and preferences information 208 such as described above, to divide translation processing operations between translation server 302 and client devices 308a, 308b, 308c. Persons of ordinary skill in the art will understand that other architectures with other divisions of operations between translation server 302 and client devices 308a, 308b, 308c also may be used.

[00155] FIG. 3E is a simplified block diagram of still another example architecture 300d that includes a sender client device 348 coupled via network 346 to receiver client devices 308a, 308b and 308c. In an embodiment, sender client device 348 includes input content 304 and also includes processor 344 configured to perform all of the translation processing for translating input content 304 to output content 306, substantially matching appearance characteristics of the input text and substantially matching appearance characteristics of non-text features in the input content. In an embodiment, sender client device 348 is configured to provide the output content 306 via network 346 to receiver client devices 308a, 308b, and 308c.

[00156] Example architecture 300d may be referred to as “sender side processing.” For example, such sender side processing may be useful if a sending user has copyright permission from a content owner, sender client device 348 has ample processing power, but one or more of receiver client devices 308a, 308b and 308c is a very simple mobile device that has low computing resources. In this embodiment, sender client device 348 performs as much processing possible to accommodate the limited processing power of receiver client devices 308a, 308b and 308c.

[00157] In other architecture embodiments, some translation processing may be performed on translation server 302, some translation processing on may be performed by edge computing centers, and some translation processing on may be performed by client devices. Edge computing centers can be deployed at edges of a communication network such that computing resources can be available in close proximity to end user clients. In this way, the edge computing centers can be employed to support computation-intensive and latency-sensitive applications at user equipment having limited resources.

[00158] In the example embodiments described above, translation server 302 and/or processor 344 may be configured to permit selective translation of input content. For example, translation server 302 and/or processor 344 may use interaction engine 312 to provide a user interface that allows a user to interactively select to translate a part or all of input content. Also, a content owner may specify copyright rules that permit partial or full translation. For example, a content owner may permit translation of text, but prohibit translation of embedded text in images or videos.

[00159] In still other embodiments, translation server 302 and/or processor 344 may use interaction engine 312 to allow a user to specify that only certain words or phrases may be translated, or may allow translation of everything except certain words (e.g., product names, brand names, trademarks). In additional embodiments, translation server 302 and/or processor 344 may use interaction engine 312 to allow a user to specify preferences or presets to automatically select part of the input content to translate.

[00160] FIG. 4 is a flowchart of an example method 400 of instantly translating text. Example method 400 may be implemented by any of example instant translation systems 200a-200c described above and depicted in FIGS. 2A-2C, using any of example architectures 300a-300d described above and depicted in FIGS. 3A and 3C- 3E.

[00161] At step 402, receive input content. For example, as described above, input content (e.g., input content 100a of FIG. 1A) may be received from a user via a user interface, a mobile phone camera, an email interface or other method.

[00162] At step 404, identify input text objects in the input content. For example, as described above, input text objects 102a (input text 108a1 and 108a2, and input embedded text 110a1 , 110a2 and 110a3) may be identified in input content 100a of FIG. 1A.

[00163] At step 406, translate the identified input text objects from an input language to an output language. For example, as described above, input text objects 102a (input text 108a1 and 108a2, and input embedded text 110a1 , 110a2 and 110a3) may be translated from a first language (e.g., English) to a second language (e.g., French).

[00164] At step 408, determine an appearance characteristic of the identified input text objects. For example, as described above, one or more appearance characteristics (e.g., font style, color, size, shape, typeface, underlining, dimension, relative location, and orientation) of input text objects 102a (input text 108a1 and 108a2, and input embedded text 110a1 , 110a2 and 110a3) may be identified.

[00165] At step 410, generate output text objects that include the translated identified input text objects and that have an appearance characteristic that substantially matches the determined appearance characteristic of the identified input text objects. For example, as described above, output text objects 102b (output text 108b1 and 108b2, and output embedded text 110b1 , 110b2 and 110b3) are generated that have an appearance characteristic that substantially matches the determined appearance characteristic of the identified input text objects 102a (input text 108a1 and 108a2, and input embedded text 110a1 , 110a2 and 110a3).

[00166] At step 410, generate output content by replacing the identified input text objects with the generated output text objects. As described above and depicted in FIG. 1 B, output content 100b is generated by replacing the identified input text objects 102a (input text 108a1 and 108a2, and input embedded text 110a1 , 110a2 and 110a3) with the generated output text objects 102b (output text 108b1 and 108b2, and output embedded text 110b1 , 110b2 and 110b3).

[00167] For the purposes of this document, it should be noted that the dimensions of the various features depicted in the figures may not necessarily be drawn to scale.

[00168] For purposes of this document, reference in the specification to “an embodiment,” “one embodiment,” “some embodiments,” or “another embodiment” may be used to describe different embodiments or the same embodiment.

[00169] For the purposes of this document, a connection may be a direct connection or an indirect connection (e.g., via one or more other parts). In some cases, when an element is referred to as being connected or coupled to another element, the element may be directly connected to the other element or indirectly connected to the other element via intervening elements. When an element is referred to as being directly connected to another element, then there are no intervening elements between the element and the other element. Two devices are “in communication” if they are directly or indirectly connected so that they can communicate electronic signals between them.

[00170] Although the present disclosure has been described with reference to specific features and embodiments thereof, various modifications and combinations can be made thereto without departing from the scope of the disclosure. The specification and drawings are, accordingly, to be regarded simply as an illustration of the disclosure as defined by the appended claims, and are contemplated to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the present disclosure.

[00171] The present subject matter may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this subject matter will be thorough and complete and will fully convey the disclosure to those skilled in the art. Indeed, the subject matter is intended to cover alternatives, modifications and equivalents of these embodiments, which are included within the scope and spirit of the subject matter as defined by the appended claims.

[00172] Furthermore, in the above detailed description of the present subject matter, numerous specific details are set forth to provide a thorough understanding of the present subject matter. However, persons of ordinary skill in the art will understand that the present subject matter may be practiced without such specific details.

[00173] Aspects of the present disclosure described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatuses (systems) and computer program products according to embodiments of the disclosure. Each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions.

[00174] These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create a mechanism for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

[00175] The technology described herein can be implemented using hardware, software, or a combination of both hardware and software. The software used can be stored on one or more of the processor readable storage devices described above to program one or more of the processors to perform the functions described herein. The processor readable storage devices can include computer readable media such as volatile and non-volatile media, removable and non-removable media.

[00176] By way of example, and not limitation, computer readable media may include computer readable storage media and communication media. Computer readable storage media may be implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

[00177] Examples of computer readable storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information, and which can be accessed by a computer. A computer readable medium or media does (do) not include propagated, modulated, or transitory signals

[00178] The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure

[00179] The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure with various modifications as are suited to the particular use contemplated.

[00180] For purposes of this document, each process associated with the disclosed technology may be performed continuously and by one or more computing devices. Each step in a process may be performed by the same or different computing devices as those used in other steps, and each step need not necessarily be performed by a single computing device.

[00181] Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1 . A computer implemented method of instantly translating text at a processing device, comprising: receiving, at the processing device, input content; identifying, at the processing device, input text objects in the input content; translating the identified input text objects from an input language to an output language; determining, at the processing device, an appearance characteristic of the identified input text objects; generating, at the processing device, output text objects that comprise the translated identified input text objects and that have an appearance characteristic that substantially matches the determined appearance characteristic of the identified input text objects; and generating, at the processing device, output content by replacing the identified input text objects with the generated output text objects.

2. The computer implemented method of claim 1 , wherein the identified input text objects comprise one or more of symbols, characters, numbers.

3. The computer implemented method of any preceding claim, wherein the identified input text objects comprise text embedded within image objects.

4. The computer implemented method of any preceding claim, wherein an appearance characteristic of the identified input text objects includes one or more of a font style, color, size, shape, typeface, underlining, dimension, relative location, and orientation.

5. The computer implemented method of any preceding claim, wherein replacing comprises removing the identified input text objects and inserting the generated output text objects.

6. The computer implemented method of any preceding claim, further comprising: identifying, at the processing device, input image objects in the input content; determining, at the processing device, an appearance characteristic of the identified input image objects; generating output image objects, at the processing device, that have an appearance characteristic that substantially matches the determined appearance characteristic of the identified input image objects; and generating output content further comprises replacing the identified input image objects with the generated output image objects.

7. The computer implemented method of claim 6, wherein an appearance characteristic of the identified input image objects include non-text image details of the identified input image objects.

8. The computer implemented method of any preceding claim, further comprising: identifying, at the processing device, an input background in the input content; determining, at the processing device, an appearance characteristic of the identified input background; generating, at the processing device, an output background that has an appearance characteristic that substantially matches the determined appearance characteristic of the identified input background; and generating output content further comprises replacing the identified input background with the generated output background.

9. The computer implemented method of claim 8, wherein an appearance characteristic of the identified input background includes one or more of a color, a shading, and a simulated texture.

10. The computer implemented method of any preceding claim, wherein the processing device comprises any of a server and a client device.

11 . The computer implemented method of any preceding claim, wherein a first subset of the method steps is performed by a server and a second subset of the method steps is performed by a client device.

12. The computer implemented method of any preceding claim, wherein the method steps are performed by a sender client device, which is configured to provide the output content to one or more receiver client devices.

13. The computer implemented method of any preceding claim, further comprising: estimating a computing capability of a user processing device; and selectively performing the determining and generating steps based on the estimated computing capability.

14. The computer implemented method of claim 13, further comprising: determining from the estimated computing capability that the user processing device comprises a first power type device, and generating, at the processing device, output text objects that comprise the translated identified input text objects, but without preserving appearance characteristics of the identified input text objects; and determining from the estimated computing capability that the user processing device comprises a second power type device comprising greater computing capability than a first power type device, and generating, at the processing device, output text objects that comprise the translated identified input text objects and preserving appearance characteristics of the identified input text objects.

15. The computer implemented method of any preceding claim, further comprising adjusting, at the processing device, a color and a contrast of the output text objects based on a user’s ambient environment.

16. A non-transitory computer-readable medium storing computer instructions for instantly translating text at a processing device, that when executed by one or more processors, cause the one or more processors to perform the steps of: receiving, at the processing device, input content; identifying, at the processing device, input text objects in the input content; translating the identified input text objects from an input language to an output language; determining, at the processing device, an appearance characteristic of the identified input text objects; generating, at the processing device, output text objects that comprise the translated identified input text objects and that have an appearance characteristic that substantially matches the determined appearance characteristic of the identified input text objects; and generating, at the processing device, output content by replacing the identified input text objects with the generated output text objects.

17. The non-transitory computer-readable medium of claim 16, wherein the identified input text objects comprise one or more of symbols, characters, numbers.

18. The non-transitory computer-readable medium of claims 16 through 17, wherein the identified input text objects comprise text embedded within image objects.

19. The non-transitory computer-readable medium of claims 16 through 18, wherein an appearance characteristic of the identified input text objects includes one or more of a font style, color, size, shape, typeface, underlining, dimension, relative location, and orientation.

20. The non-transitory computer-readable medium of claims 16 through 19, wherein replacing comprises removing the identified input text objects and inserting the generated output text objects.

21 . The non-transitory computer-readable medium of claims 16 through 20, further comprising: identifying, at the processing device, input image objects in the input content; determining, at the processing device, an appearance characteristic of the identified input image objects; generating output image objects, at the processing device, that have an appearance characteristic that substantially matches the determined appearance characteristic of the identified input image objects; and generating output content further comprises replacing the identified input image objects with the generated output image objects.

22. The non-transitory computer-readable medium of claim 21 , wherein an appearance characteristic of the identified input image objects include non-text image details of the identified input image objects.

23. The non-transitory computer-readable medium of claims 16 through 22, further comprising: identifying, at the processing device, an input background in the input content; determining, at the processing device, an appearance characteristic of the identified input background; generating, at the processing device, an output background that has an appearance characteristic that substantially matches the determined appearance characteristic of the identified input background; and generating output content further comprises replacing the identified input background with the generated output background.

24. The non-transitory computer-readable medium of claim 23, wherein an appearance characteristic of the identified input background includes one or more of a color, a shading, and a simulated texture.

25. The non-transitory computer-readable medium of claims 16 through 24, wherein the processing device comprises any of a server and a client device.

26. The non-transitory computer-readable medium of claims 16 through 25, wherein a first subset of the method steps is performed by a server and a second subset of the method steps is performed by a client device.

27. A user equipment device comprising: a non-transitory memory storage comprising instructions; and one or more processors in communication with the memory storage wherein the one or more processors execute the instructions to cause the device to: receive, at the user equipment device, input content; identify, at the user equipment device, input text objects in the input content; translate selective ones of the identified input text objects from an input language to an output language; determine, at the user equipment device, an appearance characteristic of the selective ones of identified input text objects; generate, at the user equipment device, output text objects that comprise the translated selective ones of the identified input text objects and that have an appearance characteristic that substantially matches the determined appearance characteristic of the selective ones of the identified input text objects; and instantly generate, at the user equipment device, output content by replacing the identified selective ones of the input text objects with the generated output text objects.

28. The user equipment device of claim 27, wherein the one or more processors execute the instructions to further cause the device provide a user interface for specifying the selective ones of the identified input text objects.

29. The user equipment device of claim 28, wherein the one or more processors execute the instructions to further cause the device provide the user interface for specifying the output language.

30. The user equipment device of claim 27, wherein the one or more processors execute the instructions to further cause the device to determine from one or more of a user profile and system settings the selective ones of the identified input text objects.

31 . The user equipment device of claim 27 through 29, wherein the one or more processors execute the instructions to further cause the device to receive from the user interface the output language.