WO2016166416A1 - Method and system for temporal predictions - Google Patents
Method and system for temporal predictions Download PDFInfo
- Publication number
- WO2016166416A1 WO2016166416A1 PCT/FI2016/050243 FI2016050243W WO2016166416A1 WO 2016166416 A1 WO2016166416 A1 WO 2016166416A1 FI 2016050243 W FI2016050243 W FI 2016050243W WO 2016166416 A1 WO2016166416 A1 WO 2016166416A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- document
- temporal
- event
- time
- collected
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
Definitions
- the present invention relates to computer-implemented methods and systems for predicting needs for resources.
- Truve discloses methods for automatic processing of information.
- Truve discloses a wide area network fact information service method, including: storing a plurality of canonical fact entries in different subject matter domains, wherein the canonical fact entries each correspond to an occurrence and each include an occurrence date for that occurrence, storing one or more fact descriptor entries for each of the canonical fact entries, ranking the canonical fact entries in each subject matter domain relative to each other based on the descriptor entries, the occurrence dates, and on at least one ranking measure for that subject matter domain, and wherein at least some of the occurrence dates are extracted from content meaning in one or more textual sources about the occur- rence.
- An aspect of the present invention is a method according to claim 1.
- Other aspects include systems and computer program products according to the other inde- pendent claim.
- the dependent claims and the following description and drawings disclose various optional features, use cases or the like.
- the present disclosure is used to predict progress of resource requirements.
- resources whose requirements may be predicted, include workforce, education, energy resources, raw materials, construction materials, and transportation resources.
- the present disclosure can also be used to make predictions concerning market share, expansion, growth, and social or device connectivity.
- the source credibility value of a data source is repeatedly updated based on facts identified from documents collected from that data source.
- a data source initially associated with a low source credibility value may have its source credibility value increased is predictions based on documents collected from that data source turn out to be accurate, or vice versa.
- the determining of the associated document credibility value for each collected document, using the set of preselected criteria comprises one or more of: registering a first time stamp, which describes a publication time of the collected document; registering a second time stamp, which describes a collection time of the collected document; determining the data source of the collected document and setting the document credibility value of the collected document based on the source credibility value of the data source; determining and registering a place of publication of the collected document; and providing the collected document with a document tag, which comprises at least one time stamp and the data source.
- This embodiment utilizes the fact that some source(s) have been known to yield more accurate prediction, and thereby earn better source credibility than other sources do.
- the linguistic analysis comprises one or more of: removing stop words from the collected document; extracting nouns, adjectives and verbs from the collected document and searching selected key- words among the extracted nouns, adjectives and verbs; classifying at least extracted nouns from the collected document into pre-selected topical fields; extracting temporal specifiers, generating numerical forms of the extracted tern- poral specifiers and placing each of the generated numerical forms on a timeline; and counting occurrences of the keywords.
- This embodiment provides the additional benefit of defining temporal items for each processed content, which allows the content or section (sentence of text) to be placed on specific future point in time, when there are multiple placements of same content of same industry in same temporal position in the future. This increases the likelihood of the event/product/service to be produced or to be available at that predicted time, because opportunity creates investment for research and development resources to fulfill the predicted demand.
- the linguistic analysis comprises extracting temporal specifiers by using a pre-formed vocabulary, which comprises plural time-related expressions in a language of the collected document.
- a pre-formed vocabulary which comprises plural time-related expressions in a language of the collected document.
- This may involve a recursive self-learning system wherein, each time a process is performed, the next time there is the same or similar source text with temporal data, the process can be performed quicker and placement to a future time point is made according to the content and industry in question.
- the self-learning ability of the system may be based on recursive placement of temporal positions in future time. Eventually there will be large numbers (up to thousands) of temporal terms or specifiers that the system has processed.
- the method may comprise ranking the collected documents by importance in each of multiple topical fields, based on occurrences of keywords, document temporal match and the document credibility value of the document.
- the method may comprise presenting the collected documents in the ranked order in each of multiple topical fields.
- the method may comprise repeating the prediction in one or more selected periods of time, and estimating a change of the time range of the at least one event based on a comparison of the one or more repeated predictions with one or more earlier predictions.
- the estimating of the change may comprises repeating the collecting of at least some of the documents form one or more selected data sources; repeating the extracting of at least one temporal specifier; updating the determining of the time range of the at least one event based on a comparison of each temporal specifier obtained from the repeated extraction with a corresponding temporal speci- bomb extracted earlier.
- Figure 1 is a flow chart showing basis steps of a method according to an embodiment of the present disclosure
- Figure 2 is a flow chart illustrating an embodiment for determination of a document credibility value
- Figure 3 is a flow chart illustrating an embodiment for a linguistic analysis
- Figure 4a shows a fragment of a vocabulary usable in an embodiment of the present disclosure
- Figure 4b shows a fragment of a stop word list usable in an embodiment of the present disclosure
- Figure 5 illustrates an extraction step of a linguistic analysis being applied on a fragment from a document
- Figure 6 is a flow chart illustrating additional steps for updating and assessing predictions
- Figures 7a through 7d illustrate change of a point of time as a result of an updated prediction
- Figure 8 shows an IT system for implementing embodiments of the present disclosure.
- Figure 9 shows a report generated by a method according to an embodiment of the present disclosure.
- Figure 8 shows an IT system for implementing embodiments of the present disclosure.
- presently disclosed method are carried out by executing program code instructions 34 in a processor (shown as laptop computer) 30, which uses data network(s), such as the internet, to access cloud- based documents 12 provided by multiple data sources 50.
- Reference number 36 denotes a database, which stores sets of documents deemed relevant for selected predictions.
- Reference number 32 denotes a vocabulary (typically one of many). Each vocabulary 32 is specific to a human language and within each language there may be multiple vocabularies each of which is specific to a topical field, such as electronics, chemistry, and so on.
- Figure 4a shows a fragment of a vocabulary 32 usable in an embodiment of the present disclosure.
- Each vocabulary 32 comprises, as exhaustively as possible, the words of the language and indicates a word class for each word, such as noun, adjective, verb, and so on.
- the vocabularies of the present disclosure indicate temporal specifiers and stop words, which may be implemented as extensions to traditional grammatical word classes or as distinct lists. Examples of temporal specifiers include "today”, "this week”, “next year”, and so on.
- the system may also comprise a data source register (not shown separately), which stores identifications and access resources (such as URLs) of data sources.
- the data source register may also store topical fields and credibility values for the data sources. For instance, a well-known news agency may have an initial credibility value of 0.99, have an initial credibility while the initial credibility for a tabloid magazine may be much lower, such as 0.1 for instance.
- Figure 4b shows a fragment of a stop word list usable in an embodiment of the present disclosure. Stop words are words that are ignored in searches.
- the illustrated stop word list comprises tuples, each having an ID field, stop word, creation date and modification date.
- the steps for generating a prediction may be compressed into six main phases:
- A. Scanning which comprises searching for temporal specifiers and keywords within the documents used as source material.
- Processing which comprises forming a view of the process of events and time range in the future.
- the process being described herein begins from registering a keyword in memory at 102.
- the keyword means a word indicating the selected topic, such as "fusion reactor”.
- keywords There are typically several keywords, which may be hierarchically coupled in the dictionary. For instance, “fusion reactor” may belong in a broader keyword “nuclear”, which in turns belongs in a broader keyword “energy technology” and onwards to "energy industry.
- fusion reactor may belong in a broader keyword “nuclear”, which in turns belongs in a broader keyword “energy technology” and onwards to "energy industry.
- some or all of the keywords relating to the topic being predicted may be obtained from a user.
- each keyword is classi- fied into a topical field at 104.
- classification may be automatic, based on a pre-generated keyword list, or it may be performed or assisted by one or more users.
- the keyword "fusion reactor” may be classified in topical fields “technology” and/or "energy technology”.
- collection of information begins at 106.
- the collection of information typically comprises queries to document data bases, search engines, or the like, via the internet, whereby information, typically in the form of documents relating to the selected keywords, may be collected automatically.
- Some implementations of the process may involve associating topical fields with data sources, which are more than usually relevant to the topical field and whose contents is ideally collected for generation of the prediction. Alternatively, information may be collected from all data sources associated with a given topical field.
- Collection of information typically involves use of one or more internet search engines, such as Google (Google, Inc.) or Bing (Microsoft Corp.) which search for documents relevant to the keywords given in the query to the search engine.
- the documents, whose identifiers (URLs) are returned by the search engines may be stored on a cloud server.
- the stored information may be automatically analyzed in order to determine relevant documents, while new information may be stored to replace information stored earlier, so as to keep storage requirements within reasonable limits. Keyword-based selection of information to be stored is preferably applied prior to storing the information, so as to avoid storing documents which are irrelevant to the selected keyword (s).
- the information collection may also comprise use of filters to prevent storing of duplicates of documents. After storing the documents, a credibility value is determined for each document at 108.
- FIG. 2 is a flow chart illustrating an embodiment for determination of a document credibility value, which is shown as a single step 108 in Figure 1. Some steps of Figure 2 may be omitted in some implementations.
- Step 202 comprises registering a first time stamp, which describes a publication time of the collected doc- ument.
- Step 204 comprises registering a second time stamp, which describes a collection time of the collected document.
- Step 206 comprises determining the data source of the collected document.
- Step 208 comprises setting the document's credibility value.
- a document's initial credibility value may correspond to or be based on the source credibility value of the data source the document was collected from. Alternatively or additionally, the document's initial credibility value may be based on the first and/or second time stamps.
- a document's credibility value may be lowered as the document ages, as indicated by the first and/or second time stamps.
- Step 210 comprises determining and registering a place of publication of the collected document.
- the document's credibility value may be adjusted based on place of publication, with or without regard to the topic. For instance, a region having universities, research centers or respected publishers in a given topical field may increase the credibility value of documents published in that region and topical field.
- Step 212 comprises providing the collected document with a document tag, which comprises at least one time stamp and an indication of the data source the document was collected from.
- the document tag be in the following form: document name; first time stamp; second time stamp; credibility value; place of origin.
- the fields of the document tag may be populated as follows, for example: "fissionpower.pdf; 22:01 20-01-2015; 08:03 22-01-2015; 0.56; London".
- a document's credibility value may be positively affected by increasing temporal concentrations of documents in the document's topical field. For instance, 100 documents relating to topic "fusion reactor" published or collected over one hour may result in a higher increase of the credibility value of those documents, compared with the same number of documents published or collected over one month.
- keywords and temporal specifiers within a document may also affect the document's credibility value. For instance, keywords and temporal specifiers in headings and or abstracts may boost a document's credibility value more than the same keywords and temporal specifiers in other sections.
- documents may be classified into categories which have associated initial credibility values. For instance, "user reports” may score 0.1, “opinions” 0.2, “articles 0.5, “researches” 0..8 and “legal notifications” 0.9.
- Figure 3 is a flow chart illustrating steps (process phases) for a linguistic analysis, which is cursorily shown as 110 - 112 in Figure 1.
- the process phases relate to the exemplary document fragment shown in Figure 5.
- Step 302 of Figure 3 comprises removal of fill words, which are indicated by strikethrough format in Figure 5.
- An illustrative but non-exhaustive list of fill words comprises prepositions, articles and some conjunctions. These are largely irrelevant to the automatic linguistic analysis.
- Nouns, adjectives, verbs and key- words are extracted at 304.
- the nouns, including proper names are shown in italic text and adjectives in plaintext.
- Verbs are indicated by underlining and keywords by a black background.
- the extraction utilizes the stored vocabulary 32 shown in Figure 8.
- each of "fusion reactor” and “nuclear” have two occurrences in the document fragment of Figure 5.
- the document may be classified in to multiple topical fields based on the extracted nouns.
- proper names may be relevant, such as “Lockheed Martin”, “Tom McGuire”, “Reuters”, “Lockheed's Skunk Works” and “America's”.
- "Reuters” may have been recognized as the document's source at 206.
- Other nouns describing topical fields may include "Defense contractor", “hydrogen”, “helium”, “sun”, “industry”, “government”, “atom” and “earth”.
- Step 308 comprises extracting temporal specifiers from the document. These include “three years", “now” and “today”. Temporal specifiers are utilized to determine a start point and/or end point of an individual event of a selected topic. In principle, the duration of an individual event may range from seconds to infinity, but in practical implementations it is a limited period of time. Most human languages have hundreds of temporal specifiers each, which are included in the vocabulary as exhaustively as possible (within reason). These are detected as they occur in documents.
- Steps 308 and 310 respectively comprise generating numeri- cal forms of the extracted temporal specifiers and placing each of the generated numerical forms on a timeline.
- Some implementations comprise detection of duplicate information, so as to eliminate duplicate entry of information, which might wrongly bias predictions.
- Wild- card searches may be used to detect temporal specifiers in slightly altered forms.
- Temporal specifiers detected in a document may also affect the document's credibility value. For instance, if a document predicts, or contributes to prediction of, an event occurring in a narrow time range, this results in a higher credibility value for the document, compared to a prediction of an occurrence in a broader time range.
- temporal specifiers in a document indicating a narrow time range are considered more accurate than are temporal specifiers in a document indicating a broad time range.
- Keywords in the vicinity of temporal specifiers having a sufficiently high accuracy may form events for a selected topic, which events will be presented for a user.
- a sufficiently high accuracy eg a calendar day
- keywords in the vicinity of temporal specifiers having a sufficiently high accuracy may form events for a selected topic, which events will be presented for a user.
- document relating to topic "fusion reactor” has an accurate temporal specifier, such as "20 th June 2015”, occurring near a keyword “prototype test”. This would result in the keyword "prototype test” being presented for the user.
- an event may be formed by a word, which has not been registered as a keyword but which occurs in documents with a sufficiently high frequency near a specific temporal specifier.
- An event extracted from documents may be assigned a credibility value based on the likelihood of the event occurring at the predicted point of time. For instance, if the event is "US presidential elections on 8 th November 2016", the event may be assigned a high credibility value, such as 0.95 for example, as a result of the accu- rate temporal specifier frequently occurring in the documents. Furthermore, the topical field of "politics" for the event also contributes to the credibility value. In contrast, an event "First fusion reactor on the marked by 2035" would be assigned a low credibility value because the temporal specifiers given to this even by various documents have an extremely broad variation and the range of time extends a long way into the future.
- the extracted keywords are counted with respect to occurrences.
- one document may comprise 20 occurrences of "fusion reactor", which is a strong indication of that document's association to the given topic.
- fusion reactor which is a strong indication of that document's association to the given topic.
- the more occurrences a keyword has in a document the more relevant is the document to the topic being predicted.
- a combination of keyword and temporal data in the same sentence has a high credibility. For instance, if an established supplier of power plant (expressed as a keyword) estimates, that a live fusion reactor is operable in 2025, this has a higher credibility than the temporal data alone.
- the documents are classified into topical fields based on the keywords at 114.
- the documents classified into a topical field are ranked by importance based on the credibility values. A higher credibility value results in a higher importance ranking and vice versa.
- a time range for an event is determined based on the temporal specifier of the document with the highest credibility value.
- this document optionally accompanied by other documents based on a set of selection criteria, is/are presented for the user, grouped into topical fields and ranked by im- portance.
- the documents (or icons of documents) may be placed on a timeline based on the temporal specifiers contained in the documents.
- Figure 9 shows a report generated by a method according to an embodiment of the present disclosure.
- the linguistic analysis may comprise an assessment of a degree to which the information contained in the document is fact (as opposed to prediction) with respect to the future. This assessment may be based on temporal specifiers which, when pointing to the future indicate that the document likely describes the future, and when pointing to the past or present indicate that the document likely describes fact.
- a definite assessment of factual content may be based on the volume of the document and the accuracy of its temporal specifiers. For instance, if one thousand documents report an event having occurred precisely on 2 nd February 2015, the information contained in the documents may be considered factual, provided that at least some of the documents originate from credible sources.
- a prediction generated by a method according to the present disclosure is compared with new facts discovered at the time of generating a subsequent prediction, wherein the new facts have emerged between the predictions. For instance, assume that an earlier prediction predicts an event occurring "mid 2016" and a new prediction is generated near the end of 2016. If the event did occur in mid-2016, the documents contributing to the correct prediction may have their credibility values increased, as do the data sources from which these documents were collected. Conversely, if the event failed to occur at the predicted time, the documents and corresponding data sources may have their credibility values decreased. The increase or decrease of credibility values based on correct or incorrect predictions may also be applied to credibility values of other events described in the affected documents.
- Figure 6 is a flow chart illustrating additional steps for updating and assessing predictions. Comparison between predictions generated at different times aims at determining a change of the predicted time range with respect to earlier prediction ⁇ ). Steps 600, 602 and 604 generally correspond to the methods described in Figures 1, 2 and 3, respectively, and will not be described again.
- the com- ponents of change to be determined comprise a direction and speed of the change of the time range. With new predictions, the time range may remain static, move closer towards the present, or move farther away into the future. Based on the direction and speed of the change, it is possible to estimate a time range for the next prediction, if the change of time range indicates movement towards the pre- sent.
- the change of the time range indicates movement farther away into the future, an assessment can be made at 608 that the event predicted for the time range is unlikely to materialize in that time range.
- the old and new time ranges are stored at 610, and by comparing these, the credibility value of the updated time range may be re-assessed at 612.
- the comparison shown in Figure 6 may be repeated several times. As a result, the predicted time range continually progresses with new predictions, and its accuracy is likely to improve.
- Figures 7a through 7d illustrate change of a point of time 14 as a result of an updated prediction.
- information is extracted from documents 12 to identify at least one event 16 and its point of time 14.
- events are "product A", "technology B” and "market C”.
- the documents 12 contain information related to these events, and temporal specifiers from which the document-specific points of time 14 can be derived. These can point to generally the same point of time, or they may point to widely scattered points of time. Based on the credibility values of the documents, the document- specific points of time 14 yield a time range 20, in which the event is expected to occur.
- FIG. 7b when the prediction is updated n times, more documents 12 are identified, and these have still more points of time 14. Based on these, a new time range 20' is formed, which may deviate in time from the old time range 20.
- Figures 7 c and 7d show predictions, which have been updated n+m or n+m+p times, respectively. Each new prediction introduces new documents and points of time for the next updated prediction. When the time ranges change, they usually get narrower, as shown by value T in Figure 7d. In some cases the width T of the time range may even increase.
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Strategic Management (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Entrepreneurship & Innovation (AREA)
- General Business, Economics & Management (AREA)
- Economics (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Operations Research (AREA)
- General Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Accounting & Taxation (AREA)
- Development Economics (AREA)
- Finance (AREA)
- Tourism & Hospitality (AREA)
- Game Theory and Decision Science (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A prediction process, comprising: registering keywords, each characterizing selected topics; classifying the keywords into topical fields; collecting documents (12), having associated source credibility values; determining for each document an associated document credibility value. For each document (12), the temporal specifier(s) and keyword(s) are extracted by linguistic analysis. The temporal specifiers describe timings of events. A document-specific point of time (14) for the event is determined based on the temporal specifiers, and the document is classified into topical field(s) based on its keywords. A prediction is generated by determining a time range (20) of events (16) relating to selected topics, based on occurrences of the document-specific points of time (14) of the topics (16). Documents (12) relating to the event (16) of the selected topic are presented, based on correlation of the document-specific points of time (14), the correlation being computed from the time range (20) and document credibility values.
Description
METHOD AND SYSTEM FOR TEMPORAL PREDICTIONS
PARENT CASE INFORMATION
The present invention claims priority from FI20155271, titled "Menetelma valittujen aiheiden ennustamiseksi ohjelmallisia valineita kayttaen", the contents of which is incorporated herein by reference.
FIELD
The present invention relates to computer-implemented methods and systems for predicting needs for resources.
BACKGROUND OF THE INVENTION
Predicting future development, such as needs for various resources, has traditionally required large amounts of brainwork. Experts have studied publications relating to various topics, based on which they have drafted their predictions. Due to exponentially increasing amounts of information and topics, this is a monumental task.
US 8468153 (Staffan Truve) discloses methods for automatic processing of information. Truve discloses a wide area network fact information service method, including: storing a plurality of canonical fact entries in different subject matter domains, wherein the canonical fact entries each correspond to an occurrence and each include an occurrence date for that occurrence, storing one or more fact descriptor entries for each of the canonical fact entries, ranking the canonical fact entries in each subject matter domain relative to each other based on the descriptor entries, the occurrence dates, and on at least one ranking measure for that subject matter domain, and wherein at least some of the occurrence dates are extracted from content meaning in one or more textual sources about the occur- rence.
SUMMARY
It is an object of the present invention to provide alternative or improved methods, systems and computer program products with respect to accuracy of prediction and/or applicability to specific fields, such as predictive estimation of re- source needs.
An aspect of the present invention is a method according to claim 1. Other aspects include systems and computer program products according to the other inde-
pendent claim. The dependent claims and the following description and drawings disclose various optional features, use cases or the like.
In a preferred but non-restrictive use case, the present disclosure is used to predict progress of resource requirements. An illustrative but non-exhaustive list of resources, whose requirements may be predicted, include workforce, education, energy resources, raw materials, construction materials, and transportation resources. The present disclosure can also be used to make predictions concerning market share, expansion, growth, and social or device connectivity.
According to an optional embodiment, the source credibility value of a data source is repeatedly updated based on facts identified from documents collected from that data source. As a result, a data source initially associated with a low source credibility value may have its source credibility value increased is predictions based on documents collected from that data source turn out to be accurate, or vice versa.
According to another optional embodiment, the determining of the associated document credibility value for each collected document, using the set of preselected criteria comprises one or more of: registering a first time stamp, which describes a publication time of the collected document; registering a second time stamp, which describes a collection time of the collected document; determining the data source of the collected document and setting the document credibility value of the collected document based on the source credibility value of the data source; determining and registering a place of publication of the collected document; and providing the collected document with a document tag, which comprises at least one time stamp and the data source. This embodiment utilizes the fact that some source(s) have been known to yield more accurate prediction, and thereby earn better source credibility than other sources do. For instance, some individuals may have a high capability in defining future events, and these are referred to as superforecasters. Some organisations may have employ or utilize superforecasters therfore they may be even more precise and capable of making better predictions. This embodiment therefore democratises such sources, allowing anyone or any group to be a successful superforecaster.
According to another optional embodiment, the linguistic analysis comprises one or more of: removing stop words from the collected document; extracting nouns, adjectives and verbs from the collected document and searching selected key- words among the extracted nouns, adjectives and verbs; classifying at least extracted nouns from the collected document into pre-selected topical fields; extracting temporal specifiers, generating numerical forms of the extracted tern-
poral specifiers and placing each of the generated numerical forms on a timeline; and counting occurrences of the keywords. This embodiment provides the additional benefit of defining temporal items for each processed content, which allows the content or section (sentence of text) to be placed on specific future point in time, when there are multiple placements of same content of same industry in same temporal position in the future. This increases the likelihood of the event/product/service to be produced or to be available at that predicted time, because opportunity creates investment for research and development resources to fulfill the predicted demand.
According to another optional embodiment, the linguistic analysis comprises extracting temporal specifiers by using a pre-formed vocabulary, which comprises plural time-related expressions in a language of the collected document. This may involve a recursive self-learning system wherein, each time a process is performed, the next time there is the same or similar source text with temporal data, the process can be performed quicker and placement to a future time point is made according to the content and industry in question. The self-learning ability of the system may be based on recursive placement of temporal positions in future time. Eventually there will be large numbers (up to thousands) of temporal terms or specifiers that the system has processed.
Still further optional embodiments contribute to better topical match and temporal match to desired content. For instance, the method may comprise ranking the collected documents by importance in each of multiple topical fields, based on occurrences of keywords, document temporal match and the document credibility value of the document. Alternatively or additionally, the method may comprise presenting the collected documents in the ranked order in each of multiple topical fields. Alternatively or additionally, the method may comprise repeating the prediction in one or more selected periods of time, and estimating a change of the time range of the at least one event based on a comparison of the one or more repeated predictions with one or more earlier predictions. Alternatively or addi- tionally the estimating of the change may comprises repeating the collecting of at least some of the documents form one or more selected data sources; repeating the extracting of at least one temporal specifier; updating the determining of the time range of the at least one event based on a comparison of each temporal specifier obtained from the repeated extraction with a corresponding temporal speci- fier extracted earlier.
BRIEF DESCRIPTION OF THE DRAWINGS
In the following section, specific embodiments of the invention will be described in greater detail in connection with illustrative but non-restrictive examples. A reference is made to the following drawings:
Figure 1 is a flow chart showing basis steps of a method according to an embodiment of the present disclosure;
Figure 2 is a flow chart illustrating an embodiment for determination of a document credibility value;
Figure 3 is a flow chart illustrating an embodiment for a linguistic analysis;
Figure 4a shows a fragment of a vocabulary usable in an embodiment of the present disclosure;
Figure 4b shows a fragment of a stop word list usable in an embodiment of the present disclosure;
Figure 5 illustrates an extraction step of a linguistic analysis being applied on a fragment from a document;
Figure 6 is a flow chart illustrating additional steps for updating and assessing predictions;
Figures 7a through 7d illustrate change of a point of time as a result of an updated prediction;
Figure 8 shows an IT system for implementing embodiments of the present disclosure; and
Figure 9 shows a report generated by a method according to an embodiment of the present disclosure.
DETAILED DESCRIPTION OF SOME SPECIFIC EMBODIMENTS
Figure 8 shows an IT system for implementing embodiments of the present disclosure. In the illustrated implementation, presently disclosed method are carried out by executing program code instructions 34 in a processor (shown as laptop computer) 30, which uses data network(s), such as the internet, to access cloud- based documents 12 provided by multiple data sources 50. Reference number 36 denotes a database, which stores sets of documents deemed relevant for selected predictions. Reference number 32 denotes a vocabulary (typically one of many). Each vocabulary 32 is specific to a human language and within each language there may be multiple vocabularies each of which is specific to a topical field, such as electronics, chemistry, and so on.
Figure 4a shows a fragment of a vocabulary 32 usable in an embodiment of the present disclosure. Each vocabulary 32 comprises, as exhaustively as possible, the
words of the language and indicates a word class for each word, such as noun, adjective, verb, and so on. In addition, the vocabularies of the present disclosure indicate temporal specifiers and stop words, which may be implemented as extensions to traditional grammatical word classes or as distinct lists. Examples of temporal specifiers include "today", "this week", "next year", and so on. The system may also comprise a data source register (not shown separately), which stores identifications and access resources (such as URLs) of data sources. The data source register may also store topical fields and credibility values for the data sources. For instance, a well-known news agency may have an initial credibility value of 0.99, have an initial credibility while the initial credibility for a tabloid magazine may be much lower, such as 0.1 for instance.
Figure 4b shows a fragment of a stop word list usable in an embodiment of the present disclosure. Stop words are words that are ignored in searches. The illustrated stop word list comprises tuples, each having an ID field, stop word, creation date and modification date.
Referring to Figure 1, an exemplary process according to the present disclosure will be described next. The exemplary process being described relates to the topic of "fusion reactor".
The steps for generating a prediction may be compressed into six main phases:
A. Scanning, which comprises searching for temporal specifiers and keywords within the documents used as source material.
B. Tempo build-up of cumulative temporal specifiers, in linear or exponential mode, of an instance per time segment.
C. Tagging, which comprises all detected features.
D. Processing, which comprises forming a view of the process of events and time range in the future.
E. Storing, which comprises storing events and their time ranges.
F. Re-measuring stored values constantly, from A to E, and calculating the temporal estimate for every instance, and measuring the change of the temporal estimate, its changing position either further away from current time, staying identical to current time or increasing in distance from current time.
As shown in Figure 1, the process being described herein begins from registering a keyword in memory at 102. In this context, the keyword means a word indicating the selected topic, such as "fusion reactor". There are typically several keywords, which may be hierarchically coupled in the dictionary. For instance, "fusion reactor" may belong in a broader keyword "nuclear", which in turns belongs
in a broader keyword "energy technology" and onwards to "energy industry. In other exemplary processes according to the present disclosure some or all of the keywords relating to the topic being predicted may be obtained from a user.
After selecting and registering the one or more keywords, each keyword is classi- fied into a topical field at 104. Such classification may be automatic, based on a pre-generated keyword list, or it may be performed or assisted by one or more users. For instance, the keyword "fusion reactor" may be classified in topical fields "technology" and/or "energy technology".
After registration and classification of the one or more keywords, collection of information begins at 106. The collection of information typically comprises queries to document data bases, search engines, or the like, via the internet, whereby information, typically in the form of documents relating to the selected keywords, may be collected automatically. Some implementations of the process may involve associating topical fields with data sources, which are more than usually relevant to the topical field and whose contents is ideally collected for generation of the prediction. Alternatively, information may be collected from all data sources associated with a given topical field.
Collection of information typically involves use of one or more internet search engines, such as Google (Google, Inc.) or Bing (Microsoft Corp.) which search for documents relevant to the keywords given in the query to the search engine. The documents, whose identifiers (URLs) are returned by the search engines may be stored on a cloud server. The stored information may be automatically analyzed in order to determine relevant documents, while new information may be stored to replace information stored earlier, so as to keep storage requirements within reasonable limits. Keyword-based selection of information to be stored is preferably applied prior to storing the information, so as to avoid storing documents which are irrelevant to the selected keyword (s). The information collection may also comprise use of filters to prevent storing of duplicates of documents. After storing the documents, a credibility value is determined for each document at 108.
Figure 2 is a flow chart illustrating an embodiment for determination of a document credibility value, which is shown as a single step 108 in Figure 1. Some steps of Figure 2 may be omitted in some implementations. Step 202 comprises registering a first time stamp, which describes a publication time of the collected doc- ument. Step 204 comprises registering a second time stamp, which describes a collection time of the collected document. Step 206 comprises determining the data source of the collected document. Step 208 comprises setting the document's
credibility value. In some implementations, a document's initial credibility value may correspond to or be based on the source credibility value of the data source the document was collected from. Alternatively or additionally, the document's initial credibility value may be based on the first and/or second time stamps. For instance, a document's credibility value may be lowered as the document ages, as indicated by the first and/or second time stamps. Step 210 comprises determining and registering a place of publication of the collected document. In some implementations, the document's credibility value may be adjusted based on place of publication, with or without regard to the topic. For instance, a region having universities, research centers or respected publishers in a given topical field may increase the credibility value of documents published in that region and topical field. Step 212 comprises providing the collected document with a document tag, which comprises at least one time stamp and an indication of the data source the document was collected from. For example the document tag be in the following form: document name; first time stamp; second time stamp; credibility value; place of origin. The fields of the document tag may be populated as follows, for example: "fissionpower.pdf; 22:01 20-01-2015; 08:03 22-01-2015; 0.56; London".
Alternatively or additionally, a document's credibility value may be positively affected by increasing temporal concentrations of documents in the document's topical field. For instance, 100 documents relating to topic "fusion reactor" published or collected over one hour may result in a higher increase of the credibility value of those documents, compared with the same number of documents published or collected over one month.
Alternatively or additionally, placement of keywords and temporal specifiers within a document may also affect the document's credibility value. For instance, keywords and temporal specifiers in headings and or abstracts may boost a document's credibility value more than the same keywords and temporal specifiers in other sections.
Still further, documents may be classified into categories which have associated initial credibility values. For instance, "user reports" may score 0.1, "opinions" 0.2, "articles 0.5, "researches" 0..8 and "legal notifications" 0.9.
Referring back to Figure 1, after assignation of the document credibility values, the process continues at 110, in which temporal specifiers, which indicate points of time in the documents, are extracted. Keywords are extracted at 112.
Figure 3 is a flow chart illustrating steps (process phases) for a linguistic analysis, which is cursorily shown as 110 - 112 in Figure 1. The process phases relate to the exemplary document fragment shown in Figure 5.
Step 302 of Figure 3 comprises removal of fill words, which are indicated by strikethrough format in Figure 5. An illustrative but non-exhaustive list of fill words comprises prepositions, articles and some conjunctions. These are largely irrelevant to the automatic linguistic analysis. Nouns, adjectives, verbs and key- words are extracted at 304. In Figure 5, the nouns, including proper names, are shown in italic text and adjectives in plaintext. Verbs are indicated by underlining and keywords by a black background. The extraction utilizes the stored vocabulary 32 shown in Figure 8. Among the keywords identified in step 102 of Figure 1, each of "fusion reactor" and "nuclear" have two occurrences in the document fragment of Figure 5. In step 306 of Figure 3, the document may be classified in to multiple topical fields based on the extracted nouns. In this process, proper names may be relevant, such as "Lockheed Martin", "Tom McGuire", "Reuters", "Lockheed's Skunk Works" and "America's". Among these, "Reuters" may have been recognized as the document's source at 206. Other nouns describing topical fields may include "Defense contractor", "hydrogen", "helium", "sun", "industry", "government", "atom" and "earth".
Step 308 comprises extracting temporal specifiers from the document. These include "three years", "now" and "today". Temporal specifiers are utilized to determine a start point and/or end point of an individual event of a selected topic. In principle, the duration of an individual event may range from seconds to infinity, but in practical implementations it is a limited period of time. Most human languages have hundreds of temporal specifiers each, which are included in the vocabulary as exhaustively as possible (within reason). These are detected as they occur in documents. Steps 308 and 310 respectively comprise generating numeri- cal forms of the extracted temporal specifiers and placing each of the generated numerical forms on a timeline. For instance, in a publication issued 5th March 2015, "three years" may be converted to a numerical form of "20150305- 20160805-20180305" (start-mid-end). There are typically multiple varieties or groups of temporal specifiers. One group is formed by absolute expressions, such as numerically expressed points of time (eg dates or date-times). Another group is formed by relative expressions, such as "in three years" or "soon after second quarter". Such linguistics, which may vary based on culture and can be separately defined, are evaluated from the document's publication date into the future. A third group is formed by relative expressions based on a verbally expressed point of time, such as "a week following Christmas Eve". Assuming a publication year of 2015, this is converted to a numerical form of "20151224-20151231". A fourth group is formed by relative, repetitive expressions, such as "next year". In 2015
this would be expressed as "20160101-20160630-20161231". A fifth group is formed by periods of time based on a particular point of time, such as "this year, probably in mid-July". This would be converted to a numerical form of "20150101-20150715-20151231". It is also possible to include an indication of time. For instance, a start-mid-end expression for the last six hours ("new year's eve") of the year 2016 can be expressed as 201612311800-201612312100- 201701010000.
Some implementations comprise detection of duplicate information, so as to eliminate duplicate entry of information, which might wrongly bias predictions. Wild- card searches may be used to detect temporal specifiers in slightly altered forms. Temporal specifiers detected in a document may also affect the document's credibility value. For instance, if a document predicts, or contributes to prediction of, an event occurring in a narrow time range, this results in a higher credibility value for the document, compared to a prediction of an occurrence in a broader time range. In other words, temporal specifiers in a document indicating a narrow time range are considered more accurate than are temporal specifiers in a document indicating a broad time range. Keywords in the vicinity of temporal specifiers having a sufficiently high accuracy (eg a calendar day) may form events for a selected topic, which events will be presented for a user. Consider an illustrative example, wherein document relating to topic "fusion reactor" has an accurate temporal specifier, such as "20th June 2015", occurring near a keyword "prototype test". This would result in the keyword "prototype test" being presented for the user. Alternatively, an event may be formed by a word, which has not been registered as a keyword but which occurs in documents with a sufficiently high frequency near a specific temporal specifier.
An event extracted from documents may be assigned a credibility value based on the likelihood of the event occurring at the predicted point of time. For instance, if the event is "US presidential elections on 8th November 2016", the event may be assigned a high credibility value, such as 0.95 for example, as a result of the accu- rate temporal specifier frequently occurring in the documents. Furthermore, the topical field of "politics" for the event also contributes to the credibility value. In contrast, an event "First fusion reactor on the marked by 2035" would be assigned a low credibility value because the temporal specifiers given to this even by various documents have an extremely broad variation and the range of time extends a long way into the future.
At 314 the extracted keywords are counted with respect to occurrences. For instance, one document may comprise 20 occurrences of "fusion reactor", which is a
strong indication of that document's association to the given topic. The more occurrences a keyword has in a document, the more relevant is the document to the topic being predicted. A combination of keyword and temporal data in the same sentence has a high credibility. For instance, if an established supplier of power plant (expressed as a keyword) estimates, that a live fusion reactor is operable in 2025, this has a higher credibility than the temporal data alone.
Returning to Figure 1, after the linguistic analysis shown as 120, the documents are classified into topical fields based on the keywords at 114. The documents classified into a topical field are ranked by importance based on the credibility values. A higher credibility value results in a higher importance ranking and vice versa. At 116 a time range for an event is determined based on the temporal specifier of the document with the highest credibility value. At 118 this document, optionally accompanied by other documents based on a set of selection criteria, is/are presented for the user, grouped into topical fields and ranked by im- portance. The documents (or icons of documents) may be placed on a timeline based on the temporal specifiers contained in the documents. Finally, Figure 9 shows a report generated by a method according to an embodiment of the present disclosure.
In some implementations, the linguistic analysis may comprise an assessment of a degree to which the information contained in the document is fact (as opposed to prediction) with respect to the future. This assessment may be based on temporal specifiers which, when pointing to the future indicate that the document likely describes the future, and when pointing to the past or present indicate that the document likely describes fact. A definite assessment of factual content may be based on the volume of the document and the accuracy of its temporal specifiers. For instance, if one thousand documents report an event having occurred precisely on 2nd February 2015, the information contained in the documents may be considered factual, provided that at least some of the documents originate from credible sources.
In some use cases, a prediction generated by a method according to the present disclosure is compared with new facts discovered at the time of generating a subsequent prediction, wherein the new facts have emerged between the predictions. For instance, assume that an earlier prediction predicts an event occurring "mid 2016" and a new prediction is generated near the end of 2016. If the event did occur in mid-2016, the documents contributing to the correct prediction may have their credibility values increased, as do the data sources from which these documents were collected. Conversely, if the event failed to occur at the predicted
time, the documents and corresponding data sources may have their credibility values decreased. The increase or decrease of credibility values based on correct or incorrect predictions may also be applied to credibility values of other events described in the affected documents.
Figure 6 is a flow chart illustrating additional steps for updating and assessing predictions. Comparison between predictions generated at different times aims at determining a change of the predicted time range with respect to earlier prediction^). Steps 600, 602 and 604 generally correspond to the methods described in Figures 1, 2 and 3, respectively, and will not be described again. At 606, the com- ponents of change to be determined comprise a direction and speed of the change of the time range. With new predictions, the time range may remain static, move closer towards the present, or move farther away into the future. Based on the direction and speed of the change, it is possible to estimate a time range for the next prediction, if the change of time range indicates movement towards the pre- sent. On the other hand, if the change of the time range indicates movement farther away into the future, an assessment can be made at 608 that the event predicted for the time range is unlikely to materialize in that time range. After the assessment, the old and new time ranges are stored at 610, and by comparing these, the credibility value of the updated time range may be re-assessed at 612. In some implementations the comparison shown in Figure 6 may be repeated several times. As a result, the predicted time range continually progresses with new predictions, and its accuracy is likely to improve.
Figures 7a through 7d illustrate change of a point of time 14 as a result of an updated prediction. As shown in Figure 7a, information is extracted from documents 12 to identify at least one event 16 and its point of time 14. In the example of Figure 7a, such events are "product A", "technology B" and "market C". The documents 12 contain information related to these events, and temporal specifiers from which the document-specific points of time 14 can be derived. These can point to generally the same point of time, or they may point to widely scattered points of time. Based on the credibility values of the documents, the document- specific points of time 14 yield a time range 20, in which the event is expected to occur.
As shown in Figure 7b, when the prediction is updated n times, more documents 12 are identified, and these have still more points of time 14. Based on these, a new time range 20' is formed, which may deviate in time from the old time range 20. Figures 7 c and 7d show predictions, which have been updated n+m or n+m+p times, respectively. Each new prediction introduces new documents and points of
time for the next updated prediction. When the time ranges change, they usually get narrower, as shown by value T in Figure 7d. In some cases the width T of the time range may even increase.
Those skilled in the art will realize that the inventive principle may be modified in various ways without departing from the scope of the present invention as defined by the following claims.
Claims
1. A method comprising performing the following steps on a data processing system, which comprises at least one processor, memory having program code instructions, and input-output peripherals for communicating with data sources over one or more networks:
- registering a plurality of keywords, each of which characterizes a selected topic;
- classifying each of the plurality of keywords into a pre-selected topical field;
- collecting documents (12) from the data sources, wherein each of the data sources has an associated source credibility value;
- determining for each collected document an associated document credibility value by using a set of pre-selected criteria;
- for each of several documents (12):
- extracting from the document at least one temporal specifier and at least one keyword by a linguistic analysis, wherein the at least one temporal specifier describes timing of at least one event, which event relates to a topic and is mentioned in the document;
- determining a document-specific point of time (14) for the at least one event based on the temporal specifier extracted from the document;
- classifying the document into one or more topical fields based on occurrences of keywords in the document;
- generating a prediction by determining a time range (20) of at least one event (16), which relates to a selected topic, based on a concentration of occurrences of the document-specific points of time (14) of the selected topic (16); and
- presenting at least one document (12) relating to the event (16) of the selected topic, based on correlation of the document-specific point of time (14) of the document (12), wherein the correlation is computed from the time range (20) of at least one event (16) and document credibility value of the document.
2. The method of claim 1, wherein the determining of the associated document credibility value for each collected document (12), using the set of pre-selected criteria comprises one or more of:
- registering (202) a first time stamp, which describes a publication time of the collected document (12);
- registering (204) a second time stamp, which describes a collection time of the collected document (12);
- determining (206) the data source of the collected document (12) and setting the document credibility value of the collected document (12) based on the source credibility value of the data source;
- determining and registering a place of publication of the collected document (12);
- providing the collected document (12) with a document tag, which indicates at least one time stamp and the data source.
3. The method of claim 1 or 2, wherein the linguistic analysis comprises one or more of:
- removing stop words from the collected document (12);
- extracting nouns, adjectives and verbs from the collected document (12) and searching selected keywords among the extracted nouns, adjectives and verbs;
- classifying at least extracted nouns from the collected document (12) into pre-selected topical fields;
- extracting temporal specifiers, generating numerical forms of the extracted temporal specifiers and placing each of the generated numerical forms on a timeline; and
- counting occurrences of the keywords.
4. The method of any one of the preceding claims, wherein the linguistic analysis comprises extracting temporal specifiers by using a pre-formed vocabulary, which comprises plural time-related expressions in a language of the collected document.
5. The method of any one of the preceding claims, further comprising ranking the collected documents (12) by importance in each of multiple topical fields, based on occurrences of keywords, temporal match and the document credibility value of the document.
6. The method of claim 5, further comprising presenting the collected documents (12) in the ranked order in each of multiple topical fields.
7. The method of any one of the preceding claims, further comprising repeating the prediction in one or more selected periods of time, and estimating a change (18) of the time range (20) of the at least one event (16) based on a comparison of the one or more repeated predictions with one or more earlier predictions.
8. The method of claim 7, wherein the estimating the change (18) comprises:
- repeating the collecting of at least some of the documents form one or more selected data sources;
- repeating the extracting of at least one temporal specifier;
- updating the determining of the a time range (20) of the at least one event (16) based on a comparison of each temporal specifier obtained from the repeated extraction with a corresponding temporal specifier extracted earlier.
9. The method of any one of the preceding claims, further comprising using the generated prediction to predict progress of one or more resource requirements.
10. The method of claim 9, wherein the one or more resource requirements relate to one or more of workforce, education, energy resources, raw materials, construction materials, and transportation resources.
11. A processor comprising:
- at least one processing unit;
- a memory system for storing program code instructions and data;
- input-output peripherals for communicating with data sources over one or more networks;
- wherein the program code instructions, when executed by the at least one processing unit, cause the processor to the carry out the following steps: - registering a plurality of keywords, each of which characterizes a selected topic;
- classifying each of the plurality of keywords into a pre-selected topical field;
- collecting documents (12) from the data sources, wherein each of the data sources has an associated source credibility value;
- determining for each collected document an associated document credibility value by using a set of pre-selected criteria;
- for each of several documents (12):
- extracting from the document at least one temporal specifier and at
least one keyword by a linguistic analysis, wherein the at least one temporal specifier describes timing of at least one event, which event relates to a topic and is mentioned in the document;
- determining a document-specific point of time (14) for the at least one event based on the temporal specifier extracted from the document;
- classifying the document into one or more topical fields based on occurrences of keywords in the document;
- generating a prediction by determining a time range (20) of at least one event (16), which relates to a selected topic, based on a concentration of occurrences of the document-specific points of time (14) of the selected topic (16); and
presenting at least one document (12) relating to the event (16) of the selected topic, based on correlation of the document-specific point of time (14) of the document (12), wherein the correlation is computed from the time range (20) of at least one event (16) and document credibility value of the document.
12. A tangible memory for a processor, which comprises at least one processing unit and input-output peripherals for communicating with data sources over one or more networks, the tangible memory comprising program code instructions which, when executed by the at least one processing unit, cause the processor to the carry out the following steps:
- registering a plurality of keywords, each of which characterizes a selected topic;
- classifying each of the plurality of keywords into a pre-selected topical field;
- collecting documents (12) from the data sources, wherein each of the data sources has an associated source credibility value;
- determining for each collected document an associated document credibility value by using a set of pre-selected criteria;
- for each of several documents (12):
- extracting from the document at least one temporal specifier and at least one keyword by a linguistic analysis, wherein the at least one temporal specifier describes timing of at least one event, which event relates to a topic and is mentioned in the document;
- determining a document-specific point of time (14) for the at least one event based on the temporal specifier extracted from the doc-
ument;
- classifying the document into one or more topical fields based on occurrences of keywords in the document;
generating a prediction by determining a time range (20) of at least one event (16), which relates to a selected topic, based on a concentration of occurrences of the document-specific points of time (14) of the selected topic (16); and
presenting at least one document (12) relating to the event (16) of the selected topic, based on correlation of the document-specific point of time (14) of the document (12), wherein the correlation is computed from the time range (20) of at least one event (16) and document credibility value of the document.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| FI20155271 | 2015-04-13 | ||
| FI20155271 | 2015-04-13 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2016166416A1 true WO2016166416A1 (en) | 2016-10-20 |
Family
ID=57127172
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/FI2016/050243 Ceased WO2016166416A1 (en) | 2015-04-13 | 2016-04-13 | Method and system for temporal predictions |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2016166416A1 (en) |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20080040321A1 (en) * | 2006-08-11 | 2008-02-14 | Yahoo! Inc. | Techniques for searching future events |
| US20080086363A1 (en) * | 2006-10-06 | 2008-04-10 | Accenture Global Services Gmbh | Technology event detection, analysis, and reporting system |
| US20100299324A1 (en) * | 2009-01-21 | 2010-11-25 | Truve Staffan | Information service for facts extracted from differing sources on a wide area network |
| US20140006328A1 (en) * | 2012-06-29 | 2014-01-02 | Yahoo! Inc. | Method or system for ranking related news predictions |
-
2016
- 2016-04-13 WO PCT/FI2016/050243 patent/WO2016166416A1/en not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20080040321A1 (en) * | 2006-08-11 | 2008-02-14 | Yahoo! Inc. | Techniques for searching future events |
| US20080086363A1 (en) * | 2006-10-06 | 2008-04-10 | Accenture Global Services Gmbh | Technology event detection, analysis, and reporting system |
| US20100299324A1 (en) * | 2009-01-21 | 2010-11-25 | Truve Staffan | Information service for facts extracted from differing sources on a wide area network |
| US20140006328A1 (en) * | 2012-06-29 | 2014-01-02 | Yahoo! Inc. | Method or system for ranking related news predictions |
Non-Patent Citations (1)
| Title |
|---|
| JOHANSSON F. ET AL.: "Detecting Emergent Conflicts through Web Mining and Visualization.", 2011 EUROPEAN INTELLIGENCE AND SECURITY INFORMATICS CONFERENCE (EISIC, 12 September 2011 (2011-09-12), pages 346 - 353, XP032066153, ISBN: 978-1-4577-1464-1 * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10824682B2 (en) | Enhanced online user-interaction tracking and document rendition | |
| Bornmann et al. | The application of bibliometrics to research evaluation in the humanities and social sciences: An exploratory study using normalized G oogle S cholar data for the publications of a research institute | |
| US11222310B2 (en) | Automatic tagging for online job listings | |
| JP4809441B2 (en) | Estimating search category synonyms from user logs | |
| US7783630B1 (en) | Tuning of relevancy ranking for federated search | |
| US9535911B2 (en) | Processing a content item with regard to an event | |
| US8131705B2 (en) | Relevancy scoring using query structure and data structure for federated search | |
| US8990241B2 (en) | System and method for recommending queries related to trending topics based on a received query | |
| US8051088B1 (en) | Document analysis | |
| US9235563B2 (en) | Systems and processes for identifying features and determining feature associations in groups of documents | |
| CN1777892A (en) | Navigate within websites and similar sources of information | |
| CN107967290A (en) | A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data | |
| US20060230005A1 (en) | Empirical validation of suggested alternative queries | |
| US8914359B2 (en) | Ranking documents with social tags | |
| Jepsen et al. | Characteristics of scientific Web publications: Preliminary data gathering and analysis | |
| Kavila et al. | An automatic legal document summarization and search using hybrid system | |
| US20150193444A1 (en) | System and method to determine social relevance of Internet content | |
| van Hoof et al. | Googling politics? Comparing five computational methods to identify political and news-related searches from web browser histories | |
| WO2016166416A1 (en) | Method and system for temporal predictions | |
| KR101418744B1 (en) | System and method for searching weak signal | |
| Jain et al. | Organizing query completions for web search | |
| CN109902099B (en) | Public opinion tracking method and device based on graphic and text big data and computer equipment | |
| van Hoofa et al. | Googling Politics? The Computational Identification of Political and News-related Searches from Web Browser Histories | |
| Van den Hoven et al. | Beyond reported history: Strikes that never happened | |
| US20150046437A1 (en) | Search Method |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16779664 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 20/12/2017) |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 16779664 Country of ref document: EP Kind code of ref document: A1 |