[go: up one dir, main page]

CN111159361B - Method and device for acquiring article and electronic equipment - Google Patents

Method and device for acquiring article and electronic equipment Download PDF

Info

Publication number
CN111159361B
CN111159361B CN201911422953.6A CN201911422953A CN111159361B CN 111159361 B CN111159361 B CN 111159361B CN 201911422953 A CN201911422953 A CN 201911422953A CN 111159361 B CN111159361 B CN 111159361B
Authority
CN
China
Prior art keywords
articles
article
keywords
word
word segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911422953.6A
Other languages
Chinese (zh)
Other versions
CN111159361A (en
Inventor
徐磊
袁力
邸烁
胡坤歌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Aershan Block Chain Alliance Technology Co ltd
Original Assignee
Beijing Aershan Block Chain Alliance Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Aershan Block Chain Alliance Technology Co ltd filed Critical Beijing Aershan Block Chain Alliance Technology Co ltd
Priority to CN201911422953.6A priority Critical patent/CN111159361B/en
Publication of CN111159361A publication Critical patent/CN111159361A/en
Application granted granted Critical
Publication of CN111159361B publication Critical patent/CN111159361B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method and a device for acquiring an article and electronic equipment, comprising the following steps: acquiring a keyword for specifying the adjustability; searching according to the keywords to obtain articles corresponding to the keywords; performing word segmentation processing on the articles to obtain word segmentation files of the articles; wherein the word segmentation file comprises a plurality of word sequences of the articles; comparing each word sequence in the word segmentation file with the keywords with designated tunability respectively, and calculating the similarity between the word sequences and the keywords; selecting a specified number of word sequences with highest similarity as target keywords; and searching according to the target keywords to obtain articles corresponding to the target keywords until the number of the articles reaches a preset threshold, and storing the searched articles into an article database. According to the application, the article with the user specified adjustability is automatically acquired through the crawler technology, and the method has high efficiency and improves the experience of the user.

Description

Method and device for acquiring article and electronic equipment
Technical Field
The application relates to the technical field of crawlers, in particular to a method and a device for acquiring articles and electronic equipment.
Background
At present, the Internet articles are rich in variety, novel in content and huge in data volume, various emerging media websites are layered endlessly, various media content forms are different, different users have different reading requirements, namely, each user prefers to read articles and media with specific adjustability, and how to automatically push articles with specific adjustability to the users is a main task of a plurality of media software. The existing method mainly obtains the article with the user-specific adjustability through the Word2Ve tool, and the method has the defect of low efficiency although the article with the user-specific adjustability can be obtained, so that poor reading experience is caused for the user.
Disclosure of Invention
In view of the above, the present application aims to provide a method and an apparatus for acquiring an article, and an electronic device, which automatically acquire an article with user-specified adjustability through a crawler technology, and have high efficiency, so that the experience of a user is improved.
In a first aspect, an embodiment of the present application provides a method for acquiring an article, applied to a server, where the method includes:
acquiring a keyword for specifying the adjustability;
searching according to the keywords to obtain articles corresponding to the keywords;
performing word segmentation processing on the article to obtain a word segmentation file of the article; wherein the word segmentation file comprises a plurality of word sequences of the article;
comparing each word sequence in the word segmentation file with the keywords with designated tunability, and calculating the similarity between the word sequences and the keywords;
selecting a specified number of word sequences with highest similarity as target keywords;
and continuing searching according to the target keywords to obtain articles corresponding to the target keywords until the number of the articles reaches a preset threshold, and storing the searched articles into an article database.
With reference to the first aspect, an embodiment of the present application provides a first possible implementation manner of the first aspect, where the step of comparing each word sequence in the word segmentation file with the keywords of a designated tonality, and calculating a similarity between the word sequence and the keywords includes:
inputting the word segmentation file into a pre-trained word training model to output word vectors of each word sequence;
and respectively calculating the similarity between the word sequence and the keywords through the word vector and the keyword vector of the keywords with designated tonality.
With reference to the first possible implementation manner of the first aspect, the embodiment of the present application provides a second possible implementation manner of the first aspect, where after inputting the word segmentation file into a pre-trained word training model, the method further includes:
outputting article vectors of articles corresponding to the word segmentation files through the pre-trained word training model;
and calculating the similarity between the searched articles and the articles stored in the article database according to the article vector so as to carry out repeatability judgment on the searched articles.
With reference to the first aspect, an embodiment of the present application provides a third possible implementation manner of the first aspect, where the step of searching according to the keyword to obtain an article corresponding to the keyword includes:
acquiring a designated website address input by a user;
searching on the website corresponding to the designated website address according to the keyword to obtain an article corresponding to the keyword.
With reference to the third possible implementation manner of the first aspect, the embodiment of the present application provides a fourth possible implementation manner of the first aspect, wherein the step of searching, on the website corresponding to the specified website address, according to the keyword to obtain an article corresponding to the keyword further includes:
inputting the keywords into a preset crawler program;
searching on the website corresponding to the designated website address through the crawler program according to the keywords to obtain articles corresponding to the keywords.
In a second aspect, an embodiment of the present application further provides an apparatus for acquiring an article, which is applied to a server, where the apparatus includes:
the acquisition module is used for acquiring keywords of designated adjustability;
the first search module is used for searching according to the keywords to obtain articles corresponding to the keywords;
the processing module is used for carrying out word segmentation processing on the article to obtain a word segmentation file of the article; wherein the word segmentation file comprises a plurality of word sequences of the article;
the calculation module is used for comparing each word sequence in the word segmentation file with the keywords with designated tunability respectively, and calculating the similarity between the word sequences and the keywords;
the selection module is used for selecting the specified number of word sequences with highest similarity as target keywords;
and the second searching module is used for continuing searching according to the target keywords to obtain articles corresponding to the target keywords until the number of the articles reaches a preset threshold value, and storing the searched articles into an article database.
With reference to the second aspect, an embodiment of the present application provides a first possible implementation manner of the second aspect, where the computing module further includes:
inputting the word segmentation file into a pre-trained word training model to output word vectors of each word sequence;
and respectively calculating the similarity between the word sequence and the keywords through the word vector and the keyword vector of the keywords with designated tonality.
With reference to the first possible implementation manner of the second aspect, an embodiment of the present application provides a second possible implementation manner of the second aspect, where after inputting the word segmentation file into a pre-trained word training model, the apparatus further includes:
outputting article vectors of articles corresponding to the word segmentation files through the pre-trained word training model;
and calculating the similarity between the searched articles and the articles stored in the article database according to the article vector so as to carry out repeatability judgment on the searched articles.
In a third aspect, an embodiment of the present application further provides an electronic device, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the steps of the method for acquiring an article according to the first aspect are implemented when the processor executes the computer program.
In a fourth aspect, embodiments of the present application also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of obtaining an article according to the first aspect.
The embodiment of the application has the following beneficial effects:
the embodiment of the application provides a method and a device for acquiring an article and electronic equipment, comprising the following steps: acquiring a keyword for specifying the adjustability; searching according to the keywords to obtain articles corresponding to the keywords; performing word segmentation processing on the articles to obtain word segmentation files of the articles; wherein the word segmentation file comprises a plurality of word sequences of the articles; comparing each word sequence in the word segmentation file with the keywords with designated tunability respectively, and calculating the similarity between the word sequences and the keywords; selecting a specified number of word sequences with highest similarity as target keywords; and searching according to the target keywords to obtain articles corresponding to the target keywords until the number of the articles reaches a preset threshold, and storing the searched articles into an article database. According to the application, the article with the user specified adjustability is automatically acquired through the crawler technology, and the method has high efficiency and improves the experience of the user.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and drawings.
In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for acquiring an article according to an embodiment of the present application;
FIG. 2 is a flowchart of another method for acquiring an article according to an embodiment of the present application;
FIG. 3 is a flowchart of another method for acquiring an article according to an embodiment of the present application;
FIG. 4 is a flowchart of another method for acquiring an article according to an embodiment of the present application;
fig. 5 is a schematic diagram of an apparatus for acquiring an article according to an embodiment of the present application.
Icon:
10-an acquisition module; 20-a first search module; 30-a processing module; 40-a calculation module; 50-selecting a module; 60-a second search module.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Currently, the Internet articles are rich in variety, novel in content and huge in data volume, various emerging media websites are layered endlessly, and various media content forms are different. Thus, different users have different reading needs, each user favoring reading articles and media of different specific tonality, such as: positive energy media content is generally readily accepted and accepted by the general public, and how to obtain articles of user-specified tonality through semantic analysis is also a major task for many media software.
In the field of natural language processing, text similarity is always a popular research direction. The existing language and semantic similarity method tool is mainly a Word2Ve tool, and the Word vector of the Word is subjected to unsupervised learning by using a neural network through dividing sentences and warehousing all words, and the Word vector of the Word is subjected to unsupervised learning by using a skip gram and a negative sampling method, so that the Word vector of the corresponding Word and the description vector of the article content are finally obtained, and the similarity of two words or two articles can be judged through the comparison of the Word vectors or the comparison of the content vectors.
However, although the Word2Ve tool may obtain a user-specific article, it has a drawback of inefficiency, thereby causing a bad reading experience for the user. Aiming at the technical problem, the embodiment of the application provides a method and a device for acquiring articles and electronic equipment, and the articles with user-specified adjustability are automatically acquired through a crawler technology.
In practical application, along with the maturation of web crawler technology, the current high-performance crawler is more and more rich in form, and can crawl all public website contents in the Internet. The method and the device have the advantages that the queue is used for scheduling the crawlers instead of the function self-call, the function self-call easily causes the recursive function, the recursive function occupies a large amount of memory, and the system performance is reduced, so that the basic design of the crawlers in the embodiment of the application obeys the queue scheduling, and the method and the device have the characteristic of high efficiency, thereby improving the experience of users.
For the sake of understanding the present embodiment, a detailed description is first given below of a method for acquiring an article according to an embodiment of the present application.
Embodiment one:
the embodiment of the application provides a method for acquiring an article, which is applied to a server. Fig. 1 is a flowchart of a method for acquiring an article according to an embodiment of the present application, as shown in fig. 1, the method includes the following steps:
step S102, acquiring keywords with designated adjustability;
wherein, the designated tonality is the style orientation and the line style of an article, such as: the keywords are words for expressing the appointed adjustability of the article, such as words for the appointed adjustability of the positive energy, such as high efficiency, effectiveness, learning and the like; for the appointed adjustability of the optimistic attitude, the keywords of the method can be words such as optimistic, active and lively; for the appointed adjustability of the injury class, the keywords can be words such as sinking, falling and depression, and in practical application, a user can automatically set the appointed adjustability of the article and the keywords of the appointed adjustability, and the embodiment of the application does not limit the description.
Step S104, searching is carried out according to the keywords, and articles corresponding to the keywords are obtained;
after setting the designated tonality and the keywords of the designated tonality according to the reading bias of the user, the user searches according to the keywords by using a crawler technology to obtain articles corresponding to the keywords. For example: and automatically searching in the website according to the depression by using a crawler technology, so that an article corresponding to the depression is obtained, and a wound-sensing article is obtained.
Step S106, performing word segmentation processing on the articles to obtain word segmentation files of the articles; wherein the word segmentation file comprises a plurality of word sequences of the articles;
optionally, word segmentation is performed on an article, and the article content can be segmented into a plurality of individual words according to the rule of Chinese, so as to obtain a word segmentation file of the article, for example: after the Beijing love is subjected to word segmentation, a word segmentation file can be obtained, wherein the word segmentation file comprises the following steps: i, love and beijing. In practical application, the specific word segmentation processing method and rule can be set according to practical situations, and the embodiment of the application does not limit the description.
Step S108, each word sequence in the word segmentation file is respectively compared with the keywords with designated tunability, and the similarity between the word sequences and the keywords is calculated;
specifically, after the word segmentation file is obtained, a word vector of each word sequence can be obtained, wherein the word vector is a multidimensional real number vector capable of describing a word, the higher the dimension is, the finer the meaning of the word can be represented, and the vector of the word can be obtained by training through non-supervision learning based on a neural network or a regression algorithm. Similarly, a keyword vector can be obtained according to the keywords with designated tonality, so that the similarity between the word sequence and the keywords can be calculated according to the word vector and the keyword vector. The similarity is generally expressed by the included angle between the word vector and the keyword vector, and the smaller the included angle is, the higher the similarity is, whereas the larger the included angle is, the smaller the similarity is.
Step S110, selecting a specified number of word sequences with highest similarity as target keywords;
specifically, the word sequence with the highest similarity and the designated number replaces the keywords with the designated tonality originally set to obtain target keywords, and searching is conducted again through the crawler technology.
And step S112, searching is continued according to the target keywords to obtain articles corresponding to the target keywords until the number of the articles reaches a preset threshold value, and the searched articles are stored in an article database.
At this time, searching is performed again on the designated website through the crawler technology, so that articles corresponding to the target keywords are obtained, and when the number of the articles reaches a preset threshold, the searched articles are stored in an article database, and the searched articles are displayed for the user to read. Therefore, the embodiment of the application automatically acquires the article with the designated adjustability through the crawler technology, and has the characteristic of high efficiency because the crawler technology is basically designed to obey the queue scheduling, thereby improving the experience of the user.
The method for acquiring the article provided by the embodiment of the application comprises the following steps: acquiring a keyword for specifying the adjustability; searching according to the keywords to obtain articles corresponding to the keywords; performing word segmentation processing on the articles to obtain word segmentation files of the articles; wherein the word segmentation file comprises a plurality of word sequences of the articles; comparing each word sequence in the word segmentation file with the keywords with designated tunability respectively, and calculating the similarity between the word sequences and the keywords; selecting a specified number of word sequences with highest similarity as target keywords; and searching according to the target keywords to obtain articles corresponding to the target keywords until the number of the articles reaches a preset threshold, and storing the searched articles into an article database. According to the application, the article with the user specified adjustability is automatically acquired through the crawler technology, and the method has high efficiency and improves the experience of the user.
On the basis of fig. 1, the embodiment of the present application provides another method for acquiring an article, and fig. 2 is a flowchart of another method for acquiring an article provided by the embodiment of the present application, as shown in fig. 2, where the method includes the following steps:
step S202, obtaining keywords with designated adjustability;
step S204, searching is carried out according to the keywords, and articles corresponding to the keywords are obtained;
step S206, performing word segmentation processing on the articles to obtain word segmentation files of the articles; wherein the word segmentation file comprises a plurality of word sequences of the articles;
the steps S202 to S206 may refer to the steps S102 to S106, and the embodiments of the present application are not described in detail herein.
Step S208, inputting the word segmentation file into a pre-trained word training model to output word vectors of each word sequence;
specifically, the word segmentation file is input into a pre-trained word training model, so that a word vector of each word sequence is obtained, and the similarity between the word sequence and the keywords is calculated through the word vector and the keyword vector.
In addition, after the word segmentation file is input into the pre-trained word training model, the method further comprises the following steps: outputting article vectors of articles corresponding to the word segmentation files through the pre-trained word training model; and calculating the similarity between the searched articles and the articles stored in the article database according to the article vectors so as to carry out repeatability judgment on the searched articles. The article vector is similar to the word vector and is used for describing the multidimensional real number vector of the article content.
Step S210, calculating the similarity between the word sequence and the keywords through the word vector and the keyword vector of the keywords with designated tonality;
specifically, according to the included angle between the word vector and the keyword vector, the similarity between the word sequence and the keyword can be obtained. Similarly, the embodiment of the application can obtain the similarity between the searched article and the articles stored in the article database through the included angle between the article vector and the vectors of the articles stored in the article database, so that the searched article can be repeatedly judged, the repeated articles are prevented from being sent to the user for reading, and the reading experience of the user is improved.
Step S212, selecting a specified number of word sequences with highest similarity as target keywords;
step S214, searching is continued according to the target keywords to obtain articles corresponding to the target keywords until the number of the articles reaches a preset threshold, and the searched articles are stored in an article database.
Further, on the basis of fig. 1, another method for acquiring an article is provided in the embodiment of the present application, and fig. 3 is a flowchart of another method for acquiring an article provided in the embodiment of the present application, as shown in fig. 3, where the method includes the following steps:
step S302, obtaining keywords with designated adjustability;
step S304, acquiring a designated website address input by a user;
step S306, searching according to the keywords on the websites corresponding to the appointed website addresses to obtain articles corresponding to the keywords;
specifically, firstly, inputting keywords into a preset crawler program; and then searching on the website corresponding to the designated website address according to the keywords by the crawler program to obtain articles corresponding to the keywords. In practical application, after obtaining the keywords for specifying the adjustability, the user inputs the specified website address, and automatically searches on the website corresponding to the specified website address according to the keywords by the crawler program, where the specified website address may be all internet addresses or website addresses for reading articles individually.
Step S308, performing word segmentation processing on the articles to obtain word segmentation files of the articles; wherein the word segmentation file comprises a plurality of word sequences of the articles;
step S310, each word sequence in the word segmentation file is respectively compared with keywords with designated tunability, and the similarity between the word sequences and the keywords is calculated;
step S312, selecting a specified number of word sequences with highest similarity as target keywords;
step S314, searching is continued according to the target keywords to obtain articles corresponding to the target keywords until the number of the articles reaches a preset threshold, and the searched articles are stored in an article database.
The steps S308 to S314 may refer to the steps S106 to S112, and the embodiments of the present application are not described in detail herein.
For ease of understanding, this is illustrated herein. As shown in fig. 4, the user sets to read a positive energy article, and first sets keywords of positive energy, such as: words such as high efficiency and learning, and simultaneously setting addresses of specified websites, such as: current mainstream article website address; configuring initial parameters of a crawler program; then searching on the website corresponding to the address of the appointed website through the crawler program according to the keywords to obtain articles related to the keywords with positive energy, and downloading the searched articles to a local database; at this time, word segmentation processing is carried out on the searched articles to obtain word segmentation files, wherein the word segmentation files comprise a plurality of word sequences of the articles; inputting the word segmentation file into a word training model trained in advance to obtain word vectors of each word sequence, obtaining article vectors of articles corresponding to the word segmentation file, respectively calculating the similarity between the word sequences and keywords, and the similarity between the searched articles and articles stored in an article database, and warehousing; at this time, selecting a specified number of word sequences with highest similarity as target keywords, replacing the original positive energy keywords, searching on the website corresponding to the specified website address again through the crawler program until the number of articles reaches a preset threshold, and storing the searched articles in an article database. The preset threshold may be 3 or any value, and may be set according to practical situations, which is not limited in the embodiment of the present application.
On the basis of the embodiment, the embodiment of the application also provides a device for acquiring the article, which is applied to the server. Fig. 5 is a schematic diagram of an apparatus for acquiring an article according to an embodiment of the present application, where, as shown in fig. 5, the apparatus includes:
an obtaining module 10, configured to obtain a keyword specifying a tonality;
the first search module 20 is configured to search according to the keywords to obtain articles corresponding to the keywords;
the processing module 30 is used for performing word segmentation processing on the articles to obtain word segmentation files of the articles; wherein the word segmentation file comprises a plurality of word sequences of the articles;
the calculation module 40 is configured to compare each word sequence in the word segmentation file with the keywords with designated tonality, and calculate the similarity between the word sequence and the keywords;
a selection module 50, configured to select a specified number of word sequences with highest similarity as target keywords;
the second search module 60 is configured to continue searching according to the target keyword, obtain articles corresponding to the target keyword, until the number of articles reaches a preset threshold, and store the searched articles in the article database.
Further, the computing module 40 further includes:
inputting the word segmentation file into a pre-trained word training model to output word vectors of each word sequence;
and respectively calculating the similarity between the word sequence and the keywords through the word vector and the keyword vector of the keywords with designated tunability.
Further, after the word segmentation file is input to the pre-trained word training model, the apparatus further includes:
outputting article vectors of articles corresponding to the word segmentation files through a pre-trained word training model;
and calculating the similarity between the searched articles and the articles stored in the article database according to the article vectors so as to carry out repeatability judgment on the searched articles.
The device for acquiring the article provided by the embodiment of the application comprises the following steps: acquiring a keyword for specifying the adjustability; searching according to the keywords to obtain articles corresponding to the keywords; performing word segmentation processing on the articles to obtain word segmentation files of the articles; wherein the word segmentation file comprises a plurality of word sequences of the articles; comparing each word sequence in the word segmentation file with the keywords with designated tunability respectively, and calculating the similarity between the word sequences and the keywords; selecting a specified number of word sequences with highest similarity as target keywords; and searching according to the target keywords to obtain articles corresponding to the target keywords until the number of the articles reaches a preset threshold, and storing the searched articles into an article database. According to the application, the article with the user specified adjustability is automatically acquired through the crawler technology, and the method has high efficiency and improves the experience of the user.
The embodiment of the application also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the method for acquiring the article provided by the embodiment when executing the computer program.
The embodiment of the application also provides a computer readable storage medium, and a computer program is stored on the computer readable storage medium, and when the computer program is executed by a processor, the steps of the method for acquiring the article in the embodiment are executed.
The computer program product provided by the embodiment of the present application includes a computer readable storage medium storing a program code, where instructions included in the program code may be used to perform the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment and will not be described herein.
It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding process in the foregoing method embodiment for the specific working process of the apparatus described above, which is not described herein again.
In addition, in the description of embodiments of the present application, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present application will be understood in specific cases by those of ordinary skill in the art.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In the description of the present application, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present application and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present application. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above examples are only specific embodiments of the present application, and are not intended to limit the scope of the present application, but it should be understood by those skilled in the art that the present application is not limited thereto, and that the present application is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (6)

1. A method for obtaining an article, the method being applied to a server, the method comprising:
acquiring a keyword for specifying the adjustability;
searching according to the keywords to obtain articles corresponding to the keywords;
performing word segmentation processing on the article to obtain a word segmentation file of the article; wherein the word segmentation file comprises a plurality of word sequences of the article;
inputting the word segmentation file into a pre-trained word training model to output word vectors of each word sequence;
calculating the similarity between the word sequence and the keywords through the word vector and the keyword vector of the keywords with designated tonality; outputting article vectors of articles corresponding to the word segmentation files through the pre-trained word training model;
calculating the similarity between the searched articles and the articles stored in the article database according to the article vector so as to repeatedly judge the searched articles;
selecting a specified number of word sequences with highest similarity as target keywords;
and searching according to the target keywords to obtain articles corresponding to the target keywords until the number of the articles reaches a preset threshold, and storing the searched articles into the article database, wherein the similarity between the searched articles and the stored articles in the article database is obtained after the included angle between the article vectors and the vectors of the articles stored in the article database is exceeded, so that the searched articles are repeatedly judged, and the repeated articles are prevented from being sent to a user for reading.
2. The method of claim 1, wherein the step of searching according to the keyword to obtain the article corresponding to the keyword comprises:
acquiring a designated website address input by a user;
searching on the website corresponding to the designated website address according to the keyword to obtain an article corresponding to the keyword.
3. The method of claim 2, wherein the step of searching on the website corresponding to the specified website address according to the keyword to obtain the article corresponding to the keyword further comprises:
inputting the keywords into a preset crawler program;
searching on the website corresponding to the designated website address through the crawler program according to the keywords to obtain articles corresponding to the keywords.
4. An apparatus for obtaining an article, for application to a server, the apparatus comprising:
the acquisition module is used for acquiring keywords of designated adjustability;
the first search module is used for searching according to the keywords to obtain articles corresponding to the keywords;
the processing module is used for carrying out word segmentation processing on the article to obtain a word segmentation file of the article; wherein the word segmentation file comprises a plurality of word sequences of the article;
the computing module is used for inputting the word segmentation file into a pre-trained word training model so as to output word vectors of each word sequence; calculating the similarity between the word sequence and the keywords through the word vector and the keyword vector of the keywords with designated tonality; outputting article vectors of articles corresponding to the word segmentation files through the pre-trained word training model; calculating the similarity between the searched articles and the articles stored in the article database according to the article vector so as to repeatedly judge the searched articles;
the selection module is used for selecting the specified number of word sequences with highest similarity as target keywords;
and the second search module is used for continuing searching according to the target keywords to obtain articles corresponding to the target keywords until the number of the articles reaches a preset threshold value, and storing the searched articles into the article database, wherein the included angles between the article vectors and the vectors of the articles stored in the article database are exceeded to obtain the similarity between the searched articles and the articles stored in the article database, so that the searched articles are repeatedly judged, and the repeated articles are prevented from being sent to a user for reading.
5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of obtaining articles according to any of the preceding claims 1-3 when the computer program is executed.
6. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the method of obtaining an article according to any of the preceding claims 1-3.
CN201911422953.6A 2019-12-30 2019-12-30 Method and device for acquiring article and electronic equipment Active CN111159361B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911422953.6A CN111159361B (en) 2019-12-30 2019-12-30 Method and device for acquiring article and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911422953.6A CN111159361B (en) 2019-12-30 2019-12-30 Method and device for acquiring article and electronic equipment

Publications (2)

Publication Number Publication Date
CN111159361A CN111159361A (en) 2020-05-15
CN111159361B true CN111159361B (en) 2023-10-20

Family

ID=70560716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911422953.6A Active CN111159361B (en) 2019-12-30 2019-12-30 Method and device for acquiring article and electronic equipment

Country Status (1)

Country Link
CN (1) CN111159361B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111651674B (en) * 2020-06-03 2023-08-25 北京妙医佳健康科技集团有限公司 Bidirectional searching method and device and electronic equipment
CN112784042B (en) * 2021-01-12 2024-07-26 北京明略软件系统有限公司 Text similarity calculation method and system combining article structure and aggregation word vector
CN112765962B (en) * 2021-01-15 2022-08-30 上海微盟企业发展有限公司 Text error correction method, device and medium
CN113176878B (en) * 2021-06-30 2021-10-08 深圳市维度数据科技股份有限公司 Automatic query method, device and equipment
CN115329051B (en) * 2022-10-17 2022-12-20 成都大学 A multi-view news information fast retrieval method, system, storage medium and terminal

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6647383B1 (en) * 2000-09-01 2003-11-11 Lucent Technologies Inc. System and method for providing interactive dialogue and iterative search functions to find information
CN101059806A (en) * 2007-06-06 2007-10-24 华东师范大学 Word sense based local file searching method
CN104516902A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Semantic information acquisition method and corresponding keyword extension method and search method
CN105095203A (en) * 2014-04-17 2015-11-25 阿里巴巴集团控股有限公司 Methods for determining and searching synonym, and server
CN110019669A (en) * 2017-10-31 2019-07-16 北京国双科技有限公司 A kind of text searching method and device
CN111143516A (en) * 2019-12-30 2020-05-12 广州探途网络技术有限公司 Article search result display method and related device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6647383B1 (en) * 2000-09-01 2003-11-11 Lucent Technologies Inc. System and method for providing interactive dialogue and iterative search functions to find information
CN101059806A (en) * 2007-06-06 2007-10-24 华东师范大学 Word sense based local file searching method
CN104516902A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Semantic information acquisition method and corresponding keyword extension method and search method
CN105095203A (en) * 2014-04-17 2015-11-25 阿里巴巴集团控股有限公司 Methods for determining and searching synonym, and server
CN110019669A (en) * 2017-10-31 2019-07-16 北京国双科技有限公司 A kind of text searching method and device
CN111143516A (en) * 2019-12-30 2020-05-12 广州探途网络技术有限公司 Article search result display method and related device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王卫国 ; 徐炜民 ; .基于潜在语义分析的个性化查询扩展模型.计算机工程.2010,(第21期),全文. *

Also Published As

Publication number Publication date
CN111159361A (en) 2020-05-15

Similar Documents

Publication Publication Date Title
CN111159361B (en) Method and device for acquiring article and electronic equipment
Trstenjak et al. KNN with TF-IDF based framework for text categorization
CN111368042A (en) Intelligent question answering method, device, computer equipment and computer storage medium
CN112905768B (en) Data interaction method, device and storage medium
US10482146B2 (en) Systems and methods for automatic customization of content filtering
CN113127632A (en) Text summarization method and device based on heterogeneous graph, storage medium and terminal
CN108763529A (en) A kind of intelligent search method, device and computer readable storage medium
CN111125348A (en) Text abstract extraction method and device
CN118377783B (en) SQL sentence generation method and device
CN112883165B (en) Intelligent full-text retrieval method and system based on semantic understanding
CN113392305A (en) Keyword extraction method and device, electronic equipment and computer storage medium
CN106708929A (en) Video program searching method and device
Shawon et al. Website classification using word based multiple n-gram models and random search oriented feature parameters
CN113806510B (en) Legal provision retrieval method, terminal equipment and computer storage medium
CN113761104A (en) Method, device and electronic device for detecting entity relationship in knowledge graph
CN119066172A (en) Question and answer processing method, device, computer equipment, readable storage medium and program product
CN110727769A (en) Corpus generation method and device, and man-machine interaction processing method and device
CN106570196A (en) Video program searching method and device
CN113822038B (en) Abstract generation method and related device
CN107766419B (en) Threshold denoising-based TextRank document summarization method and device
CN112149424A (en) Semantic matching method, apparatus, computer equipment and storage medium
CN111930880A (en) Text code retrieval method, device and medium
CN120030146A (en) A transformer fault root cause analysis method based on large model iterative reasoning
CN112507097B (en) Method for improving generalization capability of question-answering system
CN113392309B (en) Data processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant