CN111813711B - Method and device for reading training sample data, storage medium and electronic equipment - Google Patents
Method and device for reading training sample data, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN111813711B CN111813711B CN202010891955.6A CN202010891955A CN111813711B CN 111813711 B CN111813711 B CN 111813711B CN 202010891955 A CN202010891955 A CN 202010891955A CN 111813711 B CN111813711 B CN 111813711B
- Authority
- CN
- China
- Prior art keywords
- target
- data
- sample data
- sequence
- data blocks
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0866—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches for peripheral storage systems, e.g. disk cache
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0893—Caches characterised by their organisation or structure
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1016—Performance improvement
- G06F2212/1021—Hit rate improvement
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
The invention discloses a method and a device for reading training sample data in a cloud technical scene, a storage medium and electronic equipment, and particularly relates to data query and the like in a database technical scene. Wherein, the method comprises the following steps: reading sample data blocks from a target data set stored in a storage space according to the sequence of the first data blocks, performing model training on a target model according to the read sample data blocks, and caching the read sample data blocks into a target cache space; after the sample data blocks are read from the target data set according to the sequence of the first data blocks, adjusting the sequence of the first data blocks into the sequence of the second data blocks; and according to the second data block sequence, reading sample data blocks in the target data set from the target cache space and the storage space, and performing model training on the target model according to the read sample data blocks in the target data set. The invention solves the technical problem of low hit rate of data reading.
Description
Technical Field
The invention relates to the field of computers, in particular to a method and a device for reading training sample data, a storage medium and electronic equipment.
Background
In a data reading scenario, the volume of the data sets is often much larger than that of the memory, so that only the memory is used as a cache, and all the data sets cannot be cached. In order to solve the above problems, the prior art proposes the following two schemes:
in the first scheme, during deep learning model training, the volume of the data set is often much larger than that of the memory, so that the memory is only used as the cache, and all the data sets cannot be cached. The FScache is a single-machine cache system aiming at a remote file system and supports multi-level cache of a memory and an SSD.
And in the second scheme, Memcached is a distributed cache system, and also supports multi-level cache, and the Memcached protocol is used, so that the cache acceleration capability can be provided, and the resources of the computing nodes are effectively utilized for caching.
The two schemes described above use classical cache replacement strategies: first-in-first-out (FIFO)/Least Recently Used (LRU).
However, the first and second schemes both use a multi-level cache technology of an internal memory and an SSD, and as the data size required for deep learning model training is continuously increased, the configured SSD hard disk cannot meet the training scenario of large data size, and the low-capacity SSD hard disk cannot guarantee a high data reading hit rate in both the first and second schemes.
Namely, the prior art scheme has the problem of low hit rate of data reading.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides a method and a device for reading training sample data, a storage medium and electronic equipment, which at least solve the technical problem of low hit rate of data reading.
According to an aspect of an embodiment of the present invention, there is provided a method for reading training sample data, including: reading sample data blocks from a target data set stored in a storage space according to a first data block sequence, performing model training on a target model according to the read sample data blocks, and caching the read sample data blocks to a target cache space until the sample data blocks cached in the target cache space reach the upper cache limit of the target cache space; after the sample data blocks are read from the target data set according to the first data block sequence, adjusting the first data block sequence to a second data block sequence, wherein the first data block sequence is different from the second data block sequence; according to the second data block sequence, reading sample data blocks in the target data set from the target cache space and the storage space, and performing model training on the target model according to the read sample data blocks in the target data set, wherein according to the second data block sequence, reading the sample data blocks in the target data set from the target cache space and the storage space, the method includes: and reading the sample data blocks in the target data set from the target cache space when the sample data blocks in the target data set are inquired in the target cache space, and reading the sample data blocks in the target data set from the storage space when the sample data blocks in the target data set are not inquired in the target cache space.
According to another aspect of the embodiments of the present invention, there is also provided an apparatus for reading training sample data, including: the first reading unit is used for reading sample data blocks from a target data set stored in a storage space according to a first data block sequence, performing model training on a target model according to the read sample data blocks, and caching the read sample data blocks into a target cache space until the sample data blocks cached in the target cache space reach the upper cache limit of the target cache space; an updating unit, configured to adjust an order of the first data chunks to an order of second data chunks after the sample data chunks are read from the target data set according to the order of the first data chunks, where the order of the first data chunks is different from the order of the second data chunks; a second reading unit, configured to read sample data chunks in the target data set from the target cache space and the storage space according to the second data chunk order, and perform model training on the target model according to the read sample data chunks in the target data set, where reading the sample data chunks in the target data set from the target cache space and the storage space according to the second data chunk order includes: and reading the sample data blocks in the target data set from the target cache space when the sample data blocks in the target data set are inquired in the target cache space, and reading the sample data blocks in the target data set from the storage space when the sample data blocks in the target data set are not inquired in the target cache space.
According to still another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, in which a computer program is stored, where the computer program is configured to execute the above-mentioned reading method of training sample data when running.
According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the method for reading the training sample data by using the computer program.
In the embodiment of the invention, sample data blocks are read from a target data set stored in a storage space according to a first data block sequence, model training is carried out on a target model according to the read sample data blocks, and the read sample data blocks are cached to a target cache space until the sample data blocks cached in the target cache space reach the upper cache limit of the target cache space; after the sample data blocks are read from the target data set according to the first data block sequence, adjusting the first data block sequence to a second data block sequence, wherein the first data block sequence is different from the second data block sequence; according to the second data block sequence, reading sample data blocks in the target data set from the target cache space and the storage space, and performing model training on the target model according to the read sample data blocks in the target data set, wherein according to the second data block sequence, reading the sample data blocks in the target data set from the target cache space and the storage space, the method includes: under the condition that the sample data blocks in the target data set are inquired in the target cache space, reading the sample data blocks in the target data set from the target cache space, and under the condition that the sample data blocks in the target data set cannot be inquired in the target cache space, reading the sample data blocks in the target data set from the storage space, and directly reading the fixed data with unchanged using sequence in the target cache space, the reading efficiency of data reading is improved, and meanwhile, the aim of higher data block hit rate is fulfilled, so that the technical effect of improving the reading hit rate of data is achieved, and the technical problem of lower hit rate of data reading is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a schematic diagram of a flow chart of an alternative method for reading training sample data according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating an alternative method for reading training sample data according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating an alternative method for reading training sample data according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating an alternative method for reading training sample data according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating an alternative method for reading training sample data according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating an alternative method for reading training sample data according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating an alternative method for reading training sample data according to an embodiment of the present invention;
FIG. 8 is a diagram illustrating an alternative method for reading training sample data according to an embodiment of the present invention;
FIG. 9 is a diagram illustrating an alternative method for reading training sample data according to an embodiment of the present invention;
FIG. 10 is a diagram illustrating an alternative method for reading training sample data according to an embodiment of the present invention;
fig. 11 is a schematic structural diagram of an alternative reading apparatus for training sample data according to an embodiment of the present invention;
fig. 12 is a schematic structural diagram of an alternative reading apparatus for training sample data according to an embodiment of the present invention;
fig. 13 is a schematic structural diagram of an alternative training sample data reading apparatus according to an embodiment of the present invention;
fig. 14 is a schematic structural diagram of an alternative training sample data reading apparatus according to an embodiment of the present invention;
fig. 15 is a schematic structural diagram of an alternative electronic device according to an embodiment of the invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Cloud technology refers to a hosting technology for unifying serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.
Cloud technology (Cloud technology) is based on a general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied in a Cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.
Database (Database), which can be regarded as an electronic file cabinet in short, a place for storing electronic files, a user can add, query, update, delete, etc. to data in files. A "database" is a collection of data that is stored together in a manner that can be shared by multiple users, has as little redundancy as possible, and is independent of the application.
A Database Management System (DBMS) is a computer software System designed for managing a Database, and generally has basic functions of storage, interception, security assurance, backup, and the like. The database management system may classify the database according to the database model it supports, such as relational, Extensible Markup Language (XML); or classified according to the type of computer supported, e.g., server cluster, mobile phone; or classified according to the Query Language used, such as Structured Query Language (SQL), XQuery; or by performance impulse emphasis, e.g., maximum size, maximum operating speed; or other classification schemes. Regardless of the manner of classification used, some DBMSs are capable of supporting multiple query languages across categories, for example, simultaneously.
Optionally, as an optional implementation manner, as shown in fig. 1, the method for reading the training sample data includes:
s102, reading sample data blocks from a target data set stored in a storage space according to a first data block sequence, performing model training on a target model according to the read sample data blocks, and caching the read sample data blocks to a target cache space until the sample data blocks cached in the target cache space reach the upper cache limit of the target cache space;
s104, after the sample data blocks are read from the target data set according to the sequence of the first data blocks, adjusting the sequence of the first data blocks into the sequence of the second data blocks, wherein the sequence of the first data blocks is different from the sequence of the second data blocks;
s106, according to the second data block sequence, reading sample data blocks in the target data set from the target cache space and the storage space, and performing model training on the target model according to the read sample data blocks in the target data set, wherein according to the second data block sequence, reading the sample data blocks in the target data set from the target cache space and the storage space comprises: and under the condition that the sample data blocks in the target data set are inquired in the target cache space, reading the sample data blocks in the target data set from the target cache space, and under the condition that the sample data blocks in the target data set cannot be inquired in the target cache space, reading the sample data blocks in the target data set from the storage space.
Optionally, a method for reading training sample data may be, but is not limited to, applied in a Deep Learning scenario of data, and the method for reading training sample data may be, but is not limited to, a unique data caching scheme (DCDL), where the method may include, but is not limited to, a Cache replacement policy, and specifically, may include, but is not limited to, a Cache once policy (COO). A specimen data chunk may be, but is not limited to, a data set that holds a certain level of access or specimen data. The target cache space may be, but is not limited to, a cache area for caching data in a linked list form, including a linked list head, a linked list middle, a linked list tail, and the like, the data in the sample data block may be, but is not limited to, cached in the target cache space, and specifically, the data in the currently acquired sample data block is inserted into the linked list head in the target cache space. Optionally, the data in the target cache space may be, but is not limited to, fixed, and the data in the sample data chunk may be, but is not limited to, fixed. The data chunk order may, but is not limited to, affect the storage or caching of data in the target cache space and/or specimen data chunks, in other words, in the event of a change in the data chunk order, the data in the target cache space and/or specimen data chunks does not change. Optionally, the ratio of the data size of the sample data chunk to the cache amount of the target cache space may be, but is not limited to, a fixed ratio, for example, 1.5: 1. 2: 1. 3: 1. 3: 2, etc.
It should be noted that, sample data blocks are read from a target data set stored in a storage space according to a first data block sequence, model training is performed on a target model according to the read sample data blocks, and the read sample data blocks are cached in a target cache space until the sample data blocks cached in the target cache space reach a cache upper limit of the target cache space, where optionally, the cache of the target cache space has an upper limit, and the upper limit may cause the target cache space to be unable to cache all the sample data blocks.
Further, after the sample data blocks are read from the target data set according to the sequence of the first data blocks, adjusting the sequence of the first data blocks into the sequence of the second data blocks, wherein the sequence of the first data blocks is different from the sequence of the second data blocks; according to the second data block sequence, reading sample data blocks in the target data set from the target cache space and the storage space, and performing model training on the target model according to the read sample data blocks in the target data set, wherein in the process of reading the sample data blocks in the target data set from the target cache space and the storage space according to the second data block sequence, the following two scenarios may be included, but not limited to: 1) under the condition that the sample data blocks in the target data set are inquired in the target cache space, reading the sample data blocks in the target data set from the target cache space; 2) and under the condition that the sample data blocks in the target data set cannot be inquired in the target cache space, reading the sample data blocks in the target data set from the storage space. Through the technical means corresponding to the two scenes, all the sample data blocks are read, and the read sample data blocks are subjected to model training on the target model. Optionally, the model training of the target model by using the read sample data block may be, but is not limited to, an execution logic for performing real-time reading and real-time training without waiting for the sample data block to be completely read, or may be, but is not limited to, performing unified training after all the sample data blocks are read.
To further illustrate, an alternative example, as shown in FIG. 2, includes a target storage area 202 (storage space), wherein the target storage area 202 includes a plurality of data, such as data block A, B, C, D, E, F, G; a target cache area 204 (target cache space) is also included, where the target cache area 204 includes a linked list head 2042, a linked list bottom 2044, and other cache regions (spaces).
Further by way of example, optionally, as shown in fig. 3, a first data set 302 (for representing sample data chunks, where shading may be, but is not limited to, indicating that data has been read from the target storage area 202 to the target cache area 204) is determined in the target storage area 202, where the first data set 302 includes data chunks A, B, C, D, optionally, the first data set 302 determined in the target storage area 202 may be, but is not limited to, randomly determined or determined according to a current data chunk order, for example, data of n before the sorting is determined according to the current data chunk order, and n is a data amount of the data chunks in the first data set 302;
further, optionally, data blocks in the first data set 302 are sequentially inserted into the target cache region 204, specifically, the data blocks in the first data set 302 are sequentially inserted into the linked list head 2042 in the target cache region 204, each region in the target cache region 204 may be, but is not limited to, used for caching a unique data block, for example, when the data block a in the first data set 302 is inserted into the linked list head 2042, then the data block B in the first data set 302 is inserted into the linked list head 2042, the data block a in the linked list head 2042 is moved to the next region in the target cache region 204, and the cached data in the current linked list head 2042 is the data block B;
furthermore, the number of the data in the first data set 302 and the number of the cache regions in the target cache region 204 are both 4, and optionally, after the data in the first data set 302 is sequentially inserted into the cache regions in the target cache region 204, the cache data in the bottom 2044 of the linked list should be the data block a inserted first, the cache data in the head 2042 of the linked list should be the data block D inserted last, and similarly, the remaining cache regions in the target cache region 204 should also cache the corresponding data blocks according to different insertion orders.
For further illustration, optionally, for example, as shown in fig. 4, the first data set 302 in the target storage area 202 may be, but is not limited to being, determined in a data block to be inserted 402, where the data block to be inserted 402 may be, but is not limited to being, representative of data to be inserted into the target cache area 204 for pre-caching; specifically, the data in the target storage area 202 is sequentially inserted into the target cache area 204 according to the current data block sequence until the data blocks are cached at the bottom 2044 of the linked list in the target cache area 204, or in other words, until all cache areas in the target cache area 204 cache the data blocks (reach the cache upper limit), the data block 402 to be inserted is not inserted into the target cache area 204 any more, and the data cached in the current target cache area 204 is used as the first data set 302; for example, if the current target cache area 204 has data block a, data block B, data block C, and data block D cached therein, the first data set 302 is composed of data block a, data block B, data block C, and data block D, and the rest of data blocks E, F, and G in the target storage area 202 remain in the target storage area 202 for storage.
Further by way of example, as shown in fig. 5, optionally, the target storage area 502 (storage space) is included, where the target storage area 502 includes a data block a, a data block B, a data block C, a data block D, a data block E, a data block F, and a data block G, and further includes a target cache area 504 (target cache space) that caches the data block a, the data block B, the data block C, and the data block D in the target storage area 502;
further, the first target data 506 (e.g., data block a, data block D) is directly read out from the target buffer 504 according to the current data block order (second data block order), and the data that cannot be read out from the target buffer 504 (e.g., data block B, data block C, data block E, data block F, data block G) is read out as the second target data 508 from the target storage area 502; further, target data 510 having usage data in accordance with the current data block order (second data block order) is acquired from the first target data 506 and the second target data 508.
For further example, as shown in fig. 6, optionally, the target storage area 602 (storage space), a first data block sequence 604 for indicating an order of data blocks in the target storage area 602, a second data block sequence 606 obtained by updating the first data block sequence 604, and a target buffer area 608 (target buffer space) for buffering a part of data in the target storage area 602, where for convenience of understanding, the target buffer area 608 further includes a target buffer area 6082 at time t1 and a target buffer area 6084 at time t2 at different times (e.g., t1 and t 2), it should be noted that the target buffer area 6082 at time t1 and the target buffer area 6084 at time t2 both indicate the same target buffer area 608;
specifically, for example, a part of data blocks (e.g., data block a, data block B, data block C, data block D, and data block E) in the target storage area 602 is pre-cached in the target cache area 608 in advance; acquiring a current data block sequence, namely a first data block sequence 604, of the target storage area 602, and updating the first data block sequence 604 to obtain an updated second data block sequence 606; according to the second data block sequence 606, pre-buffered data blocks are respectively obtained in the target buffer area 608, for example, data blocks (for example, data block a, data block B, data block C, data block D, data block E, data block F, data block G, data block H, data block I, data block J) with 1T data amount are stored in the target storage area 602, and data blocks (for example, data block a, data block B, data block C, data block D, data block E) with 500G data amount pre-buffered in the target buffer area 608 are read first with data (for example, D, J, H, A, E) with 500G data amount represented by the second data block sequence 606, and then with data blocks (for example, data block C, data block G, data block B, data block I, data block F) with 500G data amount represented by the second data block sequence 606 are read, specifically, in the target buffer area 6082 at time T1, reading a data block A, a data block D and a data block E of a hit data block D, a data block J, a data block H, a data block A and a data block E; reading hit data block C, data block G, data block B, data block I, data block C of data block F, and data block B in target cache 6084 at time t 2;
further, at time t1, data block C, data block B, which were not read in target buffer 6082 at time t1, and data block E, data block D, data block a, which were not read in target buffer 6084 at time t2, are read in target storage area 602 at time t 2.
According to the embodiment provided by the application, the sample data blocks are read from the target data set stored in the storage space according to the sequence of the first data blocks, model training is carried out on the target model according to the read sample data blocks, and the read sample data blocks are cached to the target cache space until the sample data blocks cached in the target cache space reach the upper cache limit of the target cache space; after the sample data blocks are read from the target data set according to the sequence of the first data blocks, adjusting the sequence of the first data blocks into the sequence of the second data blocks, wherein the sequence of the first data blocks is different from the sequence of the second data blocks; according to a second data block sequence, reading sample data blocks in the target data set from the target cache space and the storage space, and performing model training on the target model according to the read sample data blocks in the target data set, wherein according to the second data block sequence, reading the sample data blocks in the target data set from the target cache space and the storage space comprises: under the condition that the sample data block in the target data set is inquired in the target cache space, the sample data block in the target data set is read from the target cache space, under the condition that the sample data block in the target data set cannot be inquired in the target cache space, the sample data block in the target data set is read from the storage space, and fixed data with unchanged using sequence in the target cache space is directly read, so that the purposes of improving the reading efficiency of data reading and having higher data block hit rate are achieved, and the technical effect of improving the data reading hit rate is achieved.
As an alternative, adjusting the first data block order to the second data block order includes:
s1, performing a first scrambling operation on sample data blocks arranged according to the sequence of the first data block in the target data set to obtain a second data block sequence, wherein the first scrambling operation is used for changing the use sequence of the sample data blocks in the target data set; or
And S2, performing a second scrambling operation on the sample data chunks arranged in the target data set according to the initial data chunk sequence to obtain a second data chunk sequence, wherein the initial data chunk sequence is the arrangement sequence of the sample data chunks in the target data set stored in the storage space, and the second scrambling operation is used for changing the use sequence of the sample data chunks in the target data set.
It should be noted that, a first scrambling operation is performed on sample data chunks arranged according to a first data chunk order in the target data set to obtain a second data chunk order, where the first scrambling operation is used to change a use order of the sample data chunks in the target data set. Optionally, the scrambling operation may be, but is not limited to, randomly scrambling the data block order.
Or, performing a second scrambling operation on the sample data chunks arranged in the target data set according to the initial data chunk sequence to obtain a second data chunk sequence, where the initial data chunk sequence is the arrangement sequence of the sample data chunks in the target data set stored in the storage space, and the second scrambling operation is used to change the use sequence of the sample data chunks in the target data set, in other words, the scrambling operation may be performed based on, but not limited to, the current first data chunk sequence arrangement of the sample data chunks, and may also be performed based on, but not limited to, the most original initial data chunk sequence arrangement of the sample data chunks. Optionally, the initial data chunk order arrangement may be, but is not limited to, an initialization order before the read operation is performed on the specimen data chunks.
To further illustrate, optionally, for example, as shown in fig. 7, a first scrambling operation is performed based on the initial data block order 702 to obtain a first data block order 704; a second scrambling operation is performed based on the initial data block order 702 resulting in a second data block order 706.
To further illustrate, optionally, for example, as shown in fig. 8, a first scrambling operation is performed based on the initial data block order 702 to obtain a first data block order 704; a second scrambling operation is performed based on the first data block order 704, resulting in a second data block order 802.
By the embodiment provided by the application, a first scrambling operation is performed on sample data blocks arranged according to a first data block sequence in a target data set to obtain a second data block sequence, wherein the first scrambling operation is used for changing the use sequence of the sample data blocks in the target data set; or performing a second scrambling operation on the sample data blocks arranged in the target data set according to the sequence of the initial data blocks to obtain a second data block sequence, wherein the sequence of the initial data blocks is the arrangement sequence of the sample data blocks in the target data set stored in the storage space, the second scrambling operation is used for changing the use sequence of the sample data blocks in the target data set, and the purpose of obtaining the data block sequence conforming to the scrambling requirement is achieved by flexibly performing the scrambling operation, so that the effect of obtaining flexibility of the data block sequence is achieved.
As an optional scheme, according to the second data block sequence, reading sample data blocks in the target data set from the target cache space and the storage space, and performing model training on the target model according to the read sample data blocks in the target data set, including:
s1, reading all sample data blocks in the target data set from the target cache space and the storage space according to the second data block sequence to obtain a first sample data block sequence, wherein the first sample data block sequence comprises all sample data blocks arranged according to the second data block sequence; inputting the first sample data block sequence into a target model to perform model training on the target model; or
And S2, reading all sample data blocks in the target data set from the target cache space and the storage space according to the sequence of the second data blocks, and inputting the read sample data block or the read sample data block into the target model every time one or one group of sample data blocks are read so as to perform model training on the target model.
It should be noted that, according to the second data block sequence, all sample data blocks in the target data set are read from the target cache space and the storage space to obtain a first sample data block sequence, where the first sample data block sequence includes all sample data blocks arranged according to the second data block sequence; inputting the first sample data block sequence into a target model to perform model training on the target model, in other words, uniformly inputting all sample data blocks into the target model to perform model training on the target model;
or, according to the sequence of the second data block, reading all sample data blocks in the target data set from the target cache space and the storage space, inputting the read sample data block or the read sample data block into the target model to perform model training on the target model when one or one group of sample data blocks are read, in other words, inputting the read sample data blocks into the target model to perform model training on the target model under the condition that the sample data blocks are read, without waiting for the completion of reading all the sample data blocks.
To further illustrate, alternatively, as shown in fig. 5, the first target data 506 read from the target buffer area 504 and the second target data 508 read from the target storage area 502 are combined into target data 510 arranged in a second data block order (not shown).
According to the embodiment provided by the application, all sample data blocks in the target data set are read from the target cache space and the storage space according to the second data block sequence to obtain a first sample data block sequence, wherein the first sample data block sequence comprises all sample data blocks arranged according to the second data block sequence; inputting the first sample data block sequence into a target model to perform model training on the target model; or according to the sequence of the second data block, reading all sample data blocks in the target data set from the target cache space and the storage space, inputting the read sample data block or the read sample data block into the target model every time one or one group of sample data blocks are read, performing model training on the target model, acquiring the sample data blocks sequentially arranged by the second data block by combining the target cache space and the storage space, and achieving the purposes of acquiring and training the sample data blocks sequentially arranged by the second data block and completing the model training by using a flexible training mode, thereby achieving the effect of improving the model training efficiency.
As an optional scheme, reading sample data chunks from a target data set stored in a storage space according to a first data chunk sequence, and performing model training on a target model according to the read sample data chunks, including:
s1, reading all sample data blocks in a target data set from the target data set stored in the storage space according to the sequence of the first data blocks to obtain a second sample data block sequence, wherein the second sample data block sequence comprises all sample data blocks arranged according to the sequence of the first data blocks; inputting the second sample data block sequence into the target model to perform model training on the target model; or,
and S2, reading all sample data blocks in the target data set from the target data set stored in the storage space according to the sequence of the first data blocks, and inputting the read sample data blocks into the target model every time one or one group of sample data blocks are read so as to perform model training on the target model.
It should be noted that, according to the first data block sequence, reading all sample data blocks in the target data set from the target data set stored in the storage space to obtain a second sample data block sequence, where the second sample data block sequence includes all sample data blocks arranged according to the first data block sequence; inputting the second sample data block sequence into the target model to perform model training on the target model;
or according to the sequence of the first data blocks, reading all sample data blocks in the target data set from the target data set stored in the storage space, and inputting the read sample data block or the read sample data block into the target model every time one or one group of sample data blocks are read so as to perform model training on the target model. Optionally, in the scenario of reading a sample data chunk in the storage space, it may be, but is not limited to, performing model training by reading one or one group of sample data chunks to be read at a time, or it may be, but is not limited to, performing model training by inputting a plurality of or more groups of sample data chunks read into a target model under the condition that a plurality of or more groups of sample data chunks read are all sample data chunks.
For further example, the optional data obtained by data reading, for example, is applied to the training process of deep learning specifically as follows:
1. and reading the data, namely reading the target data (set) which is sequentially arranged according to the second data block.
2. Data parsing and preprocessing, such as decompression of images, data enhancement (e.g., flipping, scaling, random cropping, noise, etc.), out-of-order, batch processing, etc., are required for models in the Computer Vision (CV) domain.
3. And (4) data loading, namely loading the preprocessed data onto a training accelerator (GPU/TPU).
4. And model training, wherein the training accelerator is trained by using the loaded data.
According to the embodiment provided by the application, according to a first data block sequence, reading all sample data blocks in a target data set from the target data set stored in a storage space to obtain a second sample data block sequence, wherein the second sample data block sequence comprises all sample data blocks arranged according to the first data block sequence; inputting the second sample data block sequence into the target model to perform model training on the target model; or according to the sequence of the first data blocks, reading all sample data blocks in the target data set from the target data set stored in the storage space, inputting the read sample data block or the read sample data block into the target model when one or one group of sample data blocks are read, so as to perform model training on the target model, and stably and flexibly reading data, so that the purpose of flexibly providing stable read data for the neural network model to be trained is achieved, and the effect of improving the stability of the neural network model training is achieved.
As an alternative to this, it is possible to,
performing model training on the target model according to the read sample data blocks, wherein the model training comprises the following steps: using sample data blocks arranged according to the sequence of the first data blocks in the target data set to perform a first round of training on a target model;
according to the second data block sequence, reading the sample data blocks in the target data set from the target cache space and the storage space, wherein the reading comprises the following steps: during or after the first round of training of the target model, reading sample data blocks in the target data set from the target cache space and the storage space according to the sequence of the second data blocks;
performing model training on the target model according to the read sample data blocks in the target data set, wherein the model training comprises the following steps: and performing a second round of training on the target model by using the sample data blocks in the target data set, which are arranged according to the sequence of the second data blocks.
It should be noted that, sample data blocks arranged in the target data set according to the first data block order are used to perform a first round of training on the target model, and optionally, the sample data blocks used in the first round of training are directly read from the target data set stored in the storage space, so that the efficiency of the first round of training is low;
further, during or after the first round of training of the target model, according to the sequence of the second data blocks, sample data blocks in the target data set are read from the target cache space and the storage space, optionally, during or after the first round of training, the sample data blocks in the target data set are read from the target cache space and the storage space, that is, the sample data blocks are not all read directly from the target data set stored in the storage space, and the target cache space is also read, so that the efficiency is improved;
and then, using the sample data blocks arranged in the target data set according to the sequence of the second data blocks to perform a second round of training on the target model, optionally, using the sample data blocks with higher reading efficiency to further perform a second round of training, wherein in comparison, the efficiency of the second round of training is obviously higher than that of the first round of training.
For example, optionally, in the process of performing a first round of training on the neural network model to be trained, sample data blocks are searched in the target cache space according to the sequence of the second data blocks, and in the case that the sample data blocks are searched and the first round of training is completed, the read sample data blocks arranged according to the sequence of the second data blocks are used to perform a second round of training on the neural network model to be trained.
By way of further illustration, an alternative, e.g., deep learning training scenario may be, but is not limited to, for each sample of the data set, fixedly visited and visited only once during each training of the deep learning training using all samples of the training set once (epoch). The probability of revisiting is consistent for each sample. For the above characteristics, the COO algorithm may be applied, but not limited to, to regard both the memory and the Solid State Disk (SSD) as a common storage space, and cache replacement is not required in the memory and the SSD, and only the first-time cached data is cached.
Alternatively, the specific process of the COO algorithm is as follows:
1. new data is inserted directly into the head of the linked list.
2. When the linked list is full, new data is not inserted and is directly discarded.
For the nth epoch (n > = 2), the hit rate of the COO algorithm, HitRate, is shown in the following equation 1:
where Sizeof (cache) represents the size of the cache and Sizeof (data) represents the total size of data that needs to be accessed.
Through the embodiment provided by the application, the model training is carried out on the target model according to the read sample data blocks, and the method comprises the following steps: using sample data blocks arranged according to the sequence of the first data blocks in the target data set to perform a first round of training on a target model; according to the second data block sequence, reading the sample data blocks in the target data set from the target cache space and the storage space, wherein the reading comprises the following steps: during or after the first round of training of the target model, reading sample data blocks in the target data set from the target cache space and the storage space according to the sequence of the second data blocks; performing model training on the target model according to the read sample data blocks in the target data set, wherein the model training comprises the following steps: and performing a second round of training on the target model by using the sample data blocks which are arranged in the target data set according to the sequence of the second data blocks, and completing a plurality of rounds of training by using the sample data blocks with higher reading efficiency, thereby achieving the purpose of improving the training speed of the neural network and realizing the effect of improving the training efficiency of the neural network.
As an optional scheme, updating the current data block order of the specimen data block to obtain a second data block order includes:
in the process of using the sample data blocks arranged according to the sequence of the first data block in the target data set to perform the first round of training on the target model, the sequence of the first data block is adjusted to the sequence of the second data block.
It should be noted that, in the process of performing a first round of training on the target model by using the sample data chunks arranged in the target data set according to the order of the first data chunks, the order of the first data chunks is adjusted to the order of the second data chunks. Optionally, in the process of performing the first round of training on the target model, the adjustment operation of the data block sequence is performed in parallel, so that the execution efficiency of the overall operation is improved.
For further example, the second data block sequence may optionally be obtained, for example, while performing the first round of training on the neural network model to be trained.
According to the embodiment provided by the application, in the process of using the sample data blocks arranged in the target data set according to the sequence of the first data blocks to perform the first round of training on the target model, the sequence of the first data blocks is adjusted to the sequence of the second data blocks, and the first round of training and the sequence adjustment are performed in parallel, so that the purpose of synchronously performing the training operation of the neural network model and the acquisition operation of the sequence of the data blocks is achieved, and the effect of improving the training efficiency of the neural network model is realized.
As an optional scheme, after the sample data chunks in the target data set are read from the target cache space and the storage space according to the second data chunk sequence, and the model training is performed on the target model according to the read sample data chunks in the target data set, the method further includes:
s1, adjusting the second data block sequence to a third data block sequence, wherein the second data block sequence is different from the third data block sequence;
s2, according to a third data block sequence, reading sample data blocks in the target data set from the target cache space and the storage space, and performing model training on the target model according to the read sample data blocks in the target data set, wherein according to the third data block sequence, reading the sample data blocks in the target data set from the target cache space and the storage space comprises: and under the condition that the sample data blocks in the target data set are inquired in the target cache space, reading the sample data blocks in the target data set from the target cache space, and under the condition that the sample data blocks in the target data set cannot be inquired in the target cache space, reading the sample data blocks in the target data set from the storage space.
It should be noted that, the second data block order is adjusted to a third data block order, where the second data block order is different from the third data block order; according to a third data block sequence, reading sample data blocks in the target data set from the target cache space and the storage space, and performing model training on the target model according to the read sample data blocks in the target data set, wherein according to the third data block sequence, reading the sample data blocks in the target data set from the target cache space and the storage space comprises: and under the condition that the sample data blocks in the target data set are inquired in the target cache space, reading the sample data blocks in the target data set from the target cache space, and under the condition that the sample data blocks in the target data set cannot be inquired in the target cache space, reading the sample data blocks in the target data set from the storage space.
Further by way of example, an optional example, as shown in fig. 9, includes a target storage area 902 (storage space), a first data block order 906 indicating an order of use of data blocks in the current target storage area 902, and a second data block order 908 and a third data block order 910 updated and obtained based on the first data block order 906; also included are target cache regions 904 (target cache spaces), and the behavior of the same target cache region 904 at different times (e.g., times T1, T2, T3, T4, T5, T6); optionally, based on data reading efficiency, the first data block order 906, the second data block order 908, and the third data block order 910 may be, but are not limited to being, divided into a first half and a second half, respectively;
specifically, optionally, for example, at time t1, if the data block in the target storage area 902 is not yet inserted into the target cache area 9040, the data block in the target cache area 9042 at time t1 is empty, and the hit rate of the data block corresponding to the first half of the first data block sequence 906 is 0; for example, at time t2, part of the data blocks in the target storage area 902 are already inserted into the target cache area 904, but the data blocks inserted into the target cache area 904 are the first half of the data blocks in the target storage area 902, which are equivalent to the first half of the first data block sequence 906, and the hit rate of the data blocks in the target cache area 9044 at time t2 corresponding to the second half of the first data block sequence 906 is also 0;
further, optionally, for example, at time t3, the first data block sequence 906 is updated to the second data block sequence 908, and the data in the target cache area 904 remains unchanged, in the above case, if the data block a, the data block D, and the data block E in the target cache area 9046 at time t3 hit the data block a, the data block D, and the data block E corresponding to the first half of the second data block sequence 908, the data block a, the data block D, and the data block E are directly read in the target cache area 904, and the data block B and the data block C corresponding to the first half of the second data block sequence 908 and not hit in the target cache area 9046 at time t3 are read in the target storage area 902;
further, optionally, at time t4, if the data block B, C in the target cache area 9048 at time t4 hits the data block B and the data block C corresponding to the second half of the second data block sequence 908, the data block B and the data block C are directly read from the target cache area 904, and the data block a, the data block D, and the data block E corresponding to the second half of the second data block sequence 908 and not hit in the target cache area 9046 at time t3 are read from the target storage area 902;
furthermore, optionally, for example, at time t5, the second data block sequence 908 is updated to the third data block sequence 910, while the data blocks in the target cache region 904 remain unchanged, in the above case, if the data block a, the data block B, the data block C, the data block D in the target cache region 9050 at time t5 hit the data block A, B, C, D corresponding to the first half of the third data block sequence 910, the data block A, B, C, D is directly read from the target cache region 904, and the data block E corresponding to the first half of the third data block sequence 910 and not hit in the target cache region 9050 at time t5 is read from the target storage region 902;
further, optionally, at time t6, if the data block E in the target cache area 9052 at time t6 hits the data block E corresponding to the second half of the third data block sequence 910, the data block E is directly read from the target cache area 904, and the data block a, the data block B, the data block C, and the data block D corresponding to the second half of the third data block sequence 910 and not hit in the target cache area 9052 at time t5 are read from the target storage area 902.
By the embodiment provided by the application, the second data block sequence is adjusted to be a third data block sequence, wherein the second data block sequence is different from the third data block sequence; according to a third data block sequence, reading sample data blocks in the target data set from the target cache space and the storage space, and performing model training on the target model according to the read sample data blocks in the target data set, wherein according to the third data block sequence, reading the sample data blocks in the target data set from the target cache space and the storage space comprises: under the condition that the sample data block in the target data set is inquired in the target cache space, the sample data block in the target data set is read from the target cache space, under the condition that the sample data block in the target data set cannot be inquired in the target cache space, the sample data block in the target data set is read from the storage space, and through the fixed data to be read in the target cache space, the purpose of considering both the efficiency of data reading and the hit rate is achieved in a stable data mode with high hit rate, and the effect of synchronously keeping the efficiency of data reading and the hit rate is achieved.
As an optional scheme, reading the first target data from the target cache space includes:
a set of specimen data chunks in the target data set is obtained from the target cache space in parallel using multiple threads.
Alternatively, multithreading may be, but is not limited to, a technology for implementing concurrent execution of multiple threads from software or hardware, and a computer with multithreading may be, but is not limited to, capable of executing more than one thread at the same time due to hardware support, thereby improving overall processing performance.
It should be noted that, a group of sample data chunks in the target data set is acquired from the target cache space in parallel by using multiple threads, and target data is acquired by multi-thread execution, so that the data acquisition efficiency can be effectively improved, and the overall processing performance of data reading is further improved.
Further, optionally, besides obtaining a set of target data from the target cache space in parallel by using a plurality of threads, the plurality of threads may also be used in parallel in a specific neural network training scenario. Optionally, for example, as shown in fig. 10, while the accelerator performs model training, data preparation required for the next training is performed on the CPU synchronously, which specifically includes a data reading process and a preprocessing process shown in the figure.
The data reading and preprocessing are carried out in parallel by using multiple threads, and the data reading and preprocessing are executed on the CPU in parallel, so that the overall training efficiency of the neural network model can be effectively improved, and the improvement of the overall training efficiency of the neural network model by carrying out the data reading and preprocessing in parallel by using the multiple threads is represented by more intuitive calculation:
optionally, for the nth epoch, the total training time required by the present scheme is shown in the following formula 2:
wherein Indicates the total training time required for the nth epoch,indicating the time required for a single block of data to be read,indicating the time required for the pre-processing of the data,indicating the time required for the loading of the data,indicating the time of single epoch training. M represents the number of data blocks to be read, and K represents the number of threads.
According to the embodiment provided by the application, a group of sample data blocks in the target data set are acquired from the target cache space in parallel by using a plurality of threads, so that the aim of reading data in a shorter time is fulfilled, and the effect of improving the data reading efficiency is realized.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
According to another aspect of the embodiments of the present invention, there is also provided a device for reading training sample data, which is used for implementing the method for reading training sample data. As shown in fig. 11, the apparatus includes:
the first reading unit 1102 is configured to read sample data blocks from a target data set stored in a storage space according to a first data block sequence, perform model training on a target model according to the read sample data blocks, and cache the read sample data blocks in a target cache space until the sample data blocks cached in the target cache space reach an upper cache limit of the target cache space;
an updating unit 1104, configured to adjust a first data chunk order to a second data chunk order after the sample data chunks are read from the target data set according to the first data chunk order, where the first data chunk order is different from the second data chunk order;
a second reading unit 1106, configured to read, according to a second data block order, sample data blocks in the target data set from the target cache space and the storage space, and perform model training on the target model according to the read sample data blocks in the target data set, where reading, according to the second data block order, sample data blocks in the target data set from the target cache space and the storage space includes: and under the condition that the sample data blocks in the target data set are inquired in the target cache space, reading the sample data blocks in the target data set from the target cache space, and under the condition that the sample data blocks in the target data set cannot be inquired in the target cache space, reading the sample data blocks in the target data set from the storage space.
Optionally, a reading apparatus for training sample data may be, but is not limited to, applied in a Deep Learning scenario of data, and the reading apparatus for training sample data may be, but is not limited to, a unique data caching scheme (DCDL), where the reading apparatus may include, but is not limited to, a Cache replacement policy, and specifically, may include, but is not limited to, a Cache once policy (COO). A specimen data chunk may be, but is not limited to, a data set that holds a certain level of access or specimen data. The target cache space may be, but is not limited to, a cache area for caching data in a linked list form, including a linked list head, a linked list middle, a linked list tail, and the like, the data in the sample data block may be, but is not limited to, cached in the target cache space, and specifically, the data in the currently acquired sample data block is inserted into the linked list head in the target cache space. Optionally, the data in the target cache space may be, but is not limited to, fixed, and the data in the sample data chunk may be, but is not limited to, fixed. The data chunk order may, but is not limited to, affect the storage or caching of data in the target cache space and/or specimen data chunks, in other words, in the event of a change in the data chunk order, the data in the target cache space and/or specimen data chunks does not change.
It should be noted that, sample data blocks are read from a target data set stored in a storage space according to a first data block sequence, model training is performed on a target model according to the read sample data blocks, and the read sample data blocks are cached in a target cache space until the sample data blocks cached in the target cache space reach a cache upper limit of the target cache space, where optionally, the cache of the target cache space has an upper limit, and the upper limit may cause the target cache space to be unable to cache all the sample data blocks.
Further, after the sample data blocks are read from the target data set according to the sequence of the first data blocks, adjusting the sequence of the first data blocks into the sequence of the second data blocks, wherein the sequence of the first data blocks is different from the sequence of the second data blocks; according to the second data block sequence, reading sample data blocks in the target data set from the target cache space and the storage space, and performing model training on the target model according to the read sample data blocks in the target data set, wherein in the process of reading the sample data blocks in the target data set from the target cache space and the storage space according to the second data block sequence, the following two scenarios may be included, but not limited to: 1) under the condition that the sample data blocks in the target data set are inquired in the target cache space, reading the sample data blocks in the target data set from the target cache space; 2) and under the condition that the sample data blocks in the target data set cannot be inquired in the target cache space, reading the sample data blocks in the target data set from the storage space. Through the technical means corresponding to the two scenes, all the sample data blocks are read, and the read sample data blocks are subjected to model training on the target model. Optionally, the model training of the target model by using the read sample data block may be, but is not limited to, an execution logic for performing real-time reading and real-time training without waiting for the sample data block to be completely read, or may be, but is not limited to, performing unified training after all the sample data blocks are read.
For a specific embodiment, reference may be made to the example shown in the above method for reading training sample data, which is not described herein again in this example.
According to the embodiment provided by the application, the sample data blocks are read from the target data set stored in the storage space according to the sequence of the first data blocks, model training is carried out on the target model according to the read sample data blocks, and the read sample data blocks are cached to the target cache space until the sample data blocks cached in the target cache space reach the upper cache limit of the target cache space; after the sample data blocks are read from the target data set according to the sequence of the first data blocks, adjusting the sequence of the first data blocks into the sequence of the second data blocks, wherein the sequence of the first data blocks is different from the sequence of the second data blocks; according to a second data block sequence, reading sample data blocks in the target data set from the target cache space and the storage space, and performing model training on the target model according to the read sample data blocks in the target data set, wherein according to the second data block sequence, reading the sample data blocks in the target data set from the target cache space and the storage space comprises: under the condition that the sample data block in the target data set is inquired in the target cache space, the sample data block in the target data set is read from the target cache space, under the condition that the sample data block in the target data set cannot be inquired in the target cache space, the sample data block in the target data set is read from the storage space, and fixed data with unchanged using sequence in the target cache space is directly read, so that the purposes of improving the reading efficiency of data reading and having higher data block hit rate are achieved, and the technical effect of improving the data reading hit rate is achieved.
As an alternative, as shown in fig. 12, the updating unit 1104 includes:
a first operation module 1202, configured to perform a first scrambling operation on sample data chunks arranged according to a first data chunk sequence in a target data set to obtain a second data chunk sequence, where the first scrambling operation is used to change a use sequence of the sample data chunks in the target data set; or
A second operation module 1204, configured to perform a second scrambling operation on sample data chunks arranged in the target data set according to the initial data chunk sequence to obtain a second data chunk sequence, where the initial data chunk sequence is the arrangement sequence of the sample data chunks in the target data set stored in the storage space, and the second scrambling operation is used to change the use sequence of the sample data chunks in the target data set.
For a specific embodiment, reference may be made to the example shown in the above method for reading training sample data, which is not described herein again in this example.
As an alternative, as shown in fig. 13, the second reading unit 1106 includes:
a first reading module 1302, configured to read all sample data chunks in the target data set from the target cache space and the storage space according to a second data chunk sequence, to obtain a first sample data chunk sequence, where the first sample data chunk sequence includes all sample data chunks arranged according to the second data chunk sequence; inputting the first sample data block sequence into a target model to perform model training on the target model; or
And the second reading module 1304 is configured to read all sample data chunks in the target data set from the target cache space and the storage space according to the sequence of the second data chunks, and input one or one group of read sample data chunks into the target model every time one or one group of sample data chunks is read, so as to perform model training on the target model.
For a specific embodiment, reference may be made to the example shown in the above method for reading training sample data, which is not described herein again in this example.
As an alternative, as shown in fig. 14, the first reading unit 1102 includes:
a third reading module 1402, configured to read all sample data chunks in the target data set from the target data set stored in the storage space according to the first data chunk sequence, to obtain a second sample data chunk sequence, where the second sample data chunk sequence includes all sample data chunks arranged according to the first data chunk sequence; inputting the second sample data block sequence into the target model to perform model training on the target model; or
The fourth reading module 1404 is configured to read all sample data chunks in the target data set from the target data set stored in the storage space according to the sequence of the first data chunks, and input one or one group of read sample data chunks into the target model every time one or one group of sample data chunks is read, so as to perform model training on the target model.
For a specific embodiment, reference may be made to the example shown in the above method for reading training sample data, which is not described herein again in this example.
As an alternative to this, it is possible to,
the first reading unit 1102 includes: the first training module is used for performing first round training on the target model by using the sample data blocks which are arranged in the target data set according to the sequence of the first data blocks;
a second reading unit 1106 including: the second training module is used for reading sample data blocks in the target data set from the target cache space and the storage space according to the sequence of the second data blocks in the process of or after the first round of training of the target model;
the third training module is used for performing model training on the target model according to the read sample data blocks in the target data set, and comprises: and performing a second round of training on the target model by using the sample data blocks in the target data set, which are arranged according to the sequence of the second data blocks.
For a specific embodiment, reference may be made to the example shown in the above method for reading training sample data, which is not described herein again in this example.
As an alternative, the updating unit 1104 includes:
and the adjusting module is used for adjusting the sequence of the first data blocks to the sequence of the second data blocks in the process of performing first round training on the target model by using the sample data blocks arranged in the target data set according to the sequence of the first data blocks.
For a specific embodiment, reference may be made to the example shown in the above method for reading training sample data, which is not described herein again in this example.
As an optional scheme, the apparatus further comprises:
the adjusting unit is used for reading the sample data blocks in the target data set from the target cache space and the storage space according to the second data block sequence, and adjusting the second data block sequence to be a third data block sequence after performing model training on the target model according to the read sample data blocks in the target data set, wherein the second data block sequence is different from the third data block sequence;
the third reading unit is configured to, after the sample data chunks in the target data set are read from the target cache space and the storage space according to the second data chunk sequence, and the target model is subjected to model training according to the sample data chunks in the read target data set, read the sample data chunks in the target data set from the target cache space and the storage space according to the third data chunk sequence, and perform model training on the target model according to the sample data chunks in the read target data set, where the reading of the sample data chunks in the target data set from the target cache space and the storage space according to the third data chunk sequence includes: and under the condition that the sample data blocks in the target data set are inquired in the target cache space, reading the sample data blocks in the target data set from the target cache space, and under the condition that the sample data blocks in the target data set cannot be inquired in the target cache space, reading the sample data blocks in the target data set from the storage space.
For a specific embodiment, reference may be made to the example shown in the above method for reading training sample data, which is not described herein again in this example.
As an alternative, the second reading unit 1108 includes:
and the acquisition module is used for acquiring a group of sample data blocks in the target data set from the target cache space in parallel by using a plurality of threads.
For a specific embodiment, reference may be made to the example shown in the above method for reading training sample data, which is not described herein again in this example.
According to a further aspect of the embodiments of the present invention, there is also provided an electronic device for implementing the method for reading the training sample data, as shown in fig. 15, the electronic device includes a memory 1502 and a processor 1504, the memory 1502 stores a computer program, and the processor 1504 is configured to execute the steps in any one of the above method embodiments through the computer program.
Optionally, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of a computer network.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, reading sample data blocks from a target data set stored in the storage space according to the sequence of the first data blocks, performing model training on a target model according to the read sample data blocks, and caching the read sample data blocks into a target cache space until the sample data blocks cached in the target cache space reach the upper cache limit of the target cache space;
s2, after the sample data blocks are read from the target data set according to the sequence of the first data blocks, adjusting the sequence of the first data blocks into the sequence of the second data blocks, wherein the sequence of the first data blocks is different from the sequence of the second data blocks;
s3, according to the second data block sequence, reading sample data blocks in the target data set from the target cache space and the storage space, and performing model training on the target model according to the read sample data blocks in the target data set, wherein according to the second data block sequence, reading the sample data blocks in the target data set from the target cache space and the storage space comprises: and under the condition that the sample data blocks in the target data set are inquired in the target cache space, reading the sample data blocks in the target data set from the target cache space, and under the condition that the sample data blocks in the target data set cannot be inquired in the target cache space, reading the sample data blocks in the target data set from the storage space.
Alternatively, it can be understood by those skilled in the art that the structure shown in fig. 15 is only an illustration, and the electronic device may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 15 does not limit the structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 15, or have a different configuration than shown in FIG. 15.
The memory 1502 may be configured to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for reading training sample data in the embodiments of the present invention, and the processor 1504 executes various functional applications and data processing by running the software programs and modules stored in the memory 1502, that is, implements the method for reading training sample data. The memory 1502 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 1502 can further include memory located remotely from the processor 1504, which can be coupled to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 1502 may be used for storing information such as a sample data chunk, an order of a first data chunk, and an order of a second data chunk. As an example, as shown in fig. 15, the memory 1502 may include, but is not limited to, a first reading unit 1102, an updating unit 1104, and a second reading unit 1106 of the reading device for training sample data. In addition, the device may further include, but is not limited to, other module units in the above-mentioned device for reading training sample data, which is not described in detail in this example.
Optionally, the transmission device 1506 is used for receiving or transmitting data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 1506 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices to communicate with the internet or a local area Network. In one example, the transmission device 1506 is a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
In addition, the electronic device further includes: a display 1508, for displaying the sample data chunks, the order of the first data chunks, and the order of the second data chunks; and a connection bus 1510 for connecting the respective module parts in the above-described electronic apparatus.
According to a further aspect of an embodiment of the present invention, there is also provided a computer-readable storage medium having a computer program stored thereon, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
Alternatively, in the present embodiment, the above-mentioned computer-readable storage medium may be configured to store a computer program for executing the steps of:
s1, reading sample data blocks from a target data set stored in the storage space according to the sequence of the first data blocks, performing model training on a target model according to the read sample data blocks, and caching the read sample data blocks into a target cache space until the sample data blocks cached in the target cache space reach the upper cache limit of the target cache space;
s2, after the sample data blocks are read from the target data set according to the sequence of the first data blocks, adjusting the sequence of the first data blocks into the sequence of the second data blocks, wherein the sequence of the first data blocks is different from the sequence of the second data blocks;
s3, according to the second data block sequence, reading sample data blocks in the target data set from the target cache space and the storage space, and performing model training on the target model according to the read sample data blocks in the target data set, wherein according to the second data block sequence, reading the sample data blocks in the target data set from the target cache space and the storage space comprises: and under the condition that the sample data blocks in the target data set are inquired in the target cache space, reading the sample data blocks in the target data set from the target cache space, and under the condition that the sample data blocks in the target data set cannot be inquired in the target cache space, reading the sample data blocks in the target data set from the storage space.
Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be substantially or partially implemented in the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, and including instructions for causing one or more computer devices (which may be personal computers, servers, or network devices) to execute all or part of the steps of the method according to the embodiments of the present invention.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit is merely a division of a logic function, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (15)
1. A method for reading training sample data is characterized by comprising the following steps:
reading sample data blocks from a target data set stored in a storage space according to a first data block sequence, performing model training on a target model according to the read sample data blocks, and caching the read sample data blocks to a target cache space until the sample data blocks cached in the target cache space reach the upper cache limit of the target cache space;
after the sample data blocks are read from the target data set according to the first data block sequence, adjusting the first data block sequence to a second data block sequence, wherein the first data block sequence is different from the second data block sequence;
according to the second data block sequence, reading sample data blocks in the target data set from the target cache space and the storage space, and performing model training on the target model according to the read sample data blocks in the target data set, wherein according to the second data block sequence, reading the sample data blocks in the target data set from the target cache space and the storage space includes: and under the condition that the sample data blocks in the target data set are inquired in the target cache space, reading the sample data blocks in the target data set from the target cache space, and under the condition that the sample data blocks in the target data set are not inquired in the target cache space, reading the sample data blocks in the target data set from the storage space.
2. The method of claim 1, wherein the adjusting the first data block order to the second data block order comprises:
performing a first scrambling operation on the sample data blocks arranged according to the first data block sequence in the target data set to obtain a second data block sequence, wherein the first scrambling operation is used for changing the use sequence of the sample data blocks in the target data set; or
And performing a second scrambling operation on the sample data chunks arranged in the target data set according to the initial data chunk sequence to obtain a second data chunk sequence, where the initial data chunk sequence is the arrangement sequence of the sample data chunks in the target data set stored in the storage space, and the second scrambling operation is used to change the use sequence of the sample data chunks in the target data set.
3. The method of claim 1, wherein the reading sample data chunks in the target data set from the target cache space and the storage space according to the second data chunk order, and performing model training on the target model according to the read sample data chunks in the target data set comprises:
reading all sample data blocks in the target data set from the target cache space and the storage space according to the sequence of the second data blocks to obtain a first sample data block sequence, wherein the first sample data block sequence comprises all sample data blocks arranged according to the sequence of the second data blocks; inputting the first sample data block sequence into the target model to perform model training on the target model; or
And reading all sample data blocks in the target data set from the target cache space and the storage space according to the sequence of the second data blocks, and inputting the read sample data block or group of sample data blocks into the target model every time one or group of sample data blocks are read so as to perform model training on the target model.
4. The method of claim 1, wherein reading sample data chunks from a target data set stored in a storage space in an order of a first data chunk, and performing model training on a target model according to the read sample data chunks comprises:
reading all sample data blocks in the target data set from the target data set stored in the storage space according to the sequence of the first data blocks to obtain a second sample data block sequence, wherein the second sample data block sequence comprises all sample data blocks arranged according to the sequence of the first data blocks; inputting the second sample data block sequence into the target model to perform model training on the target model; or
And according to the sequence of the first data blocks, reading all sample data blocks in the target data set from the target data set stored in the storage space, and inputting the read sample data blocks or a group of sample data blocks into the target model every time one or a group of sample data blocks are read so as to perform model training on the target model.
5. The method of claim 1,
performing model training on a target model according to the read sample data blocks, including: using sample data blocks arranged according to the first data block sequence in the target data set to perform a first round of training on the target model;
the reading of the specimen data blocks in the target data set from the target cache space and the storage space according to the second data block sequence includes: during or after the first round of training of the target model, reading sample data blocks in the target data set from the target cache space and the storage space according to the sequence of the second data blocks;
performing model training on the target model according to the read sample data blocks in the target data set, wherein the model training comprises the following steps: and performing a second round of training on the target model by using the sample data blocks in the target data set, which are arranged according to the second data block sequence.
6. The method of claim 1, wherein the adjusting the first data block order to the second data block order comprises:
and in the process of using the sample data blocks arranged in the target data set according to the sequence of the first data block to perform first round training on the target model, adjusting the sequence of the first data block to the sequence of a second data block.
7. The method according to any one of claims 1 to 6, wherein after reading the specimen data chunks in the target data set from the target cache space and the storage space in the second data chunk order, and performing model training on the target model according to the read specimen data chunks in the target data set, the method further comprises:
adjusting the second data block order to a third data block order, wherein the second data block order is different from the third data block order;
according to the third data block sequence, reading sample data blocks in the target data set from the target cache space and the storage space, and performing model training on the target model according to the read sample data blocks in the target data set, wherein according to the third data block sequence, reading the sample data blocks in the target data set from the target cache space and the storage space includes: and under the condition that the sample data blocks in the target data set are inquired in the target cache space, reading the sample data blocks in the target data set from the target cache space, and under the condition that the sample data blocks in the target data set are not inquired in the target cache space, reading the sample data blocks in the target data set from the storage space.
8. The method of any one of claims 1 to 6, wherein said reading specimen data chunks in the target data set from the target cache space comprises:
and using a plurality of threads to acquire a group of sample data blocks in the target data set from the target cache space in parallel.
9. An apparatus for reading training sample data, comprising:
the first reading unit is used for reading sample data blocks from a target data set stored in a storage space according to a first data block sequence, performing model training on a target model according to the read sample data blocks, and caching the read sample data blocks into a target cache space until the sample data blocks cached in the target cache space reach the upper cache limit of the target cache space;
an updating unit, configured to adjust, after the sample data chunks are read from the target data set according to the first data chunk order, the first data chunk order to a second data chunk order, where the first data chunk order is different from the second data chunk order;
a second reading unit, configured to read sample data chunks in the target data set from the target cache space and the storage space according to the second data chunk order, and perform model training on the target model according to the read sample data chunks in the target data set, where reading the sample data chunks in the target data set from the target cache space and the storage space according to the second data chunk order includes: and under the condition that the sample data blocks in the target data set are inquired in the target cache space, reading the sample data blocks in the target data set from the target cache space, and under the condition that the sample data blocks in the target data set are not inquired in the target cache space, reading the sample data blocks in the target data set from the storage space.
10. The apparatus of claim 9, wherein the updating unit comprises:
a first operation module, configured to perform a first scrambling operation on sample data chunks arranged according to the first data chunk order in the target data set, to obtain the second data chunk order, where the first scrambling operation is used to change a use order of the sample data chunks in the target data set; or
A second operation module, configured to perform a second scrambling operation on sample data chunks arranged in the target data set according to an initial data chunk order to obtain a second data chunk order, where the initial data chunk order is an arrangement order of the sample data chunks in the target data set stored in the storage space, and the second scrambling operation is used to change a use order of the sample data chunks in the target data set.
11. The apparatus of claim 9, wherein the second reading unit comprises:
a first reading module, configured to read all sample data chunks in the target data set from the target cache space and the storage space according to the second data chunk sequence to obtain a first sample data chunk sequence, where the first sample data chunk sequence includes all sample data chunks arranged according to the second data chunk sequence; inputting the first sample data block sequence into the target model to perform model training on the target model; or
And the second reading module is used for reading all sample data blocks in the target data set from the target cache space and the storage space according to the sequence of the second data blocks, and inputting the read sample data block or group of sample data blocks into the target model every time one or group of sample data blocks are read so as to perform model training on the target model.
12. The apparatus of claim 9, wherein the first reading unit comprises:
a third reading module, configured to read all sample data chunks in the target data set from the target data set stored in the storage space according to the first data chunk sequence to obtain a second sample data chunk sequence, where the second sample data chunk sequence includes all sample data chunks arranged according to the first data chunk sequence; inputting the second sample data block sequence into the target model to perform model training on the target model; or
And a fourth reading module, configured to read all sample data chunks in the target data set from the target data set stored in the storage space according to the sequence of the first data chunks, and input the read sample data chunk or sample data chunks into the target model every time one or a group of sample data chunks are read, so as to perform model training on the target model.
13. The apparatus of claim 9,
the first reading unit includes: the first training module is used for performing first round training on the target model by using sample data blocks which are arranged in the target data set according to the sequence of the first data blocks;
the second reading unit includes: a second training module, configured to read sample data chunks in the target data set from the target cache space and the storage space according to the second data chunk order during or after the first round of training on the target model;
a third training module, configured to perform model training on the target model according to the read sample data chunks in the target data set, where the model training includes: and performing a second round of training on the target model by using the sample data blocks in the target data set, which are arranged according to the second data block sequence.
14. A computer-readable storage medium comprising a stored program, wherein the program when executed performs the method of any of claims 1 to 8.
15. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method as claimed in any one of claims 1 to 8 by means of the computer program.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010891955.6A CN111813711B (en) | 2020-08-31 | 2020-08-31 | Method and device for reading training sample data, storage medium and electronic equipment |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010891955.6A CN111813711B (en) | 2020-08-31 | 2020-08-31 | Method and device for reading training sample data, storage medium and electronic equipment |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN111813711A CN111813711A (en) | 2020-10-23 |
| CN111813711B true CN111813711B (en) | 2020-12-29 |
Family
ID=72859733
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010891955.6A Active CN111813711B (en) | 2020-08-31 | 2020-08-31 | Method and device for reading training sample data, storage medium and electronic equipment |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111813711B (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113392309B (en) * | 2021-01-04 | 2025-08-26 | 腾讯科技(深圳)有限公司 | Data processing method, device, equipment and storage medium |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103907095A (en) * | 2011-07-11 | 2014-07-02 | 内存技术有限责任公司 | Mobile memory cache read optimization |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6889288B2 (en) * | 2002-12-02 | 2005-05-03 | Emc Corporation | Reducing data copy operations for writing data from a network to storage of a cached data storage system by organizing cache blocks as linked lists of data fragments |
| CN105989012B (en) * | 2015-01-28 | 2019-12-13 | 深圳市腾讯计算机系统有限公司 | page display method, device, mobile terminal and system |
| US10042764B2 (en) * | 2016-06-27 | 2018-08-07 | International Business Machines Corporation | Processing commands in a directory-based computer memory management system |
-
2020
- 2020-08-31 CN CN202010891955.6A patent/CN111813711B/en active Active
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103907095A (en) * | 2011-07-11 | 2014-07-02 | 内存技术有限责任公司 | Mobile memory cache read optimization |
Also Published As
| Publication number | Publication date |
|---|---|
| CN111813711A (en) | 2020-10-23 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN112948025B (en) | Data loading method and device, storage medium, computing equipment and computing system | |
| CN104133783B (en) | Method and device for processing distributed cache data | |
| US20170060922A1 (en) | Method and device for data search | |
| CN103593442B (en) | The De-weight method and device of daily record data | |
| US10790862B2 (en) | Cache index mapping | |
| CN113392863A (en) | Method and device for acquiring machine learning training data set and terminal | |
| CN110516119A (en) | A method, device and storage medium for organizing and scheduling natural resource scene data | |
| CN115905168A (en) | Adaptive compression method and compression apparatus, computer device, storage medium | |
| CN112667847A (en) | Data caching method, data caching device and electronic equipment | |
| CN110515979A (en) | Data query method, device, equipment and storage medium | |
| CN111813711B (en) | Method and device for reading training sample data, storage medium and electronic equipment | |
| WO2024017283A1 (en) | Model training system and method and related device | |
| US20190057120A1 (en) | Efficient Key Data Store Entry Traversal and Result Generation | |
| US20170293661A1 (en) | Bucket skiplists | |
| HK40031303A (en) | Method and apparatus for reading training sample data, storage medium and electronic device | |
| HK40031303B (en) | Method and apparatus for reading training sample data, storage medium and electronic device | |
| CN119003458A (en) | Log query method, device, medium, electronic equipment and program product | |
| CN110442616B (en) | Page access path analysis method and system for large data volume | |
| US11966393B2 (en) | Adaptive data prefetch | |
| JP2018511131A (en) | Hierarchical cost-based caching for online media | |
| US11809992B1 (en) | Applying compression profiles across similar neural network architectures | |
| US10185729B2 (en) | Index creation method and system | |
| CN111240843B (en) | Data acquisition method and device, electronic equipment and storage medium | |
| CN113792031A (en) | Method, system, device and medium for processing key value pair data | |
| CN107679093A (en) | A kind of data query method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40031303 Country of ref document: HK |