[go: up one dir, main page]

WO2025001399A1 - Method and apparatus for data check, and device, medium and product - Google Patents

Method and apparatus for data check, and device, medium and product Download PDF

Info

Publication number
WO2025001399A1
WO2025001399A1 PCT/CN2024/086034 CN2024086034W WO2025001399A1 WO 2025001399 A1 WO2025001399 A1 WO 2025001399A1 CN 2024086034 W CN2024086034 W CN 2024086034W WO 2025001399 A1 WO2025001399 A1 WO 2025001399A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
memory
groups
encoding
bits
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/CN2024/086034
Other languages
French (fr)
Chinese (zh)
Inventor
陈智勇
夏天
信恒超
谭小飞
黄天强
梁传增
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202311126669.0A external-priority patent/CN119226028A/en
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of WO2025001399A1 publication Critical patent/WO2025001399A1/en
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's

Definitions

  • Embodiments of the present application relate to the field of memory technology, and more specifically to methods, devices, equipment, media, and products for verifying data in a memory.
  • HPC high-performance computing
  • HBM is a high-performance dynamic random access memory based on 3D stacking technology.
  • HBM technology achieves large-capacity and high-bit-width memory arrays by stacking multiple Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM or DDR for short) particles in three-dimensional space.
  • DDR SDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • HBM supports multi-channel parallel reading of memory, which significantly improves the reading speed.
  • An embodiment of the present application provides a data verification solution.
  • a method for data verification includes: obtaining N groups of data from a memory, each group of data in the N groups of data includes a check bit, wherein N is a positive integer greater than or equal to 2; aggregating the N groups of data to obtain aggregated data, the aggregated data includes an aggregated check bit, and the aggregated check bit is an aggregation of the check bits of the N groups of data; error correction code (ECC) encoding the aggregated data to obtain coded data; decomposing the coded data into N groups of coded data, each group of data in the N groups of coded data includes a check bit; and writing the N groups of coded data into the memory respectively.
  • ECC error correction code
  • this method can support memory ECC functions with stronger error detection and correction capabilities compared to traditional technologies, and meet higher error correction and error detection requirements.
  • obtaining N groups of data from the memory includes: reading data from N channels or N memory bank groups of the memory, wherein the data of each channel or each memory bank is a group of data.
  • the data as the coding unit can be aggregated from the spatial dimension to obtain a larger coding unit.
  • performing ECC encoding on the aggregated data includes: determining a method for obtaining N groups of data from a memory according to a preset encoding method; and performing ECC encoding on the aggregated data according to the preset encoding method.
  • performing ECC encoding on the aggregated data according to a preset encoding method includes: presetting the encoding method according to a user input, the input including an ECC performance requirement for a memory, thereby automatically providing an optimal encoding method under hardware conditions based on the ECC performance requirement.
  • the method further comprises: in response to the encoding method being updated, acquiring data from the memory according to the updated encoding method to update the check bit of the acquired data.
  • the relevant configuration can be dynamically updated without restarting in response to demand changes, thereby avoiding business interruption caused by updating the configuration.
  • the encoding method includes Reed-Solomon (RS) encoding.
  • RS encoding can detect and correct multi-bit errors, significantly improving the error detection and correction capabilities of the generated ECC code.
  • the method further comprises: determining a corresponding decoding method according to the encoding method; and when reading data from the memory for use, generating error-corrected data bits for the data based on the check bits of the read data according to the decoding method.
  • the method further includes: determining the number of bits that are different between the corrected data bits and the read data bits; if the number of bits satisfies a threshold condition, generating a prompt indicating that there may be an error in the corrected data bits.
  • a device for data verification comprising: an acquisition module, configured to acquire N groups of data from a memory, each group of the N groups of data comprising a check bit, wherein N is a positive integer greater than or equal to 2; an aggregation module, configured to aggregate the N groups of data to obtain aggregated data, the aggregated data comprising an aggregated check bit, the aggregated check bit being an aggregation of the check bits of the N groups of data; an encoding module, configured to perform error correction code ECC encoding on the aggregated data to obtain encoded data; a decomposition module, configured to decompose the encoded data into N groups of encoded data, each group of the N groups of encoded data comprising a check bit; and a writing module, writing the N groups of encoded data into the memory respectively.
  • an acquisition module configured to acquire N groups of data from a memory, each group of the N groups of data comprising a check bit, wherein N is a positive integer greater than or equal to 2
  • an aggregation module configured
  • the acquisition module includes a first reading module, which is configured to read data from N channels or N memory bank groups of the memory, wherein the data of each channel or each memory bank is a group of data.
  • the encoding module includes: a determination module, configured to determine a method for obtaining N groups of data from a memory according to a preset encoding method; and an ECC encoding module, configured to perform ECC encoding on the aggregated data according to a preset encoding method.
  • the ECC encoding module includes: pre-setting the encoding method according to user input, where the input includes ECC performance requirements for the memory.
  • the apparatus further comprises an updating module, wherein the updating module is configured to: in response to the encoding method being updated, obtain data from the memory according to the updated encoding method to update a check bit of the obtained data.
  • the encoding scheme includes Reed-Solomon encoding.
  • the device also includes: a decoding method determination module, configured to determine a corresponding decoding method according to an encoding method; and an error correction module, configured to generate error-corrected data bits for the data based on the check bits of the read data according to the decoding method when reading data from the memory for use.
  • a decoding method determination module configured to determine a corresponding decoding method according to an encoding method
  • an error correction module configured to generate error-corrected data bits for the data based on the check bits of the read data according to the decoding method when reading data from the memory for use.
  • the device also includes: a difference bit number determination module, which determines the number of different bits between the error-corrected data bits and the data bits of the read data; a prompt module, which is configured to generate a prompt when the different number of bits meets a threshold condition, and the prompt indicates that there may be an error in the error-corrected data bits.
  • an electronic device including a processor and a memory, wherein computer instructions are stored in the memory, and when the computer instructions are executed by the processor, the electronic device performs actions according to the method in the first aspect or any embodiment thereof.
  • a computer-readable storage medium stores computer-executable instructions, which, when executed by an electronic device, enable the electronic device to perform the operation of the method according to the first aspect or any of its embodiments.
  • a computer program product is provided.
  • the computer program product is tangibly stored on a computer-readable medium and includes computer-executable instructions, which, when executed, implement the operations of the method according to the first aspect or any of its embodiments.
  • FIG1 is a schematic diagram showing an example environment in which various embodiments of the present application can be implemented
  • FIG2 shows a flow chart of an example method for data verification according to some embodiments of the present application
  • FIG3 shows a schematic diagram of a non-limiting example of reading data for aggregation according to some embodiments of the present application
  • FIG4 is a schematic diagram showing a non-limiting example of aggregating multiple groups of data for ECC encoding according to some embodiments of the present application.
  • FIG5 is a schematic diagram showing a non-limiting example of decomposing and writing ECC-encoded aggregate data into a memory according to some embodiments of the present application;
  • FIG6 shows a flow chart of an example method for performing error detection and correction on data according to some embodiments of the present application
  • FIG7 shows a flowchart of an example method for data verification according to some embodiments of the present application.
  • FIG8 shows a schematic block diagram of a message transmission device according to some embodiments of the present application.
  • FIG. 9 shows a schematic block diagram of an example device that can be used to implement embodiments of the present application.
  • HBM is based on 3D stacking technology, which can provide multi-channel parallel reading of memory. In application fields that require HPC, HBM is widely used due to the high demand for memory bandwidth in high-computing systems.
  • the structure of HBM determines that it has no separate redundant particles, and the redundancy ratio is also lower than that of general DDR. Due to the limited redundancy ratio and other reasons, the ECC algorithm that conventional HBM can support is relatively simple and has limited error detection and correction capabilities. For example, for 128 bits of data, 8-bit Hamming code ECC can detect and correct one bit error. If you want to detect two-bit errors, you need 9-bit Hamming code ECC.
  • Some HBM providers provide 16 redundant bits for ECC check bits for 128-bit data.
  • This redundancy ratio can support 8-bit basic Hamming code check bits and 8-bit parity check bits, where each parity check bit is calculated for 16 bits of 128-bit data.
  • This method can achieve single-grain error correction, but its error correction requires multiple trial calculations, resulting in a large delay.
  • the redundancy ratio of the memory needs to be increased to 8:1, which increases the cost of the hardware.
  • Hamming code-based ECC cannot correct multi-bit errors. If more than two bits of errors occur, missed detection or miscorrection will still occur, that is, it cannot achieve the ability to correct multi-bit errors required to cope with hard failures.
  • an embodiment of the present disclosure provides a data verification scheme.
  • the scheme aggregates multiple groups of data read from the memory unit (for example, from dimensions such as space and time) to obtain a larger coding unit for ECC encoding.
  • the scheme can support ECC-related functions for memory with stronger error detection and correction capabilities than traditional technologies, thereby significantly improving the reliability of ECC error detection and correction of the memory and reducing uncorrectable errors, meeting higher error correction and error detection requirements.
  • HBM may be used as an example in the description of this article, the scheme according to the embodiment of the present application is also applicable to various other memory forms, such as DDR, mobile DDR, and storage class memory (SCM).
  • Example environment 100 includes a computing device 110.
  • Computing device 110 includes a processor 120 and a memory 130.
  • Processor 120 may include a CPU, a GPU, various types of dedicated computing units (such as, AI chips, etc.), various controllers, or a suitable combination of the foregoing items.
  • Processor 120 can perform various processing actions according to computer program instructions stored in or loaded into memory 130.
  • the memory 130 may be a random access memory (RAM), such as a dynamic random access memory (DRAM), which may support fast random read and write from the processor 120, thereby directly supporting
  • RAM random access memory
  • DRAM dynamic random access memory
  • the processor 120 can quickly read or write data to a random address in the memory 130.
  • the processor 120 when it performs a process, it can read data required for the execution (such as instructions and/or calculation parameters) from the memory 130 and write data (such as the output of the calculation) to the memory 130.
  • the computing device 110 also includes (multiple) memory controllers 140 as part of the processor 120.
  • the memory controller 140 is a component of the computing device 110 used to control and manage data transmission between a computing unit such as a CPU/GPU and the memory 130.
  • the processor 120 reads and writes data from the memory 130 via the memory controller 140.
  • Each memory controller 140 can be implemented as a separate chip, or integrated into another larger chip, such as a controller built into a CPU or a north bridge.
  • data may be transmitted between the processor 120 and the memory 130 in a burst transmission manner.
  • the burst transmission has a certain burst length.
  • the memory controller 140 may continuously read/write a number of consecutive storage units equal to the burst length according to a specified starting address without continuously providing addresses.
  • a unit burst length is capable of reading 64 bits of data from the memory 130 via a single channel
  • the processor 120 may be capable of reading 128 bits of data from the storage area via a single channel.
  • the memory 130 may include multiple bank groups, each of which can be read and written independently, so that data can be read/written from multiple bank groups at the same time.
  • the memory 130 may be an HBM based on 3D stacking technology, in which multiple bank groups are stacked together in a three-dimensional manner.
  • the memory 130 also supports virtual channels. Virtual channel technology can split each physical channel into multiple virtual channels, so that the number of single data that can be transmitted in parallel between the processor 120 and the memory 130 is greater than the number of physical channels.
  • the embodiments of the present application are not limited to the specific number of memory controllers, memory bank groups, or channels/virtual channels.
  • the memory 130 supports an error-correcting code (ECC) check function.
  • ECC error-correcting code
  • the processor 120 can, for example, use the memory controller 140 to calculate the corresponding ECC.
  • the calculated ECC is also written to the memory 130 as a check bit. Later, when the group of data bits is read back for use by the processor 120, the corresponding check bit is also read together.
  • the memory controller 140 then recalculates the ECC for the read data bits and compares it with the ECC in the read check bits. If the two do not match, the memory controller 140 can determine that there is an error in the read data bits. In this case, the memory controller 140 can perform further error correction decoding on the data bits to determine which one or more bits are incorrect. Then, the erroneous bits are corrected. For binary data bits, this correction can mean that the erroneous bits are flipped. The correction of the data enables the processor 120 to correctly perform subsequent tasks, thereby ensuring the normal operation of the computing device 110.
  • the memory 130 may include a data portion 150 specifically storing actual data bits and a check portion 160 storing check bits.
  • the processor 120 may write the check bits of the data bits into the check portion 160 based on the corresponding read and write specifications. It should be understood that the data portion 150 and the check portion 160 are shown only as examples. In some embodiments, there may be no separate particles storing check bits in the memory 130.
  • the memory 130 may also store data bits and corresponding check bits in other appropriate structures.
  • the processor 120 includes a paired ECC encoding circuit and decoding circuit (not shown), for example, in the memory controller 140 or in another part thereof.
  • a pair of encoding circuits can implement the encoding and decoding functions of a specific ECC algorithm, for example, a Hamming code algorithm, and a Reed-Solomon encoding algorithm.
  • the encoding circuit can be used to calculate the ECC of the data bit.
  • the decoding circuit can decode the calculated ECC to determine the error bit in the corresponding data bit and correct it.
  • the memory controller 140 may include multiple pairs of encoding circuits and decoding circuits, and switch between different circuits according to the system settings to perform the ECC function using different algorithms.
  • the data unit targeted by ECC encoding is called a codeword.
  • Each codeword includes actual data bits and redundant check bits including ECC.
  • the number of check bits needs to meet certain conditions, which are usually limited to the achievable hardware redundancy ratio between the capacity of the data part 150 and the check part 160.
  • a maximum of 8 additional check bits can support the detection and correction of one bit of error. If it is necessary to ensure reliable detection of two-bit errors, the redundancy ratio needs to be increased. If two-bit errors are to be detected and one-bit errors are to be corrected, 128-bit data requires 9 bits of ECC.
  • the processor 120 can increase the size of the coding unit according to the method of an embodiment of the present application, as will be described in detail later.
  • the architecture and functions in the example environment 100 are described for exemplary purposes only, and do not imply any limitation on the scope of the present application.
  • other devices, systems or components not shown may also exist in the example environment 100.
  • the embodiments of the present application may also be applied to other environments with different and/or other functions.
  • the processor 120 and the memory 130 may be packaged in the same SoC.
  • the read, write and data verification actions for the memory 130 may also be performed by the base die. The following will generally describe the embodiments of the present application in the context of the computing device 110 using the processor 120 to perform actions.
  • Figure 2 shows a flow chart of an example method 200 for data verification according to some embodiments of the present application.
  • the example method 200 can be performed, for example, by the computing device 110 as shown in Figure 1. It should be understood that the method 200 can also include additional actions not shown, and the scope of the present application is not limited in this respect.
  • the method 200 is described in detail below in conjunction with the example environment 100 of Figure 1.
  • N groups of data are obtained from the memory, each of which includes a check bit.
  • N is an integer greater than or equal to 2.
  • the computing device 110 can read N groups of data from the memory 130 to the processor 120 via the memory controller 140, wherein each of the N groups of data includes actual data bits and a check bit.
  • the check bit can be used to store the ECC check code of the group of data bits. At this time, the check bit can be empty or may need to be updated.
  • the processor 120 may read N groups of data in parallel from N memory bank groups or N channels of the memory 130.
  • the N channels may be virtual channels.
  • the N groups of data read in parallel may be aggregated for encoding as described below, and compared with ECC encoding of a single group of data obtained from each channel, this reading method can obtain more numbers of encoding units without increasing latency.
  • Each of the L time slices is the duration of a burst transmission of data read from the memory.
  • the burst transmission has a certain burst length.
  • the duration of a burst transmission can be used as a unit time for configuration.
  • the N groups of data are aggregated to obtain aggregated data, the aggregated data including an aggregated check bit, the aggregated check bit being an aggregation of the check bits of the N groups of data.
  • the processor 120 may aggregate the N groups of data obtained at 210 to obtain aggregated data, including an aggregated check bit of the check bits of the N groups of data.
  • the processor 120 may concatenate the data bits of the N groups of data and the check bits of the N groups of data to obtain aggregated data. It should be understood that such aggregation may be a logical aggregation, as long as the processor 120 is able to know which bits in the aggregated data are data bits and which bits are check bits.
  • the aggregated data is ECC-encoded to obtain encoded data.
  • the processor 120 may perform ECC encoding on the aggregated data obtained at 220 to obtain encoded data.
  • the ECC encoding may be Hamming code encoding and its various variants.
  • the processor 120 may calculate a Hamming code for the aggregated data, calculate a parity bit for each of a plurality of segments of the aggregated data, and use both as part of the check bits of the encoded data.
  • the ECC encoding may be RS encoding with a stronger error correction and detection capability.
  • the coded data is decomposed into N groups of coded data, each of which includes a check bit.
  • the processor 120 may decompose the coded data obtained at 230 into N groups of coded data, each of which includes a check bit.
  • this decomposition may be a logical decomposition, as long as the processor 120 can correctly process (e.g., transmit) the N groups of coded data in groups.
  • the N groups of coded data are written to the memory separately.
  • the processor 120 may write the N groups of coded data decomposed at 240 to the memory 130 separately.
  • the processor 120 may write data to the memory 130 in a manner corresponding to the manner in which the data is read from the memory 130 at 210.
  • the processor 120 may write the N groups of data to N memory bank groups or N channels (or virtual channels) in parallel.
  • method 200 can support memory ECC functions with stronger error detection and correction capabilities (such as RS encoding with stronger capabilities than Hamming code) compared to traditional methods, thereby significantly improving the reliability of memory ECC error detection and correction functions and reducing the occurrence of uncorrectable errors (UCE).
  • method 200 can also meet the increased demand by aggregating data without changing hardware characteristics such as hardware redundancy ratio after the error correction and error detection requirements of memory ECC increase.
  • FIG. 3 a schematic diagram 300 of a non-limiting example of reading data for aggregation according to some embodiments of the present application is shown.
  • a processor 310 is shown in the schematic diagram 300 , which communicates with an HBM 320 supporting multiple channels.
  • the processor 310 and the HBM 320 may be example implementations of the processor 120 and the memory 130 in FIG. 1 , respectively, and the actions described in FIG. 3 may be example implementations of the action 210 in the method 200 of FIG. 2 .
  • the processor 310 reads data in parallel via a plurality of controllers 330-1 to 330-I (individually or collectively referred to as controller 330).
  • controller 330-1 is taken as an example, which is capable of controlling the data transmission of channels 340-1 and 340-2.
  • controller 330-1 can read a group of unit data from each channel of channels 340-1 and 340-2 into its own buffer. The duration of a data cycle depends on the burst length of the burst transmission and the hardware conditions of the relevant components. Therefore, the group of unit data can be regarded as data read in a unit time slice.
  • the controller 330-1 can read data 311 from the channel 340-1 and read data 312 from the channel 340-2.
  • the controller 330-1 can read data 313 from the channel 340-1 and read data 314 from the channel 340-2.
  • Each group of data in the data 311 to 314 includes an additional check bit of the actual data bit. It should be understood that, depending on the scenario, the check bit at this time can be a null value reserved for later writing to HBM 320.
  • the controller 330-1 can calculate the check bit separately for each data as a codeword, then for the example codeword of 128+8 bits, the Hamming code encoding for detecting and correcting single bit errors can be used.
  • the controller 330-1 can aggregate the data 311 to 314 into a longer codeword for ECC encoding.
  • the longer codeword includes 32 check bits, so that 4 RS8 redundant encodings can be supported.
  • RS encoding is widely used in DDR4 and DDR5. It is a symbol-based algorithm that can be used to correct errors in one or more symbols, each symbol can be 8 bits, for example.
  • the processor 310 aggregates data based on the data read in parallel by each controller 330. It should be understood that this is for illustration purposes only.
  • the computing device e.g., computing device 110
  • the computing device may set the configuration information of the encoding method according to the parameter values (e.g., the number of channels and the number of time slices) input by the user.
  • the input includes ECC performance requirements for the memory. For example, the correctable rate, the expected error correction delay, and the expected missed detection rate.
  • the computing device can then adapt the configuration parameters of the encoding method based on these performance requirements and the hardware conditions of the memory 130.
  • FIG. 4 illustrates a schematic diagram 400 of a non-limiting example of aggregating multiple sets of data for ECC encoding according to some embodiments of the present application, wherein multiple sets of data 411 to 414 are shown read at multiple time slices and/or from multiple channels.
  • the actions described in FIG. 4 may be example implementations of actions 220 and 230 in the method 200 of FIG. 2 , and the multiple sets of data 411 to 414 may be examples of data 311 to 314 read from the HBM 320 in the example of FIG. 3 .
  • FIG. 4 will continue to be described below in the context of the processor 310 in FIG. 3 performing the actions.
  • each group of data in data 411 to 414 includes a group of data bits and a group of check bits, such as data bit 425 and check bit 435 of data 411.
  • the check bit can be empty at this time.
  • the check bit can be a placeholder added by the processor according to the hardware redundancy ratio of the memory.
  • the processor 310 can aggregate the data 411 to 414 before performing the calculation.
  • the processor 310 concatenates the data bits of the data 411 to 414 together to form an aggregate 445 of data bits, and concatenates the check bits of the data 411 to 414 together to form an aggregate 455 of check bits.
  • the aggregate 445 of data bits and the aggregate 455 of check bits are aggregated (in this example, concatenated) to form final aggregate data 465.
  • Aggregate data 455 includes aggregate data bits 475 and aggregate check bits 485.
  • the processor 310 can perform ECC encoding on the aggregate data 465 to obtain encoded data. Specifically, the processor 310 can calculate the value of the check bit of the aggregate data 465 according to a preset ECC encoding method, and put it into the aggregate check bit 485. In some embodiments, the processor 310 can perform ECC encoding on the aggregate data 465 according to a preset encoding method. For example, the processor 310 can configure its ECC encoding circuit according to the parameters in the encoding settings to perform ECC encoding on the aggregate data. In some embodiments, the encoding method can be RS encoding with specific parameters, which specify various properties of the RS encoding algorithm and codewords. In some embodiments, the processor 310 may include multiple ECC encoding circuits. According to the preset encoding method, the processor 310 can switch to The corresponding encoding circuit is configured according to the set parameters to perform ECC encoding on the aggregate data.
  • FIG. 5 shows a schematic diagram 500 of a non-limiting example of decomposing and writing ECC-encoded aggregate data to a memory according to some embodiments of the present application.
  • the actions described in FIG. 5 may be an example implementation of actions 240 and 250 in the method 200 of FIG. 2
  • the aggregate data 565 may be an example of the aggregate data 465 in the example of FIG. 4 .
  • FIG. 5 will continue to be described below in the context of the processor 310 in FIG. 3 performing actions.
  • the processor 310 After performing ECC encoding on the aggregate data 565, the processor 310 also needs to write each data bit and check bit of the encoded aggregate data 565 to the corresponding address. Therefore, the processor 310 can perform the reverse process of the above-mentioned aggregation process, and rewrite the encoded data (for example, via the controller 330) to the HBM 320 in an analogous manner to reading N groups of data from the HBM.
  • the aggregated data 565 may be aggregated from multiple sets of data (e.g., data 411 to 414 in FIG4 ), and its data bits 575 include multiple parts from the multiple sets of data, such as part 545.
  • the number of bits of the check bits 585 of the aggregated data 565 is also equal to the aggregation of the multiple sets of check bits of the multiple sets of data.
  • the check bits 585 include the ECC values calculated for the data bits 575.
  • the check bits 575 may be considered to include multiple parts corresponding to the check bits of each set of data in the multiple sets of data, such as 555.
  • the processor 310 can decompose the data bits 575 into multiple groups of data bits again, and decompose the check bits 585 into multiple groups of check bits.
  • the processor 310 can then combine each group of data bits in the multiple groups of data bits with the corresponding check bits to re-obtain multiple groups of data, in this example, data 511 to 514.
  • Each group of data in the data 511 to 514 has the same structure as the multiple groups of data previously obtained by the processor 310 and includes a portion of the calculated check bits 585.
  • the processor 310 can rewrite N groups of data into the HBM 320 according to the corresponding read and write specifications in an analogous manner to the previous reading of multiple groups of data.
  • the controller 330-1 of the processor 310 can write data 511 to 514 into the HBM 320 in two time slices in parallel via the channel 340-1 and the channel 340-2.
  • FIG. 6 shows a flowchart of an example method 600 for error detection and error correction of data according to some embodiments of the present application.
  • the example method 600 can be performed, for example, by a computing device 110 as shown in Figure 1. It should be understood that the method 600 can also include additional actions not shown, and some actions in the method 600 can be omitted, and the scope of the present application is not limited in this respect.
  • the method 600 is described in detail below in conjunction with the example environment 100 of Figure 1.
  • the processor 120 can read the corresponding amount of encoded data from the memory 130 into its cache line, such as the cache line of the controller 140.
  • the encoded data includes data bits to be used by the program and corresponding check bits.
  • the check bits can be generated by the processor 120 based on the method of the embodiment of the present application (for example, the method described with respect to Figures 2 to 5).
  • the processor 120 can determine the corresponding decoding mode. For example, when the computing device 110 is started, the processor 120 can configure the corresponding ECC encoding and decoding path according to the current ECC encoding configuration information. The processor 120 can then read the amount of encoded data suitable for the current decoding mode to perform decoding according to the decoding mode. For example, the processor 120 can read multiple groups of encoded data from (multiple) channels and/or (multiple) time slices in a manner similar to the above-mentioned acquisition of multiple groups of data to combine into codewords suitable for the current decoding mode.
  • the processor 120 may perform error detection on the encoded data read, and if necessary, attempt to generate error-corrected data bits for the data based on the read check bits. Specifically, at 615, the processor 120 may generate check bits for the data bits of the encoded data read based on the current decoding method. Then, at 620, the processor 120 may compare the check bits generated at 615 with the check bits of the encoded data read at 610 to determine whether the two sets of check bits match.
  • the processor 120 can determine that the data bits of the encoded data read are consistent with when the data bits were written, that is, there are no errors in the data bits. In this case, the method 600 can proceed to 625, where the detected data is provided to the application process for further use. Then, the method 600 can return to 610, where the processor 120 can read the next batch of data of the decoding unit size and its check bits.
  • the processor 120 may determine that there are errors in the data bits. In this case, the processor 120 needs to attempt to perform error correction on the data bits to correct the erroneous bits. Therefore, the method 600 proceeds to 630, where the processor 120 may decode the read check bits according to the currently active ECC decoding path (e.g., the RS decoding circuit path). Such a decoding process is intended to determine the location of the erroneous bit in the corresponding data bit for correction.
  • the currently active ECC decoding path e.g., the RS decoding circuit path
  • the situation that the error of data cannot be corrected may occur, that is, there is UCE in the data. If the processor 120 finds UCE at 635, the method 600 proceeds to 640, where the processor 120 can trigger the UCE interrupt. If no UCE is found, the method 600 proceeds to 645, where the processor 120 can generate the result of error correction. That is, the processor 120 can correct the error in the detected data bit based on the result of decoding the check bit to generate the data bit through error correction. For a binary erroneous data bit in the data bit, this can mean the flipping of the value.
  • a specific ECC decoding method has a certain error detection and correction capability.
  • the errors in the data bits detected by it exceed its capability, in addition to ECC, miscorrection may also occur. That is, using the current decoding method to perform ECC decoding on the check code can obtain a certain result, but in fact there may be errors in the result.
  • method 600 may further detect the reliability of the corrected data.
  • processor 120 may determine whether to perform a reliability test on the corrected data. For example, processor 120 may turn this function on or off according to a pre-set configuration, thereby implementing two different error correction modes. If it is determined that the reliability test is not to be performed, method 600 proceeds to 655, where processor 120 may provide the corrected data to the application process for further use. At 655, processor 120 may also record a relevant error correction log. Then, method 600 may return to 610.
  • the method 600 proceeds to 660, where the processor 120 can determine the difference between the error-corrected data bits generated at 645 and the read data bits. In some embodiments, the processor 120 can determine the number of bits that are different between the error-corrected data bits and the data bits of the read data. In some embodiments, the processor 120 can also determine the distribution of these different bits in the data bits. Then, at 665, the processor 120 can determine whether the difference meets a threshold condition.
  • the processor 120 may determine that the number of bits that differ between the error-corrected data bits and the data bits of the read data exceeds a threshold.
  • the threshold is associated with the maximum number of bits that can be corrected in a group of data bits by the ECC encoding and decoding method used. For example, if an ECC method that can detect at most one bit of data generates a correction result that corrects two bits, then the result meets the threshold condition, that is, the correction result may include an error.
  • the threshold condition is also related to the distribution of the error bits. For example, if the ECC method can reliably detect errors of j consecutive bits, but if the error bits are discontinuous, it can only reliably detect fewer k bits of errors. If the error-corrected data bits correct the errors of i discontinuous bits, the processor 120 can still determine that the result meets the threshold condition.
  • the method proceeds to 655. If the difference meets the threshold condition, the method proceeds to 670, where the processor 120 can generate a hint indicating that there may be an error in the corrected data bit. Then, at 675, the processor 120 can provide the corrected data with the hint (e.g., as an additional mark) to the application process for use. The processor 120 can also record the relevant error correction log for tracking and analysis. Then, the method 600 can return to 610.
  • the processor 120 can trigger an interrupt.
  • Method 600 can be used in combination with the ECC encoding process described above according to FIGS. 2 to 5 , so as to achieve an ECC function with stronger error detection and correction performance than the conventional method under the same hardware conditions.
  • the above process of detecting and prompting the reliability of the error-corrected data can also be used separately with other ECC decoding processes to provide the related benefits described above.
  • the computing device 110 can read data and perform ECC functions on the data according to a preset encoding method.
  • these encoding methods can be updated due to changes in ECC-related requirements for the computing device 110.
  • the update of the encoding method can occur during system operation, and the update does not require the system to be restarted for it to take effect.
  • FIG. 7 shows a flowchart of an example method 700 for data verification according to some embodiments of the present application, wherein the ECC encoding configuration for data verification is updated during system operation.
  • the example method 700 may be executed, for example, by the computing device 110 shown in FIG. 1. It should be understood that the method 700 may also include additional actions not shown, and some actions in the method 700 may be omitted, and the scope of the present application is not limited in this respect.
  • the method 700 is described in detail below in conjunction with the example environment 100 of FIG. 1. FA 700.
  • the method 700 begins with the system on the computing device 110 starting at 710.
  • the computing device 110 can configure the components that perform ECC encoding and decoding based on the preset encoding method to initialize the ECC error correction and error detection functions.
  • the computing device 110 can be configured based on a configuration file that stores relevant parameters of the encoding method. The values of these relevant parameters can be default values that are not changed or restored, or can be values previously entered by a user.
  • the configuration parameters may include the number of channels that should be used to read data in parallel and/or the number of time slices over which the data read should be aggregated.
  • the computing device 110 may configure its ECC encoding function to encode based on N groups of data read in parallel from M channels or M groups of memory banks aggregated over L time slices, as described above with respect to method 200.
  • the computing device 110 may include multiple ECC encoding circuits and corresponding decoding circuits, such as a circuit implementing a Hamming code algorithm and a circuit implementing a RS algorithm. According to the encoding algorithm specified in the configuration file, the computing device 110 may enable and configure the corresponding circuit and perform ECC-related functions.
  • the computing device 110 may also perform other ECC-related configurations.
  • the computing device 110 may include a function of enabling or disabling reliability detection of error-corrected data after it is generated, as described in detail above with respect to method 600.
  • the processor 120 of the computing device 110 begins to run various system and application processes, and continuously communicates with the memory 130 to read and write data, etc. During this period, the configured ECC-related functions are utilized.
  • the processor 120 (for example, in its internal buffer) can perform ECC encoding on the data to be written to the memory 130, and write the data bits of the data and the calculated check bits to the memory 130 according to the corresponding read and write specifications. And, when reading the encoded data for use, the processor 120 can use the configured ECC-related functions to read in the data bits and corresponding check bits of the data for decoding, thereby performing ECC error correction and error detection on the data.
  • the computing device 110 can perform ECC encoding on the aggregated long codeword and correspondingly perform ECC decoding based on the configured ECC error correction and error detection functions according to the method actions described above with respect to Figures 2 to 6, which will not be repeated here.
  • the process can be performed by the memory controller 140 or another component of the processor.
  • the computing device 110 performs relevant error handling actions, such as triggering an interrupt.
  • the computing device 110 may receive an update indication for the ECC encoding method, for example, via a user interface.
  • the update indication may include update information for one or more parameters, such as the number of channels for reading the data to be aggregated in parallel, the number of time slices for reading the data to be aggregated, and the encoding algorithm.
  • the computing device 110 may configure the relevant components according to the update to switch to a new encoding method. For example, the computing device 110 may switch the encoding circuit to another encoding circuit, and correspondingly switch the decoding circuit to a decoding circuit paired with another encoding circuit.
  • the processor 120 of the computing device 110 may perform subsequent ECC encoding and decoding actions according to the new encoding method.
  • the computing device 110 can cause the processor 120 to update the check bits of the encoded data in the memory 130 with the updated encoding method.
  • the processor 120 can read the encoded data from the memory 130 into its buffer according to the updated encoding method to update the check bit of the data.
  • the processor 120 can use the number of channels and the number of time slices indicated in the updated configuration to read and aggregate the long codeword for ECC encoding, and use the updated encoding method to calculate the check bit of the long codeword. Then, the processor 120 can write the data bits and check bits of the long codeword back to the corresponding addresses, thereby keeping the data bits unchanged and overwriting the previous check bits. The processor 120 can continue to read and aggregate the next long codeword to update the check bit until the required update is completed.
  • Method 700 provides customizable ECC error detection and correction functions that are flexible and configurable in multiple dimensions such as space, time, and encoding algorithms. In this way, multiple ECC encoding and decoding methods that meet different needs of users can be provided under the same hardware environment.
  • the relevant configuration can be dynamically updated without restarting in response to changes in demand, thereby avoiding business interruptions caused by updating ECC-related function configurations.
  • the method can be used for operational configuration optimization of data centers, etc. Multiple ECC-related function configurations can be adopted and tested for a period of time to clarify the most important memory failure modes and determine the best configuration parameters.
  • the user can configure the ECC related functions according to the requirements of memory error detection and correction in its application scenario.
  • the user can directly input parameter values to configure the parameters according to the requirements.
  • the user can configure the encoding algorithm to be a Hamming code encoding algorithm for cost considerations.
  • the RS encoding algorithm with stronger power is encoded on the basis of the aggregated long codeword to achieve error correction capability for multiple consecutive symbols.
  • the user can enter a configuration value to specify RS8 encoding based on the aggregation of two time slices of data read in parallel from two channels. However, if less time delay is to be achieved without changing the encoding algorithm, the user can modify the configuration to aggregate the data of one time slice read in parallel from four channels for RS8 encoding.
  • user input may include ECC performance requirements for memory 130.
  • the computing device 110 may automatically adapt the parameters of the encoding method that can meet the performance requirements and meet the hardware and other constraints according to these ECC performance requirements and the hardware constraints of the memory 130, and use these parameters to set or update the ECC related function configuration.
  • these performance requirements may include the expected delay caused by ECC error correction, and the expected reliability indicators of the corresponding data using the generated verification data, such as the expected missed detection rate or the false correction rate.
  • the hardware constraints of the memory 130 may include the redundancy ratio between the capacity used to store the original data and the capacity used to store the verification data in the memory 130.
  • the computing device 110 may provide preset multiple parameter combinations, and corresponding requirements to these multiple parameter combination mapping rules. In this way, the computing device 110 may provide corresponding parameter combinations based on the mapping rules to configure ECC related functions.
  • the computing device 110 may implement a configuration planning function. This function takes performance requirements and constraints as input, and outputs the optimal configuration parameters for extreme reliability that meet various constraints, such as the number of channels and time slices on which the aggregated data is based, and the ECC encoding algorithm to be used. The optimal configuration parameters can then be used to configure ECC related functions.
  • Fig. 8 shows a schematic block diagram of a data verification device 800 according to some embodiments of the present application.
  • the device 800 may be implemented as or included in the computing device 110 of Fig. 1.
  • the device 800 may include multiple modules to perform corresponding actions in, for example, method 200, method 600, and method 700.
  • the device 800 includes the following modules: an acquisition module 810, configured to acquire N groups of data from a memory, each group of data in the N groups of data includes a check bit, wherein N is a positive integer greater than or equal to 2; an aggregation module 820, configured to aggregate the N groups of data to obtain aggregated data, the aggregated data includes an aggregated check bit, and the aggregated check bit is an aggregation of the check bits of the N groups of data; an encoding module 830, configured to perform error correction code ECC encoding on the aggregated data to obtain encoded data; a decomposition module 840, configured to decompose the encoded data into N groups of encoded data, each group of encoded data in the N groups of encoded data includes a check bit; and a writing module 850, configured to write the N groups of encoded data into the memory respectively.
  • an acquisition module 810 configured to acquire N groups of data from a memory, each group of data in the N groups of data includes a check bit, wherein N is
  • the acquisition module 810 includes a first reading module, which is configured to read data from N channels or N memory bank groups of the memory, wherein the data of each channel or each memory bank is a group of data.
  • the encoding module 830 includes: a determination module configured to determine a method for obtaining N groups of data from a memory according to a preset encoding method; and an ECC encoding module configured to perform ECC encoding on the aggregated data according to a preset encoding method.
  • the ECC encoding module includes: presetting the encoding method according to user input, where the input includes ECC performance requirements for the memory.
  • the apparatus 800 further includes an updating module, wherein the updating module is configured to: in response to the encoding method being updated, obtain data from the memory according to the updated encoding method to update the check bit of the obtained data.
  • the encoding scheme includes Reed-Solomon encoding.
  • the device 800 also includes: a decoding method determination module, configured to determine a corresponding decoding method according to an encoding method; and an error correction module, configured to generate error-corrected data bits for the data based on the check bits of the read data according to the decoding method when reading data from the memory for use.
  • a decoding method determination module configured to determine a corresponding decoding method according to an encoding method
  • an error correction module configured to generate error-corrected data bits for the data based on the check bits of the read data according to the decoding method when reading data from the memory for use.
  • the device 800 also includes: a difference bit number determination module, which determines the number of different bits between the corrected data bits and the data bits of the read data; a prompt module, which is configured to generate a prompt when the different number of bits meets a threshold condition, and the prompt indicates that there may be errors in the corrected data bits.
  • a difference bit number determination module which determines the number of different bits between the corrected data bits and the data bits of the read data
  • a prompt module which is configured to generate a prompt when the different number of bits meets a threshold condition, and the prompt indicates that there may be errors in the corrected data bits.
  • FIG9 shows a schematic block diagram of an example device 900 that can be used to implement an embodiment of the present application.
  • the device 900 can be used to implement the functions of the computing device 110 shown in FIG1 .
  • the device 900 includes a computing unit 901, which can perform various appropriate actions and processes according to computer program instructions stored in a random access memory (RAM) 903 and/or a read-only memory (ROM) 902 or computer program instructions loaded from a storage unit 908 into the RAM 903 and/or the ROM 902.
  • RAM random access memory
  • ROM read-only memory
  • Various programs and data required for the operation of the device 900 can also be stored in the RAM 903 and/or the ROM 902.
  • the computing unit 901 and the RAM 903 can respectively implement 1 shows the functions of the processor 120 and the memory 130.
  • the computing unit 901 and the RAM 903 and/or the ROM 902 are connected to each other via a bus 904.
  • An input/output (I/O) interface 905 is also connected to the bus 904.
  • a number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906, such as a keyboard, a mouse, etc.; an output unit 907, such as various types of displays, speakers, etc.; a storage unit 908, such as a disk, an optical disk, etc.; and a communication unit 909, such as a network card, a modem, a wireless communication transceiver, etc.
  • the communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
  • the computing unit 901 may be a variety of general and/or special processing components with processing and computing capabilities. It may implement the functions of the processor 120 in FIG. 1 . Some examples of the computing unit 901 include, but are not limited to, a CPU, a GPU, various dedicated AI computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any appropriate processor, controller, microcontroller, etc.
  • the computing unit 901 performs the various methods and processes described above, such as methods 200, 600, and 700.
  • methods 200, 600, and 700 may be implemented as a computer software program, which is tangibly contained in a machine-readable medium, such as a storage unit 908.
  • part or all of the computer program may be loaded and/or installed on the device 900 via RAM and/or ROM and/or a communication unit 909.
  • the computer program When the computer program is loaded into the RAM and/or ROM and executed by the computing unit 901, one or more steps of the methods 200, 600, or 700 described above may be performed.
  • the computing unit 901 may be configured to execute the method 200 , 600 , or 700 in any other appropriate manner (eg, by means of firmware).
  • the above embodiments it can be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a server or terminal, the process or function described in the embodiment of the present application is generated in whole or in part.
  • the computer instructions can be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions can be transmitted from one website site, computer, server or data center to another website site, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line) or wireless (e.g., infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium can be any available medium that can be accessed by a server or terminal or a data storage device such as a server or data center that includes one or more available media integrated.
  • the available medium can be a magnetic medium (such as a floppy disk, a hard disk, and a tape, etc.), or an optical medium (such as a digital video disk (digital video disk, DVD), etc.), or a semiconductor medium (such as a solid-state hard disk, etc.).
  • a magnetic medium such as a floppy disk, a hard disk, and a tape, etc.
  • an optical medium such as a digital video disk (digital video disk, DVD), etc.
  • a semiconductor medium such as a solid-state hard disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Detection And Correction Of Errors (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

The embodiments of the present disclosure relate to a method and apparatus for a data check, and a device, a medium and a product. The method comprises: acquiring N groups of data from a memory, wherein each group of data among the N groups of data comprises a check bit, N being a positive integer greater than or equal to 2. The method further comprises: performing aggregation on the N groups of data to obtain aggregated data, and performing error correction code (ECC) encoding on the aggregated data to obtain encoded data, wherein the aggregated data comprises an aggregated check bit, and the aggregated check bit is an aggregation of the check bits of the N groups of data. The method further comprises: decomposing the encoded data into N groups of encoded data, among which each group of data comprises a check bit, and respectively writing the N groups of encoded data into the memory. Therefore, a plurality of single groups of data, which are read from a memory in unit, are aggregated to obtain a larger encoding unit, such that a memory ECC function with stronger error detection and error correction capabilities can be supported under the same hardware condition.

Description

用于数据校验的方法、装置、设备、介质和产品Method, device, equipment, medium and product for data verification

本申请要求于2023年06月29日提交中国专利局的申请号为202310792806.8并且发明名称为“一种数据纠错方法”的中国专利申请、以及于2023年08月31日提交中国专利局的申请号为202311126669.0并且发明名称为“用于数据校验的方法、装置、设备、介质和产品”的中国专利申请的优先权,其全部内容通过引用并入本申请。This application claims priority to Chinese patent application with application number 202310792806.8 filed with the Chinese Patent Office on June 29, 2023 and invention name “A method for data error correction”, and Chinese patent application with application number 202311126669.0 filed with the Chinese Patent Office on August 31, 2023 and invention name “Method, device, equipment, medium and product for data verification”, the entire contents of which are incorporated into this application by reference.

技术领域Technical Field

本申请的实施例涉及存储器技术领域,更具体地涉及存储器中的数据校验的方法、装置、设备、介质和产品。Embodiments of the present application relate to the field of memory technology, and more specifically to methods, devices, equipment, media, and products for verifying data in a memory.

背景技术Background Art

在人工智能(artificial intelligence,AI)等需要大量计算的应用领域,对高性能和融合中央处理单元(central processing unit,CPU)&图形处理单元(graphics processing unit,GPU)算力的需求,使得E级(支持每秒10的18次方次浮点运算)以上算力的高性能计算(high performance computing,HPC)系统发展迅速。这样的高算力系统对内存带宽具有很高的要求,这进而促进了例如高带宽存储器(high bandwidth memory,HBM)在HPC系统中的广泛应用。In the application fields that require a lot of computing, such as artificial intelligence (AI), the demand for high performance and the integration of central processing unit (CPU) and graphics processing unit (GPU) computing power has led to the rapid development of high-performance computing (HPC) systems with computing power above E-level (supporting 1018 floating-point operations per second). Such high-computing systems have high requirements for memory bandwidth, which in turn promotes the widespread application of high-bandwidth memory (HBM) in HPC systems.

HBM是一种基于3D堆叠技术的高性能动态随机存取存储器。HBM技术通过将多个双倍数据率动态随机存取存储器(Double Data Rate Synchronous Dynamic Random Access Memory,DDR SDRAM或简称DDR)颗粒在三维空间堆叠在一起,来实现大容量和高位宽的内存阵列。HBM支持对内存的多通道并行读取,从而显著提高读取速度。HBM is a high-performance dynamic random access memory based on 3D stacking technology. HBM technology achieves large-capacity and high-bit-width memory arrays by stacking multiple Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM or DDR for short) particles in three-dimensional space. HBM supports multi-channel parallel reading of memory, which significantly improves the reading speed.

发明内容Summary of the invention

本申请的实施例提供了一种数据校验方案。An embodiment of the present application provides a data verification solution.

在第一方面,提供了一种用于数据校验的方法。该方法包括:从存储器获取N组数据,该N组数据中的每组数据包括校验位,其中N为大于等于2的正整数;对N组数据进行聚合得到聚合数据,该聚合数据包括聚合校验位,聚合校验位为N组数据的校验位的聚合;对聚合数据进行纠错码(error correction code,ECC)编码,得到编码数据;将编码数据分解为N组编码数据,该N组编码数据中的每组数据包括校验位;以及将N组编码数据分别写入存储器。由此,能够将从存储器单位地读取的多组数据聚合,来获得更大的编码单元用于ECC编码。从而,在同样的冗余比等硬件条件下,该方法相比传统技术能够支持检错和纠错能力更强的内存ECC功能,满足更高的纠错和检错需求。In a first aspect, a method for data verification is provided. The method includes: obtaining N groups of data from a memory, each group of data in the N groups of data includes a check bit, wherein N is a positive integer greater than or equal to 2; aggregating the N groups of data to obtain aggregated data, the aggregated data includes an aggregated check bit, and the aggregated check bit is an aggregation of the check bits of the N groups of data; error correction code (ECC) encoding the aggregated data to obtain coded data; decomposing the coded data into N groups of coded data, each group of data in the N groups of coded data includes a check bit; and writing the N groups of coded data into the memory respectively. Thus, multiple groups of data read from the memory unit can be aggregated to obtain a larger coding unit for ECC coding. Therefore, under the same hardware conditions such as redundancy ratio, this method can support memory ECC functions with stronger error detection and correction capabilities compared to traditional technologies, and meet higher error correction and error detection requirements.

在第一方面的一些实施例中,从存储器获取N组数据包括:从存储器的N个通道或者N个存储库组读取数据,其中每个通道或者每个存储库的数据为一组数据。由此,可以从空间维度来聚合作为编码单元的数据,以获得更大的编码单元。In some embodiments of the first aspect, obtaining N groups of data from the memory includes: reading data from N channels or N memory bank groups of the memory, wherein the data of each channel or each memory bank is a group of data. Thus, the data as the coding unit can be aggregated from the spatial dimension to obtain a larger coding unit.

在第一方面的一些实施例中,从存储器获取N组数据包括:在L个时间片从存储器的M个通道或者M个存储库组读取数据,其中N=L*M,其中时间片为从存储器读取数据的一次突发传输的持续时间。由此,能够从时间和空间维度来聚合作为编码数据,获得更大的编码单元。In some embodiments of the first aspect, acquiring N groups of data from the memory includes: reading data from M channels or M memory bank groups of the memory in L time slices, where N=L*M, and the time slice is the duration of a burst transmission of reading data from the memory. Thus, it is possible to aggregate as encoded data from the time and space dimensions to obtain a larger encoding unit.

在第一方面的一些实施例中,对聚合数据进行纠错码ECC编码包括:根据预先设置的编码方式来确定从存储器获取N组数据的方式;以及根据预先设置的编码方式对聚合数据进行ECC编码。由此,提供了灵活可配置的可定制ECC检错和纠错功能,其能够在相同的硬件环境下提供满足用户的不同需求的多种ECC编码和译码方式。In some embodiments of the first aspect, performing ECC encoding on the aggregated data includes: determining a method for obtaining N groups of data from a memory according to a preset encoding method; and performing ECC encoding on the aggregated data according to the preset encoding method. Thus, a flexible, configurable, and customizable ECC error detection and correction function is provided, which can provide a variety of ECC encoding and decoding methods that meet different needs of users under the same hardware environment.

在第一方面的一些实施例中,根据预先设置的编码方式对聚合数据进行ECC编码包括:根据用户的输入对编码方式进行预先设置,该输入包括针对存储器的ECC性能要求。由此,能够基于ECC的性能要求来自动给定硬件条件下最优的编码方式。In some embodiments of the first aspect, performing ECC encoding on the aggregated data according to a preset encoding method includes: presetting the encoding method according to a user input, the input including an ECC performance requirement for a memory, thereby automatically providing an optimal encoding method under hardware conditions based on the ECC performance requirement.

在第一方面的一些实施例中,该方法还包括:响应于编码方式被更新,根据已更新的编码方式来从存储器获取数据以更新所获取的数据的校验位。由此,相关的配置能够应需求改变而在无需重启的情况下动态地被更新,从而避免由于更新配置而引起的业务中断。 In some embodiments of the first aspect, the method further comprises: in response to the encoding method being updated, acquiring data from the memory according to the updated encoding method to update the check bit of the acquired data. Thus, the relevant configuration can be dynamically updated without restarting in response to demand changes, thereby avoiding business interruption caused by updating the configuration.

在第一方面的一些实施例中,编码方式包括里德-所罗门(Reed-Solomon,RS)编码。RS编码能够实现针对多位错误的检测和纠正,显著提高所生成的ECC码的检错和纠错能力。In some embodiments of the first aspect, the encoding method includes Reed-Solomon (RS) encoding. RS encoding can detect and correct multi-bit errors, significantly improving the error detection and correction capabilities of the generated ECC code.

在第一方面的一些实施例中,该方法还包括:根据编码方式确定对应的译码方式;以及在从存储器读取数据以供使用时,根据译码方式,基于所读取的数据的校验位来生成针对该数据的经纠错的数据位。由此,能够与聚合多组数据的ECC编码过程结合使用,从而在相同硬件条件下,相比传统方法能够实现达到更强检错和纠错能力和性能的ECC功能。In some embodiments of the first aspect, the method further comprises: determining a corresponding decoding method according to the encoding method; and when reading data from the memory for use, generating error-corrected data bits for the data based on the check bits of the read data according to the decoding method. Thus, it can be used in combination with the ECC encoding process of aggregating multiple groups of data, so that under the same hardware conditions, an ECC function with stronger error detection and correction capabilities and performance can be achieved compared with traditional methods.

在第一方面的一些实施例中,该方法还包括,该方法还包括:确定经纠错的数据位与所读取的数据的数据位之间不同的位数;如果该位数满足阈值条件,则生成提示,该提示指示经纠错的数据位中有可能存在错误。由此,能够提供更加谨慎的纠错结果,并且能够将是否使用有风险的数据的决策交给上层应用,从而在虑及误纠风险的同时也尽量保证业务的连续运行。In some embodiments of the first aspect, the method further includes: determining the number of bits that are different between the corrected data bits and the read data bits; if the number of bits satisfies a threshold condition, generating a prompt indicating that there may be an error in the corrected data bits. Thus, a more cautious error correction result can be provided, and the decision of whether to use risky data can be handed over to an upper-layer application, thereby ensuring the continuous operation of the business as much as possible while taking into account the risk of miscorrection.

在第二方面,提供了一种用于数据校验的装置,该装置包括:获取模块,被配置为从存储器获取N组数据,该N组数据中的每组数据包括校验位,其中N为大于等于2的正整数;聚合模块,被配置为对N组数据进行聚合得到聚合数据,该聚合数据包括聚合校验位,该聚合校验位为N组数据的校验位的聚合;编码模块,被配置为对聚合数据进行纠错码ECC编码,得到编码数据;分解模块,被配置为将编码数据分解为N组编码数据,该N组编码数据中的每组编码数据包括校验位;以及写入模块,将N组编码数据分别写入存储器。In a second aspect, a device for data verification is provided, the device comprising: an acquisition module, configured to acquire N groups of data from a memory, each group of the N groups of data comprising a check bit, wherein N is a positive integer greater than or equal to 2; an aggregation module, configured to aggregate the N groups of data to obtain aggregated data, the aggregated data comprising an aggregated check bit, the aggregated check bit being an aggregation of the check bits of the N groups of data; an encoding module, configured to perform error correction code ECC encoding on the aggregated data to obtain encoded data; a decomposition module, configured to decompose the encoded data into N groups of encoded data, each group of the N groups of encoded data comprising a check bit; and a writing module, writing the N groups of encoded data into the memory respectively.

在第二方面的一些实施例中,获取模块包括第一读取模块,第一读取模块被配置为从存储器的N个通道或者N个存储库组读取数据,其中每个通道或者每个存储库的数据为一组数据。In some embodiments of the second aspect, the acquisition module includes a first reading module, which is configured to read data from N channels or N memory bank groups of the memory, wherein the data of each channel or each memory bank is a group of data.

在第二方面的一些实施例中,获取模块包括第二读取模块,第二读取模块被配置为在L个时间片从存储器的M个通道或者M个存储库组读取数据,其中N=L*M,其中时间片为从存储器读取数据的一次突发传输的持续时间。In some embodiments of the second aspect, the acquisition module includes a second read module, which is configured to read data from M channels or M storage memory groups of the memory in L time slices, where N=L*M, and the time slice is the duration of a burst transmission of reading data from the memory.

在第二方面的一些实施例中,编码模块包括:确定模块,被配置为根据预先设置的编码方式来确定从存储器获取N组数据的方式;以及ECC编码模块,被配置为根据预先设置的编码方式对聚合数据进行ECC编码。In some embodiments of the second aspect, the encoding module includes: a determination module, configured to determine a method for obtaining N groups of data from a memory according to a preset encoding method; and an ECC encoding module, configured to perform ECC encoding on the aggregated data according to a preset encoding method.

在第二方面的一些实施例中,ECC编码模块包括:根据用户的输入对编码方式进行预先设置,该输入包括针对存储器的ECC性能要求。In some embodiments of the second aspect, the ECC encoding module includes: pre-setting the encoding method according to user input, where the input includes ECC performance requirements for the memory.

在第二方面的一些实施例中,该装置还包括更新模块,该更新模块被配置为:响应于编码方式被更新,根据已更新的编码方式来从存储器获取数据以更新所获取的数据的校验位。In some embodiments of the second aspect, the apparatus further comprises an updating module, wherein the updating module is configured to: in response to the encoding method being updated, obtain data from the memory according to the updated encoding method to update a check bit of the obtained data.

在第二方面的一些实施例中,编码方式包括里德-所罗门编码。In some embodiments of the second aspect, the encoding scheme includes Reed-Solomon encoding.

在第二方面的一些实施例中,该装置还包括:译码方式确定模块,被配置为根据编码方式确定对应的译码方式;以及纠错模块,被配置为在从存储器读取数据以供使用时,根据译码方式基于所读取的数据的校验位来生成针对该数据的经纠错的数据位。In some embodiments of the second aspect, the device also includes: a decoding method determination module, configured to determine a corresponding decoding method according to an encoding method; and an error correction module, configured to generate error-corrected data bits for the data based on the check bits of the read data according to the decoding method when reading data from the memory for use.

在第二方面的一些实施例中,该装置还包括:差异位数确定模块,确定经纠错的数据位与所读取的数据的数据位之间不同的位数;提示模块,被配置为在不同的位数满足阈值条件的情况下生成提示,该提示指示经纠错的数据位中有可能存在错误。In some embodiments of the second aspect, the device also includes: a difference bit number determination module, which determines the number of different bits between the error-corrected data bits and the data bits of the read data; a prompt module, which is configured to generate a prompt when the different number of bits meets a threshold condition, and the prompt indicates that there may be an error in the error-corrected data bits.

在第三方面,提供了一种电子设备。包括处理器和存储器,存储器上存储有计算机指令,在计算机指令被该处理器执行时,使得该电子设备执行根据上述第一方面或其任一实施例中的方法的动作。In a third aspect, an electronic device is provided, including a processor and a memory, wherein computer instructions are stored in the memory, and when the computer instructions are executed by the processor, the electronic device performs actions according to the method in the first aspect or any embodiment thereof.

在第四方面,提供了一种计算机可读存储介质。该计算机可读存储介质上存储有计算机可执行指令,在被电子设备执行时,使得该电子设备执行根据上述第一方面或其任一实施例中的方法的操作。In a fourth aspect, a computer-readable storage medium is provided. The computer-readable storage medium stores computer-executable instructions, which, when executed by an electronic device, enable the electronic device to perform the operation of the method according to the first aspect or any of its embodiments.

在第五方面,提供了一种计算机程序产品。该计算机程序产品被有形地存储在计算机可读介质上并且包括计算机可执行指令,计算机可执行指令在被执行时实现根据上述第一方面或其任一实施例中的方法的操作。In a fifth aspect, a computer program product is provided. The computer program product is tangibly stored on a computer-readable medium and includes computer-executable instructions, which, when executed, implement the operations of the method according to the first aspect or any of its embodiments.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

结合附图并参考以下具体实施方式,本申请各实施例的上述和其他特征、优点及方面将变得更加明显。在附图中,相同或相似的附图标记表示相同或相似的元素,其中:The above and other features, advantages and aspects of the embodiments of the present application will become more apparent with reference to the following detailed description in conjunction with the accompanying drawings. In the accompanying drawings, the same or similar reference numerals represent the same or similar elements, wherein:

图1示出了本申请的多个实施例能够在其中实现的示例环境的示意图;FIG1 is a schematic diagram showing an example environment in which various embodiments of the present application can be implemented;

图2示出了根据本申请的一些实施例的用于数据校验的示例方法的流程图; FIG2 shows a flow chart of an example method for data verification according to some embodiments of the present application;

图3示出了根据本申请的一些实施例来读取数据以供聚合的非限制性示例的示意图;FIG3 shows a schematic diagram of a non-limiting example of reading data for aggregation according to some embodiments of the present application;

图4示出了根据本申请的一些实施例来聚合多组数据以进行ECC编码的非限制性示例的示意图;FIG4 is a schematic diagram showing a non-limiting example of aggregating multiple groups of data for ECC encoding according to some embodiments of the present application;

图5示出了根据本申请的一些实施例来将经ECC编码的聚合数据分解写入存储器的非限制性示例的示意图;FIG5 is a schematic diagram showing a non-limiting example of decomposing and writing ECC-encoded aggregate data into a memory according to some embodiments of the present application;

图6示出了根据本申请的一些实施例来对数据进行检错和纠错的示例方法的流程图;FIG6 shows a flow chart of an example method for performing error detection and correction on data according to some embodiments of the present application;

图7示出了根据本申请的一些实施例的用于数据校验的示例方法的流程图;FIG7 shows a flowchart of an example method for data verification according to some embodiments of the present application;

图8示出了根据本申请的一些实施例的报文传输装置的示意性框图;以及FIG8 shows a schematic block diagram of a message transmission device according to some embodiments of the present application; and

图9示出了可以用来实施本申请的实施例的示例设备的示意性框图。FIG. 9 shows a schematic block diagram of an example device that can be used to implement embodiments of the present application.

具体实施方式DETAILED DESCRIPTION

下面将参照附图更详细地描述本申请的实施例。虽然附图中显示了本申请的某些实施例,然而应当理解的是,本申请可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本申请。应当理解的是,本申请的附图及实施例仅用于示例性作用,并非用于限制本申请的保护范围。The embodiments of the present application will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present application are shown in the accompanying drawings, it should be understood that the present application can be implemented in various forms and should not be construed as being limited to the embodiments described herein. Instead, these embodiments are provided to provide a more thorough and complete understanding of the present application. It should be understood that the drawings and embodiments of the present application are only for exemplary purposes and are not intended to limit the scope of protection of the present application.

在本申请的实施例的描述中,术语“包括”及其类似用语应当理解为开放性包含,即“包括但不限于”。术语“基于”应当理解为“至少部分地基于”。术语“一个实施例”或“该实施例”应当理解为“至少一个实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。In the description of the embodiments of the present application, the term "including" and similar terms should be understood as open inclusion, that is, "including but not limited to". The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first", "second", etc. may refer to different or the same objects. Other explicit and implicit definitions may also be included below.

HBM基于3D堆叠技术,其能够提供对内存的多通道并行读取。在需要HPC的应用领域,由于高算力系统对内存带宽的高需求,HBM被广泛应用。HBM的结构决定了其没有单独的冗余颗粒,冗余比也比一般DDR低。由于有限的冗余比等原因,常规的HBM所能支持的ECC算法相对简单并且检错和纠错能力有限。例如,针对128个位的数据,8个位的汉明码ECC可以检测和纠正一个比特错误。如果要检测两个位的错误,则需要9个位的汉明码ECC。HBM is based on 3D stacking technology, which can provide multi-channel parallel reading of memory. In application fields that require HPC, HBM is widely used due to the high demand for memory bandwidth in high-computing systems. The structure of HBM determines that it has no separate redundant particles, and the redundancy ratio is also lower than that of general DDR. Due to the limited redundancy ratio and other reasons, the ECC algorithm that conventional HBM can support is relatively simple and has limited error detection and correction capabilities. For example, for 128 bits of data, 8-bit Hamming code ECC can detect and correct one bit error. If you want to detect two-bit errors, you need 9-bit Hamming code ECC.

一些HBM提供方针对128位数据提供16个冗余位用于ECC校验位。这种冗余比可以支持8位的基础汉明码校验位和8位的奇偶校验位,其中每个奇偶校验位针对128位数据中的16位被计算。这种方法能够实现单颗粒纠错,然而其纠错要试算多次,导致较大的时延。此外,为实现这种方法,需要将存储器的冗余比提高的8:1,这增加了硬件的成本。并且,基于汉明码的ECC不能纠正多比特错误。如果发生多于两个位的错误,依然会出现漏检或误纠,即其无法达成应对硬失效所需要的能够纠正多位错误的能力。Some HBM providers provide 16 redundant bits for ECC check bits for 128-bit data. This redundancy ratio can support 8-bit basic Hamming code check bits and 8-bit parity check bits, where each parity check bit is calculated for 16 bits of 128-bit data. This method can achieve single-grain error correction, but its error correction requires multiple trial calculations, resulting in a large delay. In addition, to implement this method, the redundancy ratio of the memory needs to be increased to 8:1, which increases the cost of the hardware. Moreover, Hamming code-based ECC cannot correct multi-bit errors. If more than two bits of errors occur, missed detection or miscorrection will still occur, that is, it cannot achieve the ability to correct multi-bit errors required to cope with hard failures.

另一方面,随着存储器技术的发展,在一个单元里保存的电荷量减低,瞬时错误可能导致多位错误的概率是增加的。此外,在存储器的设计和制造过程中,也可能引入导致多位错误的概率增加。同时,HBM常被应用的HPC领域对数据的可靠性和时延要求又很高。因此,传统的针对HBM的简单编码方法的能力和性能不一定能满足应用需求。并且,越来越多的专用芯片把CPU和HBM等存储器封装在一个片上系统(System On Chip,SoC)中。存储器的高故障率可能直接导致整个SoC的返还,使得售后成本和服务的风险很高。因此,如何在有限的硬件冗余比和一定的时延要求等约束条件下来提升针对HBM等存储器的ECC纠错和检错的能力及性能存在挑战。On the other hand, with the development of memory technology, the amount of charge stored in a unit decreases, and the probability that instantaneous errors may lead to multi-bit errors is increased. In addition, during the design and manufacturing process of the memory, the probability of multi-bit errors may also be increased. At the same time, the HPC field where HBM is often used has high requirements for data reliability and latency. Therefore, the capabilities and performance of traditional simple encoding methods for HBM may not meet application requirements. In addition, more and more dedicated chips encapsulate CPUs and HBM and other memories in a system on chip (System On Chip, SoC). The high failure rate of the memory may directly lead to the return of the entire SoC, making the after-sales cost and service risk very high. Therefore, there is a challenge in how to improve the ECC error correction and error detection capabilities and performance for HBM and other memories under the constraints of limited hardware redundancy ratio and certain latency requirements.

为了解决上述问题和其他问题,本公开的实施例提供了一种数据校验方案。该方案将从存储器单位地读取(例如,从空间和时间等维度)的多组数据聚合,以获得更大的编码单元来进行ECC编码。由此,在同样的冗余比等硬件条件下,该方案相比传统技术能够支持检错和纠错能力更强的针对存储器的ECC相关功能,从而显著提高真的对存储器的ECC检错和纠错的可靠性以及减少不可纠正错误,满足更高的纠错和检错需求。应理解,尽管在本文的描述中可能使用HBM作为示例,但是根据本申请的实施例的方案也适用于其他各种存储器形态,诸如DDR、移动DDR、以及储存级存储器(storage class memory,SCM)等。In order to solve the above problems and other problems, an embodiment of the present disclosure provides a data verification scheme. The scheme aggregates multiple groups of data read from the memory unit (for example, from dimensions such as space and time) to obtain a larger coding unit for ECC encoding. Thus, under the same hardware conditions such as redundancy ratio, the scheme can support ECC-related functions for memory with stronger error detection and correction capabilities than traditional technologies, thereby significantly improving the reliability of ECC error detection and correction of the memory and reducing uncorrectable errors, meeting higher error correction and error detection requirements. It should be understood that although HBM may be used as an example in the description of this article, the scheme according to the embodiment of the present application is also applicable to various other memory forms, such as DDR, mobile DDR, and storage class memory (SCM).

首先参考图1,其示出了本申请的多个实施例能够在其中实现的示例环境100的示意图。示例环境100包括计算设备110。计算设备110包括处理器120和存储器130。处理器120可以包括CPU、GPU、各种类型的专用计算单元(诸如,AI芯片等)、各种控制器、或者前述项的适当组合。处理器120可以根据被存储在或被加载到存储器130的计算机程序指令来执行各种处理动作。First, refer to Figure 1, which shows a schematic diagram of an example environment 100 in which multiple embodiments of the present application can be implemented. Example environment 100 includes a computing device 110. Computing device 110 includes a processor 120 and a memory 130. Processor 120 may include a CPU, a GPU, various types of dedicated computing units (such as, AI chips, etc.), various controllers, or a suitable combination of the foregoing items. Processor 120 can perform various processing actions according to computer program instructions stored in or loaded into memory 130.

存储器130可以是随机存取存储器(Random Access Memory,RAM),诸如动态随机存取存储器(Dynamic Random Access Memory,DRAM),其可以支持从处理器120的快速随机读写,从而直接支持 处理器120的计算。即,给定存储器130中的某个随机地址,处理器120能够快速从该地址读取或向其写入数据。因此,在处理器120执行处理时时,其可以从存储器130读取执行所需的数据(诸如,指令和/或计算参数)以及向存储器130写入数据(诸如,计算的输出)。The memory 130 may be a random access memory (RAM), such as a dynamic random access memory (DRAM), which may support fast random read and write from the processor 120, thereby directly supporting The processor 120 can quickly read or write data to a random address in the memory 130. Thus, when the processor 120 performs a process, it can read data required for the execution (such as instructions and/or calculation parameters) from the memory 130 and write data (such as the output of the calculation) to the memory 130.

计算设备110还包括(多个)存储器控制器140作为处理器120的部分。存储器控制器140是计算设备110用于控制和管理CPU/GPU等计算单元与存储器130之间的数据传输的组件。处理器120经由存储器控制器140来从存储器130读写数据。每个存储器控制器140可以被实现为单独的芯片,或者被集成到另一更大的芯片中,例如作为CPU或北桥内置的控制器。The computing device 110 also includes (multiple) memory controllers 140 as part of the processor 120. The memory controller 140 is a component of the computing device 110 used to control and manage data transmission between a computing unit such as a CPU/GPU and the memory 130. The processor 120 reads and writes data from the memory 130 via the memory controller 140. Each memory controller 140 can be implemented as a separate chip, or integrated into another larger chip, such as a controller built into a CPU or a north bridge.

在本公开的一些实施例中,数据可以以突发(burst)传输的方式在处理器120和存储器130之间进行传输。突发传输具有一定的突发长度(burst length)。在一次突发传输的持续时间(下文也称为一个数据周期)内,存储器控制器140可以根据指定的起始地址来连续读/写与突发长度相等的数目的连续存储单元,而不需要连续提供地址。作为非限制性示例,在单位突发长度能够经由单个通道从存储器130读取64位数据的情况下,在一次突发长度等于2的数据周期期间,处理器120能够经由单个通道从存储区读取128位数据。In some embodiments of the present disclosure, data may be transmitted between the processor 120 and the memory 130 in a burst transmission manner. The burst transmission has a certain burst length. During the duration of a burst transmission (hereinafter also referred to as a data cycle), the memory controller 140 may continuously read/write a number of consecutive storage units equal to the burst length according to a specified starting address without continuously providing addresses. As a non-limiting example, in the case where a unit burst length is capable of reading 64 bits of data from the memory 130 via a single channel, during a data cycle with a burst length equal to 2, the processor 120 may be capable of reading 128 bits of data from the storage area via a single channel.

在一些实施例中,在存储器130和存储器控制器140之间存在多个并行的通信通道,使得处理器120能够并行地经由多个通道同时读/写数据。例如,存储器130可以包括多个存储库组(bank group),每个存储库组能够被独立地读写,使得能够同时从多个存储库组读/写数据。例如,存储器130可以是基于3D堆叠技术的HBM,其中多个存储库组以立体方式被堆叠在一起。In some embodiments, there are multiple parallel communication channels between the memory 130 and the memory controller 140, so that the processor 120 can read/write data simultaneously via the multiple channels in parallel. For example, the memory 130 may include multiple bank groups, each of which can be read and written independently, so that data can be read/written from multiple bank groups at the same time. For example, the memory 130 may be an HBM based on 3D stacking technology, in which multiple bank groups are stacked together in a three-dimensional manner.

在这样的一些实施例中,可以存在多个存储器控制器140,其中每个存储器控制器140控制针对一部分存储库组的数据传输,并且可以经由多个通道来并行从多个存储库组读取数据。在这样的一些实施例中,存储器130还支持虚通道。虚通道技术能够将每条物理通道拆分为多个虚拟的通道,使得能够在处理器120和存储器130之间并行传输的单笔数据的数目多于物理通道的数目。本申请的实施例不限于存储器控制器、存储库组、或通道/虚通道的具体数目In some such embodiments, there may be multiple memory controllers 140, each of which controls data transmission for a portion of the memory bank groups, and can read data from the multiple memory bank groups in parallel via multiple channels. In some such embodiments, the memory 130 also supports virtual channels. Virtual channel technology can split each physical channel into multiple virtual channels, so that the number of single data that can be transmitted in parallel between the processor 120 and the memory 130 is greater than the number of physical channels. The embodiments of the present application are not limited to the specific number of memory controllers, memory bank groups, or channels/virtual channels.

在本申请的实施例中,存储器130支持纠错码(error-correcting code,ECC)校验功能。在这样的情况下,针对要被写入存储器130的一组数据位,处理器120可以例如利用存储器控制器140来计算对应的ECC。当该组数据位被写入存储器130时,所计算的ECC也作为校验位被一起写入存储器130。稍后,当重新读取回该组数据位以供处理器120使用时,对应的校验位也一起被读取。In an embodiment of the present application, the memory 130 supports an error-correcting code (ECC) check function. In this case, for a group of data bits to be written to the memory 130, the processor 120 can, for example, use the memory controller 140 to calculate the corresponding ECC. When the group of data bits is written to the memory 130, the calculated ECC is also written to the memory 130 as a check bit. Later, when the group of data bits is read back for use by the processor 120, the corresponding check bit is also read together.

存储器控制器140然后针对所读取的数据位来再次计算ECC,并且将其与所读取的校验位中的ECC进行比较。如果两者不匹配,则存储器控制器140可以确定所读取的数据位中存在错误。在这种情况下,存储器控制器140可以针对数据位执行进一步的纠错译码,以确定其中的哪一个或多个位是不正确的。然后,错误的位被纠正。对于二进制的数据位而言,该纠正可以意味着错误位被翻转。数据的纠正使得处理器120能够正确执行后续任务,从而保证计算设备110的正常运行。The memory controller 140 then recalculates the ECC for the read data bits and compares it with the ECC in the read check bits. If the two do not match, the memory controller 140 can determine that there is an error in the read data bits. In this case, the memory controller 140 can perform further error correction decoding on the data bits to determine which one or more bits are incorrect. Then, the erroneous bits are corrected. For binary data bits, this correction can mean that the erroneous bits are flipped. The correction of the data enables the processor 120 to correctly perform subsequent tasks, thereby ensuring the normal operation of the computing device 110.

为了实现这一功能,可选地,存储器130可以包括专门存储实际数据位的数据部分150和存储校验位的校验部分160。处理器120可以基于相应的读写规范来将数据位的校验位写入校验部分160。应理解,数据部分150和校验部分160仅作为示例示出。在一些实施例中,存储器中130可以不存在存储校验位的单独颗粒。存储器130还可以以其他适当的结构来存储数据位和对应的校验位。To achieve this function, optionally, the memory 130 may include a data portion 150 specifically storing actual data bits and a check portion 160 storing check bits. The processor 120 may write the check bits of the data bits into the check portion 160 based on the corresponding read and write specifications. It should be understood that the data portion 150 and the check portion 160 are shown only as examples. In some embodiments, there may be no separate particles storing check bits in the memory 130. The memory 130 may also store data bits and corresponding check bits in other appropriate structures.

此外,处理器120中包括成对的ECC编码电路和译码电路(未示出),例如在存储器控制器140中或者其另一部分中。一对编码电路可以实现特定ECC算法的编译码功能,例如,汉明码算法、以及里德-所罗门编码算法等。编码电路可以用于计算数据位的ECC。译码电路可以对所计算的ECC来进行译码,以确定对应数据位中的错误位并进行纠正。在一些实施例中,存储器控制器140可以包括多对编码电路和译码电路,并且根据系统设置来在不同的电路之间切换,以利用不同的算法来执行ECC功能。In addition, the processor 120 includes a paired ECC encoding circuit and decoding circuit (not shown), for example, in the memory controller 140 or in another part thereof. A pair of encoding circuits can implement the encoding and decoding functions of a specific ECC algorithm, for example, a Hamming code algorithm, and a Reed-Solomon encoding algorithm. The encoding circuit can be used to calculate the ECC of the data bit. The decoding circuit can decode the calculated ECC to determine the error bit in the corresponding data bit and correct it. In some embodiments, the memory controller 140 may include multiple pairs of encoding circuits and decoding circuits, and switch between different circuits according to the system settings to perform the ECC function using different algorithms.

ECC编码所针对的数据单元称为码字。每个码字包括实际数据位和包括ECC的冗余校验位。要达成一定的纠错能力,校验位的数目需要达到一定条件,该条件通常受限于数据部分150和校验部分160的容量之间的能达到的硬件冗余比。以硬件冗余比为16:1的情况下采用基础汉明码的ECC功能为例。针对包括128位数据位的码字,最多附加的8位校验位能够支持检测和纠正一个位的错误。如果需要保证检测可靠的两个位的错误,则需要增加冗余比。如果要检测两个位的错误并纠正一个位的错误,则128位的数据需要9位的ECC。在这种情况下,可以提供更高的硬件冗余比,即在数据部分150容量不变的情况下提高校验部分160的容量。这种方式会更加硬件的成本。在不增加硬件成本的情况下,则需要增加编码单元的大小。例如,处理器120可以根据本申请的实施例的方法来增加编码单元的大小,如后文将要详细描述的。 The data unit targeted by ECC encoding is called a codeword. Each codeword includes actual data bits and redundant check bits including ECC. To achieve a certain error correction capability, the number of check bits needs to meet certain conditions, which are usually limited to the achievable hardware redundancy ratio between the capacity of the data part 150 and the check part 160. Take the ECC function of the basic Hamming code with a hardware redundancy ratio of 16:1 as an example. For a codeword including 128 data bits, a maximum of 8 additional check bits can support the detection and correction of one bit of error. If it is necessary to ensure reliable detection of two-bit errors, the redundancy ratio needs to be increased. If two-bit errors are to be detected and one-bit errors are to be corrected, 128-bit data requires 9 bits of ECC. In this case, a higher hardware redundancy ratio can be provided, that is, the capacity of the check part 160 is increased while the capacity of the data part 150 remains unchanged. This approach will reduce the cost of hardware. Without increasing the hardware cost, the size of the coding unit needs to be increased. For example, the processor 120 can increase the size of the coding unit according to the method of an embodiment of the present application, as will be described in detail later.

应理解,仅出于示例性的目的来描述示例环境100中的架构和功能,而不暗示对本申请的范围的任何限制。并且,示例环境100中还可以存在其他未示出的设备、系统或组件等。另外,本申请的实施例还可以被应用到具有不同的和/或其他功能的其他环境中。例如,在一些高性能计算系统中,处理器120和存储器130可以被封装在同一个SoC中。例如,针对存储器130的读写和数据校验动作也可以由基础晶片(Base Die)来执行。下文将一般性地在由计算设备110利用处理器120来执行动作的上下文中来描述本申请的实施例。It should be understood that the architecture and functions in the example environment 100 are described for exemplary purposes only, and do not imply any limitation on the scope of the present application. In addition, other devices, systems or components not shown may also exist in the example environment 100. In addition, the embodiments of the present application may also be applied to other environments with different and/or other functions. For example, in some high-performance computing systems, the processor 120 and the memory 130 may be packaged in the same SoC. For example, the read, write and data verification actions for the memory 130 may also be performed by the base die. The following will generally describe the embodiments of the present application in the context of the computing device 110 using the processor 120 to perform actions.

现在参考图2,其示出了根据本申请一些实施例的用于数据校验的示例方法200的流程图。示例方法200可以例如由如图1所示的计算设备110执行。应理解,方法200还可以包括未示出的附加动作,本申请的范围在此方面不受限制。以下结合图1的示例环境100来详细描述方法200。Now refer to Figure 2, which shows a flow chart of an example method 200 for data verification according to some embodiments of the present application. The example method 200 can be performed, for example, by the computing device 110 as shown in Figure 1. It should be understood that the method 200 can also include additional actions not shown, and the scope of the present application is not limited in this respect. The method 200 is described in detail below in conjunction with the example environment 100 of Figure 1.

在210,从存储器获取N组数据,该N组数据中的每组数据包括校验位。其中,N为大于等于2的整数。例如,计算设备110可以经由存储器控制器140从存储器130读取N组数据到处理器120中,其中N组数据中的每组数据包括实际数据位和校验位。校验位可以用于存储该组数据位的ECC校验码。此时,校验位可以为空或者可以需要被更新。At 210, N groups of data are obtained from the memory, each of which includes a check bit. Wherein N is an integer greater than or equal to 2. For example, the computing device 110 can read N groups of data from the memory 130 to the processor 120 via the memory controller 140, wherein each of the N groups of data includes actual data bits and a check bit. The check bit can be used to store the ECC check code of the group of data bits. At this time, the check bit can be empty or may need to be updated.

在一些实施例中,处理器120可以从存储器130的N个存储库组或N个通道来并行地读取N组数据。在一些实施例中,N个通道可以是虚通道。这样并行地读取的N组数据可以如下文所述被聚合以供编码,与针对每个通道获取的单组数据进行ECC编码相比,这种方式的读取能够在不增加时延的情况下获得更多作为编码单位的数目。In some embodiments, the processor 120 may read N groups of data in parallel from N memory bank groups or N channels of the memory 130. In some embodiments, the N channels may be virtual channels. The N groups of data read in parallel may be aggregated for encoding as described below, and compared with ECC encoding of a single group of data obtained from each channel, this reading method can obtain more numbers of encoding units without increasing latency.

在一些实施例中,处理器120可以在L个时间片从存储器的M个通道或者M个存储库组读取N组数据,其中L和M为整数,并且N=L*M。L个时间片中的每个时间片为从存储器读取数据的一次突发传输的持续时间。如本领域技术人员所知,该突发传输具有一定的突发长度。在硬件能力和突发长度确定的情况下,可以将一次突发传输的持续时间作为用于配置的单位时间。下文将结合图3更详细地描述从多通道和/或多时间片来读取数据以进行ECC编码的示例。In some embodiments, the processor 120 can read N groups of data from M channels or M storage banks of the memory in L time slices, where L and M are integers, and N=L*M. Each of the L time slices is the duration of a burst transmission of data read from the memory. As known to those skilled in the art, the burst transmission has a certain burst length. When the hardware capability and the burst length are determined, the duration of a burst transmission can be used as a unit time for configuration. An example of reading data from multiple channels and/or multiple time slices for ECC encoding will be described in more detail below in conjunction with FIG. 3.

继续参考图2。在220,对N组数据进行聚合得到聚合数据,该聚合数据包括聚合校验位,该聚合校验位为该N组数据的校验位的聚合。例如,处理器120可以对在210获取的N组数据进行聚合得到聚合数据,其中包括聚合该N组数据的校验位的聚合校验位。2. At 220, the N groups of data are aggregated to obtain aggregated data, the aggregated data including an aggregated check bit, the aggregated check bit being an aggregation of the check bits of the N groups of data. For example, the processor 120 may aggregate the N groups of data obtained at 210 to obtain aggregated data, including an aggregated check bit of the check bits of the N groups of data.

在一些实施例中,处理器120可以N组数据的数据位和N组数据的校验位各自连结在一起,来得到聚合数据。应理解,这种聚合可以是逻辑上的聚合,只要处理器120能够知晓应在聚合数据中的哪些位是数据位以及哪些位是校验位。In some embodiments, the processor 120 may concatenate the data bits of the N groups of data and the check bits of the N groups of data to obtain aggregated data. It should be understood that such aggregation may be a logical aggregation, as long as the processor 120 is able to know which bits in the aggregated data are data bits and which bits are check bits.

在230,对聚合数据进行ECC编码,得到编码数据。例如,处理器120可以对在220得到的聚合数据进行ECC编码,得到编码数据。在一些实施例中,该ECC编码可以是汉明码编码以及其各种变体。例如,处理器120可以针对聚合数据计算汉明码,针对聚合数据的多个段中的每个段计算奇偶校验位,并且将两者都作为编码数据的校验位的一部分。在一些实施例中,该ECC编码可以是纠错检错能力更强的RS编码。下文将结合图4来更详细地描述聚合N组数据进行ECC编码的示例。At 230, the aggregated data is ECC-encoded to obtain encoded data. For example, the processor 120 may perform ECC encoding on the aggregated data obtained at 220 to obtain encoded data. In some embodiments, the ECC encoding may be Hamming code encoding and its various variants. For example, the processor 120 may calculate a Hamming code for the aggregated data, calculate a parity bit for each of a plurality of segments of the aggregated data, and use both as part of the check bits of the encoded data. In some embodiments, the ECC encoding may be RS encoding with a stronger error correction and detection capability. An example of ECC encoding of aggregated N groups of data will be described in more detail below in conjunction with FIG. 4.

继续参考图2。在240,将编码数据分解为N组编码数据,该N组编码数据中的每组数据包括校验位。例如,处理器120可以将在230得到的编码数据分解为N组编码数据,该N组编码数据中的每组数据包括校验位。类似地,这种分解可以是逻辑上的分解,只要处理器120能够正确按组来处理(例如,传输)该N组编码数据。2. At 240, the coded data is decomposed into N groups of coded data, each of which includes a check bit. For example, the processor 120 may decompose the coded data obtained at 230 into N groups of coded data, each of which includes a check bit. Similarly, this decomposition may be a logical decomposition, as long as the processor 120 can correctly process (e.g., transmit) the N groups of coded data in groups.

在250,将N组编码数据分别写入存储器。例如,处理器120可以将在240所分解的N组编码数据分别写入存储器130。处理器120可以以与在210从存储器130读取数据的方式对应的方式来向存储器130写入数据。在一些实施例中,在一些实施例中,处理器120可以向N个存储库组或N个通道(或虚通道)来并行地写入N组数据。在一些实施例中,处理器120可以在L个时间片向M个通道或者M个存储库组写入N组数据,其中L和M为整数,并且N=L*M。下文将结合图5来更详细地描述分解编码数据的示例。At 250, the N groups of coded data are written to the memory separately. For example, the processor 120 may write the N groups of coded data decomposed at 240 to the memory 130 separately. The processor 120 may write data to the memory 130 in a manner corresponding to the manner in which the data is read from the memory 130 at 210. In some embodiments, in some embodiments, the processor 120 may write the N groups of data to N memory bank groups or N channels (or virtual channels) in parallel. In some embodiments, the processor 120 may write the N groups of data to M channels or M memory bank groups in L time slices, where L and M are integers, and N=L*M. An example of decomposing coded data will be described in more detail below in conjunction with FIG. 5.

使用方法200,从空间和时间等维度,将从存储器单位地读取的多个单组数据聚合,能够获得更大的编码单元以进行ECC编码。由此,在同样的硬件条件下,方法200相比传统方法能够支持检错和纠错能力更强的内存ECC功能(诸如,比汉明码能力更强的RS编码),从而显著提高存储器ECC检错和纠错功能的可靠性以及减少不可纠正错误(uncorrectable error,UCE)的发生。并且,方法200还能够在内存ECC的纠错和检错需求提高后,在不改变硬件的冗余比等硬件特性的情况下,通过聚合数据来满足提高的需求。 Using method 200, multiple single groups of data read from the memory unit are aggregated from dimensions such as space and time, and a larger coding unit can be obtained for ECC encoding. Therefore, under the same hardware conditions, method 200 can support memory ECC functions with stronger error detection and correction capabilities (such as RS encoding with stronger capabilities than Hamming code) compared to traditional methods, thereby significantly improving the reliability of memory ECC error detection and correction functions and reducing the occurrence of uncorrectable errors (UCE). In addition, method 200 can also meet the increased demand by aggregating data without changing hardware characteristics such as hardware redundancy ratio after the error correction and error detection requirements of memory ECC increase.

现在参考图3,其示出了根据本申请的一些实施例来读取数据以供聚合的非限制性示例的示意图300。示意图300中示出了处理器310,其与支持多个通道的HBM 320通信。处理器310和HBM 320可以分别是图1中的处理器120和存储器130的示例实现,并且图3中描述的动作可以是图2的方法200中的动作210的示例实现。Referring now to FIG. 3 , a schematic diagram 300 of a non-limiting example of reading data for aggregation according to some embodiments of the present application is shown. A processor 310 is shown in the schematic diagram 300 , which communicates with an HBM 320 supporting multiple channels. The processor 310 and the HBM 320 may be example implementations of the processor 120 and the memory 130 in FIG. 1 , respectively, and the actions described in FIG. 3 may be example implementations of the action 210 in the method 200 of FIG. 2 .

在示意图300中,处理器310经由多个控制器330-1至330-I(单独地或统称为控制器330)来并行读取数据。为说明期间,以控制器330-1为例,其能够控制通道340-1和340-2的数据传输。在每次突发传输的数据周期,控制器330-1可以从通道340-1和340-2中的每个通道将一组单位数据读取到自身的缓冲区中。一个数据周期的持续时间取决于该突发传输的突发长度和相关组件的硬件条件。因此,该组单位数据可以被视为在一个单位时间片所读取的数据。In the schematic diagram 300, the processor 310 reads data in parallel via a plurality of controllers 330-1 to 330-I (individually or collectively referred to as controller 330). For the purpose of illustration, controller 330-1 is taken as an example, which is capable of controlling the data transmission of channels 340-1 and 340-2. In each data cycle of a burst transmission, controller 330-1 can read a group of unit data from each channel of channels 340-1 and 340-2 into its own buffer. The duration of a data cycle depends on the burst length of the burst transmission and the hardware conditions of the relevant components. Therefore, the group of unit data can be regarded as data read in a unit time slice.

如图所示,在第一时间片,控制器330-1可以从通道340-1读取数据311并且从通道340-2读取数据312。在第二时间片,控制器330-1可以从通道340-1读取数据313并且从通道340-2读取数据314。数据311至314中的每组数据包括实际的数据位的附加的校验位。应理解,取决于场景,此时校验位可以是为了稍后写入HBM 320而预留的空值。As shown, in the first time slice, the controller 330-1 can read data 311 from the channel 340-1 and read data 312 from the channel 340-2. In the second time slice, the controller 330-1 can read data 313 from the channel 340-1 and read data 314 from the channel 340-2. Each group of data in the data 311 to 314 includes an additional check bit of the actual data bit. It should be understood that, depending on the scenario, the check bit at this time can be a null value reserved for later writing to HBM 320.

传统地,控制器330-1可以将每笔数据作为一个码字来单独计算校验位,则针对128+8位的示例码字,检测和纠正单个位错误的汉明码编码可以被使用。然而,在图3的示例中,控制器330-1可以将数据311至314聚合成一个更长的码字来进行ECC编码。这样,在该示例中,该更长的码字包括32位校验位,从而能够支持4个RS8的冗余编码。RS编码在DDR4和DDR5上应用广泛,其是基于符号(symbol)的算法,能够用于纠正一个或多个符号的错误,每个符号例如可以是8位。Traditionally, the controller 330-1 can calculate the check bit separately for each data as a codeword, then for the example codeword of 128+8 bits, the Hamming code encoding for detecting and correcting single bit errors can be used. However, in the example of Figure 3, the controller 330-1 can aggregate the data 311 to 314 into a longer codeword for ECC encoding. Thus, in this example, the longer codeword includes 32 check bits, so that 4 RS8 redundant encodings can be supported. RS encoding is widely used in DDR4 and DDR5. It is a symbol-based algorithm that can be used to correct errors in one or more symbols, each symbol can be 8 bits, for example.

根据里德-所罗门德的理论,使用基于有限域算法的RS8冗余编码来计算这32位校验位最大能够纠正随机两个RS符号的错误,即能够纠正16位的错误,这能能够覆盖涉及更多错误的故障模式。这样,针对存储器的检错和纠错能力能够显著提高,并且减少了UCE的发生。According to the Reed-Solomon theory, using RS8 redundant coding based on finite field algorithm to calculate the 32-bit check bits can correct the errors of two random RS symbols at most, that is, it can correct 16-bit errors, which can cover the failure modes involving more errors. In this way, the error detection and correction capabilities of the memory can be significantly improved, and the occurrence of UCE can be reduced.

在该非限制性示例中,处理器310以每个控制器330并行读取的数据为基础来聚合数据。应理解,这仅是为了说明起见。在一些实施例中,计算设备(例如,计算设备110)可以根据预先设置的编码方式来确定从存储器获取用于聚合的N组数据的方式。例如,可以预先设置应当用于并行读取数据的通道数目和/或应聚合在多少个时间片所读取的数据。In this non-limiting example, the processor 310 aggregates data based on the data read in parallel by each controller 330. It should be understood that this is for illustration purposes only. In some embodiments, the computing device (e.g., computing device 110) can determine the manner of obtaining N groups of data for aggregation from the memory according to a preset encoding manner. For example, the number of channels that should be used to read data in parallel and/or the data read in how many time slices should be aggregated can be preset.

在一些实施例中,计算设备可以根据用户输入的参数值(例如,通道数目和时间片数目)来设置编码方式的配置信息。在一些这样的实施例中,输入包括针对存储器的ECC性能要求。例如,可纠率、预期纠错时延、以及预期漏检率等。计算设备然后可以基于这些性能要求和存储器130的硬件条件来适配编码方式的配置参数。In some embodiments, the computing device may set the configuration information of the encoding method according to the parameter values (e.g., the number of channels and the number of time slices) input by the user. In some such embodiments, the input includes ECC performance requirements for the memory. For example, the correctable rate, the expected error correction delay, and the expected missed detection rate. The computing device can then adapt the configuration parameters of the encoding method based on these performance requirements and the hardware conditions of the memory 130.

现在参考图4,其示出了根据本申请的一些实施例来聚合多组数据以进行ECC编码的非限制性示例的示意图400,示意图400中示出了在多个时间片和/或从多个通道读取的多组数据411至414。图4中描述的动作可以是图2的方法200中的动作220和230的示例实现,并且多组数据411至414可以是图3的示例中从HBM 320读取的数据311至314的示例。为说明起见,以下将继续在图3中的处理器310执行动作的上下文中来描述图4。Reference is now made to FIG. 4 , which illustrates a schematic diagram 400 of a non-limiting example of aggregating multiple sets of data for ECC encoding according to some embodiments of the present application, wherein multiple sets of data 411 to 414 are shown read at multiple time slices and/or from multiple channels. The actions described in FIG. 4 may be example implementations of actions 220 and 230 in the method 200 of FIG. 2 , and the multiple sets of data 411 to 414 may be examples of data 311 to 314 read from the HBM 320 in the example of FIG. 3 . For purposes of illustration, FIG. 4 will continue to be described below in the context of the processor 310 in FIG. 3 performing the actions.

如图4所述,数据411至414中的每组数据包括一组数据位和一组校验位,例如数据411的数据位425和校验位435。在一些实施例中,校验位此时可以为空。在一些实施例中,校验位此时可以是由处理器根据存储器的硬件冗余比来添加的占位的位。As shown in FIG. 4 , each group of data in data 411 to 414 includes a group of data bits and a group of check bits, such as data bit 425 and check bit 435 of data 411. In some embodiments, the check bit can be empty at this time. In some embodiments, the check bit can be a placeholder added by the processor according to the hardware redundancy ratio of the memory.

代替于针对每组数据单独计算校验位的值,处理器310可以在执行计算之前将数据411至414进行聚合。在图4的示例中,为了聚合数据,处理器310将数据411至414的数据位连结在一起形成数据位的聚合445,并且将数据411至414的校验位连结在一起形成校验位的聚合455。进而,数据位的聚合445和校验位的聚合455被聚合(在该示例中,连结),从而形成最终的聚合数据465。聚合数据455包括聚合数据位475和聚合校验位485。Instead of calculating the value of the check bit separately for each set of data, the processor 310 can aggregate the data 411 to 414 before performing the calculation. In the example of FIG. 4 , to aggregate the data, the processor 310 concatenates the data bits of the data 411 to 414 together to form an aggregate 445 of data bits, and concatenates the check bits of the data 411 to 414 together to form an aggregate 455 of check bits. In turn, the aggregate 445 of data bits and the aggregate 455 of check bits are aggregated (in this example, concatenated) to form final aggregate data 465. Aggregate data 455 includes aggregate data bits 475 and aggregate check bits 485.

然后,处理器310可以对聚合数据465进行ECC编码,从而得到编码数据。具体而言,处理器310可以根据预先设置的ECC编码方式来计算聚合数据465的校验位的值,并将其放入聚合校验位485。在一些实施例中,处理器310可以根据预先设置的编码方式对来聚合数据465进行ECC编码。例如,处理器310可以根据编码设置中的参数来配置其ECC编码电路,以便进行针对聚合数据的ECC编码。在一些实施例中,该编码方式可以是具有特定参数的RS编码,这些参数指定了RS编码算法和码字的各种属性。在一些实施例中,处理器310可以包括多个ECC编码电路。根据预先设置编码方式,处理器310可以切换到对 应的编码电路并根据设置的参数来配置该编码电路,以便进行针对聚合数据的ECC编码。Then, the processor 310 can perform ECC encoding on the aggregate data 465 to obtain encoded data. Specifically, the processor 310 can calculate the value of the check bit of the aggregate data 465 according to a preset ECC encoding method, and put it into the aggregate check bit 485. In some embodiments, the processor 310 can perform ECC encoding on the aggregate data 465 according to a preset encoding method. For example, the processor 310 can configure its ECC encoding circuit according to the parameters in the encoding settings to perform ECC encoding on the aggregate data. In some embodiments, the encoding method can be RS encoding with specific parameters, which specify various properties of the RS encoding algorithm and codewords. In some embodiments, the processor 310 may include multiple ECC encoding circuits. According to the preset encoding method, the processor 310 can switch to The corresponding encoding circuit is configured according to the set parameters to perform ECC encoding on the aggregate data.

现在参考图5,其示出了根据本申请的一些实施例将经ECC编码的聚合数据分解写入存储器的非限制性示例的示意图500。图5中描述的动作可以是图2的方法200中的动作240和250的示例实现,并且聚合数据565可以是图4的示例中的聚合数据465的示例。为方便说明起见,以下将继续在图3中的处理器310执行动作的上下文中来描述图5。Now refer to FIG. 5 , which shows a schematic diagram 500 of a non-limiting example of decomposing and writing ECC-encoded aggregate data to a memory according to some embodiments of the present application. The actions described in FIG. 5 may be an example implementation of actions 240 and 250 in the method 200 of FIG. 2 , and the aggregate data 565 may be an example of the aggregate data 465 in the example of FIG. 4 . For ease of explanation, FIG. 5 will continue to be described below in the context of the processor 310 in FIG. 3 performing actions.

在针对聚合数据565进行ECC编码后,处理器310还需要将经编码的聚合数据565的各个数据位和校验位写入相应的地址。因此,处理器310可以执行上述聚合过程的逆过程,以类比从HBM读取N组数据的方式来将经编码的数据(例如经由控制器330)重新写入HBM 320。After performing ECC encoding on the aggregate data 565, the processor 310 also needs to write each data bit and check bit of the encoded aggregate data 565 to the corresponding address. Therefore, the processor 310 can perform the reverse process of the above-mentioned aggregation process, and rewrite the encoded data (for example, via the controller 330) to the HBM 320 in an analogous manner to reading N groups of data from the HBM.

如图5所示,聚合数据565可以是由多组数据(例如,图4中的数据411至414)聚合而成的,其数据位575包括来自多组数据的多个部分,例如部分545。此外,聚合数据565的校验位585的位数也与多组数据的多组校验位的聚合相等。在处理器已经将聚合数据565作为长码字进行了ECC编码之后,校验位585包括针对数据位575所计算的ECC值。校验位575可以被视为包括与多组数据中的每组数据的校验位对应的多个部分,例如555。As shown in FIG5 , the aggregated data 565 may be aggregated from multiple sets of data (e.g., data 411 to 414 in FIG4 ), and its data bits 575 include multiple parts from the multiple sets of data, such as part 545. In addition, the number of bits of the check bits 585 of the aggregated data 565 is also equal to the aggregation of the multiple sets of check bits of the multiple sets of data. After the processor has ECC-encoded the aggregated data 565 as a long codeword, the check bits 585 include the ECC values calculated for the data bits 575. The check bits 575 may be considered to include multiple parts corresponding to the check bits of each set of data in the multiple sets of data, such as 555.

在此基础上,处理器310可以将数据位575重新分解成多组数据位,并且将校验位585分解成多组校验位。处理器310然后可以将多组数据位中的每组数据位与对应的校验位组合,以重新得到多组数据,在该示例中为数据511至514。数据511至514中的每组数据与处理器310先前获取的多组数据具有相同的结构并且包括已计算的校验位585的一部分。On this basis, the processor 310 can decompose the data bits 575 into multiple groups of data bits again, and decompose the check bits 585 into multiple groups of check bits. The processor 310 can then combine each group of data bits in the multiple groups of data bits with the corresponding check bits to re-obtain multiple groups of data, in this example, data 511 to 514. Each group of data in the data 511 to 514 has the same structure as the multiple groups of data previously obtained by the processor 310 and includes a portion of the calculated check bits 585.

这样,处理器310能够类比先前读取多组数据的方式,根据相应的读写规范来将N组数据重新写入HBM 320。以图3为例,如图所示,处理器310的控制器330-1可以经由通道340-1和通道340-2并行地在两个时间片将数据511至514写入HBM 320。In this way, the processor 310 can rewrite N groups of data into the HBM 320 according to the corresponding read and write specifications in an analogous manner to the previous reading of multiple groups of data. Taking FIG. 3 as an example, as shown in the figure, the controller 330-1 of the processor 310 can write data 511 to 514 into the HBM 320 in two time slices in parallel via the channel 340-1 and the channel 340-2.

应理解,图4和图5中描述的聚合和分解的数据结构仅为逻辑性示例。符合这样的逻辑结构的各种实现变型也落入本申请的范围,只要能够处理器能够正确地对数据的相应位进行操作和传输。还应理解,图3至图5中示出的控制器、时间片、以及通道的数目仅作为示例,其可以如上所述基于用于编码配置的输入参数和/或性能要求来确定,并且该编码配置能够被更改。下文将结合图7更详细地ECC编码配置被更改的示例。It should be understood that the data structures of the aggregation and decomposition described in Fig. 4 and Fig. 5 are only logical examples. Various implementation variations that meet such logical structures also fall into the scope of the present application, as long as the processor can correctly operate and transmit the corresponding bits of data. It should also be understood that the number of controllers, time slices, and channels shown in Fig. 3 to Fig. 5 are only examples, which can be determined based on input parameters and/or performance requirements for coding configuration as described above, and the coding configuration can be changed. The example of the ECC coding configuration being changed will be described in more detail in conjunction with Fig. 7 hereinafter.

在对存储器中的数据进行ECC编码之后,当处理器120后续需要获取经ECC编码的数据以供使用(例如,由运行在处理器120上的应用进程使用等)时,其可以使用编码数据的校验位来检测编码数据的实际数据位是否在被写入时出现了错误以及对错误进行纠正。现在参考图6,其示出了根据本申请的一些实施例来对数据进行检错和纠错的示例方法600的流程图。示例方法600可以例如由如图1所示的计算设备110执行。应理解,方法600还可以包括未示出的附加动作,并且方法600中的一些动作可以被省略,本申请的范围在此方面不受限制。以下结合图1的示例环境100来详细描述方法600。After ECC encoding the data in the memory, when the processor 120 subsequently needs to obtain the ECC-encoded data for use (for example, by an application process running on the processor 120, etc.), it can use the check bits of the encoded data to detect whether an error has occurred in the actual data bits of the encoded data when being written and correct the error. Now refer to Figure 6, which shows a flowchart of an example method 600 for error detection and error correction of data according to some embodiments of the present application. The example method 600 can be performed, for example, by a computing device 110 as shown in Figure 1. It should be understood that the method 600 can also include additional actions not shown, and some actions in the method 600 can be omitted, and the scope of the present application is not limited in this respect. The method 600 is described in detail below in conjunction with the example environment 100 of Figure 1.

在610,例如根据应用程序的指令,处理器120可以从存储器130将相应量的编码数据读读取入其高速缓存行,例如控制器140的高速缓存行。编码数据中包括将要供程序使用的数据位和对应的校验位。在这些数据位先前被写入时,该校验位可以由处理器120基于本申请的实施例的方法(例如关于图2至图5描述的方法)生成。At 610, for example, according to the instructions of the application program, the processor 120 can read the corresponding amount of encoded data from the memory 130 into its cache line, such as the cache line of the controller 140. The encoded data includes data bits to be used by the program and corresponding check bits. When these data bits are previously written, the check bits can be generated by the processor 120 based on the method of the embodiment of the present application (for example, the method described with respect to Figures 2 to 5).

根据当前所配置的ECC编码方式,处理器120可以确定对应的译码方式。例如,在计算设备110启动时,处理器120可以根据当前的ECC编码配置信息配置相应的ECC编码和译码路径。处理器120然后可以根据译码方式来相应地读取适于当前译码方式执行的量的编码数据以进行译码。例如,处理器120可以以与前文获取多组数据类似的方式来从(多个)通道和/或(多个)时间片读取多组编码数据,以组合成适用于当前译码方式的码字。According to the currently configured ECC encoding mode, the processor 120 can determine the corresponding decoding mode. For example, when the computing device 110 is started, the processor 120 can configure the corresponding ECC encoding and decoding path according to the current ECC encoding configuration information. The processor 120 can then read the amount of encoded data suitable for the current decoding mode to perform decoding according to the decoding mode. For example, the processor 120 can read multiple groups of encoded data from (multiple) channels and/or (multiple) time slices in a manner similar to the above-mentioned acquisition of multiple groups of data to combine into codewords suitable for the current decoding mode.

针对所读取的编码数据,处理器120可以进行错误检测,并且在必要时尝试基于所读取的校验位来生成针对该数据的经纠错的数据位。具体而言,在615,处理器120可以基于当前的译码方式来针对所读取的编码数据的数据位生成校验位。然后,在620处,处理器120可以将在615处所生成的校验位,与在610处所读取的该编码数据的校验位相比较,以确定这两组校验位是否匹配。The processor 120 may perform error detection on the encoded data read, and if necessary, attempt to generate error-corrected data bits for the data based on the read check bits. Specifically, at 615, the processor 120 may generate check bits for the data bits of the encoded data read based on the current decoding method. Then, at 620, the processor 120 may compare the check bits generated at 615 with the check bits of the encoded data read at 610 to determine whether the two sets of check bits match.

如果这两组校验位匹配,则处理器120可以确定所读取的编码数据的数据位与这些数据位被写入时一致,即这些数据位中不存在错误。在这种情况下,方法600可以进行到625,在此,经检测的数据被提供给应用进程以供进一步使用。然后,方法600可以返回到610,在此处理器120可以读取下一批译码单位大小的数据以及其校验位。 If the two sets of check bits match, the processor 120 can determine that the data bits of the encoded data read are consistent with when the data bits were written, that is, there are no errors in the data bits. In this case, the method 600 can proceed to 625, where the detected data is provided to the application process for further use. Then, the method 600 can return to 610, where the processor 120 can read the next batch of data of the decoding unit size and its check bits.

如果则两组校验位不匹配,则处理器120可以确定这些数据位中存在错误。在这种情况下,处理器120需要尝试对这些数据位执行纠错,来纠正其中错误的位。因此,方法600进行到630,在此,处理器120可以根据当前活动的ECC译码路径(例如,RS译码电路路径)来对读取的校验位进行译码。这样的译码过程旨在确定相应的数据位中错误位的位置以便进行纠正。If the two sets of check bits do not match, the processor 120 may determine that there are errors in the data bits. In this case, the processor 120 needs to attempt to perform error correction on the data bits to correct the erroneous bits. Therefore, the method 600 proceeds to 630, where the processor 120 may decode the read check bits according to the currently active ECC decoding path (e.g., the RS decoding circuit path). Such a decoding process is intended to determine the location of the erroneous bit in the corresponding data bit for correction.

不同参数的不同ECC算法具有不同的纠错能力。在一些情况下,可能出现数据的错误无法被纠正的情况,即数据中存在UCE。如果处理器120在635处发现UCE,则方法600进行到640,在此处理器120可以触发UCE中断。如果未发现UCE,则方法600进行到645处,在此处理器120可以生成纠错的结果。即,处理器120可以基于对校验位进行译码的结果来纠正所检测的数据位中的错误,以生成经纠错的数据位。对于数据位中的一个二值的错误数据位来说,这可以意味着值的翻转。Different ECC algorithms with different parameters have different error correction capabilities. In some cases, the situation that the error of data cannot be corrected may occur, that is, there is UCE in the data. If the processor 120 finds UCE at 635, the method 600 proceeds to 640, where the processor 120 can trigger the UCE interrupt. If no UCE is found, the method 600 proceeds to 645, where the processor 120 can generate the result of error correction. That is, the processor 120 can correct the error in the detected data bit based on the result of decoding the check bit to generate the data bit through error correction. For a binary erroneous data bit in the data bit, this can mean the flipping of the value.

如上文所述,特定的ECC译码方式具有一定的检错和纠错能力,当其所检测的数据位中存在的错误超过其能力时,除了出现ECC之外,还可能出现误纠的情况。即,使用当前译码方式来对校验码进行ECC译码能够得到某个结果,但是事实上该结果中可能存在错误。As mentioned above, a specific ECC decoding method has a certain error detection and correction capability. When the errors in the data bits detected by it exceed its capability, in addition to ECC, miscorrection may also occur. That is, using the current decoding method to perform ECC decoding on the check code can obtain a certain result, but in fact there may be errors in the result.

在一些情况下,方法600还可以进一步检测经纠错的数据的可靠性。在650,处理器120可以确定是否进行对经纠错数据的可靠性检测。例如,处理器120可以根据预先设置的配置打开或关闭这一功能,从而实现两种不同的纠错模式。如果确定不进行可靠性检测,则方法600进行到655,在此处理器120可以将经纠错的数据提供给应用进程以供进一步使用。在655,处理器120还可以记录相关纠错日志。然后,方法600可以返回到610。In some cases, method 600 may further detect the reliability of the corrected data. At 650, processor 120 may determine whether to perform a reliability test on the corrected data. For example, processor 120 may turn this function on or off according to a pre-set configuration, thereby implementing two different error correction modes. If it is determined that the reliability test is not to be performed, method 600 proceeds to 655, where processor 120 may provide the corrected data to the application process for further use. At 655, processor 120 may also record a relevant error correction log. Then, method 600 may return to 610.

如果确定进行可靠性检测,则方法600进行到660,在此处理器120可以确定在645生成的经纠错的数据位与所读取的数据位之间的差异。在一些实施例中,处理器120可以确定经纠错的数据位与所读取的数据的数据位之间不同的位数。在一些实施例中,处理器120还可以确定这些不同位在数据位中的分布。然后,在665,处理器120可以确定该差异是否满足阈值条件。If it is determined to perform reliability testing, the method 600 proceeds to 660, where the processor 120 can determine the difference between the error-corrected data bits generated at 645 and the read data bits. In some embodiments, the processor 120 can determine the number of bits that are different between the error-corrected data bits and the data bits of the read data. In some embodiments, the processor 120 can also determine the distribution of these different bits in the data bits. Then, at 665, the processor 120 can determine whether the difference meets a threshold condition.

在一些实施例中,处理器120可以确定经纠错的数据位与所读取的数据的数据位之间不同的位数超过阈值。该阈值与所使用的ECC编码译码方式在一组数据位中最多能够纠正的位数相关联。例如,如果最多能检测一位数据的ECC方式生成了纠正两位的纠正结果,则该结果满足了阈值条件,即纠正结果中可能包括错误。In some embodiments, the processor 120 may determine that the number of bits that differ between the error-corrected data bits and the data bits of the read data exceeds a threshold. The threshold is associated with the maximum number of bits that can be corrected in a group of data bits by the ECC encoding and decoding method used. For example, if an ECC method that can detect at most one bit of data generates a correction result that corrects two bits, then the result meets the threshold condition, that is, the correction result may include an error.

在一些实施例中,阈值条件还与错误位的分布有关。例如,如果ECC方法能够可靠地检测j个连续位的错误,但是如果错误位不连续,则只能可靠地检测更少的k个位的错误。则如果经纠错的数据位纠正了i个不连续位的错误,处理器120仍然可以确定该结果满足了阈值条件。In some embodiments, the threshold condition is also related to the distribution of the error bits. For example, if the ECC method can reliably detect errors of j consecutive bits, but if the error bits are discontinuous, it can only reliably detect fewer k bits of errors. If the error-corrected data bits correct the errors of i discontinuous bits, the processor 120 can still determine that the result meets the threshold condition.

如果该差异未满足阈值条件,则方法进行到655。如果该差异满足了阈值条件,则方法进行到670,在此处理器120可以生成提示,该提示指示经纠错的数据位中有可能存在错误。然后,在675处,处理器120可以将经纠错的数据与提示(例如,作为附加标记)一起提供给应用进程以供使用。处理器120也可以在此记录相关的纠错日志以供跟踪和分析。然后,方法600可以返回到610。If the difference does not meet the threshold condition, the method proceeds to 655. If the difference meets the threshold condition, the method proceeds to 670, where the processor 120 can generate a hint indicating that there may be an error in the corrected data bit. Then, at 675, the processor 120 can provide the corrected data with the hint (e.g., as an additional mark) to the application process for use. The processor 120 can also record the relevant error correction log for tracking and analysis. Then, the method 600 can return to 610.

以这种方式,能够向上层应用提供更加谨慎的纠错结果,并且能够将是否使用有风险的数据的决策交给上层应用,从而在虑及误纠风险的同时也尽量保证业务的连续运行。例如,在数据是将由应用显示的图像时,涉及若干个像素的错误可能并不影响显示。因此,应用可以选择使用该数据。作为另一示例,在应用对数据正确性要求很高时,其可以丢弃被标记为可能存在错误的经纠错数据。在本申请的另一些实施例中,例如在数据的正确性非常重要的安全服务器中,为了尽可能地保证数据的可靠性,在确定经纠错的数据位中可能存在错误后,代替于在670处生成提示,处理器120可以引发中断。In this way, more cautious error correction results can be provided to upper-level applications, and the decision of whether to use risky data can be handed over to upper-level applications, so as to ensure the continuous operation of the business while taking into account the risk of miscorrection. For example, when the data is an image to be displayed by an application, errors involving several pixels may not affect the display. Therefore, the application can choose to use the data. As another example, when the application has high requirements for data correctness, it can discard the corrected data that is marked as possibly erroneous. In other embodiments of the present application, for example, in a security server where the correctness of the data is very important, in order to ensure the reliability of the data as much as possible, after determining that there may be errors in the corrected data bits, instead of generating a prompt at 670, the processor 120 can trigger an interrupt.

方法600可以与前文根据图2至图5的ECC编码过程结合使用,从而在相同硬件条件下,实现相比传统方法能够达到更强检错和纠错性能的ECC功能。此外,上述对经纠错的数据的可靠性进行检测和提示的过程也可以单独地与其他ECC译码过程一起使用,来提供如上所述的相关益处。Method 600 can be used in combination with the ECC encoding process described above according to FIGS. 2 to 5 , so as to achieve an ECC function with stronger error detection and correction performance than the conventional method under the same hardware conditions. In addition, the above process of detecting and prompting the reliability of the error-corrected data can also be used separately with other ECC decoding processes to provide the related benefits described above.

如前文所述,计算设备110可以根据预先设置的编码方式来读取数据和对该数据执行ECC功能。并且,这些该编码方式可以由于针对计算设备110的ECC相关需求等的改变而被更新。在一些实施例中,编码方式的更新可以发生在系统运行期间,并且该更新的生效无需重启系统。As described above, the computing device 110 can read data and perform ECC functions on the data according to a preset encoding method. In addition, these encoding methods can be updated due to changes in ECC-related requirements for the computing device 110. In some embodiments, the update of the encoding method can occur during system operation, and the update does not require the system to be restarted for it to take effect.

现在参考图7来描述编码方式的设置与更新。图7示出了根据本申请的一些实施例的用于数据校验的示例方法700的流程图,其中用于数据校验的ECC编码配置在系统运行期间被更新。示例方法700可以例如由如图1所示的计算设备110执行。应理解,方法700还可以包括未示出的附加动作,并且方法700中的一些动作可以被省略,本申请的范围在此方面不受限制。以下结合图1的示例环境100来详细描述方 法700。Now, the setting and updating of the encoding method will be described with reference to FIG. 7. FIG. 7 shows a flowchart of an example method 700 for data verification according to some embodiments of the present application, wherein the ECC encoding configuration for data verification is updated during system operation. The example method 700 may be executed, for example, by the computing device 110 shown in FIG. 1. It should be understood that the method 700 may also include additional actions not shown, and some actions in the method 700 may be omitted, and the scope of the present application is not limited in this respect. The method 700 is described in detail below in conjunction with the example environment 100 of FIG. 1. FA 700.

方法700开始于计算设备110上的系统在710启动。在此,计算设备110可以基于预先设置的编码方式来配置执行ECC编码和译码的组件,以初始化ECC纠错和检错功能。例如,计算设备110可以基于保存编码方式的相关参数的配置文件来进行配置。这些相关参数的值可以是未更改或恢复的默认值,或者可以是用户先前输入的值。The method 700 begins with the system on the computing device 110 starting at 710. Here, the computing device 110 can configure the components that perform ECC encoding and decoding based on the preset encoding method to initialize the ECC error correction and error detection functions. For example, the computing device 110 can be configured based on a configuration file that stores relevant parameters of the encoding method. The values of these relevant parameters can be default values that are not changed or restored, or can be values previously entered by a user.

在一些实施例中,配置参数可以包括应当用于并行读取数据的通道数目和/或应聚合在多少个时间片所读取的数据。例如,计算设备110可以将其ECC编码功配置为在聚合L个时间片并行从M个通道或者M个存储库组读取的N组数据的基础上进行编码,如前文关于方法200所述。In some embodiments, the configuration parameters may include the number of channels that should be used to read data in parallel and/or the number of time slices over which the data read should be aggregated. For example, the computing device 110 may configure its ECC encoding function to encode based on N groups of data read in parallel from M channels or M groups of memory banks aggregated over L time slices, as described above with respect to method 200.

在一些实施例中,例如作为处理器120的一部分,计算设备110可以包括多个ECC编码电路和对应的译码电路,诸如实现汉明码算法电路以及实现RS算法的电路。根据配置文件中指定的编码算法,计算设备110可以启用并配置对应的电路并来执行ECC相关的功能。In some embodiments, for example, as part of the processor 120, the computing device 110 may include multiple ECC encoding circuits and corresponding decoding circuits, such as a circuit implementing a Hamming code algorithm and a circuit implementing a RS algorithm. According to the encoding algorithm specified in the configuration file, the computing device 110 may enable and configure the corresponding circuit and perform ECC-related functions.

应理解,除了上述影响编码方式的配置之外,计算设备110还可以进行其他ECC相关的配置。例如,根据针对ECC译码的漏检率等要求,计算设备110可以包括启用或关闭在生成经纠错的数据之后对其进行可靠性检测的功能,如前文关于方法600所详述的。It should be understood that in addition to the above configurations affecting the encoding method, the computing device 110 may also perform other ECC-related configurations. For example, according to requirements such as the missed detection rate for ECC decoding, the computing device 110 may include a function of enabling or disabling reliability detection of error-corrected data after it is generated, as described in detail above with respect to method 600.

在720,计算设备110的处理器120开始运行各种系统和应用进程,并且不断与存储器130进行通信以读写数据等。在此期间,利用所配置的ECC相关功能。处理器120(例如,在其内部缓冲区中)可以对要写入存储器130的数据进行ECC编码,并根据相应的读写规范将数据的数据位和所计算的校验位一起写入存储器130。并且,在读取已编码的数据以供使用时,处理器120可以利用所配置的ECC相关功能,来读入该数据的数据位和对应的校验位以进行译码,从而执行针对该数据的ECC纠错和检错。At 720, the processor 120 of the computing device 110 begins to run various system and application processes, and continuously communicates with the memory 130 to read and write data, etc. During this period, the configured ECC-related functions are utilized. The processor 120 (for example, in its internal buffer) can perform ECC encoding on the data to be written to the memory 130, and write the data bits of the data and the calculated check bits to the memory 130 according to the corresponding read and write specifications. And, when reading the encoded data for use, the processor 120 can use the configured ECC-related functions to read in the data bits and corresponding check bits of the data for decoding, thereby performing ECC error correction and error detection on the data.

计算设备110可以基于所配置的ECC纠错和检错功能,根据前文关于图2至图6描述的方法动作来在聚合的长码字上进行ECC编码以及对应地执行ECC译码,在此不再赘述。取决于实现,该过程可以由存储器控制器140或者处理器的另一部件执行。尽管在图7中未示出,但是如果在译码过程发现UCE,则计算设备110执行相关的错误处理动作,例如触发中断。The computing device 110 can perform ECC encoding on the aggregated long codeword and correspondingly perform ECC decoding based on the configured ECC error correction and error detection functions according to the method actions described above with respect to Figures 2 to 6, which will not be repeated here. Depending on the implementation, the process can be performed by the memory controller 140 or another component of the processor. Although not shown in Figure 7, if UCE is found during the decoding process, the computing device 110 performs relevant error handling actions, such as triggering an interrupt.

在系统运行期间,在730,计算设备110可以例如经由用户界面接收到针对ECC编码方式的更新指示。该更新指示可以包括对一项或多项参数的更新信息,例如并行读取待聚合数据的通道数目、读取待聚合数据的时间片数目、以及编码算法。响应于接收到该更新指示,在740,计算设备110可以根据更新来配置相关组件以切换到新的编码方式。例如,计算设备110可以将编码电路切换为另一编码电路,以及对应地将译码电路切换为与另一编码电路配对的译码电路。在切换到新的编码方式后,计算设备110的处理器120可以根据新的编码方式来执行后续的ECC编码和译码动作。During system operation, at 730, the computing device 110 may receive an update indication for the ECC encoding method, for example, via a user interface. The update indication may include update information for one or more parameters, such as the number of channels for reading the data to be aggregated in parallel, the number of time slices for reading the data to be aggregated, and the encoding algorithm. In response to receiving the update indication, at 740, the computing device 110 may configure the relevant components according to the update to switch to a new encoding method. For example, the computing device 110 may switch the encoding circuit to another encoding circuit, and correspondingly switch the decoding circuit to a decoding circuit paired with another encoding circuit. After switching to the new encoding method, the processor 120 of the computing device 110 may perform subsequent ECC encoding and decoding actions according to the new encoding method.

当编码方式的更新发生在系统运行期间时,在更新后,存储器130内可能存在以更新前的方式编码有校验位的数据。如果处理器120读取这些数据进行检错和纠错,其将使用更新后的译码方式来执行这些动作。这将产生错误的结果或者无法产生结果。为了解决这一问题,在750,计算设备110可以使处理器120以更新后的编码方式更新存储器130中的已编码数据的校验位。When the update of the encoding method occurs during system operation, after the update, there may be data encoded with check bits in the method before the update in the memory 130. If the processor 120 reads this data for error detection and error correction, it will use the updated decoding method to perform these actions. This will produce wrong results or no results. To solve this problem, at 750, the computing device 110 can cause the processor 120 to update the check bits of the encoded data in the memory 130 with the updated encoding method.

为此,处理器120可以根据已更新的编码方式来将已编码的数据从存储器130读取到其缓冲区中,以便更新该数据的校验位。根据如前文所述的ECC编码过程,处理器120可以采用更新的配置中指示的通道数目和时间片数目来读取和聚合用于ECC编码的长码字,并且采用更新的编码方式来计算该长码字的校验位。然后,处理器120可以将该长码字的数据位和校验位写回相应的地址,从而保持数据位不变而覆写先前的校验位。处理器120可以继续读取和聚合下一个长码字来更新校验位,直到所需的更新完成。To this end, the processor 120 can read the encoded data from the memory 130 into its buffer according to the updated encoding method to update the check bit of the data. According to the ECC encoding process as described above, the processor 120 can use the number of channels and the number of time slices indicated in the updated configuration to read and aggregate the long codeword for ECC encoding, and use the updated encoding method to calculate the check bit of the long codeword. Then, the processor 120 can write the data bits and check bits of the long codeword back to the corresponding addresses, thereby keeping the data bits unchanged and overwriting the previous check bits. The processor 120 can continue to read and aggregate the next long codeword to update the check bit until the required update is completed.

方法700提供了在诸如空间、时间、以及编码算法等多个维度灵活可配置的可定制ECC检错和纠错功能。这样,可以在相同的硬件环境下提供满足用户的不同需求的多种ECC编码和译码方式。并且,相关的配置可以应需求改变而在无需重启的情况下动态地被更新,从而能够避免由于更新ECC相关功能配置而引起的业务中断。该方法可以用于例如数据中心等的运营配置优化。多种ECC相关功能配置可以各自被采用并且试运行一段时间,从而明确最主要的内存故障模式并确定最佳的配置参数。Method 700 provides customizable ECC error detection and correction functions that are flexible and configurable in multiple dimensions such as space, time, and encoding algorithms. In this way, multiple ECC encoding and decoding methods that meet different needs of users can be provided under the same hardware environment. In addition, the relevant configuration can be dynamically updated without restarting in response to changes in demand, thereby avoiding business interruptions caused by updating ECC-related function configurations. The method can be used for operational configuration optimization of data centers, etc. Multiple ECC-related function configurations can be adopted and tested for a period of time to clarify the most important memory failure modes and determine the best configuration parameters.

如上所述,给定存储器系统的硬件限制条件(例如其所能支持的数据位和校验位的冗余比等),用户可以根据其应用场景中的内存检错纠错的需求来对ECC相关功能进行配置。在一些实施例中,用户可以根据需求直接输入参数值来配置参数。As described above, given the hardware constraints of the memory system (e.g., the redundancy ratio of data bits and check bits that it can support, etc.), the user can configure the ECC related functions according to the requirements of memory error detection and correction in its application scenario. In some embodiments, the user can directly input parameter values to configure the parameters according to the requirements.

例如,在处理器120包括多对编码和译码电路的情况下,针对需要1位纠错的场景,出于成本考虑,用户可以将编码算法配置为汉明码编码算法。例如,针对更高的检错要求,用户可以将编码算法配置为能 力更强的RS编码算法并且在聚合的长码字基础上进行编码,以达成针对多个连续符号的纠错能力。作为另一非限制性示例,用户可以输入配置值,以指定在聚合从两个通道并行读取的两个时间片数据的基础上进行RS8编码。然而,如果要在编码算法不变的情况下达成更少的时延,用户可以将配置修改聚合为从四个通道并行读取一个时间片的数据来进行RS8编码。For example, in the case where the processor 120 includes multiple pairs of encoding and decoding circuits, for a scenario where 1-bit error correction is required, the user can configure the encoding algorithm to be a Hamming code encoding algorithm for cost considerations. The RS encoding algorithm with stronger power is encoded on the basis of the aggregated long codeword to achieve error correction capability for multiple consecutive symbols. As another non-limiting example, the user can enter a configuration value to specify RS8 encoding based on the aggregation of two time slices of data read in parallel from two channels. However, if less time delay is to be achieved without changing the encoding algorithm, the user can modify the configuration to aggregate the data of one time slice read in parallel from four channels for RS8 encoding.

在另一些实施例中,代替于直接输入参数值,用户输入可以包括针对存储器130的ECC性能要求。计算设备110可以根据这些ECC的性能要求以及存储器130的硬件限制条件,来自动地适配能够满足性能要求并符合硬件等限制条件的编码方式的参数,以及利用这些参数来设置或更新ECC相关功能配置。在一些实施例中,这些性能要求可以包括ECC纠错造成的预期时延,以及使用所生成的校验数据来对相应数据进行校验的预期可靠性指标,诸如预期漏检率或误纠率等。在一些实施例中,存储器130的硬件限制条件可以包括存储器130中用于存储原始数据与用于存储校验数据的容量之间的冗余比。In other embodiments, instead of directly inputting parameter values, user input may include ECC performance requirements for memory 130. The computing device 110 may automatically adapt the parameters of the encoding method that can meet the performance requirements and meet the hardware and other constraints according to these ECC performance requirements and the hardware constraints of the memory 130, and use these parameters to set or update the ECC related function configuration. In some embodiments, these performance requirements may include the expected delay caused by ECC error correction, and the expected reliability indicators of the corresponding data using the generated verification data, such as the expected missed detection rate or the false correction rate. In some embodiments, the hardware constraints of the memory 130 may include the redundancy ratio between the capacity used to store the original data and the capacity used to store the verification data in the memory 130.

在一些实施例中,计算设备110可以提供预设的多种参数组合,以及相应需求到这多种参数组合映射规则。这样,计算设备110可以基于映射规则来提供相应的参数组合来配置ECC相关功能。在一些实施例中,计算设备110可以实现配置规划功能。该功能将性能要求和限制条件作为输入,并且输出满足各种限制条件的极致可靠性的最优配置参数,诸如聚合数据所基于的通道数目和时间片数目以及应使用的ECC编码算法。该最优配置参数然后可以用于配置ECC相关功能。In some embodiments, the computing device 110 may provide preset multiple parameter combinations, and corresponding requirements to these multiple parameter combination mapping rules. In this way, the computing device 110 may provide corresponding parameter combinations based on the mapping rules to configure ECC related functions. In some embodiments, the computing device 110 may implement a configuration planning function. This function takes performance requirements and constraints as input, and outputs the optimal configuration parameters for extreme reliability that meet various constraints, such as the number of channels and time slices on which the aggregated data is based, and the ECC encoding algorithm to be used. The optimal configuration parameters can then be used to configure ECC related functions.

图8示出了根据本申请的一些实施例的数据校验装置800的示意框图。装置800可以被实现为或者被包括在图1的计算设备110中。装置800可以包括多个模块,以用于执行例如方法200、方法600和方法700中的对应动作。Fig. 8 shows a schematic block diagram of a data verification device 800 according to some embodiments of the present application. The device 800 may be implemented as or included in the computing device 110 of Fig. 1. The device 800 may include multiple modules to perform corresponding actions in, for example, method 200, method 600, and method 700.

如图所示,装置800包括以下模块:获取模块810,被配置为从存储器获取N组数据,该N组数据中的每组数据包括校验位,其中N为大于等于2的正整数;聚合模块820,被配置为对N组数据进行聚合得到聚合数据,该聚合数据包括聚合校验位,该聚合校验位为N组数据的校验位的聚合;编码模块830,被配置为对聚合数据进行纠错码ECC编码,得到编码数据;分解模块840,被配置为将编码数据分解为N组编码数据,该N组编码数据中的每组编码数据包括校验位;以及写入模块850,被配置为将N组编码数据分别写入存储器。As shown in the figure, the device 800 includes the following modules: an acquisition module 810, configured to acquire N groups of data from a memory, each group of data in the N groups of data includes a check bit, wherein N is a positive integer greater than or equal to 2; an aggregation module 820, configured to aggregate the N groups of data to obtain aggregated data, the aggregated data includes an aggregated check bit, and the aggregated check bit is an aggregation of the check bits of the N groups of data; an encoding module 830, configured to perform error correction code ECC encoding on the aggregated data to obtain encoded data; a decomposition module 840, configured to decompose the encoded data into N groups of encoded data, each group of encoded data in the N groups of encoded data includes a check bit; and a writing module 850, configured to write the N groups of encoded data into the memory respectively.

在一些实施例中,获取模块810包括第一读取模块,第一读取模块被配置为从存储器的N个通道或者N个存储库组读取数据,其中每个通道或者每个存储库的数据为一组数据。In some embodiments, the acquisition module 810 includes a first reading module, which is configured to read data from N channels or N memory bank groups of the memory, wherein the data of each channel or each memory bank is a group of data.

在一些实施例中,获取模块810包括第二读取模块,第二读取模块被配置为在L个时间片从存储器的M个通道或者M个存储库组读取数据,其中N=L*M,其中时间片为从存储器读取数据的一次突发传输的持续时间。In some embodiments, the acquisition module 810 includes a second read module, which is configured to read data from M channels or M memory bank groups of the memory in L time slices, where N=L*M, and the time slice is the duration of a burst transmission of reading data from the memory.

在一些实施例中,编码模块830包括:确定模块,被配置为根据预先设置的编码方式来确定从存储器获取N组数据的方式;以及ECC编码模块,被配置为根据预先设置的编码方式对聚合数据进行ECC编码。In some embodiments, the encoding module 830 includes: a determination module configured to determine a method for obtaining N groups of data from a memory according to a preset encoding method; and an ECC encoding module configured to perform ECC encoding on the aggregated data according to a preset encoding method.

在一些实施例中,ECC编码模块包括:根据用户的输入对编码方式进行预先设置,该输入包括针对存储器的ECC性能要求。In some embodiments, the ECC encoding module includes: presetting the encoding method according to user input, where the input includes ECC performance requirements for the memory.

在一些实施例中,装置800还包括更新模块,该更新模块被配置为:响应于编码方式被更新,根据已更新的编码方式来从存储器获取数据以更新所获取的数据的校验位。In some embodiments, the apparatus 800 further includes an updating module, wherein the updating module is configured to: in response to the encoding method being updated, obtain data from the memory according to the updated encoding method to update the check bit of the obtained data.

在一些实施例中,编码方式包括里德-所罗门编码。In some embodiments, the encoding scheme includes Reed-Solomon encoding.

在一些实施例中,装置800还包括:译码方式确定模块,被配置为根据编码方式确定对应的译码方式;以及纠错模块,被配置为在从存储器读取数据以供使用时,根据译码方式基于所读取的数据的校验位来生成针对该数据的经纠错的数据位。In some embodiments, the device 800 also includes: a decoding method determination module, configured to determine a corresponding decoding method according to an encoding method; and an error correction module, configured to generate error-corrected data bits for the data based on the check bits of the read data according to the decoding method when reading data from the memory for use.

在一些实施例中,装置800还包括:差异位数确定模块,确定经纠错的数据位与所读取的数据的数据位之间不同的位数;提示模块,被配置为在不同的位数满足阈值条件的情况下生成提示,该提示指示经纠错的数据位中有可能存在错误。In some embodiments, the device 800 also includes: a difference bit number determination module, which determines the number of different bits between the corrected data bits and the data bits of the read data; a prompt module, which is configured to generate a prompt when the different number of bits meets a threshold condition, and the prompt indicates that there may be errors in the corrected data bits.

图9示出了可以用来实施本申请的实施例的示例设备900的示意性框图。设备900可以用来实现图1中所示的计算设备110的功能。如图所示,设备900包括计算单元901,其可以根据存储在随机存取存储器(RAM)903和/或只读存储器(ROM)902的计算机程序指令或者从存储单元908加载到RAM 903和/或ROM 902中的计算机程序指令,来执行各种适当的动作和处理。在RAM 903和/或ROM 902中,还可存储设备900操作所需的各种程序和数据。作为非限制性示例,计算单元901和RAM 903可以分别实现 图1中所示的处理器120和存储器130的功能。计算单元901和RAM 903和/或ROM 902通过总线904彼此相连。输入/输出(I/O)接口905也连接至总线904。FIG9 shows a schematic block diagram of an example device 900 that can be used to implement an embodiment of the present application. The device 900 can be used to implement the functions of the computing device 110 shown in FIG1 . As shown in the figure, the device 900 includes a computing unit 901, which can perform various appropriate actions and processes according to computer program instructions stored in a random access memory (RAM) 903 and/or a read-only memory (ROM) 902 or computer program instructions loaded from a storage unit 908 into the RAM 903 and/or the ROM 902. Various programs and data required for the operation of the device 900 can also be stored in the RAM 903 and/or the ROM 902. As a non-limiting example, the computing unit 901 and the RAM 903 can respectively implement 1 shows the functions of the processor 120 and the memory 130. The computing unit 901 and the RAM 903 and/or the ROM 902 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

设备900中的多个部件连接至I/O接口905,包括:输入单元906,例如键盘、鼠标等;输出单元907,例如各种类型的显示器、扬声器等;存储单元908,例如磁盘、光盘等;以及通信单元909,例如网卡、调制解调器、无线通信收发机等。通信单元909允许设备900通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906, such as a keyboard, a mouse, etc.; an output unit 907, such as various types of displays, speakers, etc.; a storage unit 908, such as a disk, an optical disk, etc.; and a communication unit 909, such as a network card, a modem, a wireless communication transceiver, etc. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.

计算单元901可以是各种具有处理和计算能力的通用和/或专用处理组件。其可以实现图1中的处理器120的功能。计算单元901的一些示例包括但不限于CPU、GPU、各种专用的AI计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(digital signal processor,DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元901执行上文所描述的各个方法和处理,例如方法200、600和700。例如,在一些实施例中,方法200、600和700可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元908。在一些实施例中,计算机程序的部分或者全部可以经由RAM和/或ROM和/或通信单元909而被载入和/或安装到设备900上。当计算机程序加载到RAM和/或ROM并由计算单元901执行时,可以执行上文描述的方法200、600或700的一个或多个步骤。备选地,在其他实一些施例中,计算单元901可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行方法200、600或700。The computing unit 901 may be a variety of general and/or special processing components with processing and computing capabilities. It may implement the functions of the processor 120 in FIG. 1 . Some examples of the computing unit 901 include, but are not limited to, a CPU, a GPU, various dedicated AI computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any appropriate processor, controller, microcontroller, etc. The computing unit 901 performs the various methods and processes described above, such as methods 200, 600, and 700. For example, in some embodiments, methods 200, 600, and 700 may be implemented as a computer software program, which is tangibly contained in a machine-readable medium, such as a storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed on the device 900 via RAM and/or ROM and/or a communication unit 909. When the computer program is loaded into the RAM and/or ROM and executed by the computing unit 901, one or more steps of the methods 200, 600, or 700 described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to execute the method 200 , 600 , or 700 in any other appropriate manner (eg, by means of firmware).

在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现,当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令,在服务器或终端上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴光缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是服务器或终端能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(如软盘、硬盘和磁带等),也可以是光介质(如数字视频盘(digital video disk,DVD)等),或者半导体介质(如固态硬盘等)。In the above embodiments, it can be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented by software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a server or terminal, the process or function described in the embodiment of the present application is generated in whole or in part. The computer instructions can be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions can be transmitted from one website site, computer, server or data center to another website site, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that can be accessed by a server or terminal or a data storage device such as a server or data center that includes one or more available media integrated. The available medium can be a magnetic medium (such as a floppy disk, a hard disk, and a tape, etc.), or an optical medium (such as a digital video disk (digital video disk, DVD), etc.), or a semiconductor medium (such as a solid-state hard disk, etc.).

此外,虽然采用特定次序描绘了各操作,但是这应当理解为要求这样操作以所示出的特定次序或以顺序次序执行,或者要求所有图示的操作应被执行以取得期望的结果。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本申请的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实现中。相反地,在单个实现的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实现中。In addition, although each operation is described in a specific order, this should be understood as requiring such operation to be performed in the specific order shown or in a sequential order, or requiring that all illustrated operations should be performed to obtain desired results. Under certain circumstances, multitasking and parallel processing may be advantageous. Similarly, although some specific implementation details are included in the above discussion, these should not be interpreted as limiting the scope of the application. Some features described in the context of a separate embodiment can also be implemented in a single implementation in combination. On the contrary, the various features described in the context of a single implementation can also be implemented in multiple implementations individually or in any suitable sub-combination mode.

尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。 Although the subject matter has been described in language specific to structural features and/or methodological logical actions, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or actions described above. On the contrary, the specific features and actions described above are merely example forms of implementing the claims.

Claims (13)

一种用于数据校验的方法,包括:A method for data verification, comprising: 从存储器获取N组数据,所述N组数据中的每组数据包括校验位,其中所述N为大于等于2的正整数;对所述N组数据进行聚合得到聚合数据,所述聚合数据包括聚合校验位,所述聚合校验位为所述N组数据的校验位的聚合;Acquire N groups of data from a memory, each group of data in the N groups of data includes a check bit, wherein N is a positive integer greater than or equal to 2; aggregate the N groups of data to obtain aggregated data, the aggregated data includes an aggregated check bit, and the aggregated check bit is an aggregation of the check bits of the N groups of data; 对所述聚合数据进行纠错码ECC编码,得到编码数据;Performing error correction code (ECC) encoding on the aggregated data to obtain encoded data; 将所述编码数据分解为N组编码数据,所述N组编码数据中的每组数据包括校验位;以及Decomposing the coded data into N groups of coded data, each group of the N groups of coded data including a check bit; and 将所述N组编码数据分别写入所述存储器。The N groups of encoded data are written into the memory respectively. 根据权利要求1所述的方法,其特征在于,从存储器获取N组数据包括:The method according to claim 1, wherein acquiring N sets of data from the memory comprises: 从所述存储器的N个通道或者N个存储库组读取数据,其中每个通道或者每个存储库组的数据为一组数据。Data are read from N channels or N memory bank groups of the memory, wherein data of each channel or each memory bank group constitutes a group of data. 根据权利要求1所述的方法,其特征在于,从存储器获取N组数据包括:The method according to claim 1, characterized in that acquiring N sets of data from the memory comprises: 在L个时间片从所述存储器的M个通道或者M个存储库组读取数据,其中N=L*M,其中所述时间片为从所述存储器读取数据的一次突发传输的持续时间。Data is read from M channels or M memory bank groups of the memory in L time slices, where N=L*M, and the time slice is the duration of one burst transmission of reading data from the memory. 根据权利要求1所述的方法,其特征在于,对所述聚合数据进行纠错码ECC编码包括:The method according to claim 1, characterized in that performing error correction code (ECC) encoding on the aggregated data comprises: 根据预先设置的编码方式来确定从存储器获取所述N组数据的方式;以及Determining a method for acquiring the N groups of data from the memory according to a preset encoding method; and 根据所述预先设置的编码方式对所述聚合数据进行ECC编码。The aggregated data is ECC-encoded according to the preset encoding method. 根据权利要求4所述的方法,其特征在于,根据所述预先设置的编码方式对所述聚合数据进行ECC编码包括:The method according to claim 4, characterized in that performing ECC encoding on the aggregated data according to the preset encoding method comprises: 根据用户的输入对所述编码方式进行预先设置,所述输入包括针对所述存储器的ECC性能要求。The encoding mode is preset according to a user input, wherein the input includes an ECC performance requirement for the memory. 根据权利要求4所述的方法,其特征在于,还包括:The method according to claim 4, further comprising: 响应于所述编码方式被更新,根据已更新的编码方式来从所述存储器获取数据以更新所获取的所述数据的校验位。In response to the encoding method being updated, data is retrieved from the memory according to the updated encoding method to update a check bit of the retrieved data. 根据权利要求4所述的方法,其特征在于,所述编码方式包括里德-所罗门编码。The method according to claim 4 is characterized in that the encoding method includes Reed-Solomon encoding. 根据权利要求4所述的方法,其特征在于,还包括:The method according to claim 4, further comprising: 根据所述编码方式确定对应的译码方式;以及Determine a corresponding decoding method according to the encoding method; and 在从所述存储器读取数据以供使用时,根据所述译码方式,基于所读取的所述数据的校验位来生成针对所述数据的经纠错的数据位。When data is read from the memory for use, error-corrected data bits for the data are generated based on the parity bits of the read data according to the decoding method. 根据权利要求8所述的方法,其特征在于,还包括:The method according to claim 8, further comprising: 确定所述经纠错的数据位与所读取的所述数据的数据位之间不同的位数;determining the number of bits that differ between the error-corrected data bits and the read data bits of the data; 如果所述位数满足阈值条件,则生成提示,所述提示指示所述经纠错的数据位中有可能存在错误。If the number of bits satisfies a threshold condition, an indication is generated indicating that there is likely an error in the error-corrected data bits. 一种用于数据校验的装置,其特征在于,包括:A device for data verification, comprising: 获取模块,被配置为从存储器获取N组数据,所述N组数据中的每组数据包括校验位,其中所述N为大于等于2的正整数;An acquisition module is configured to acquire N groups of data from a memory, each group of data in the N groups of data includes a check bit, wherein N is a positive integer greater than or equal to 2; 聚合模块,被配置为对所述N组数据进行聚合得到聚合数据,所述聚合数据包括聚合校验位,所述聚合校验位为所述N组数据的校验位的聚合;an aggregation module, configured to aggregate the N groups of data to obtain aggregated data, wherein the aggregated data includes an aggregated check bit, and the aggregated check bit is an aggregation of check bits of the N groups of data; 编码模块,被配置为对所述聚合数据进行纠错码ECC编码,得到编码数据;An encoding module, configured to perform error correction code (ECC) encoding on the aggregated data to obtain encoded data; 分解模块,被配置为将所述编码数据分解为N组编码数据,所述N组编码数据中的每组编码数据包括校验位;以及a decomposition module configured to decompose the coded data into N groups of coded data, each group of coded data in the N groups of coded data including a check bit; and 将所述N组数据分别写入所述存储器,其中,所述N为大于等于2的正整数。The N groups of data are written into the memory respectively, wherein N is a positive integer greater than or equal to 2. 一种电子设备,其特征在于,包括处理器和存储器,所述存储器上存储有计算机指令,所述计算机指令在被所述处理器执行时,使得所述电子设备执行权利要求1至9中任一项所述的方法。An electronic device, characterized in that it comprises a processor and a memory, wherein the memory stores computer instructions, and when the computer instructions are executed by the processor, the electronic device executes the method according to any one of claims 1 to 9. 一种计算机存储介质,其特征在于,所述计算机可读存储介质存储有指令,所述指令在被电子设备执行时,使得所述电子设备执行根据权利要求1至9中任一项所述的方法。A computer storage medium, characterized in that the computer-readable storage medium stores instructions, and when the instructions are executed by an electronic device, the electronic device executes the method according to any one of claims 1 to 9. 一种计算机程序产品,其特征在于,所述计算机程序产品包括指令,所述指令在被电子设备执行时,使得所述电子设备执行根据权利要求1至9中任一项所述的方法。 A computer program product, characterized in that the computer program product comprises instructions, and when the instructions are executed by an electronic device, the electronic device executes the method according to any one of claims 1 to 9.
PCT/CN2024/086034 2023-06-29 2024-04-03 Method and apparatus for data check, and device, medium and product Pending WO2025001399A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202310792806 2023-06-29
CN202310792806.8 2023-06-29
CN202311126669.0A CN119226028A (en) 2023-06-29 2023-08-31 Method, apparatus, device, medium and program product for data verification
CN202311126669.0 2023-08-31

Publications (1)

Publication Number Publication Date
WO2025001399A1 true WO2025001399A1 (en) 2025-01-02

Family

ID=93937084

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2024/086034 Pending WO2025001399A1 (en) 2023-06-29 2024-04-03 Method and apparatus for data check, and device, medium and product

Country Status (1)

Country Link
WO (1) WO2025001399A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5922080A (en) * 1996-05-29 1999-07-13 Compaq Computer Corporation, Inc. Method and apparatus for performing error detection and correction with memory devices
CN102339641A (en) * 2010-07-23 2012-02-01 北京兆易创新科技有限公司 Error checking and correcting verification module and data reading-writing method thereof
CN103811076A (en) * 2012-11-01 2014-05-21 三星电子株式会社 Memory module, memory system having the same, and methods of reading therefrom and writing thereto
US20150143201A1 (en) * 2013-11-19 2015-05-21 International Business Machines Corporation Error-correcting code distribution for memory systems

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5922080A (en) * 1996-05-29 1999-07-13 Compaq Computer Corporation, Inc. Method and apparatus for performing error detection and correction with memory devices
CN102339641A (en) * 2010-07-23 2012-02-01 北京兆易创新科技有限公司 Error checking and correcting verification module and data reading-writing method thereof
CN103811076A (en) * 2012-11-01 2014-05-21 三星电子株式会社 Memory module, memory system having the same, and methods of reading therefrom and writing thereto
US20150143201A1 (en) * 2013-11-19 2015-05-21 International Business Machines Corporation Error-correcting code distribution for memory systems

Similar Documents

Publication Publication Date Title
CN102567134B (en) Error checking and correcting system and method for memory module
US9471423B1 (en) Selective memory error reporting
EP4246329B1 (en) Error correction method and apparatus
US20140089760A1 (en) Storage of codeword portions
US12019516B2 (en) Instant write scheme with delayed parity/raid
CN101546291A (en) Access method and device for increasing robustness of memory data
CN111506452B (en) Data storage protection method, device, computer equipment and storage medium
US20190102251A1 (en) Systems and methods for detecting and correcting memory corruptions in software
US9626242B2 (en) Memory device error history bit
CN110046112B (en) Memory system that changes operation of memory controller based on internal state
TWI566096B (en) Data storage system and related method
KR102819760B1 (en) Improved ecc memory chip encoder and decoder
WO2023202592A1 (en) Data writing method and processing system
US20240184665A1 (en) Data processing method and apparatus
CN119226028A (en) Method, apparatus, device, medium and program product for data verification
WO2025001399A1 (en) Method and apparatus for data check, and device, medium and product
US11735285B1 (en) Detection of address bus corruption for data storage devices
WO2024148643A1 (en) Memory and test method therefor, and memory system
US10740179B2 (en) Memory and method for operating the memory
US20250004876A1 (en) Data error correction method and apparatus
US11809272B2 (en) Error correction code offload for a serially-attached memory device
US12066888B2 (en) Efficient security metadata encoding in error correcting code (ECC) memory without dedicated ECC bits
CN115237664A (en) A data error correction method, memory controller, chip and electronic device
CN119718763A (en) Method, device and storage medium for processing check data of memory
CN120994121A (en) A data integrity verification method, apparatus, and electronic device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24830049

Country of ref document: EP

Kind code of ref document: A1