WO2024130889A1 - Data compression method, system and device, and computer readable storage medium - Google Patents
Data compression method, system and device, and computer readable storage medium Download PDFInfo
- Publication number
- WO2024130889A1 WO2024130889A1 PCT/CN2023/086109 CN2023086109W WO2024130889A1 WO 2024130889 A1 WO2024130889 A1 WO 2024130889A1 CN 2023086109 W CN2023086109 W CN 2023086109W WO 2024130889 A1 WO2024130889 A1 WO 2024130889A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- target
- information entropy
- linear function
- value
- function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the present application relates to the technical field of data compression, and in particular to a data compression method, system, device and computer-readable storage medium.
- Data compression can potentially reduce the storage space of data, increase the logical capacity of storage devices, and thus reduce the storage and transmission costs of data, so it has a high technical appeal.
- Data compression is a computationally intensive operation, and the actual data reduction effect it achieves also depends on the compressibility of the data to be compressed. If compression is performed on incompressible data, the size of the compressed data will not become smaller, which wastes the high computational overhead. Therefore, if the compressibility of data can be accurately predicted, thereby avoiding compressing incompressible data, the high compression computational overhead can be avoided.
- information entropy is a key indicator for measuring data compressibility and predicting data compression rate.
- the calculation of information entropy requires logarithmic operations, which has a certain computational complexity and complex hardware implementation logic, which makes the calculation efficiency of information entropy low, thus affecting the efficiency of data compression.
- this application provides a data compression method, which can solve the technical problem of how to improve data compression efficiency to a certain extent.
- This application also provides a data compression system, device and computer-readable storage medium. The specific scheme is as follows:
- a data compression method comprising:
- the target information entropy is less than a preset value. If the target information entropy is less than the preset value, the target data is compressed; if the target information entropy is greater than or equal to the preset value, the target data is not compressed.
- the determining of the target linear function of the function x*log 2 (x) under a preset approximate condition comprises:
- k(x) represents a function for determining the value of K that makes the value of x fall within the range of [2 K ,2 K+1 ).
- the information entropy calculation formula without logarithmic operation is determined based on the target linear function, including:
- the information entropy calculation formula includes:
- H represents the target information entropy
- N represents the total number of bytes
- M represents the total number of character types
- f(i) represents the occurrence probability of the i-th type of character, where the value range of i is [0,M].
- the determining of the target linear function of the function x*log 2 (x) under a preset approximate condition comprises:
- k(x) represents a function for determining the K value that makes the value of x between [2 K ,2 K+1 ); a represents an adjustment parameter.
- it also includes:
- the value of the adjustment parameter is determined based on the value of 1.5*2 K.
- the information entropy calculation formula without logarithmic operation is determined based on the target linear function, including:
- the information entropy calculation formula includes:
- H represents the target information entropy
- N represents the total number of bytes
- M represents the total number of character types
- f(i) represents the occurrence probability of the i-th type of character, where the value range of i is [0,M].
- the calculation process of k(x) includes:
- the process of calculating the target information entropy of the target data based on the total number of bytes and the occurrence probability by using the information entropy calculation formula includes:
- a data compression system comprising:
- a first acquisition module used for acquiring target data to be compressed
- a first statistical module used for counting the total number of bytes in the target data and the occurrence probability of each type of character
- a first determination module is used to determine a target linear function of the function x*log 2 (x) under a preset approximate condition
- a second determination module is used to determine an information entropy calculation formula without logarithmic operation based on the target linear function
- a first calculation module configured to calculate the target information entropy of the target data based on the total number of bytes and the occurrence probability by using the information entropy calculation formula
- the first judgment module is used to judge whether the target information entropy is less than a preset value. If the target information entropy is less than the preset value, the target data is compressed; if the target information entropy is greater than or equal to the preset value, the target data is not compressed.
- a data compression device comprising:
- a processor is used to implement the steps of any of the above data compression methods when executing the computer program.
- a computer-readable storage medium stores a computer program, wherein the computer program, when executed by a processor, implements the steps of any of the above-mentioned data compression methods.
- the present application provides a data compression method, which obtains target data to be compressed; counts the total number of bytes in the target data and the probability of occurrence of each type of character; determines the target linear function of the function x*log 2 (x) under preset approximate conditions; determines the information entropy calculation formula without logarithmic operation based on the target linear function; calculates the target information entropy of the target data based on the total number of bytes and the probability of occurrence through the information entropy calculation formula; determines whether the target information entropy is less than a preset value, if the target information entropy is less than the preset value, the target data is compressed, if the target information entropy is greater than or equal to the preset value, the target data is not compressed.
- the target linear function of the function x*log 2 (x) under preset approximate conditions, that is, to transform the function x*log 2 (x) into a target linear function without logarithmic operation and with simpler logic, so that the information entropy calculation formula without logarithmic operation can be determined based on the target linear function, and then the target information entropy of the target data can be calculated without complex logarithmic operation, which improves the calculation efficiency of the target information entropy, and further improves the efficiency of data compression based on the target information entropy.
- the data compression system, device and computer-readable storage medium provided in this application also solve the corresponding technical problems.
- FIG1 is a flow chart of a data compression method provided in an embodiment of the present application.
- Figure 2 is a simulation graph of x*log 2 (x);
- FIG3 is a simulation comparison result diagram of x*log 2 (x) and x*k(x)+(2*x-2 k(x)+1 );
- FIG4 is a schematic diagram of the structure of a data compression system provided in an embodiment of the present application.
- FIG5 is a schematic diagram of the structure of a data compression device provided in an embodiment of the present application.
- FIG. 6 is another schematic diagram of the structure of a data compression device provided in an embodiment of the present application.
- FIG. 1 is a flow chart of a data compression method provided in an embodiment of the present application.
- Step S101 Acquire target data to be compressed.
- the target data to be compressed may be obtained first, and the type and size of the target data may be determined according to the specific application scenario, and this application does not make any specific limitation here.
- Step S102 Count the total number of bytes in the target data and the occurrence probability of each type of character.
- the total number of bytes in the target data and the probability of occurrence of each type of character can be counted.
- computers often use one byte (also known as 1Byte), that is, 8-bit characters, to represent them.
- 1Byte also known as 1Byte
- the probability of occurrence of each type of character is the probability of occurrence of each unique character, that is, each value in the range of 0 to 255.
- Step S103 Determine a target linear function of the function x*log 2 (x) under a preset approximate condition.
- Step S104 Determine an information entropy calculation formula that does not contain logarithmic operations based on the target linear function.
- the existing information entropy calculation process involves logarithmic calculations, which are time-consuming and will result in low information entropy calculation efficiency; on the other hand, the use of information entropy to measure data compressibility or predict data compression rate does not require an accurate information entropy value, and a high-precision information entropy approximation has little effect on the accuracy of compression rate prediction; therefore, the present application herein determines a target linear function of the function x*log 2 (x) required in the information entropy calculation process under preset approximate conditions, and determines an information entropy calculation formula that does not contain logarithmic calculations based on the target linear function, so as to eliminate the logarithmic calculation in the information entropy calculation process, and obtain an information entropy calculation formula that does not affect the compression rate prediction.
- Step S105 Calculate the target information entropy of the target data based on the total number of bytes and the probability of occurrence using the information entropy calculation formula.
- the target information entropy of the target data can be calculated based on the total number of bytes and the probability of occurrence through the information entropy calculation formula.
- Step S106 Determine whether the target information entropy is less than a preset value. If the target information entropy is less than the preset value, execute step S107: compress the target data; if the target information entropy is greater than or equal to the preset value, execute step S108: do not compress the target data.
- the target information entropy of the target data After calculating the target information entropy of the target data based on the total number of bytes and the probability of occurrence through the information entropy calculation formula, it is possible to determine whether the target information entropy is less than a preset value. If the target information entropy is less than the preset value, the target data is compressed. If the target information entropy is greater than or equal to the preset value, the target data is not compressed. It should be noted that the specific value of the preset value that determines whether to compress the target data can be determined according to the application scenario, and this application does not make specific limitations here.
- the present application provides a data compression method, which comprises the following steps: obtaining target data to be compressed; counting the total number of bytes in the target data and the probability of occurrence of each type of character; determining a target linear function of the function x*log 2 (x) under a preset approximate condition; determining an information entropy calculation formula without logarithmic calculation based on the target linear function; and calculating the target data based on the total number of bytes and the probability of occurrence through the information entropy calculation formula. ; determine whether the target information entropy is less than a preset value. If the target information entropy is less than the preset value, the target data is compressed.
- the target data is not compressed.
- an information entropy calculation formula that does not contain logarithmic operations can be determined based on the target linear function, and then the target information entropy of the target data can be calculated without performing complex logarithmic operations, thereby improving the calculation efficiency of the target information entropy, and further improving the efficiency of data compression based on the target information entropy.
- the logarithm is based on 2 and the unit is bits;
- Information entropy is the average amount of information contained in each message received, that is:
- information entropy refers to byte entropy.
- the basic unit of data storage is a byte (also known as 1Byte), which includes 8 data bits, that is, it consists of 8 bits.
- the value range of n is [0,255], with a total of 256 values.
- N the total number of bytes of the input message
- the probability of occurrence of each character i is f(i) (where i ⁇ [0,255])
- the information entropy calculation formula is:
- H represents the target information entropy, that is, the information entropy of the target data
- N represents the total number of bytes, that is, the total number of bytes of the target data
- M represents the total number of character types, which is the total number of unique characters in the target data, that is, the total number of characters with different values
- f(i) represents the probability of occurrence of the i-th type of character, that is, the probability of the i-th type of character appearing in the target data, where the value range of i is [0, M];
- N ⁇ log 2 (N) and f(i)log 2 (f(i)) can be uniformly expressed as x*log 2 (x), that is, it is only necessary to determine the target linear function of the function x*log 2 (x) under the preset approximate condition, without the need to perform the complex logarithmic operation in the information entropy calculation formula;
- the preset approximate condition can be understood as calculating the target linear function by dividing the interval to improve the accuracy.
- the interval [2 K ,2 K+1 ) to which the value of x belongs can be set, and then x*log 2 (x) is converted into an approximate target linear function based on the interval;
- the slope R (y 2 -y 1 )/(x 2 -x 1 ) is obtained.
- x*log 2 (x) ⁇ k(x)*2 k(x) +(x-2 k(x) )*(k(x)+2) x*k(x)+(2*x-2 k(x)+1 ).
- x*k(x) corresponds to the integer part of x*log 2 (x), that is, the operation of 5*2
- (2*x-2 k(x)+1 ) corresponds to the operation of the decimal part of x*log 2 (x), that is, the operation of 0.32*5.
- x*log 2 (x) is close to linear distribution
- x*k(x) is a linear distribution
- (2*x-2 k(x)+1 ) is a compensatory approximation of the true x*log 2 (x) based on linear distribution.
- the information entropy calculation formula can be:
- a fixed value can be preset for the adjustment parameter a according to the changes of x and log 2 (x), for example, different a values can be given in the intervals of [2,4], [4,8], [8,16], [16,32], etc.
- the comparison result is shown in FIG3 , wherein the upper line represents x*k(x)+(2*x-2 k(x)+1 ).
- the value of the adjustment parameter can also be determined based on the value of 1.5*2 K.
- the function k(x) is defined for finding the integer K for the value x.
- the value of x is between the interval [ 2K , 2K +1 ).
- the operation process of k(x) can be as follows: convert the value of x into a binary 32-bit integer; count the number of values with prefix 0 in the 32-bit integer; and take the difference between 31 and the number as the K value corresponding to k(x).
- k(x) is the logarithmic base of x, which is a very small number relative to x
- x*k(x) is a binary multiplication that can be converted into a simple shift-add operation, that is, the operation result of x*k(x) is determined based on the shift-add operation; the operation of taking the lower (k(x)-1) bits of x is obtained to obtain the operation result of (x-2 k(x) ); in this way, the calculation complexity of the entire information entropy is reduced, and it is convenient to realize the calculation of information entropy in hardware.
- the information entropy calculation method of the present application is about 15% faster than the standard information entropy calculation method, and the error is within 2%, so that
- FIG. 4 is a schematic diagram of the structure of a data compression system provided in an embodiment of the present application.
- a first acquisition module 101 is used to acquire target data to be compressed
- the first statistical module 102 is used to count the total number of bytes in the target data and the occurrence probability of each type of character
- a first determination module 103 is used to determine a target linear function of the function x*log 2 (x) under a preset approximate condition
- a second determination module 104 is used to determine an information entropy calculation formula without logarithmic calculation based on the target linear function
- the first calculation module 105 is used to calculate the target information entropy of the target data based on the total number of bytes and the probability of occurrence through an information entropy calculation formula
- the first judgment module 106 is used to judge whether the target information entropy is less than a preset value. If the information entropy is less than the preset value, the target data is compressed. If the target information entropy is greater than or equal to the preset value, the target data is not compressed.
- the first determination module may include:
- a first determining unit used to determine a target linear function of the function x*log 2 (x) under a preset approximate condition
- k(x) represents a function for determining the value of K that makes the value of x fall within the range of [2 K ,2 K+1 ).
- the second determination module may include:
- a second determining unit is used to determine an information entropy calculation formula without logarithmic calculation based on the target linear function
- the information entropy calculation formula includes:
- H represents the target information entropy, that is, the information entropy of the target data
- N represents the total number of bytes, that is, the total number of bytes of the target data
- M represents the total number of character types, which is the total number of unique characters in the target data, that is, the total number of characters with different values
- f(i) represents the probability of occurrence of the i-th type of character, that is, the probability of the i-th type of character appearing in the target data, where the value range of i is [0,M].
- the first determination module may include:
- a third determining unit is used to determine a target linear function of the function x*log 2 (x) under a preset approximate condition
- k(x) represents a function for determining the K value that makes the value of x between [2 K ,2 K+1 ); a represents an adjustment parameter.
- the third determination module is used to determine the value of the adjustment parameter based on the value of 1.5*2 K.
- the second determination module may include:
- the fourth determining unit is used to determine the information entropy operation without logarithmic operation based on the target linear function. Calculation formula;
- the information entropy calculation formula includes:
- H represents the target information entropy, that is, the information entropy of the target data
- N represents the total number of bytes, that is, the total number of bytes of the target data
- M represents the total number of character types, which is the total number of unique characters in the target data, that is, the total number of characters with different values
- f(i) represents the probability of occurrence of the i-th type of character, that is, the probability of the i-th type of character appearing in the target data, where the value range of i is [0,M].
- the calculation process of k(x) includes:
- the difference between 31 and the individual value is taken as the K value corresponding to k(x).
- a first calculation module calculates the target information entropy of target data based on the total number of bytes and the probability of occurrence through an information entropy calculation formula, and is used to: determine the operation result of x*k(x) based on a shift-add operation; and obtain the operation result of (x-2 k(x) ) by taking the lower (k(x)-1) bits of x.
- the present application also provides a data compression device and a computer-readable storage medium, both of which have the corresponding effects of a data compression method provided in an embodiment of the present application.
- Figure 5 is a schematic diagram of the structure of a data compression device provided in an embodiment of the present application.
- a data compression device provided in an embodiment of the present application includes a memory 201 and a processor 202.
- the memory 201 stores a computer program.
- the processor 202 executes the computer program, the steps of the data compression method described in any of the above embodiments are implemented.
- Another data compression device may also include: an input port 203 connected to the processor 202, used to transmit commands input from the outside to the processor 202; a display unit 204 connected to the processor 202, used to display the processing results of the processor 202 to the outside; a communication module 205 connected to the processor 202, used to realize the communication between the data compression device and the outside.
- the display unit 204 can be a display panel, a laser scanning display, etc.; the communication mode adopted by the communication module 205 includes but is not limited to mobile high-definition link technology (HML), universal serial bus, etc. (USB), High-Definition Multimedia Interface (HDMI), wireless connection: Wireless Fidelity technology (WiFi), Bluetooth communication technology, low-power Bluetooth communication technology, and communication technology based on IEEE802.11s.
- An embodiment of the present application provides a computer-readable storage medium, in which a computer program is stored.
- the computer program is executed by a processor, the steps of the data compression method described in any of the above embodiments are implemented.
- the computer-readable storage medium involved in the present application includes random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or any other form of storage medium known in the technical field.
- RAM random access memory
- ROM read-only memory
- electrically programmable ROM electrically erasable programmable ROM
- registers hard disks, removable disks, CD-ROMs, or any other form of storage medium known in the technical field.
- each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments.
- the same or similar parts between the embodiments can be referred to each other.
- the description is relatively simple, and the relevant parts can be referred to the method part.
- the steps of the method or algorithm described in conjunction with the embodiments disclosed herein may be implemented directly using hardware, a software module executed by a processor, or a combination of the two.
- the software module may be placed in a random access memory (RAM), a memory, a read-only memory (ROM), an electrically programmable ROM, an electrically erasable programmable ROM, a register, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Description
本申请要求于2022年12月22日提交中国专利局、申请号为202211658252.4、发明名称为“一种数据压缩方法、系统、设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the China Patent Office on December 22, 2022, with application number 202211658252.4 and invention name “A data compression method, system, device and computer-readable storage medium”, the entire contents of which are incorporated by reference in this application.
本申请涉及数据压缩技术领域,特别涉及一种数据压缩方法、系统、设备及计算机可读存储介质。The present application relates to the technical field of data compression, and in particular to a data compression method, system, device and computer-readable storage medium.
数据压缩可以潜在地减小数据的存储空间,提高存储设备的逻辑容量,从而降低数据的存储和传输成本,因此具有很高的技术吸引力。数据压缩是一项计算密集的操作,其实际达到的数据缩减效果也取决于待压缩数据的可压缩性。如果对不可压缩的数据执行压缩操作,压缩后的数据大小并不会变小,这使得高额的计算开销白白浪费掉。因此,如果可以准确预测数据的可压缩性,从而避免对不可压缩的数据进行压缩,就可以避免浪费高昂的压缩计算开销。Data compression can potentially reduce the storage space of data, increase the logical capacity of storage devices, and thus reduce the storage and transmission costs of data, so it has a high technical appeal. Data compression is a computationally intensive operation, and the actual data reduction effect it achieves also depends on the compressibility of the data to be compressed. If compression is performed on incompressible data, the size of the compressed data will not become smaller, which wastes the high computational overhead. Therefore, if the compressibility of data can be accurately predicted, thereby avoiding compressing incompressible data, the high compression computational overhead can be avoided.
在数据压缩过程中,信息熵是度量数据可压缩性和预测数据压缩率的关键指标。信息熵的计算需要进行对数操作,具有一定的计算复杂度,并且硬件实现逻辑复杂,由此使得信息熵的运算效率较低,继而影响数据压缩的效率。In the process of data compression, information entropy is a key indicator for measuring data compressibility and predicting data compression rate. The calculation of information entropy requires logarithmic operations, which has a certain computational complexity and complex hardware implementation logic, which makes the calculation efficiency of information entropy low, thus affecting the efficiency of data compression.
综上所述,如何提高数据压缩的效率是目前本领域技术人员亟待解决的问题。In summary, how to improve the efficiency of data compression is a problem that needs to be solved urgently by those skilled in the art.
发明内容Summary of the invention
有鉴于此,本申请的目的在于提供一种数据压缩方法,其能在一定程度上解决如何提高数据压缩效率的技术问题。本申请还提供了一种数据压缩系统、设备及计算机可读存储介质。其具体方案如下: In view of this, the purpose of this application is to provide a data compression method, which can solve the technical problem of how to improve data compression efficiency to a certain extent. This application also provides a data compression system, device and computer-readable storage medium. The specific scheme is as follows:
一种数据压缩方法,包括:A data compression method, comprising:
获取待压缩的目标数据;Obtain target data to be compressed;
统计所述目标数据中的字节总数及每类字符的出现概率;Counting the total number of bytes in the target data and the occurrence probability of each type of character;
确定函数x*log2(x)在预设近似条件下的目标线性函数;Determine the target linear function of the function x*log 2 (x) under the preset approximate conditions;
基于所述目标线性函数确定不含对数运算的信息熵运算公式;Determine an information entropy calculation formula without logarithmic calculation based on the target linear function;
通过所述信息熵运算公式,基于所述字节总数及所述出现概率,计算所述目标数据的目标信息熵;Calculate the target information entropy of the target data based on the total number of bytes and the occurrence probability by using the information entropy calculation formula;
判断所述目标信息熵是否小于预设值,若所述目标信息熵小于所述预设值,则对所述目标数据进行压缩,若所述目标信息熵大于等于所述预设值,则不对所述目标数据进行压缩。It is determined whether the target information entropy is less than a preset value. If the target information entropy is less than the preset value, the target data is compressed; if the target information entropy is greater than or equal to the preset value, the target data is not compressed.
优选的,所述确定函数x*log2(x)在预设近似条件下的目标线性函数,包括:Preferably, the determining of the target linear function of the function x*log 2 (x) under a preset approximate condition comprises:
确定函数x*log2(x)在所述预设近似条件下的所述目标线性函数;Determine the target linear function of function x*log 2 (x) under the preset approximate condition;
所述目标线性函数包括:
y=x*k(x)+2(x-2k(x));The target linear function includes:
y=x*k(x)+2(x-2 k(x) );
其中,k(x)表示确定使得x的值介于[2K,2K+1)的K值的函数。Here, k(x) represents a function for determining the value of K that makes the value of x fall within the range of [2 K ,2 K+1 ).
优选的,所述基于所述目标线性函数确定不含对数运算的信息熵运算公式,包括:Preferably, the information entropy calculation formula without logarithmic operation is determined based on the target linear function, including:
基于所述目标线性函数确定不含对数运算的所述信息熵运算公式;Determine the information entropy calculation formula without logarithmic operation based on the target linear function;
所述信息熵运算公式包括:
The information entropy calculation formula includes:
其中,H表示所述目标信息熵;N表示所述字节总数;M表示所述字符的总类型数;f(i)表示第i类字符的所述出现概率,其中i的取值范围是[0,M]。Among them, H represents the target information entropy; N represents the total number of bytes; M represents the total number of character types; f(i) represents the occurrence probability of the i-th type of character, where the value range of i is [0,M].
优选的,所述确定函数x*log2(x)在预设近似条件下的目标线性函数,包括:Preferably, the determining of the target linear function of the function x*log 2 (x) under a preset approximate condition comprises:
确定函数x*log2(x)在所述预设近似条件下的所述目标线性函数;Determine the target linear function of function x*log 2 (x) under the preset approximate condition;
所述目标线性函数包括:
y=x*k(x)+((2-a)x-2k(x)+1);The target linear function includes:
y=x*k(x)+((2-a)x-2 k(x)+1 );
其中,k(x)表示确定使得x的值介于[2K,2K+1)的K值的函数;a表示调节参数。Wherein, k(x) represents a function for determining the K value that makes the value of x between [2 K ,2 K+1 ); a represents an adjustment parameter.
优选的,还包括:Preferably, it also includes:
基于1.5*2K的值确定所述调节参数的值。The value of the adjustment parameter is determined based on the value of 1.5*2 K.
优选的,所述基于所述目标线性函数确定不含对数运算的信息熵运算公式,包括:Preferably, the information entropy calculation formula without logarithmic operation is determined based on the target linear function, including:
基于所述目标线性函数确定不含对数运算的所述信息熵运算公式;Determine the information entropy calculation formula without logarithmic operation based on the target linear function;
所述信息熵运算公式包括:
The information entropy calculation formula includes:
其中,H表示所述目标信息熵;N表示所述字节总数;M表示所述字符的总类型数;f(i)表示第i类字符的所述出现概率,其中i的取值范围是[0,M]。Among them, H represents the target information entropy; N represents the total number of bytes; M represents the total number of character types; f(i) represents the occurrence probability of the i-th type of character, where the value range of i is [0,M].
优选的,k(x)的运算过程包括:Preferably, the calculation process of k(x) includes:
将x的值转换为二进制的32位整数;Convert the value of x to a 32-bit binary integer;
统计所述32位整数中前缀0的个数值;Count the number of prefix 0s in the 32-bit integer;
将31与所述个数值的差值作为k(x)所对应的K值。The difference between 31 and the numerical value is taken as the K value corresponding to k(x).
优选的,所述通过所述信息熵运算公式,基于所述字节总数及所述出现概率,计算所述目标数据的目标信息熵的过程中,包括:Preferably, the process of calculating the target information entropy of the target data based on the total number of bytes and the occurrence probability by using the information entropy calculation formula includes:
基于移位相加操作确定x*k(x)的运算结果;Determine the operation result of x*k(x) based on the shift-add operation;
取x低(k(x)-1)位的操作,得到(x-2k(x))的运算结果。Take the lower (k(x)-1) bits of x and get the result of (x-2 k(x) ).
一种数据压缩系统,包括:A data compression system, comprising:
第一获取模块,用于获取待压缩的目标数据;A first acquisition module, used for acquiring target data to be compressed;
第一统计模块,用于统计所述目标数据中的字节总数及每类字符的出现概率;A first statistical module, used for counting the total number of bytes in the target data and the occurrence probability of each type of character;
第一确定模块,用于确定函数x*log2(x)在预设近似条件下的目标线性函数; A first determination module is used to determine a target linear function of the function x*log 2 (x) under a preset approximate condition;
第二确定模块,用于基于所述目标线性函数确定不含对数运算的信息熵运算公式;A second determination module is used to determine an information entropy calculation formula without logarithmic operation based on the target linear function;
第一计算模块,用于通过所述信息熵运算公式,基于所述字节总数及所述出现概率,计算所述目标数据的目标信息熵;A first calculation module, configured to calculate the target information entropy of the target data based on the total number of bytes and the occurrence probability by using the information entropy calculation formula;
第一判断模块,用于判断所述目标信息熵是否小于预设值,若所述目标信息熵小于所述预设值,则对所述目标数据进行压缩,若所述目标信息熵大于等于所述预设值,则不对所述目标数据进行压缩。The first judgment module is used to judge whether the target information entropy is less than a preset value. If the target information entropy is less than the preset value, the target data is compressed; if the target information entropy is greater than or equal to the preset value, the target data is not compressed.
一种数据压缩设备,包括:A data compression device, comprising:
存储器,用于存储计算机程序;Memory for storing computer programs;
处理器,用于执行所述计算机程序时实现如上任一所述数据压缩方法的步骤。A processor is used to implement the steps of any of the above data compression methods when executing the computer program.
一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,所述计算机程序被处理器执行时实现如上任一所述数据压缩方法的步骤。A computer-readable storage medium stores a computer program, wherein the computer program, when executed by a processor, implements the steps of any of the above-mentioned data compression methods.
本申请提供的一种数据压缩方法,获取待压缩的目标数据;统计目标数据中的字节总数及每类字符的出现概率;确定函数x*log2(x)在预设近似条件下的目标线性函数;基于目标线性函数确定不含对数运算的信息熵运算公式;通过信息熵运算公式,基于字节总数及出现概率,计算目标数据的目标信息熵;判断目标信息熵是否小于预设值,若目标信息熵小于预设值,则对目标数据进行压缩,若目标信息熵大于等于预设值,则不对目标数据进行压缩。本申请中,在统计目标数据中的字节总数及每类字符的出现概率之后,需确定函数x*log2(x)在预设近似条件下的目标线性函数,也即将函数x*log2(x)转变为不含对数运算且逻辑更简单的目标线性函数,这样,便可以基于目标线性函数确定出不含对数运算的信息熵运算公式,继而可以在不进行复杂对数运算的情况下,计算出目标数据的目标信息熵,提高了目标信息熵的计算效率,进而可以提高基于目标信息熵对数据进行压缩的效率。本申请提供的一种数据压缩系统、设备及计算机可读存储介质也解决了相应技术问题。 The present application provides a data compression method, which obtains target data to be compressed; counts the total number of bytes in the target data and the probability of occurrence of each type of character; determines the target linear function of the function x*log 2 (x) under preset approximate conditions; determines the information entropy calculation formula without logarithmic operation based on the target linear function; calculates the target information entropy of the target data based on the total number of bytes and the probability of occurrence through the information entropy calculation formula; determines whether the target information entropy is less than a preset value, if the target information entropy is less than the preset value, the target data is compressed, if the target information entropy is greater than or equal to the preset value, the target data is not compressed. In the present application, after counting the total number of bytes in the target data and the probability of occurrence of each type of character, it is necessary to determine the target linear function of the function x*log 2 (x) under preset approximate conditions, that is, to transform the function x*log 2 (x) into a target linear function without logarithmic operation and with simpler logic, so that the information entropy calculation formula without logarithmic operation can be determined based on the target linear function, and then the target information entropy of the target data can be calculated without complex logarithmic operation, which improves the calculation efficiency of the target information entropy, and further improves the efficiency of data compression based on the target information entropy. The data compression system, device and computer-readable storage medium provided in this application also solve the corresponding technical problems.
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are merely embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on the provided drawings without paying any creative work.
图1为本申请实施例提供的一种数据压缩方法的流程图;FIG1 is a flow chart of a data compression method provided in an embodiment of the present application;
图2为x*log2(x)的仿真图形;Figure 2 is a simulation graph of x*log 2 (x);
图3为x*log2(x)及x*k(x)+(2*x-2k(x)+1)的仿真对比结果图;FIG3 is a simulation comparison result diagram of x*log 2 (x) and x*k(x)+(2*x-2 k(x)+1 );
图4为本申请实施例提供的一种数据压缩系统的结构示意图;FIG4 is a schematic diagram of the structure of a data compression system provided in an embodiment of the present application;
图5为本申请实施例提供的一种数据压缩设备的结构示意图;FIG5 is a schematic diagram of the structure of a data compression device provided in an embodiment of the present application;
图6为本申请实施例提供的一种数据压缩设备的另一结构示意图。FIG. 6 is another schematic diagram of the structure of a data compression device provided in an embodiment of the present application.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of this application.
请参阅图1,图1为本申请实施例提供的一种数据压缩方法的流程图。Please refer to FIG. 1 , which is a flow chart of a data compression method provided in an embodiment of the present application.
本申请实施例提供的一种数据压缩方法,可以包括以下步骤:A data compression method provided in an embodiment of the present application may include the following steps:
步骤S101:获取待压缩的目标数据。Step S101: Acquire target data to be compressed.
实际应用中,可以先获取待压缩的目标数据,目标数据的类型及规模等可以根据具体应用场景来确定,本申请在此不做具体限定。In practical applications, the target data to be compressed may be obtained first, and the type and size of the target data may be determined according to the specific application scenario, and this application does not make any specific limitation here.
步骤S102:统计目标数据中的字节总数及每类字符的出现概率。Step S102: Count the total number of bytes in the target data and the occurrence probability of each type of character.
实际应用中,在获取待压缩的目标数据之后,便可以统计目标数据中的字节总数及每类字符的出现概率,以字符为例,计算机常用一个字节(又称1Byte)即8比特(bit)的字符来表示,那么字符的种类有2^8=256种,也即字符在数学上的取值范围是[0,255],这里每类字符的出现概率,是每个独一无二的字符也即0到255这个范围内的每个值分别出现的概率,以 为后续计算目标数据的信息熵做准备。In practical applications, after obtaining the target data to be compressed, the total number of bytes in the target data and the probability of occurrence of each type of character can be counted. Taking characters as an example, computers often use one byte (also known as 1Byte), that is, 8-bit characters, to represent them. Then there are 2^8=256 types of characters, that is, the mathematical value range of characters is [0,255]. Here, the probability of occurrence of each type of character is the probability of occurrence of each unique character, that is, each value in the range of 0 to 255. Prepare for the subsequent calculation of the information entropy of the target data.
步骤S103:确定函数x*log2(x)在预设近似条件下的目标线性函数。Step S103: Determine a target linear function of the function x*log 2 (x) under a preset approximate condition.
步骤S104:基于目标线性函数确定不含对数运算的信息熵运算公式。Step S104: Determine an information entropy calculation formula that does not contain logarithmic operations based on the target linear function.
实际应用中,一方面考虑到现有信息熵运算过程中携带对数运算,而对数运算耗时较长,会导致信息熵的运算效率较低;另一方面考虑到利用信息熵来度量数据可压缩性或者预测数据压缩率并不需要精确的信息熵值,高精度的信息熵近似值对压缩率预测的准确性影响微乎其微;所以本申请在此通过确定信息熵运算过程中所需的函数x*log2(x)在预设近似条件下的目标线性函数,并基于目标线性函数确定不含对数运算的信息熵运算公式,以消除信息熵运算过程中的对数运算,且得到不影响压缩率预测的信息熵运算公式。In practical applications, on the one hand, the existing information entropy calculation process involves logarithmic calculations, which are time-consuming and will result in low information entropy calculation efficiency; on the other hand, the use of information entropy to measure data compressibility or predict data compression rate does not require an accurate information entropy value, and a high-precision information entropy approximation has little effect on the accuracy of compression rate prediction; therefore, the present application herein determines a target linear function of the function x*log 2 (x) required in the information entropy calculation process under preset approximate conditions, and determines an information entropy calculation formula that does not contain logarithmic calculations based on the target linear function, so as to eliminate the logarithmic calculation in the information entropy calculation process, and obtain an information entropy calculation formula that does not affect the compression rate prediction.
步骤S105:通过信息熵运算公式,基于字节总数及出现概率,计算目标数据的目标信息熵。Step S105: Calculate the target information entropy of the target data based on the total number of bytes and the probability of occurrence using the information entropy calculation formula.
实际应用中,在基于目标线性函数确定不含对数运算的信息熵运算公式至,便可以通过信息熵运算公式,基于字节总数及出现概率,计算目标数据的目标信息熵。In practical applications, after determining the information entropy calculation formula without logarithmic operation based on the target linear function, the target information entropy of the target data can be calculated based on the total number of bytes and the probability of occurrence through the information entropy calculation formula.
步骤S106:判断目标信息熵是否小于预设值,若目标信息熵小于预设值,则执行步骤S107:对目标数据进行压缩;若目标信息熵大于等于预设值,则执行步骤S108:不对目标数据进行压缩。Step S106: Determine whether the target information entropy is less than a preset value. If the target information entropy is less than the preset value, execute step S107: compress the target data; if the target information entropy is greater than or equal to the preset value, execute step S108: do not compress the target data.
实际应用中,在通过信息熵运算公式,基于字节总数及出现概率,计算目标数据的目标信息熵之后,便可以判断目标信息熵是否小于预设值,若目标信息熵小于预设值,则对目标数据进行压缩,若目标信息熵大于等于预设值,则不对目标数据进行压缩。需要说明的是,决定是否对目标数据进行压缩的预设值的具体数值可以根据应用场景来确定,本申请在此不做具体限定。In practical applications, after calculating the target information entropy of the target data based on the total number of bytes and the probability of occurrence through the information entropy calculation formula, it is possible to determine whether the target information entropy is less than a preset value. If the target information entropy is less than the preset value, the target data is compressed. If the target information entropy is greater than or equal to the preset value, the target data is not compressed. It should be noted that the specific value of the preset value that determines whether to compress the target data can be determined according to the application scenario, and this application does not make specific limitations here.
本申请提供的一种数据压缩方法,获取待压缩的目标数据;统计目标数据中的字节总数及每类字符的出现概率;确定函数x*log2(x)在预设近似条件下的目标线性函数;基于目标线性函数确定不含对数运算的信息熵运算公式;通过信息熵运算公式,基于字节总数及出现概率,计算目标数据 的目标信息熵;判断目标信息熵是否小于预设值,若目标信息熵小于预设值,则对目标数据进行压缩,若目标信息熵大于等于预设值,则不对目标数据进行压缩。本申请中,在统计目标数据中的字节总数及每类字符的出现概率之后,需确定函数x*log2(x)在预设近似条件下的目标线性函数,也即将函数x*log2(x)转变为不含对数运算且逻辑更简单的目标线性函数,这样,便可以基于目标线性函数确定出不含对数运算的信息熵运算公式,继而可以在不进行复杂对数运算的情况下,计算出目标数据的目标信息熵,提高了目标信息熵的计算效率,进而可以提高基于目标信息熵对数据进行压缩的效率。The present application provides a data compression method, which comprises the following steps: obtaining target data to be compressed; counting the total number of bytes in the target data and the probability of occurrence of each type of character; determining a target linear function of the function x*log 2 (x) under a preset approximate condition; determining an information entropy calculation formula without logarithmic calculation based on the target linear function; and calculating the target data based on the total number of bytes and the probability of occurrence through the information entropy calculation formula. ; determine whether the target information entropy is less than a preset value. If the target information entropy is less than the preset value, the target data is compressed. If the target information entropy is greater than or equal to the preset value, the target data is not compressed. In the present application, after counting the total number of bytes in the target data and the probability of occurrence of each type of character, it is necessary to determine the target linear function of the function x*log 2 (x) under preset approximate conditions, that is, to transform the function x*log 2 (x) into a target linear function that does not contain logarithmic operations and has simpler logic. In this way, an information entropy calculation formula that does not contain logarithmic operations can be determined based on the target linear function, and then the target information entropy of the target data can be calculated without performing complex logarithmic operations, thereby improving the calculation efficiency of the target information entropy, and further improving the efficiency of data compression based on the target information entropy.
在可示例性的应用场景下,本申请实施例提供的一种数据压缩方法中,考虑到现有中一个系统S内存在多个事件S={E1,...,En},每个事件的概率分布P={p1,...,pn},则每个事件本身的信息为:
Ie=-log2pi;In an exemplary application scenario, in a data compression method provided by an embodiment of the present application, considering that there are multiple events S = {E 1 , ..., En } in an existing system S, and the probability distribution of each event P = {p 1 , ..., p n }, the information of each event itself is:
I e = -log 2 p i ;
其中,对数以2为底,单位是比特;The logarithm is based on 2 and the unit is bits;
信息熵是接收的每条消息中包含的信息的平均量,即:
Information entropy is the average amount of information contained in each message received, that is:
在数据压缩上下文中,信息熵指的是字节熵。数据存储基础单位是一个字节(又称1Byte),包括8个数据位,即由8比特(bit)组成,则上式中n的取值范围为[0,255],共256种取值。设置输入消息的字节总数为N,每个字符i的出现概率(可以理解为出现次数)为f(i)(其中,i∈[0,255]),那么In the context of data compression, information entropy refers to byte entropy. The basic unit of data storage is a byte (also known as 1Byte), which includes 8 data bits, that is, it consists of 8 bits. In the above formula, the value range of n is [0,255], with a total of 256 values. Suppose the total number of bytes of the input message is N, and the probability of occurrence of each character i (which can be understood as the number of occurrences) is f(i) (where i∈[0,255]), then
字符x的出现概率为p(i)=f(i)/N。则 The probability of occurrence of character x is p(i) = f(i)/N. Then
信息熵运算公式为:
The information entropy calculation formula is:
其中,H表示目标信息熵,即目标数据的信息熵;N表示字节总数,即目标数据的字节总数;M表示字符的总类型数,是目标数据中独一无二的字符也即值不相同的字符的总类型数;f(i)表示第i类字符的出现概率,即第i类字符在目标数据中出现的概率,其中i的取值范围是[0,M];Among them, H represents the target information entropy, that is, the information entropy of the target data; N represents the total number of bytes, that is, the total number of bytes of the target data; M represents the total number of character types, which is the total number of unique characters in the target data, that is, the total number of characters with different values; f(i) represents the probability of occurrence of the i-th type of character, that is, the probability of the i-th type of character appearing in the target data, where the value range of i is [0, M];
对上述信息熵运算公式两边乘以N后可得:
Multiplying both sides of the above information entropy calculation formula by N yields:
此时,N·log2(N)与f(i)log2(f(i))的形式可以统一表示为x*log2(x),也即只需确定函数x*log2(x)在预设近似条件的目标线性函数,而无需进行信息熵运算公式中复杂的对数运算;x*log2(x)的仿真图形如图2所示,由图2可以看出,x*log2(x)接近线性分布y=kx;At this time, the forms of N·log 2 (N) and f(i)log 2 (f(i)) can be uniformly expressed as x*log 2 (x), that is, it is only necessary to determine the target linear function of the function x*log 2 (x) under the preset approximate condition, without the need to perform the complex logarithmic operation in the information entropy calculation formula; the simulation graph of x*log 2 (x) is shown in FIG2 , and it can be seen from FIG2 that x*log 2 (x) is close to the linear distribution y=kx;
预设近似条件可以理解为通过分区间进行目标线性函数的计算,以提升精度,可以设定x的值所属的区间[2K,2K+1),再基于该区间将x*log2(x)转换为近似的目标线性函数;The preset approximate condition can be understood as calculating the target linear function by dividing the interval to improve the accuracy. The interval [2 K ,2 K+1 ) to which the value of x belongs can be set, and then x*log 2 (x) is converted into an approximate target linear function based on the interval;
具体的,x*log2(x)在点2K处的值为K·2K,在点2K+1处的值为(K+1)·2K+1,应用斜率公式确定目标线性函数,得到斜率R为:
Specifically, the value of x*log 2 (x) at point 2 K is K·2 K , and the value at point 2 K+1 is (K+1)·2 K+1 . Apply the slope formula to determine the target linear function, and the slope R is:
根据线性分布的公式求斜率R=(y2-y1)/(x2-x1),此时线性方程可以表示为y-y1=R(x-x1),则与函数x*log2(x)近似的目标线性函数可以表示为:
y-K·2K=(K+2)(x-2K);According to the linear distribution formula, the slope R = (y 2 -y 1 )/(x 2 -x 1 ) is obtained. At this time, the linear equation can be expressed as yy 1 = R(xx 1 ), and the target linear function that approximates the function x*log 2 (x) can be expressed as:
yK· 2K =(K+2)(x- 2K );
假设k(x)表示确定使得x的值介于[2K,2K+1)的K值的函数,也即k(x)表示求取值为整数且小于等于x的对数底的K值的函数,则目标线性函数可以表示为:
y=k(x)*2k(x)+(x-2k(x))*(2k(x)+2)=x*k(x)+2(x-2k(x));Assuming that k(x) represents a function that determines the K value that makes the value of x between [2 K ,2 K+1 ), that is, k(x) represents a function that finds the K value that is an integer and less than or equal to the logarithmic base of x, the target linear function can be expressed as:
y=k(x)*2k (x) +(x-2k (x) )*(2k (x) +2)=x*k(x)+2(x-2k (x) );
此时,信息熵运算公式可以表示为:
At this time, the information entropy calculation formula can be expressed as:
再进一步的,考虑到: Furthermore, considering:
x*log2(x)≈k(x)*2k(x)+(x-2k(x))*(k(x)+2)=x*k(x)+(2*x-2k(x)+1),对x*log2(x)及x*k(x)+(2*x-2k(x)+1)进行对比可知,(2*x-2k(x)+1)是log2(x)小数部分的一个估计,为了便于理解,假设x的值为5,则k(x)=2,log2(x)=2.32,那么,x*k(x)对应x*log2(x)中的整数部分,也即5*2的运算,(2*x-2k(x)+1)对应的是x*log2(x)中小数部分的运算,也即0.32*5的运算。根据前述可知,x*log2(x)接近线性分布,而x*k(x)是一个线性分布,(2*x-2k(x)+1)是基于线性分布对真实x*log2(x)的一个补偿逼近,所以为了进一步提升目标线性函数与x*log2(x)的相似精度,可以在(2*x-2k(x)+1)中引入调节参数a,将目标线性函数进一步调整为:
y=x*k(x)+((2-a)x-2k(x)+1);x*log 2 (x)≈k(x)*2 k(x) +(x-2 k(x) )*(k(x)+2)=x*k(x)+(2*x-2 k(x)+1 ). By comparing x*log 2 (x) and x*k(x)+(2*x-2 k(x)+1 ), we can see that (2*x-2 k(x)+1 ) is an estimate of the decimal part of log 2 (x). For ease of understanding, assuming that the value of x is 5, then k(x)=2, log 2 (x)=2.32. Then, x*k(x) corresponds to the integer part of x*log 2 (x), that is, the operation of 5*2, and (2*x-2 k(x)+1 ) corresponds to the operation of the decimal part of x*log 2 (x), that is, the operation of 0.32*5. As mentioned above, x*log 2 (x) is close to linear distribution, while x*k(x) is a linear distribution. (2*x-2 k(x)+1 ) is a compensatory approximation of the true x*log 2 (x) based on linear distribution. Therefore, in order to further improve the similarity accuracy between the target linear function and x*log 2 (x), the adjustment parameter a can be introduced in (2*x-2 k(x)+1 ) to further adjust the target linear function to:
y=x*k(x)+((2-a)x-2 k(x)+1 );
此时,信息熵运算公式可以为:
At this time, the information entropy calculation formula can be:
需要说明的是,具体应用场景中,可以预先根据x和log2(x)的变化,为调节参数a预设固定的值,比如在[2,4]、[4,8]、[8,16]、[16,32]等区间给定不同的a值等,此外,对比x*log2(x)及x*k(x)+(2*x-2k(x)+1)的仿真图形之后,对比结果如图3所示,其中位于上方的线条表征x*k(x)+(2*x-2k(x)+1),由仿真图可知,x*log2(x)与x*k(x)+(2*x-2k(x)+1)在[2K,2K+1]区间上的最大差值点为1.5*2K,所以为了进一步提升调节参数a的精度,还可以基于1.5*2K的值来确定调节参数的值等。It should be noted that, in a specific application scenario, a fixed value can be preset for the adjustment parameter a according to the changes of x and log 2 (x), for example, different a values can be given in the intervals of [2,4], [4,8], [8,16], [16,32], etc. In addition, after comparing the simulation graphs of x*log 2 (x) and x*k(x)+(2*x-2 k(x)+1 ), the comparison result is shown in FIG3 , wherein the upper line represents x*k(x)+(2*x-2 k(x)+1 ). It can be seen from the simulation graph that the maximum difference point between x*log 2 (x) and x*k(x)+(2*x-2 k(x)+1 ) in the interval [2 K ,2 K+1 ] is 1.5*2 K . Therefore, in order to further improve the accuracy of the adjustment parameter a, the value of the adjustment parameter can also be determined based on the value of 1.5*2 K.
具体应用场景中,为了寻找整数K,将对值x寻找整数K定义为函数k(x),x的值介于区间[2K,2K+1),k(x)的运算过程可以如下:将x的值转换为二进制的32位整数;统计32位整数中前缀0的个数值;将31与个数值的差值作为k(x)所对应的K值。In a specific application scenario, in order to find the integer K, the function k(x) is defined for finding the integer K for the value x. The value of x is between the interval [ 2K , 2K +1 ). The operation process of k(x) can be as follows: convert the value of x into a binary 32-bit integer; count the number of values with prefix 0 in the 32-bit integer; and take the difference between 31 and the number as the K value corresponding to k(x).
为便于理解,假设x的值为12,转换为二进制数后为12=1100b,最高位为1的位是第3位,其4至31bit的值均为0,那么K=32-(31-4+1)=4;For ease of understanding, assume that the value of x is 12, which is converted to binary as 12=1100b. The highest bit is 1 at the 3rd bit, and the values of bits 4 to 31 are all 0, so K=32-(31-4+1)=4.
当然,也可以采用下述公式来计算k(x):
Of course, the following formula can also be used to calculate k(x):
具体应用场景中,为了提高运算效率,在通过信息熵运算公式,基于字节总数及出现概率,计算目标数据的目标信息熵的过程中,k(x)是x的对数底,相对x是一个非常小的数,那么x*k(x)是二进制的乘法可以转化为简单的移位相加操作,即基于移位相加操作确定x*k(x)的运算结果;取x低(k(x)-1)位的操作,得到(x-2k(x))的运算结果;以此来降低整个信息熵的计算复杂度,且便于以硬件形式实现信息熵的计算。此外,经试验后可知,本申请的信息熵计算方法比标准信息熵计算方法快15%左右,且误差在2%以内,使得数据压缩效率快且误差小。In a specific application scenario, in order to improve the operation efficiency, in the process of calculating the target information entropy of the target data based on the total number of bytes and the probability of occurrence through the information entropy calculation formula, k(x) is the logarithmic base of x, which is a very small number relative to x, then x*k(x) is a binary multiplication that can be converted into a simple shift-add operation, that is, the operation result of x*k(x) is determined based on the shift-add operation; the operation of taking the lower (k(x)-1) bits of x is obtained to obtain the operation result of (x-2 k(x) ); in this way, the calculation complexity of the entire information entropy is reduced, and it is convenient to realize the calculation of information entropy in hardware. In addition, after experiments, it can be known that the information entropy calculation method of the present application is about 15% faster than the standard information entropy calculation method, and the error is within 2%, so that the data compression efficiency is fast and the error is small.
请参阅图4,图4为本申请实施例提供的一种数据压缩系统的结构示意图。Please refer to FIG. 4 , which is a schematic diagram of the structure of a data compression system provided in an embodiment of the present application.
本申请实施例提供的一种数据压缩系统,可以包括:A data compression system provided in an embodiment of the present application may include:
第一获取模块101,用于获取待压缩的目标数据;A first acquisition module 101 is used to acquire target data to be compressed;
第一统计模块102,用于统计目标数据中的字节总数及每类字符的出现概率;The first statistical module 102 is used to count the total number of bytes in the target data and the occurrence probability of each type of character;
第一确定模块103,用于确定函数x*log2(x)在预设近似条件下的目标线性函数;A first determination module 103 is used to determine a target linear function of the function x*log 2 (x) under a preset approximate condition;
第二确定模块104,用于基于目标线性函数确定不含对数运算的信息熵运算公式;A second determination module 104 is used to determine an information entropy calculation formula without logarithmic calculation based on the target linear function;
第一计算模块105,用于通过信息熵运算公式,基于字节总数及出现概率,计算目标数据的目标信息熵;The first calculation module 105 is used to calculate the target information entropy of the target data based on the total number of bytes and the probability of occurrence through an information entropy calculation formula;
第一判断模块106,用于判断目标信息熵是否小于预设值,若目标信 息熵小于预设值,则对目标数据进行压缩,若目标信息熵大于等于预设值,则不对目标数据进行压缩。The first judgment module 106 is used to judge whether the target information entropy is less than a preset value. If the information entropy is less than the preset value, the target data is compressed. If the target information entropy is greater than or equal to the preset value, the target data is not compressed.
本申请实施例提供的一种数据压缩系统,第一确定模块可以包括:In a data compression system provided by an embodiment of the present application, the first determination module may include:
第一确定单元,用于确定函数x*log2(x)在预设近似条件下的目标线性函数;A first determining unit, used to determine a target linear function of the function x*log 2 (x) under a preset approximate condition;
目标线性函数包括:
y=x*k(x)+2(x-2k(x));The target linear functions include:
y=x*k(x)+2(x-2 k(x) );
其中,k(x)表示确定使得x的值介于[2K,2K+1)的K值的函数。Here, k(x) represents a function for determining the value of K that makes the value of x fall within the range of [2 K ,2 K+1 ).
本申请实施例提供的一种数据压缩系统,第二确定模块可以包括:In a data compression system provided by an embodiment of the present application, the second determination module may include:
第二确定单元,用于基于目标线性函数确定不含对数运算的信息熵运算公式;A second determining unit is used to determine an information entropy calculation formula without logarithmic calculation based on the target linear function;
信息熵运算公式包括:
The information entropy calculation formula includes:
其中,H表示目标信息熵,即目标数据的信息熵;N表示字节总数,即目标数据的字节总数;M表示字符的总类型数,是目标数据中独一无二的字符也即值不相同的字符的总类型数;f(i)表示第i类字符的出现概率,即第i类字符在目标数据中出现的概率,其中i的取值范围是[0,M]。Among them, H represents the target information entropy, that is, the information entropy of the target data; N represents the total number of bytes, that is, the total number of bytes of the target data; M represents the total number of character types, which is the total number of unique characters in the target data, that is, the total number of characters with different values; f(i) represents the probability of occurrence of the i-th type of character, that is, the probability of the i-th type of character appearing in the target data, where the value range of i is [0,M].
本申请实施例提供的一种数据压缩系统,第一确定模块可以包括:In a data compression system provided by an embodiment of the present application, the first determination module may include:
第三确定单元,用于确定函数x*log2(x)在预设近似条件下的目标线性函数;A third determining unit is used to determine a target linear function of the function x*log 2 (x) under a preset approximate condition;
目标线性函数包括:
y=x*k(x)+((2-a)x-2k(x)+1);The target linear functions include:
y=x*k(x)+((2-a)x-2 k(x)+1 );
其中,k(x)表示确定使得x的值介于[2K,2K+1)的K值的函数;a表示调节参数。Wherein, k(x) represents a function for determining the K value that makes the value of x between [2 K ,2 K+1 ); a represents an adjustment parameter.
本申请实施例提供的一种数据压缩系统,还可以包括:A data compression system provided in an embodiment of the present application may further include:
第三确定模块,用于基于1.5*2K的值确定调节参数的值。The third determination module is used to determine the value of the adjustment parameter based on the value of 1.5*2 K.
本申请实施例提供的一种数据压缩系统,第二确定模块可以包括:In a data compression system provided by an embodiment of the present application, the second determination module may include:
第四确定单元,用于基于目标线性函数确定不含对数运算的信息熵运 算公式;The fourth determining unit is used to determine the information entropy operation without logarithmic operation based on the target linear function. Calculation formula;
信息熵运算公式包括:
The information entropy calculation formula includes:
其中,H表示目标信息熵,即目标数据的信息熵;N表示字节总数,即目标数据的字节总数;M表示字符的总类型数,是目标数据中独一无二的字符也即值不相同的字符的总类型数;f(i)表示第i类字符的出现概率,即第i类字符在目标数据中出现的概率,其中i的取值范围是[0,M]。Among them, H represents the target information entropy, that is, the information entropy of the target data; N represents the total number of bytes, that is, the total number of bytes of the target data; M represents the total number of character types, which is the total number of unique characters in the target data, that is, the total number of characters with different values; f(i) represents the probability of occurrence of the i-th type of character, that is, the probability of the i-th type of character appearing in the target data, where the value range of i is [0,M].
本申请实施例提供的一种数据压缩系统,k(x)的运算过程包括:In a data compression system provided by an embodiment of the present application, the calculation process of k(x) includes:
将x的值转换为二进制的32位整数;Convert the value of x to a 32-bit binary integer;
统计32位整数中前缀0的个数值;Count the number of leading 0s in 32-bit integers;
将31与个数值的差值作为k(x)所对应的K值。The difference between 31 and the individual value is taken as the K value corresponding to k(x).
本申请实施例提供的一种数据压缩系统,第一计算模块通过信息熵运算公式,基于字节总数及出现概率,计算目标数据的目标信息熵的过程中,用于:基于移位相加操作确定x*k(x)的运算结果;取x低(k(x)-1)位的操作,得到(x-2k(x))的运算结果。In a data compression system provided by an embodiment of the present application, a first calculation module calculates the target information entropy of target data based on the total number of bytes and the probability of occurrence through an information entropy calculation formula, and is used to: determine the operation result of x*k(x) based on a shift-add operation; and obtain the operation result of (x-2 k(x) ) by taking the lower (k(x)-1) bits of x.
本申请还提供了一种数据压缩设备及计算机可读存储介质,其均具有本申请实施例提供的一种数据压缩方法具有的对应效果。请参阅图5,图5为本申请实施例提供的一种数据压缩设备的结构示意图。The present application also provides a data compression device and a computer-readable storage medium, both of which have the corresponding effects of a data compression method provided in an embodiment of the present application. Please refer to Figure 5, which is a schematic diagram of the structure of a data compression device provided in an embodiment of the present application.
本申请实施例提供的一种数据压缩设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如上任一实施例所描述数据压缩方法的步骤。A data compression device provided in an embodiment of the present application includes a memory 201 and a processor 202. The memory 201 stores a computer program. When the processor 202 executes the computer program, the steps of the data compression method described in any of the above embodiments are implemented.
请参阅图6,本申请实施例提供的另一种数据压缩设备中还可以包括:与处理器202连接的输入端口203,用于传输外界输入的命令至处理器202;与处理器202连接的显示单元204,用于显示处理器202的处理结果至外界;与处理器202连接的通信模块205,用于实现数据压缩设备与外界的通信。显示单元204可以为显示面板、激光扫描使显示器等;通信模块205所采用的通信方式包括但不局限于移动高清链接技术(HML)、通用串行总线 (USB)、高清多媒体接口(HDMI)、无线连接:无线保真技术(WiFi)、蓝牙通信技术、低功耗蓝牙通信技术、基于IEEE802.11s的通信技术。Please refer to FIG6 . Another data compression device provided in the embodiment of the present application may also include: an input port 203 connected to the processor 202, used to transmit commands input from the outside to the processor 202; a display unit 204 connected to the processor 202, used to display the processing results of the processor 202 to the outside; a communication module 205 connected to the processor 202, used to realize the communication between the data compression device and the outside. The display unit 204 can be a display panel, a laser scanning display, etc.; the communication mode adopted by the communication module 205 includes but is not limited to mobile high-definition link technology (HML), universal serial bus, etc. (USB), High-Definition Multimedia Interface (HDMI), wireless connection: Wireless Fidelity technology (WiFi), Bluetooth communication technology, low-power Bluetooth communication technology, and communication technology based on IEEE802.11s.
本申请实施例提供的一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如上任一实施例所描述数据压缩方法的步骤。An embodiment of the present application provides a computer-readable storage medium, in which a computer program is stored. When the computer program is executed by a processor, the steps of the data compression method described in any of the above embodiments are implemented.
本申请所涉及的计算机可读存储介质包括随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质。The computer-readable storage medium involved in the present application includes random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or any other form of storage medium known in the technical field.
本申请实施例提供的数据压缩系统、设备及计算机可读存储介质中相关部分的说明请参见本申请实施例提供的数据压缩方法中对应部分的详细说明,在此不再赘述。另外,本申请实施例提供的上述技术方案中与现有技术中对应技术方案实现原理一致的部分并未详细说明,以免过多赘述。For the description of the relevant parts of the data compression system, device and computer-readable storage medium provided in the embodiments of the present application, please refer to the detailed description of the corresponding parts in the data compression method provided in the embodiments of the present application, which will not be repeated here. In addition, the parts of the above technical solutions provided in the embodiments of the present application that are consistent with the implementation principles of the corresponding technical solutions in the prior art are not described in detail to avoid excessive elaboration.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。In this specification, each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the embodiments can be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the method part.
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Professionals may further appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the interchangeability of hardware and software, the composition and steps of each example have been generally described in the above description according to function. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians may use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。 The steps of the method or algorithm described in conjunction with the embodiments disclosed herein may be implemented directly using hardware, a software module executed by a processor, or a combination of the two. The software module may be placed in a random access memory (RAM), a memory, a read-only memory (ROM), an electrically programmable ROM, an electrically erasable programmable ROM, a register, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the sentence "comprise a ..." do not exclude the presence of other identical elements in the process, method, article or device including the elements.
以上对本申请所提供的数据压缩方案进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。 The data compression scheme provided by the present application is introduced in detail above. Specific examples are used in this article to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the method of the present application and its core idea. At the same time, for those skilled in the art, according to the idea of the present application, there will be changes in the specific implementation methods and application scope. In summary, the content of this specification should not be understood as a limitation on the present application.
Claims (11)
y=x*k(x)+2(x-2k(x));The target linear function includes:
y=x*k(x)+2(x-2 k(x) );
The information entropy calculation formula includes:
y=x*k(x)+((2-a)x-2k(x)+1);The target linear function includes:
y=x*k(x)+((2-a)x-2 k(x)+1 );
The information entropy calculation formula includes:
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211658252.4 | 2022-12-22 | ||
| CN202211658252.4A CN116032290A (en) | 2022-12-22 | 2022-12-22 | Data compression method, system, equipment and computer readable storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024130889A1 true WO2024130889A1 (en) | 2024-06-27 |
Family
ID=86072466
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2023/086109 Ceased WO2024130889A1 (en) | 2022-12-22 | 2023-04-04 | Data compression method, system and device, and computer readable storage medium |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN116032290A (en) |
| WO (1) | WO2024130889A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119300086A (en) * | 2024-12-16 | 2025-01-10 | 西安微合智联科技有限公司 | Data compression method, device, electronic device and storage medium |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20220035526A1 (en) * | 2020-07-31 | 2022-02-03 | EMC IP Holding Company LLC | Data compression method, electronic device and computer program product |
| CN114039607A (en) * | 2021-11-09 | 2022-02-11 | 山东云海国创云计算装备产业创新中心有限公司 | Multi-character limited entropy coding method, device, equipment and readable medium |
| CN114337678A (en) * | 2020-09-29 | 2022-04-12 | 华为技术有限公司 | Data compression method, device, equipment and storage medium |
| US20220224947A1 (en) * | 2019-08-15 | 2022-07-14 | Huawei Technologies Co., Ltd. | Coding method and related device |
-
2022
- 2022-12-22 CN CN202211658252.4A patent/CN116032290A/en active Pending
-
2023
- 2023-04-04 WO PCT/CN2023/086109 patent/WO2024130889A1/en not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20220224947A1 (en) * | 2019-08-15 | 2022-07-14 | Huawei Technologies Co., Ltd. | Coding method and related device |
| US20220035526A1 (en) * | 2020-07-31 | 2022-02-03 | EMC IP Holding Company LLC | Data compression method, electronic device and computer program product |
| CN114337678A (en) * | 2020-09-29 | 2022-04-12 | 华为技术有限公司 | Data compression method, device, equipment and storage medium |
| CN114039607A (en) * | 2021-11-09 | 2022-02-11 | 山东云海国创云计算装备产业创新中心有限公司 | Multi-character limited entropy coding method, device, equipment and readable medium |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119300086A (en) * | 2024-12-16 | 2025-01-10 | 西安微合智联科技有限公司 | Data compression method, device, electronic device and storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN116032290A (en) | 2023-04-28 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11249721B2 (en) | Multiplication circuit, system on chip, and electronic device | |
| CN111966649B (en) | A lightweight online file storage method and device for efficient deduplication | |
| CN110688088B (en) | General nonlinear activation function computing device and method for neural network | |
| CN111310890B (en) | Optimization method and device of deep learning model and terminal equipment | |
| CN114065704A (en) | Data compression method, electronic device and computer program product | |
| US11755540B2 (en) | Chunking method and apparatus | |
| WO2024130889A1 (en) | Data compression method, system and device, and computer readable storage medium | |
| CN115129791A (en) | A data compression storage method, device and device | |
| CN114567396B (en) | Wireless communication method, nonlinear function fitting method, terminal and device | |
| WO2024093963A1 (en) | Battery life determination method and apparatus, electronic device, and storage medium | |
| US20240137043A1 (en) | Data compression method and apparatus, and data decompression method and apparatus | |
| CN114840565B (en) | Sampling query method, device, electronic device and computer-readable storage medium | |
| CN119629657B (en) | Communication networking method and system of wireless network card | |
| CN114547030A (en) | Multi-stage time sequence data compression method and device, electronic equipment and storage medium | |
| WO2022095430A1 (en) | Sliding window configuration method and apparatus, computer device and storage medium | |
| CN117099109A (en) | Compression technique for deep neural network weights | |
| US20190334552A1 (en) | Encoding method and encoder | |
| CN107783990B (en) | Data compression method and terminal | |
| US20220050614A1 (en) | System and method for approximating replication completion time | |
| US11435956B2 (en) | Method, electronic device, and computer program product for data compression | |
| CN117999550A (en) | A self-adaptive table building method, device, electronic device and storage medium | |
| CN115237984A (en) | QPS statistical method, QPS statistical device, computer equipment and medium for distributed system | |
| CN111949110B (en) | Processing method and device for minimizing energy consumption in mobile edge calculation | |
| Taurone et al. | Zip to Zip-It: Compression to Achieve Local Differential Privacy | |
| CN114297318A (en) | Data processing method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23905081 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 23905081 Country of ref document: EP Kind code of ref document: A1 |