WO2023200509A1 - Sparsifying vectors for neural network models based on overlapping windows - Google Patents
Sparsifying vectors for neural network models based on overlapping windows Download PDFInfo
- Publication number
- WO2023200509A1 WO2023200509A1 PCT/US2023/011804 US2023011804W WO2023200509A1 WO 2023200509 A1 WO2023200509 A1 WO 2023200509A1 US 2023011804 W US2023011804 W US 2023011804W WO 2023200509 A1 WO2023200509 A1 WO 2023200509A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- elements
- vector
- window
- selecting
- absolute value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0495—Quantised networks; Sparse networks; Compressed networks
Definitions
- the present disclosure relates to computing hardware. More particularly, the present disclosure relates to techniques for sparsifying neural network parameters and/or activations.
- a neural network is a machine learning model used for a variety of different applications (e.g., image classification, computer vision, natural language processing, speech recognition, writing recognition, etc ).
- a neural network may be trained for a particular purpose by running datasets through it, comparing results from the neural network to known results, and updating the network based on the differences.
- FIG. 1 illustrates an example neural network model according to some embodiments.
- FIGS. 2A-2E illustrate an example of sparsifying a vector according to some embodiments.
- FIG. 3 illustrates a process for sparsifying a vector according to some embodiments.
- FIG. 4 depicts a simplified block diagram of an example computer system according to some embodiments.
- Fig. 5 illustrates a neural network processing system according to some embodiments.
- a neural network model includes numerous parameters arranged in the form of vectors and/or matrices. These vectors and/or matrices can be sparsified using an overlapping window technique.
- a neural network model can include vectors of weight values and vectors of activation values. To sparsify any one (or both) of these vectors, a window having a defined length is used to select a subset of elements in the vector. Next, the element having the highest absolute value is selected from the selected subset of elements. The window is slid across the vector by a defined number of elements.
- the process is repeated (e.g., selecting a subset of elements in the vector, selecting the element having the highest absolute value from the selected subset of elements, and sliding the window across the vector) until the window has moved across the entire vector.
- Elements that are not selected are modified to a defined value (e.g., zero).
- a vector sparsification technique that uses an overlapping window to select non-zero elements from the vector may be implemented using less resources (e.g., computing hardware, memory, etc.) compared to conventional vector sparsification approaches that have similar levels of accuracy.
- vector sparsification techniques based on overlapping windows provide better accuracy than conventional vector sparsification approaches implemented on the same given amount of hardware.
- FIG. 1 illustrates an example neural network model 100 according to some embodiments.
- neural network model 100 includes input layer 105, hidden layer 110, and outer layer 115.
- Input layer 105 includes two nodes 120 and 125.
- Each of the nodes 120 and 125 is configured to receive input data and send the input data to each of the nodes in hidden layer 110.
- node 120 receives input data 130 and sends it to each of the nodes 140-150.
- Node 125 receives input data 135 and sends it to each of the nodes 140-150.
- hidden layer 110 includes three nodes 140-150.
- Each of the nodes 140-150 receives data from each of the nodes 120 and 125 comprised of a particular weight value multiplied to a value of a corresponding input data.
- node 140 receives data comprised of input data 130 multiplied by weight value Wi and input data 135 multiplied by weight value W2.
- node 145 receives data comprised of input data 130 multiplied by weight value W3 and input data 135 multiplied by weight value W4.
- Node 150 receives data comprised of input data 130 multiplied by weight value Ws and input data 135 multiplied by weight value We.
- Each of the nodes 140-150 is configured to apply an activation function to the sum of the received inputs and, based on the activation function, generates an output that is sent to each of the nodes 155 and 160.
- outer layer 115 includes nodes 155 and 160.
- Each of the nodes 155 and 160 receives data from each of the nodes 140-150 comprised of a particular weight value multiplied to a value of a corresponding output generated by one of the corresponding nodes 140-150.
- node 155 receives data comprised of the output generated by node 140 multiplied by weight value W7, the output generated by node 145 multiplied by weight value W9, and the output generated by node 150 multiplied by weight value W11.
- node 160 receives data comprised of the output generated by node 140 multiplied by weight value Ws, the output generated by node 145 multiplied by weight value W 10, and the output generated by node 150 multiplied by weight value W12.
- Each of the nodes 155 and 160 applies an activation function to the sum of the received inputs and, based on the activation function, generates an output.
- node 155 generates output 165 based on the sum of its received inputs while node 160 generates output 170 based on the sum of its received inputs.
- the calculations in neural network model 100 can be implemented using vectors of values.
- the inputs for nodes 140-150 may be generated by multiplying a vector of input data 130 and 135 by a vector of weight values Wi-We.
- the inputs for nodes 155 and 160 can be generated by multiplying a vector of the activation function outputs generated by nodes 140-150 by a vector of weight values W7-W12.
- the sparsification techniques described herein may be applied to one of the vectors of weight values, the vector of activation outputs, each of the vectors of weights values and the vector of activation outputs, or any combination thereof.
- FIGS. 2A-2E illustrate an example of sparsifying a vector 200 according to some embodiments.
- the example demonstrates how vector 200 is sparsified using an overlapping window approach.
- vector 200 includes sixteen elements 202-232. Each of the elements 202-232 stores a particular value.
- vector 200 will be sparsified to a 75% level. That is, four elements with non-zero values will be selected from vector 200 and any unselected elements in vector 200 will be modified to a defined value (e.g., zero).
- the example starts by using a window 235 having a defined length (a window of eight elements in this example) to select the first eight elements 202-216 in vector 200.
- element 212 From the elements selected by window 235, the element having the highest absolute value is selected and stored in a storage 240. As depicted, element 212 has the highest absolute value among all the elements in window 235. As such, element 212 is selected and stored in storage 240.
- the value of the selected element 212 is changed to a defined value.
- the defined value is zero (“0”).
- window 235 is slid across vector 200 by a defined number of elements and a new set of elements are selected using window 235.
- the defined number of elements is four.
- window 235 is slid across vector 200 by four elements and elements 210-224 are now selected by window 235.
- the element having the highest absolute value is selected from the elements selected by window 235 and stored in a storage 245.
- element 218 has the highest absolute value among all the elements in window 235. Therefore, element 218 is selected and stored in storage 245.
- the value of the selected element 218 is modified to the defined value. As depicted, the value of element 218 has been modified to zero. Window 235 is then slid across vector 200 by the defined number of elements and a new set of elements are selected using window 235. As illustrated in FIG. 2C, window 235 is slid across vector 200 by four elements and elements 218-232 are now selected by window 235. The element having the highest absolute value is selected from the elements selected by window 235 and stored in a storage 250. Here, element 230 has the highest absolute value among all the elements in window 235, as depicted in FIG. 2C. Hence, element 230 is selected and stored in storage 250.
- window 235 is slid across vector 200 by the defined number of elements and a new set of elements are selected using window 235.
- window 235 extends past the end of vector 200 and only elements 226-232 are selected by window 235.
- the first four elements in vector 200 are used to fill in the missing elements in window 235.
- window 235 is slid across vector 200 by four elements, the first four elements 202-208 are filled in the last four elements of window 235. As such, elements 226-232 and 202-208 are now selected by window 235.
- the element having the highest absolute value is selected from the elements selected by window 235 and stored in a storage 255.
- element 204 has the highest absolute value among all the elements in window 235.
- element 204 is selected and stored in storage 255.
- FIG. 2E illustrates vector 200 after it has been sparsified using the technique explained above.
- the values of elements 204, 212, 218, and 230 which were the elements selected at each stage shown in FIGS. 2A-2D, are still stored in vector 200.
- the values of the unselected elements 202, 206-210, 214, 216, 220-228, and 232 have been modified to a defined sparsity value (zero in this example).
- the sparsification technique described above by reference to FIGS. 2A-2E may be implemented in hardware using multiplexers as well as other components necessary to accomplish the intended functionality described above. While the vector used in this example includes a certain number of elements and , one of ordinary skill in the art will understand that the techniques shown in this example may be applied to sparsify vectors having any number of different elements. Moreover, one of ordinary skill in the art will appreciate that windows having different lengths can be used in different embodiments.
- FIG. 3 illustrates a process 300 for sparsifying a vector according to some embodiments.
- a computing device e.g., computer system 400
- artificial intelligence (Al) hardware e.g., a neural network processor 511
- Process 300 begins by using, at 310, a window to select a first set of elements in a vector of elements. Referring to FIG. 2A as an example, window 235 is used to select the set of elements 202-216.
- process 300 selects, at 320, a first element from the first set of elements having the highest absolute value.
- element 212 is selected from the set of elements 202-216 because it is the element having the highest absolute value.
- Process 300 then slides, at 330, the window along the vector by a defined number of elements. Referring to FIGS. 2A and 2B as an example, window 235 in FIG. 2A is slid along vector 200 by four elements to the position shown in FIG. 2B.
- process 300 uses the window to select a second set of elements in the vector, wherein the first set of elements and the second set of elements share at least one common element.
- window 235 is used to select the set of elements 210-224.
- process 300 selects, at 350, a second element from the second set of elements having the highest absolute value.
- element 218 is selected from the set of elements 210-224 as it is the element having the highest absolute value.
- FIG. 4 depicts a simplified block diagram of an example computer system 400, which can be used to implement the techniques described in the foregoing disclosure.
- computer system 400 includes one or more processors 402 that communicate with a number of peripheral devices via a bus subsystem 404. These peripheral devices may include a storage subsystem 406 (e.g., comprising a memory subsystem 408 and a file storage subsystem 410) and a network interface subsystem 416. Some computer systems may further include user interface input devices 412 and/or user interface output devices 414.
- Bus subsystem 404 can provide a mechanism for letting the various components and subsystems of computer system 400 communicate with each other as intended. Although bus subsystem 404 is shown schematically as a single bus, alternative embodiments of the bus subsystem can utilize multiple busses.
- Network interface subsystem 416 can serve as an interface for communicating data between computer system 400 and other computer systems or networks.
- Embodiments of network interface subsystem 416 can include, e.g., Ethernet, a Wi-Fi and/or cellular adapter, a modem (telephone, satellite, cable, ISDN, etc.), digital subscriber line (DSL) units, and/or the like.
- Storage subsystem 406 includes a memory subsystem 408 and a file/disk storage subsystem 410.
- Subsystems 408 and 410 as well as other memories described herein are examples of non- transitory computer-readable storage media that can store executable program code and/or data that provide the functionality of embodiments of the present disclosure.
- Memory subsystem 408 includes a number of memories including a main random access memory (RAM) 418 for storage of instructions and data during program execution and a read-only memory (ROM) 420 in which fixed instructions are stored.
- File storage subsystem 410 can provide persistent (e.g., non-volatile) storage for program and data files, and can include a magnetic or solid-state hard disk drive, an optical drive along with associated removable media (e.g., CD- ROM, DVD, Blu-Ray, etc.), a removable flash memory -based drive or card, and/or other types of storage media known in the art.
- computer system 400 is illustrative and many other configurations having more or fewer components than system 400 are possible.
- Fig. 5 illustrates a neural network processing system 500 according to some embodiments.
- neural networks may be implemented and trained in a hardware environment comprising one or more neural network processors.
- a neural network processor may refer to various graphics processing units (GPU) (e.g., a GPU for processing neural networks produced by Nvidia Corp®), field programmable gate arrays (FPGA) (e.g., FPGAs for processing neural networks produced by Xilinx®), or a variety of application specific integrated circuits (ASICs) or neural network processors comprising hardware architectures optimized for neural network computations, for example.
- graphics processing units e.g., a GPU for processing neural networks produced by Nvidia Corp®
- FPGA field programmable gate arrays
- ASICs application specific integrated circuits
- servers 502 which may comprise architectures illustrated in Fig.
- Controllers 510(l)-510(M) may be coupled to a plurality of controllers 510(l)-510(M) over a communication network 501 (e.g., switches, routers, etc.). Controllers 510(l)-510(M) may also comprise architectures illustrated in Fig. 4 above. Each controller 510(l)-510(M) may be coupled to one or more NN processors, such as processors 51 l(l)-511(N) and 512(1)-512(N), for example. In some embodiments, NN processors 51 l(l)-511(N) and 512(1)-512(N) may be used to implement Al processor(s) 135.
- NN processors 51 l(l)-511(N) and 512(1)-512(N) may be used to implement Al processor(s) 135.
- NN processors 51 l(l)-511(N) and 512(1)-512(N) may include a variety of configurations of functional processing blocks and memory optimized for neural network processing, such as training or inference.
- the NN processors are optimized for neural network computations.
- Server 502 may configure controllers 510 with NN models as well as input data to the models, which may be loaded and executed by NN processors 51 l(l)-511(N) and 512(1)- 512(N) in parallel, for example.
- Models may include layers and associated weights as described above, for example.
- NN processors may load the models and apply the inputs to produce output results.
- NN processors may also implement training algorithms described herein, for example.
- the present disclosure includes systems, methods, and apparatuses for sparsifying vectors for neural network models based on overlapping windows.
- the techniques described herein may be embodied in non-transitory machine-readable medium storing a program executable by a computer system, the program comprising sets of instructions for performing the techniques described herein.
- a system includes a set of processing units and a non-transitory machine-readable medium storing instructions that when executed by at least one processing unit in the set of processing units cause the at least one processing unit to perform the techniques described above.
- the non-transitory machine-readable medium may be memory, for example, which may be coupled to one or more controllers or one or more artificial intelligence processors, for example.
- the present disclosure includes a non-transitory machine- readable medium storing a program executable by at least one processing unit of a device, the program comprising sets of instructions for using a window to select a first set of elements in a vector of elements; selecting a first element from the first set of elements having the highest absolute value; sliding the window along the vector by a defined number of elements; using the window to select a second set of elements in the vector, wherein the first set of elements and the second set of elements share at least one common element; and selecting a second element from the second set of elements having the highest absolute value.
- the vector is a first vector
- the program further comprises a set of instructions for multiplying the selected first and second elements in the vector with corresponding first and second elements in a second vector of elements.
- the first and second vectors are parameters in a neural network model.
- the first element in the first set of elements is included in the second set of elements, wherein selecting the second element from the second set of elements comprises selecting an element other than the first element from the second set of elements having the highest absolute value.
- the program further comprises a set of instructions for, after selecting the first element from the first set of elements and before selecting the second element from the second set of elements, storing the first element and modifying the value of the first element in the vector to a defined value.
- the second set of elements comprises a third set of elements from a first end of the vector and a fourth set of elements from a second end of the vector.
- the first element and the second element are different elements in the vector.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Mathematics (AREA)
- Mathematical Physics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Complex Calculations (AREA)
- Image Processing (AREA)
Abstract
Description
Claims
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP23707551.0A EP4508552A1 (en) | 2022-04-14 | 2023-01-30 | Sparsifying vectors for neural network models based on overlapping windows |
| CN202380028157.2A CN118891617A (en) | 2022-04-14 | 2023-01-30 | Sparsification of vectors for neural networks based on overlapping windows |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202263331188P | 2022-04-14 | 2022-04-14 | |
| US63/331,188 | 2022-04-14 | ||
| US17/827,222 US20230334284A1 (en) | 2022-04-14 | 2022-05-27 | Sparsifying vectors for neural network models based on overlapping windows |
| US17/827,222 | 2022-05-27 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023200509A1 true WO2023200509A1 (en) | 2023-10-19 |
Family
ID=85382963
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2023/011804 Ceased WO2023200509A1 (en) | 2022-04-14 | 2023-01-30 | Sparsifying vectors for neural network models based on overlapping windows |
Country Status (3)
| Country | Link |
|---|---|
| EP (1) | EP4508552A1 (en) |
| TW (1) | TW202341064A (en) |
| WO (1) | WO2023200509A1 (en) |
-
2023
- 2023-01-30 WO PCT/US2023/011804 patent/WO2023200509A1/en not_active Ceased
- 2023-01-30 EP EP23707551.0A patent/EP4508552A1/en active Pending
- 2023-03-13 TW TW112109088A patent/TW202341064A/en unknown
Non-Patent Citations (1)
| Title |
|---|
| XI JIANGBO ET AL: "Wide Sliding Window and Subsampling Network for Hyperspectral Image Classification", REMOTE SENSING, vol. 13, no. 7, 28 March 2021 (2021-03-28), pages 1 - 19, XP093042062, DOI: 10.3390/rs13071290 * |
Also Published As
| Publication number | Publication date |
|---|---|
| TW202341064A (en) | 2023-10-16 |
| EP4508552A1 (en) | 2025-02-19 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11893469B2 (en) | Position masking for transformer models | |
| US12412088B2 (en) | Reducing operations for training neural networks | |
| EP4186003A1 (en) | Compressing tokens based on positions for transformer models | |
| US11537890B2 (en) | Compressing weights for distributed neural networks | |
| US20240004718A1 (en) | Compiling tensor operators for neural network models based on tensor tile configurations | |
| WO2023200509A1 (en) | Sparsifying vectors for neural network models based on overlapping windows | |
| US20230334284A1 (en) | Sparsifying vectors for neural network models based on overlapping windows | |
| CN113159297B (en) | Neural network compression method, device, computer equipment and storage medium | |
| US20220076127A1 (en) | Forcing weights of transformer model layers | |
| US11886983B2 (en) | Reducing hardware resource utilization for residual neural networks | |
| WO2024249165A1 (en) | Optimizing deep neural network models based on sparsification and quantization | |
| CN118891617A (en) | Sparsification of vectors for neural networks based on overlapping windows | |
| US12307372B2 (en) | Data-aware model pruning for neural networks | |
| US11853717B2 (en) | Accelerating processing based on sparsity for neural network hardware processors | |
| US20230376725A1 (en) | Model customization of transformers for improved efficiency | |
| US11954448B2 (en) | Determining position values for transformer models | |
| US20220405571A1 (en) | Sparsifying narrow data formats for neural networks | |
| US20220138524A1 (en) | Training neural networks based on dual pipeline architectures | |
| US11610120B2 (en) | Systems and methods for training a neural network | |
| EP4189601A1 (en) | Stash balancing in model parallelism |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23707551 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202380028157.2 Country of ref document: CN |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023707551 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2023707551 Country of ref document: EP Effective date: 20241114 |