KR102746289B1

KR102746289B1 - Hybrid memory device and managing method therefor

Info

Publication number: KR102746289B1
Application number: KR1020220172820A
Authority: KR
Inventors: 김광선; 조성준; 홍정민
Original assignee: 포항공과대학교 산학협력단
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2024-12-23
Anticipated expiration: 2042-12-12
Also published as: KR20240087273A

Abstract

본 발명의 일 실시예에 따른 하이브리드 메모리의 관리 방법은, 비휘발성 스토리지 클래스 메모리(SCM)인 제1 메모리로부터 캐슁(caching)할 복수개의 제1 데이터를 수신하는 단계; 복수개의 제1 데이터의 태그 정보를 랜덤 액세스가 가능한 제2 메모리의 캐쉬라인(cacheline) 내의 제1 영역에 집합적으로(aggregately) 저장하는 단계; 및 복수개의 제1 데이터를 제2 메모리의 캐쉬라인 내의 제2 영역에 순차적으로 저장하는 단계를 포함한다. A method for managing a hybrid memory according to one embodiment of the present invention includes the steps of: receiving a plurality of first data to be cached from a first memory which is a nonvolatile storage class memory (SCM); collectively storing tag information of the plurality of first data in a first area within a cache line of a second memory which enables random access; and sequentially storing the plurality of first data in a second area within a cache line of the second memory.

Description

HYBRID MEMORY DEVICE AND MANAGING METHOD THEREFOR

본 발명은 고대역폭 메모리(HBM: High Bandwidth Memory) 장치 및 그 관리 방법에 관한 것으로 특히 메모리 용량 한계에 의한 HBM의 성능 저하를 해결하기 위한 이기종 메모리, 즉, 하이브리드 메모리를 이용한 고대역폭 메모리 장치 및 그 관리 방법에 관한 것이다. The present invention relates to a high bandwidth memory (HBM) device and a management method thereof, and more particularly, to a high bandwidth memory device using heterogeneous memory, i.e., hybrid memory, for resolving performance degradation of HBM due to memory capacity limitations, and a management method thereof.

이 부분에 기술된 내용은 단순히 본 실시예에 대한 배경 정보를 제공할 뿐 종래 기술을 구성하는 것은 아니다.The material described in this section merely provides background information for the present embodiment and does not constitute prior art.

메모리 장치는 예를 들어, 컴퓨터, 휴대폰, PDA, 데이터이력기록기(data logger), 게임기, 항법장치 등과 같은 많은 유형의 전자 장치들에 적용된다. 전술한 전자 장치들 가운데, 다양한 유형의 메모리 장치들, 예를 들어, NAND 플래시 메모리나 NOR 플래시 메모리, SRAM(Static Random Access Memory), DRAM(Dynamic Random Access Memory), PCM(Phase Change Memory: 상변화 메모리) 등이 적용된다. 동작 속도 및 캐쉬 라인 사이즈(cache line size)의 증가에 상응하여, 메모리 장치들은 표준화된 규격 내의 패키지 내에 패키징될 수 있다. Memory devices are applied to many types of electronic devices, such as computers, mobile phones, PDAs, data loggers, game consoles, navigation devices, etc. Among the aforementioned electronic devices, various types of memory devices, such as NAND flash memory, NOR flash memory, SRAM (Static Random Access Memory), DRAM (Dynamic Random Access Memory), PCM (Phase Change Memory), etc. are applied. In accordance with the increase in operating speed and cache line size, memory devices can be packaged in packages within standardized specifications.

예를 들어, 컴퓨팅 플랫폼(computing platform)에서 메인 메모리로 이용될 수 있는 구조로는 병렬로 장착되는 다수의 DRAM 메모리를 포함할 수 있다. 이때 메모리 장치 내의 병렬 메모리 모듈들에게 전달되는 읽기/쓰기 요청(read/write request)은 병렬 메모리 모듈들에 걸쳐 분할될 수 있고, 그에 따라 개개의 메모리 모듈은 총 캐쉬 라인 요청(total cache line request)의 하위 집합(subset)을 제공할 수 있다. 전술한 메모리 장치들은 일반적으로 예를 들면, 읽기/쓰기 타이밍(read/write timing), 메모리 페이지 사이즈(memory page size), 및/또는 어드레싱 프로토콜(addressing protocol)과 연관된 특정한 고유 파라미터들(intrinsic parameters)을 가진다. For example, a computing platform may have a structure that may be utilized as main memory, which may include a plurality of DRAM memories mounted in parallel. In this case, a read/write request addressed to the parallel memory modules within the memory device may be split across the parallel memory modules, such that each memory module may provide a subset of the total cache line request. The aforementioned memory devices typically have certain intrinsic parameters associated with, for example, read/write timing, memory page size, and/or addressing protocol.

딥 러닝 등 큰 데이터를 요구하는 GPU 어플리케이션을 위하여 고대역폭 메모리(HBM: High Bandwidth Memory)가 도입되었다. HBM은 하이엔드 GPU에서 사용되어 기존 GDDR보다 큰 용량과 높은 메모리 대역폭을 제공한다.High Bandwidth Memory (HBM) has been introduced for GPU applications that require large data such as deep learning. HBM is used in high-end GPUs and provides larger capacity and higher memory bandwidth than existing GDDR.

큰 데이터를 필요로 하는 GPU 어플리케이션을 한정된 GPU 메모리에서 실행하기 위해 demand paging을 사용할 수 있다. NVIDIA의 Unified Memory가 대표적인 예시이며 이는 프로그래머의 큰 수고 없이 CPU와 GPU 사이에서 필요한 페이지를 이동할 수 있게 한다.Demand paging can be used to run GPU applications that require large data in limited GPU memory. NVIDIA's Unified Memory is a representative example, which allows moving necessary pages between the CPU and GPU without much effort from the programmer.

GPU의 메모리 총 용량을 증가시키기 위해 여러 개의 GPU를 NVIDIA의 NVLink와 같이 높은 속도를 가진 링크로 연결하는 방법이 이용되기도 한다. To increase the total memory capacity of a GPU, methods are sometimes used to connect multiple GPUs with high-speed links, such as NVIDIA's NVLink.

기존의 하이엔드 GPU에서 사용되는 HBM은 딥러닝과 큰 규모의 그래프 분석 등 중요한 워크로드가 필요로 하는 메모리 용량보다 작아 CPU와 GPU 사이에서 데이터가 반복해서 이동하며 성능 저하가 발생한다. The HBM used in existing high-end GPUs is smaller than the memory capacity required by important workloads such as deep learning and large-scale graph analysis, resulting in repeated transfers of data between the CPU and GPU, resulting in performance degradation.

이러한 종래의 demand paging은 호스트 드라이버에서의 page fault 처리에 소요되는 긴 대기 시간과 제한된 PCIe 대역폭으로 인해 성능이 상당히 저하되는 문제가 있다. 이런 성능 저하 문제를 해결하기 위해 메모리 접근 패턴의 규칙성과 로컬리티를 활용한 prefetch, eviction policy에 대한 연구도 진행되고 있다. 하지만 그래프 분석 등의 불규칙한 워크로드들에서는 로컬리티가 낮고 접근 패턴이 불규칙하기 때문에 기존의 prefetch, eviction policy가 비효율적인 문제점이 있다. This conventional demand paging has a problem that its performance is significantly degraded due to the long waiting time required for page fault processing in the host driver and the limited PCIe bandwidth. To solve this performance degradation problem, research is also being conducted on prefetch and eviction policies that utilize the regularity and locality of memory access patterns. However, in irregular workloads such as graph analysis, the existing prefetch and eviction policies have the problem of being inefficient because the locality is low and the access pattern is irregular.

복수개의 GPU를 높은 속도를 가진 링크로 연결하는 방법은 메모리 총 용량이 증가하는 만큼 링크 인터페이스와 스위치로 인한 추가적인 비용으로 인해 GPU의 하드웨어 비용이 선형적으로 증가하는 문제가 있다. Connecting multiple GPUs with high-speed links has the problem that the hardware cost of the GPU increases linearly as the total memory capacity increases due to additional costs caused by link interfaces and switches.

한국등록특허 KR 10-2231792호 "하이브리드 메모리 모듈 및 그것의 동작 방법" (공개일 2021년 3월 18일)Korean Patent Registration No. KR 10-2231792 "Hybrid Memory Module and Its Operating Method" (Published on March 18, 2021)

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은, 고대역폭 메모리 내의 DRAM 모듈의 개수를 증가시키지 않으면서 메모리 용량이 증가된 하이브리드 메모리 기반 고대역폭 메모리 장치 및 관리 방법을 제안하는 것이다. The purpose of the present invention to solve the above problems is to propose a hybrid memory-based high-bandwidth memory device and management method with increased memory capacity without increasing the number of DRAM modules in the high-bandwidth memory.

본 발명의 목적은, 종래의 DRAM 모듈로만 이루어진 고대역폭 메모리와 동등한 수준의 데이터 액세스 성능을 제공하면서 종래의 고대역폭 메모리보다 큰 메모리 용량을 제공하는 하이브리드 메모리 기반 고대역폭 메모리 장치 및 관리 방법을 제안하는 것이다. The purpose of the present invention is to propose a hybrid memory-based high-bandwidth memory device and a management method that provide a memory capacity greater than that of a conventional high-bandwidth memory while providing data access performance equivalent to that of a conventional high-bandwidth memory composed only of DRAM modules.

본 발명의 목적은, 이기종 적층 메모리를 포함하는 고대역폭 메모리를 이용함으로써 기존의 메모리 over-subscription으로 인한 성능 저하를 줄이는 것이다.An object of the present invention is to reduce performance degradation due to conventional memory over-subscription by utilizing a high bandwidth memory including heterogeneous stacked memory.

본 발명의 목적은, 큰 데이터를 필요로 하는 GPU 기반 어플리케이션에서도 충분한 메모리 용량을 제공하는 하이브리드 메모리 기반 고대역폭 메모리 장치 및 관리 방법을 제안하는 것이다. The purpose of the present invention is to propose a hybrid memory-based high-bandwidth memory device and management method that provide sufficient memory capacity even for GPU-based applications requiring large data.

본 발명의 목적은, 큰 데이터를 필요로 하는 GPU 기반 어플리케이션에서도 메모리 특성에 따른 액세스 빈도를 효과적으로 조절함으로써 개선된 성능을 제공하는 하이브리드 메모리 기반 고대역폭 메모리 장치 및 관리 방법을 제안하는 것이다. The purpose of the present invention is to propose a hybrid memory-based high-bandwidth memory device and management method that provide improved performance by effectively controlling access frequency according to memory characteristics even in GPU-based applications requiring large data.

본 발명의 목적을 달성하기 위한 일 실시예에 따른 하이브리드 메모리의 관리 방법은, 비휘발성 스토리지 클래스 메모리(SCM)인 제1 메모리로부터 캐슁(caching)할 복수개의 제1 데이터를 수신하는 단계; 복수개의 제1 데이터의 태그 정보를 랜덤 액세스가 가능한 제2 메모리의 캐쉬라인(cacheline) 내의 제1 영역에 집합적으로(aggregately) 저장하는 단계; 및 복수개의 제1 데이터를 제2 메모리의 캐쉬라인 내의 제2 영역에 순차적으로 저장하는 단계를 포함한다. According to one embodiment of the present invention, a method for managing a hybrid memory includes: receiving a plurality of first data to be cached from a first memory, which is a nonvolatile storage class memory (SCM); collectively storing tag information of the plurality of first data in a first area within a cache line of a second memory capable of random access; and sequentially storing the plurality of first data in a second area within a cache line of the second memory.

본 발명의 일 실시예에 따른 하이브리드 메모리의 관리 방법은, 제1 영역에 인접한 제2 메모리의 캐쉬라인 내의 제3 영역에 복수개의 제1 데이터의 오류 제어 정보를 저장하는 단계를 더 포함할 수 있다. A method for managing a hybrid memory according to one embodiment of the present invention may further include a step of storing error control information of a plurality of first data in a third area within a cache line of a second memory adjacent to the first area.

제1 영역은 제2 메모리에 대한 한 번의 접근으로 모두 읽을 수 있는 크기의 영역일 수 있다. The first region may be a region of such a size that it can be read entirely in a single access to the second memory.

본 발명의 일 실시예에 따른 하이브리드 메모리의 관리 방법은, 호스트(host)로부터 복수개의 제1 데이터에 대한 정보가 요청될 때 제1 영역에 저장된 복수개의 제1 데이터에 대한 태그 정보를 호스트로 제공하는 단계를 더 포함할 수 있다.A method for managing a hybrid memory according to one embodiment of the present invention may further include a step of providing tag information on a plurality of first data stored in a first area to the host when information on a plurality of first data is requested from the host.

본 발명의 일 실시예에 따른 하이브리드 메모리의 관리 방법은, 호스트(host)가 레벨 2 캐쉬(cache)의 일부 영역에 제1 영역에 저장된 복수개의 제1 데이터에 대한 태그 정보를 복수개의 제1 데이터에 대한 식별 정보로서 저장하는 단계를 더 포함할 수 있다. A method for managing a hybrid memory according to one embodiment of the present invention may further include a step in which a host stores tag information for a plurality of first data stored in a first area in a part of a level 2 cache as identification information for the plurality of first data.

제1 메모리와 제2 메모리는 제1 개수의 비트의 공통 주소를 공유할 수 있고, 복수개의 제1 데이터의 태그 정보는 복수개의 제1 데이터가 저장되는 제1 메모리 상의 제1 주소 중 공통 주소를 제외한 제1 주소와 제2 메모리 상의 제2 주소 간의 제2 개수의 비트의 오프셋 정보를 포함할 수 있다. The first memory and the second memory can share a common address of a first number of bits, and the tag information of the plurality of first data can include offset information of a second number of bits between a first address on the first memory where the plurality of first data are stored, excluding the common address, and a second address on the second memory.

본 발명의 일 실시예에 따른 하이브리드 메모리의 관리 방법은, 비휘발성인 제1 메모리에 저장된 데이터로서 호스트(host)의 제1 요청에 대응하는 데이터가 랜덤 액세스가 가능한 제2 메모리에 저장되어 있지 않은 경우에, 제1 요청에 대한 제1 메모리 및 제2 메모리 간의 디바이스 민감도를 산출하는 단계; 및 디바이스 민감도에 기반하여 제1 요청에 대하여 제2 메모리를 바이패스하고 제1 메모리에 직접 접근할 지 여부를 결정하는 단계를 포함한다. 디바이스 민감도는 제1 요청의 접근의 종류에 따른 제1 메모리의 제1 비용, 및 제1 요청을 처리하기 위한 제2 메모리의 제2 비용에 기반하여 산출된다. A method for managing a hybrid memory according to one embodiment of the present invention includes: calculating a device sensitivity between a first memory and a second memory for a first request when data corresponding to a first request of a host is not stored in a second memory capable of random access as data stored in a non-volatile first memory; and determining whether to bypass the second memory and directly access the first memory for the first request based on the device sensitivity. The device sensitivity is calculated based on a first cost of the first memory according to a type of access of the first request, and a second cost of the second memory for processing the first request.

디바이스 민감도를 산출하는 단계에서는, 디바이스 민감도가 제1 요청이 포함하는 연속되는 데이터의 개수만큼 나누어져 산출될 수 있다. In the step of calculating the device sensitivity, the device sensitivity can be calculated by dividing it by the number of consecutive data included in the first request.

디바이스 민감도를 산출하는 단계에서는, 디바이스 민감도가 미리 결정된 최소값에서부터 관측된 최대값 사이의 어느 한 값으로 이산화될(discretized) 수 있다. In the step of calculating the device sensitivity, the device sensitivity can be discretized to any value between a predetermined minimum value and an observed maximum value.

본 발명의 일 실시예에 따른 하이브리드 메모리의 관리 방법은, 디바이스 민감도에 기반하여 제2 메모리를 바이패스하지 않기로 결정된 경우에, 제1 요청의 제2 메모리에 대한 친화도를 산출하는 단계; 및 제1 요청의 제2 메모리에 대한 친화도 및 제2 메모리로부터 추방될 빅팀 데이터의 제2 메모리에 대한 친화도 간의 비교 결과에 기반하여 제1 요청에 의하여 제2 메모리의 캐쉬라인의 일부를 대체할 지 여부를 결정하는 단계를 더 포함할 수 있다. A method for managing a hybrid memory according to one embodiment of the present invention may further include: calculating an affinity of a first request to a second memory if it is determined not to bypass the second memory based on device sensitivity; and determining whether to replace a portion of a cache line of the second memory by the first request based on a comparison result between the affinity of the first request to the second memory and the affinity of the victim data to be expelled from the second memory to the second memory.

제1 요청의 제2 메모리에 대한 친화도를 산출하는 단계에서는, 제2 메모리에 대한 친화도는, 제1 요청에 대응하는 데이터에 대한 접근 빈도 및 제1 요청의 디바이스 민감도에 기반하여 결정될 수 있다. In the step of calculating the affinity of the first request to the second memory, the affinity to the second memory can be determined based on the access frequency to data corresponding to the first request and the device sensitivity of the first request.

제1 요청의 제2 메모리에 대한 친화도를 산출하는 단계에서는, 제2 메모리에 대한 친화도는, 미리 결정된 최소값에서부터 관측된 최대값 사이의 어느 한 값으로 이산화될(discretized) 수 있다. In the step of calculating the affinity of the first request to the second memory, the affinity to the second memory can be discretized to any value between a predetermined minimum value and an observed maximum value.

본 발명의 일 실시예에 따른 하이브리드 메모리의 관리 방법은, 제1 요청에 의하여 제2 메모리의 캐쉬라인의 일부를 대체하지 않기로 결정된 경우에, 제1 요청에 대하여 제2 메모리를 바이패스하고 제1 메모리에 직접 접근하여 제1 요청을 처리하는 단계를 더 포함할 수 있다. A method for managing a hybrid memory according to one embodiment of the present invention may further include a step of bypassing the second memory and directly accessing the first memory for the first request to process the first request when it is determined not to replace a part of a cache line of the second memory in response to the first request.

본 발명의 일 실시예에 따른 하이브리드 메모리의 관리 방법은, 제1 요청에 의하여 제2 메모리의 캐쉬라인의 일부를 대체하지 않기로 결정된 경우에, 제1 요청에 대응하는 데이터가 제2 메모리에 저장되어 있지 않은 이벤트에 대응하여 빅팀 데이터의 제2 메모리에 대한 친화도를 미리 결정된 조건에 의하여 감소시키는 단계를 더 포함할 수 있다. A method for managing a hybrid memory according to one embodiment of the present invention may further include a step of reducing the affinity of victim data for the second memory according to a predetermined condition in response to an event in which data corresponding to the first request is not stored in the second memory, when it is determined not to replace a part of a cache line of the second memory according to the first request.

본 발명의 일 실시예에 따른 하이브리드 메모리 장치는, 비휘발성 스토리지 클래스 메모리(SCM)인 제1 메모리; 랜덤 액세스가 가능한 제2 메모리; 및 메모리 컨트롤러를 포함한다. 메모리 컨트롤러는, 제1 메모리로부터 캐슁(caching)할 복수개의 제1 데이터의 태그 정보를 제2 메모리의 캐쉬라인(cacheline) 내의 제1 영역에 집합적으로(aggregately) 저장하고, 복수개의 제1 데이터를 제2 메모리의 캐쉬라인 내의 제2 영역에 순차적으로 저장한다. A hybrid memory device according to one embodiment of the present invention includes a first memory which is a nonvolatile storage class memory (SCM); a second memory which is randomly accessible; and a memory controller. The memory controller collectively stores tag information of a plurality of first data to be cached from the first memory in a first area within a cache line of the second memory, and sequentially stores the plurality of first data in a second area within a cache line of the second memory.

메모리 컨트롤러는, 제1 영역에 인접한 제2 메모리의 캐쉬라인 내의 제3 영역에 복수개의 제1 데이터의 오류 제어 정보를 저장할 수 있다. The memory controller can store error control information of a plurality of first data in a third area within a cache line of a second memory adjacent to the first area.

메모리 컨트롤러는, 호스트(host)로부터 복수개의 제1 데이터에 대한 정보가 요청될 때 제1 영역에 저장된 복수개의 제1 데이터에 대한 태그 정보를 호스트로 제공할 수 있다. The memory controller can provide tag information on a plurality of first data stored in a first area to the host when information on a plurality of first data is requested from the host.

메모리 컨트롤러는, 호스트(host)가 레벨 2 캐쉬(cache)의 일부 영역에 제1 영역에 저장된 복수개의 제1 데이터에 대한 태그 정보를 복수개의 제1 데이터에 대한 식별 정보로서 저장하도록 복수개의 제1 데이터에 대한 태그 정보를 호스트로 제공할 수 있다. The memory controller can provide tag information for a plurality of first data to the host so that the host can store tag information for a plurality of first data stored in a first area in some area of a level 2 cache as identification information for the plurality of first data.

제1 메모리와 제2 메모리는 제1 개수의 비트의 공통 주소를 공유할 수 있다. 복수개의 제1 데이터의 태그 정보는 복수개의 제1 데이터가 저장되는 제1 메모리 상의 제1 주소 중 공통 주소를 제외한 제1 주소와 제2 메모리 상의 제2 주소 간의 제2 개수의 비트의 오프셋 정보를 포함할 수 있다. The first memory and the second memory can share a common address of a first number of bits. The tag information of the plurality of first data can include offset information of a second number of bits between a first address on the first memory where the plurality of first data are stored, excluding the common address, and a second address on the second memory.

본 발명의 실시예에 따르면, 고대역폭 메모리 내에 이기종 메모리를 이용함으로써 DRAM 모듈의 개수를 증가시키지 않으면서 메모리 용량을 증가시키고, DRAM 모듈로만 이루어진 것과 동등한 수준의 데이터 액세스 성능을 제공할 수 있다. According to an embodiment of the present invention, by utilizing heterogeneous memory within a high bandwidth memory, memory capacity can be increased without increasing the number of DRAM modules, and data access performance at a level equivalent to that of a memory composed solely of DRAM modules can be provided.

본 발명의 실시예에 따르면, 이기종 적층 메모리(HMS, Heterogeneous Memory Stack)를 포함하는 고대역폭 메모리를 이용함으로써 기존의 메모리 over-subscription으로 인한 성능 저하를 줄일 수 있다. According to an embodiment of the present invention, performance degradation due to existing memory over-subscription can be reduced by using a high-bandwidth memory including a heterogeneous memory stack (HMS).

본 발명의 실시예에 따르면, 딥러닝, 큰 규모의 그래프 분석 등 큰 데이터를 필요로 하는 GPU 어플리케이션에서도 충분한 메모리 용량을 제공함으로써 성능을 개선할 수 있다. According to an embodiment of the present invention, performance can be improved by providing sufficient memory capacity even in GPU applications that require large data, such as deep learning and large-scale graph analysis.

본 발명의 실시예에 따르면, 큰 데이터를 필요로 하는 GPU 어플리케이션에서도 엑세스 비용이 큰 메모리(느린 메모리)에 대한 액세스를 저감함으로써 성능을 개선할 수 있다. According to an embodiment of the present invention, performance can be improved even in GPU applications requiring large data by reducing access to memory with high access cost (slow memory).

도 1은 본 발명의 일 실시예에 따른 하이브리드 메모리 기반 고대역폭 메모리 장치를 도시하는 개념도이다.
도 2는 본 발명의 일 실시예에 따른 하이브리드 메모리 장치 및 그 주변 장치를 도시하는 개념도이다.
도 3은 본 발명의 일 실시예에 따른 하이브리드 메모리 장치의 관리 방법을 도시하는 동작 흐름도이다.
도 4는 본 발명의 일 실시예에 따른 하이브리드 메모리 장치와 관련되어, 태그가 저장되는 방식인 CTC 개념을 도시하는 개념도이다.
도 5는 본 발명의 일 실시예에 따른 하이브리드 메모리 장치에서 태그가 저장되는 방식인 ATIC 구조를 도시하는 개념도이다.
도 6은 본 발명의 일 실시예에 따른 하이브리드 메모리 장치의 관리 방법을 도시하는 동작 흐름도이다.
도 7은 도 6의 방법의 일 부분을 더욱 상세히 도시하는 동작 흐름도이다.
도 8은 본 발명의 일 실시예에 따른 하이브리드 메모리 장치에서 디바이스 친화도를 산출하는 과정을 도시하는 개념도이다.
도 9는 도 1 내지 도 8의 과정의 적어도 일부를 수행할 수 있는 일반화된 하이브리드 메모리 장치 내의 메모리 컨트롤러, 메모리 관리 장치, 또는 컴퓨팅 시스템의 예시를 도시하는 개념도이다. FIG. 1 is a conceptual diagram illustrating a hybrid memory-based high-bandwidth memory device according to one embodiment of the present invention.
FIG. 2 is a conceptual diagram illustrating a hybrid memory device and its peripheral devices according to one embodiment of the present invention.
FIG. 3 is a flowchart illustrating a management method of a hybrid memory device according to one embodiment of the present invention.
FIG. 4 is a conceptual diagram illustrating a CTC concept, which is a method for storing tags, in relation to a hybrid memory device according to one embodiment of the present invention.
FIG. 5 is a conceptual diagram illustrating an ATIC structure, which is a method for storing tags in a hybrid memory device according to one embodiment of the present invention.
FIG. 6 is a flowchart illustrating a management method of a hybrid memory device according to one embodiment of the present invention.
Figure 7 is a flowchart illustrating a portion of the method of Figure 6 in more detail.
FIG. 8 is a conceptual diagram illustrating a process for calculating device affinity in a hybrid memory device according to one embodiment of the present invention.
FIG. 9 is a conceptual diagram illustrating an example of a memory controller, memory management device, or computing system within a generalized hybrid memory device capable of performing at least a portion of the processes of FIGS. 1 through 8.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.The present invention can have various modifications and various embodiments, and specific embodiments are illustrated in the drawings and described in detail. However, this is not intended to limit the present invention to specific embodiments, but should be understood to include all modifications, equivalents, or substitutes included in the spirit and technical scope of the present invention.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. '및/또는' 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are only used to distinguish one component from another. For example, without departing from the scope of the present invention, the first component could be referred to as the second component, and similarly, the second component could also be referred to as the first component. The term "and/or" includes any combination of a plurality of related listed items or any item among a plurality of related listed items.

본 출원의 실시예들에서, "A 및 B 중에서 적어도 하나"는 "A 또는 B 중에서 적어도 하나" 또는 "A 및 B 중 하나 이상의 조합들 중에서 적어도 하나"를 의미할 수 있다. 또한, 본 출원의 실시예들에서, "A 및 B 중에서 하나 이상"은 "A 또는 B 중에서 하나 이상" 또는 "A 및 B 중 하나 이상의 조합들 중에서 하나 이상"을 의미할 수 있다.In the embodiments of the present application, “at least one of A and B” can mean “at least one of A or B” or “at least one of combinations of one or more of A and B.” Furthermore, in the embodiments of the present application, “at least one of A and B” can mean “at least one of A or B” or “at least one of combinations of one or more of A and B.”

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When it is said that a component is "connected" or "connected" to another component, it should be understood that it may be directly connected or connected to that other component, but that there may be other components in between. On the other hand, when it is said that a component is "directly connected" or "directly connected" to another component, it should be understood that there are no other components in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used in this application is only used to describe specific embodiments and is not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly indicates otherwise. In this application, it should be understood that the terms "comprises" or "has" and the like are intended to specify the presence of a feature, number, step, operation, component, part or combination thereof described in the specification, but do not exclude in advance the possibility of the presence or addition of one or more other features, numbers, steps, operations, components, parts or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms defined in commonly used dictionaries, such as those defined in common dictionaries, should be interpreted as having a meaning consistent with the meaning they have in the context of the relevant art, and will not be interpreted in an idealized or overly formal sense unless expressly defined in this application.

한편 본 출원일 전에 공지된 기술이라 하더라도 필요 시 본 출원 발명의 구성의 일부로서 포함될 수 있으며, 이에 대해서는 본 발명의 취지를 흐리지 않는 범위 내에서 본 명세서에서 설명한다. 다만 본 출원 발명의 구성을 설명함에 있어, 본 출원일 전에 공지된 기술로서 당업자가 자명하게 이해할 수 있는 사항에 대한 자세한 설명은 본 발명의 취지를 흐릴 수 있으므로, 공지 기술에 대한 지나치게 자세한 사항의 설명은 생략한다. Meanwhile, even if it is a technology known prior to the filing date of this application, it may be included as a part of the composition of the invention of this application if necessary, and this will be described in this specification within the scope that does not obscure the purpose of the invention. However, in explaining the composition of the invention of this application, a detailed description of matters that were known prior to the filing date of this application and could be clearly understood by those skilled in the art may obscure the purpose of the invention, and therefore, an excessively detailed description of the known technology will be omitted.

예를 들어, DRAM과 스토리지 클래스 메모리(SCM, Storage-Class Memory)를 적층하고 DRAM과 SCM의 서로 대응하는 단자를 연결하기 위하여 TSV(Through silicon via)를 이용하는 기술 등은 본 발명의 출원 전 공지 기술을 이용할 수 있으며, 이들 공지 기술들 중 적어도 일부는 본 발명을 실시하는 데에 필요한 요소 기술로서 적용될 수 있다. For example, a technology for stacking DRAM and storage-class memory (SCM) and using TSV (Through silicon via) to connect corresponding terminals of DRAM and SCM can utilize known technologies prior to the filing of the present invention, and at least some of these known technologies can be applied as element technologies necessary for implementing the present invention.

그러나 본 발명의 취지는 이들 공지 기술에 대한 권리를 주장하고자 하는 것이 아니며 공지 기술의 내용은 본 발명의 취지에 벗어나지 않는 범위 내에서 본 발명의 일부로서 포함될 수 있다. However, the purpose of the present invention is not to claim rights to these known technologies, and the contents of the known technologies may be included as part of the present invention within a scope that does not deviate from the purpose of the present invention.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, with reference to the attached drawings, a preferred embodiment of the present invention will be described in more detail. In order to facilitate an overall understanding in describing the present invention, the same reference numerals are used for the same components in the drawings, and redundant descriptions of the same components are omitted.

도 1은 본 발명의 일 실시예에 따른 하이브리드 메모리 기반 고대역폭 메모리 장치를 도시하는 개념도이다.FIG. 1 is a conceptual diagram illustrating a hybrid memory-based high-bandwidth memory device according to one embodiment of the present invention.

도 1에 도시된 본 발명의 일 실시예에서는 제1 메모리(110)에 대응하는 SCM(Storage Class Memory) stack과 제2 메모리(120)에 대응하는 DRAM stack을 3D 적층에 의하여 하나의 칩에 통합한 이기종 적층 메모리(HMS, Heterogeneous Memory Stack) 기반의 고대역폭 메모리(130)(HBM, High Bandwidth Memory)를 구성할 수 있다.In one embodiment of the present invention illustrated in FIG. 1, a high bandwidth memory (130) (HBM, High Bandwidth Memory) based on a heterogeneous stacked memory (HMS, Heterogeneous Memory Stack) can be configured by integrating a storage class memory (SCM) stack corresponding to a first memory (110) and a DRAM stack corresponding to a second memory (120) into a single chip through 3D stacking.

도 1의 HBM(130)과 GPU(200)를 포함한 외부 요소들과의 통신 및 상호작용은 기존 고대역폭 메모리의 기본적인 구성을 따를 수 있다. 도 1을 참조하면, 기존 HBM에서 위쪽 랭크에 해당하는 DRAM 다이들이 SCM 다이들로 대체됨으로써 도 1의 HBM(130)이 구현될 수 있다. Communication and interaction with external elements including the HBM (130) and the GPU (200) of FIG. 1 can follow the basic configuration of existing high-bandwidth memory. Referring to FIG. 1, the HBM (130) of FIG. 1 can be implemented by replacing the DRAM dies corresponding to the upper ranks in the existing HBM with SCM dies.

도 1의 이기종 적층 메모리 기반 HBM(130)에서 SCM으로 인한 성능 저하를 완화하기 위해 제2 메모리(120)에 대응하는 DRAM은 제2 메모리(120)에 대응하는 SCM에 대하여, SCM의 캐쉬(cache)처럼 동작하는 것으로 이해될 수 있다. 일반적으로 SCM은 DRAM보다 높은 메모리 집적도를 가지기 때문에 도 1에 도시된 이기종 적층 메모리 기반 HBM(130)은 기존 HBM보다 더 많은 메모리 용량을 제공할 수 있다.In order to alleviate performance degradation due to SCM in the heterogeneous stacked memory-based HBM (130) of Fig. 1, it can be understood that the DRAM corresponding to the second memory (120) operates as a cache of the SCM with respect to the SCM corresponding to the second memory (120). Since the SCM generally has a higher memory density than the DRAM, the heterogeneous stacked memory-based HBM (130) illustrated in Fig. 1 can provide a larger memory capacity than the existing HBM.

일반적으로 HBM은 데이터가 기존 DDRx 규격에서 serial 통신으로 전달되면서 발생하는 열과 에너지 소비를 줄이고, 집약적인 고대역폭 버스를 통하여 고대역의 데이터를 한꺼번에 전달함으로써 데이터 전달 시간과 에너지 소비를 절감하는 것을 목적으로 하는 기술이다. In general, HBM is a technology that aims to reduce the heat and energy consumption generated when data is transmitted through serial communication in the existing DDRx standard, and to reduce data transmission time and energy consumption by transmitting high-bandwidth data at once through an intensive high-bandwidth bus.

일반적으로 HBM은 일반적인 DDRx 기반 메모리 기술에 비해 훨씬 더 빠르면서 전기 소비량은 더 적고 공간도 덜 차지하는 장점을 가진다. 또한 리소스 사용량이 많은 고성능 컴퓨팅(HPC)과 인공지능(AI) 어플리케이션에서 특히 주목받고 있다. 그러나 높은 가격과 열 관리 문제, 그리고 때에 따라 어플리케이션을 수정해야 한다는 점으로 인해 활용의 한계가 존재하는 문제점이 있다. In general, HBM is much faster than conventional DDRx-based memory technology, consumes less power, and takes up less space. It is also particularly popular in high-performance computing (HPC) and artificial intelligence (AI) applications that require a lot of resources. However, there are problems that limit its utilization due to its high price, thermal management issues, and the fact that applications sometimes need to be modified.

일반적으로 HBM은 대역폭을 높이기 위해 버스의 폭을 넓히는 한편, TSV 구조를 이용하여 수직 적층함으로써 데이터의 이동에 걸리는 물리적 시간을 단축하려는 노력을 통하여 성능이 개선되고 있다.In general, HBM's performance is improved through efforts to increase bandwidth by widening the bus width and shortening the physical time required to move data by vertically stacking using the TSV structure.

본 발명의 일 실시예에 따르면 딥러닝, 큰 규모의 그래프 분석 등 큰 데이터를 필요로 하는 GPU 어플리케이션을 실행할 때 GPU의 메모리로 사용할 수 있다. According to one embodiment of the present invention, it can be used as a GPU memory when executing a GPU application that requires large data, such as deep learning or large-scale graph analysis.

구체적으로는 본 발명의 일 실시예에 따르면, 이기종 적층 메모리(HMS, Heterogeneous Memory Stack)를 딥러닝, 큰 규모의 그래프 분석 등의 큰 데이터를 필요로 하는 GPU 어플리케이션을 실행할 때 GPU(200)의 메모리로 사용하여 종래 기술과 비교할 때 메모리 oversubscription으로 인한 성능 저하를 줄일 수 있다.Specifically, according to one embodiment of the present invention, when executing a GPU application requiring large data such as deep learning or large-scale graph analysis, a heterogeneous memory stack (HMS) is used as the memory of the GPU (200), thereby reducing performance degradation due to memory oversubscription compared to the prior art.

도 2는 본 발명의 일 실시예에 따른 하이브리드 메모리 장치(100) 및 그 주변 장치 간의 상호작용을 도시하는 개념도이다.FIG. 2 is a conceptual diagram illustrating the interaction between a hybrid memory device (100) and its peripheral devices according to one embodiment of the present invention.

어플리케이션 레이어(300)는 주어진 태스크를 GPU(200)에 할당하여 GPU(200)에서 할당된 태스크를 수행할 수 있도록 제어할 수 있다. The application layer (300) can control the GPU (200) to allocate a given task to the GPU (200) so that the GPU (200) can perform the assigned task.

어플리케이션 레이어(300)는 별도의 프로세서를 포함할 수 있다. GPU(200)는 어플리케이션 레이어(300)와 별도로 적어도 하나 이상의 프로세서를 포함할 수 있다. The application layer (300) may include a separate processor. The GPU (200) may include at least one processor separately from the application layer (300).

어플리케이션 레이어(300)의 프로세서에 대응하는 시스템 메모리/스토리지(320)가 제공될 수 있다. 본 발명의 일 실시예에 따른 메모리 컨트롤러(140)는 시스템 메모리/스토리지(320)로부터 읽은 데이터를 고대역폭 메모리(130) 내의 제1 메모리(110) 또는 제2 메모리(120)에 저장할 수 잇다. 이때 메모리 컨트롤러(140)는 일종의 통신 인터페이스와 유사한 역할을 부분적으로 수행할 수도 있다. A system memory/storage (320) corresponding to a processor of an application layer (300) may be provided. A memory controller (140) according to an embodiment of the present invention may store data read from the system memory/storage (320) in a first memory (110) or a second memory (120) within a high bandwidth memory (130). At this time, the memory controller (140) may also partially perform a role similar to a type of communication interface.

본 발명의 일 실시예에 따른 하이브리드 메모리 장치(100)는, 비휘발성 스토리지 클래스 메모리(SCM)인 제1 메모리(110); 랜덤 액세스가 가능한 제2 메모리(120); 및 메모리 컨트롤러(140)를 포함한다. 메모리 컨트롤러(140)는, 제1 메모리(110)로부터 캐슁(caching)할 복수개의 제1 데이터의 태그 정보를 제2 메모리(120)의 캐쉬라인(cacheline) 내의 제1 영역에 집합적으로(aggregately) 저장하고, 복수개의 제1 데이터를 제2 메모리(120)의 캐쉬라인 내의 제2 영역에 순차적으로 저장한다. A hybrid memory device (100) according to one embodiment of the present invention includes a first memory (110) which is a nonvolatile storage class memory (SCM); a second memory (120) which enables random access; and a memory controller (140). The memory controller (140) collectively stores tag information of a plurality of first data to be cached from the first memory (110) in a first area within a cache line of the second memory (120), and sequentially stores the plurality of first data in a second area within a cache line of the second memory (120).

고대역폭 메모리(HBM)(130)은 제1 메모리(110) 및 제2 메모리(120)를 포함하는 이기종 적층 메모리를 의미할 수 있다. HBM(130) 및 메모리 컨트롤러(140)를 포함한 구조가 하이브리드 메모리 장치(100)로 불릴 수 있다. 다만 이는 본 발명의 일 실시예에 불과한 것으로서, 본 발명의 다른 실시예에서는 메모리 컨트롤러(140)는 하이브리드 메모리 장치(100) 내에 포함될 수도 있고, 하이브리드 메모리 장치(100) 및 GPU(200) 사이에 배치되어 하이브리드 메모리 장치(100)의 메모리 관리 장치로서 기능할 수도 있다. High bandwidth memory (HBM) (130) may refer to a heterogeneous stacked memory including a first memory (110) and a second memory (120). A structure including the HBM (130) and the memory controller (140) may be called a hybrid memory device (100). However, this is only one embodiment of the present invention, and in another embodiment of the present invention, the memory controller (140) may be included in the hybrid memory device (100) or may be arranged between the hybrid memory device (100) and the GPU (200) to function as a memory management device of the hybrid memory device (100).

메모리 컨트롤러(140)는, 제1 영역에 인접한 제2 메모리(120)의 캐쉬라인 내의 제3 영역에 복수개의 제1 데이터의 오류 제어 정보를 저장할 수 있다. 오류 제어 정보는 오류 제어 코드(ECC, Error Control Code)를 포함할 수 있다. 실시예에 따라서는 ECC는 Error Correction Code로 불리기도 한다. The memory controller (140) can store error control information of a plurality of first data in a third area within a cache line of a second memory (120) adjacent to the first area. The error control information can include an error control code (ECC). In some embodiments, ECC is also called Error Correction Code.

메모리 컨트롤러(140)는, 호스트(host)로부터 복수개의 제1 데이터에 대한 정보가 요청될 때 제1 영역에 저장된 복수개의 제1 데이터에 대한 태그 정보를 호스트로 제공할 수 있다. The memory controller (140) can provide tag information on a plurality of first data stored in a first area to the host when information on a plurality of first data is requested from the host.

메모리 컨트롤러(140)는, 호스트(host)가 레벨 2 캐쉬(cache)(220)의 일부 영역에 제1 영역에 저장된 복수개의 제1 데이터에 대한 태그 정보를 복수개의 제1 데이터에 대한 식별 정보로서 저장하도록 복수개의 제1 데이터에 대한 태그 정보를 호스트로 제공할 수 있다. The memory controller (140) can provide tag information for a plurality of first data to the host so that the host can store tag information for a plurality of first data stored in a first area in some area of a level 2 cache (220) as identification information for the plurality of first data.

L2 cache(200)는 일반적으로 GPU(200) 내에 배치되거나, GPU(200)와 매우 가깝게 배치될 수 있다. 따라서 GPU(200) 측에서 L2 cache(220)에 접근하는 비용은 하이브리드 메모리 장치(100)에 접근하는 비용보다 현저하게 작을 수 있다. The L2 cache (200) is typically placed within the GPU (200) or may be placed very close to the GPU (200). Therefore, the cost of accessing the L2 cache (220) from the GPU (200) side may be significantly smaller than the cost of accessing the hybrid memory device (100).

일반적으로 HBM이 타겟팅하는 데이터 센터, 서버, 인공신경망 연산을 위한 메모리-데이터 연산은 지속적으로 더 높은 대역폭과 더 큰 메모리 용량을 요구하고 있다. Memory-to-data operations for data centers, servers, and artificial neural network operations, which HBM typically targets, are continually demanding higher bandwidth and larger memory capacities.

일반적으로 종래 기술의 HBM은 대부분 DRAM 기반으로 구성되어 있으나, 메모리-데이터 연산의 특성 상 규칙적인 액세스가 가능한 경우가 빈번함을 가정하여 본 발명의 일 실시예에서는 메모리 용량을 증가시키기 위하여 도 1의 HBM(130) 구조에서 DRAM의 일부를 스토리지 클래스 메모리(SCM, Storage-Class Memory)로 대체하는 방안을 제안한다. In general, most of the HBMs of the prior art are configured based on DRAM. However, assuming that regular access is frequently possible due to the characteristics of memory-data operations, one embodiment of the present invention proposes a method of replacing a portion of the DRAM in the HBM (130) structure of FIG. 1 with a storage-class memory (SCM) in order to increase memory capacity.

이때 SCM은 DRAM보다 경제성을 가지는 대신 느린 응답속도, 큰 쓰기 비용 등의 단점을 함께 가지고 있으므로, overall 성능을 향상시키기 위해서는 SCM의 동작이 드러나지 않도록 DRAM 파트와 SCM 파트 간의 데이터 저장/액세스를 효과적으로 관리할 필요가 있다. At this time, SCM has disadvantages such as slow response speed and large write cost while being more economical than DRAM, so in order to improve overall performance, it is necessary to effectively manage data storage/access between DRAM part and SCM part so that SCM operation is not revealed.

본 발명은 DRAM-SCM이 결합된 이기종 적층 메모리(HMC) 기반 하이브리드 메모리 장치(100) 내의 HBM(130)으로서 DRAM 파트와 SCM 파트 간의 데이터 저장/액세스를 효과적으로 관리하는 점을 특징으로 한다. The present invention is characterized by effectively managing data storage/access between a DRAM part and an SCM part as an HBM (130) in a hybrid memory device (100) based on a heterogeneous stacked memory (HMC) in which DRAM-SCM are combined.

일반적으로 하이브리드 메모리 모듈이라 함은 주요한 데이터 기억장치로서 휘발성 메모리(예를 들어, 동적 임의 접근 기억장치(DRAM; dynamic random-access memory))와 비휘발성 메모리(예를 들어, 플래시 메모리(flash memory))를 모두 포함하는 메모리 모듈을 의미한다. 하이브리드 메모리 모듈의 하나의 예는 DRAM과 플래시 메모리를 통합한 하이브리드 듀얼 인라인 메모리 모듈(DIMM; dual in-line memory module)이다. 일반적인 구성에서는, DRAM은 플래시 메모리에 저장된 데이터를 위한 캐쉬(cache) 메모리로서 사용될 수 있다. DRAM 캐쉬에 빠르게 접근하기 위해서, DRAM 캐쉬의 메타데이터(metadata)는 하이브리드 메모리 모듈의 정적 임의 접근 기억장치(SRAM; static random-access memory)에 저장될 수 있다. 이 때 SRAM은 DRAM보다 더욱 높은 레벨(CPU에 가까운 레벨)의 캐쉬로 기능할 수 있다. Generally, a hybrid memory module refers to a memory module that includes both volatile memory (e.g., dynamic random-access memory (DRAM)) and non-volatile memory (e.g., flash memory) as the primary data storage devices. An example of a hybrid memory module is a hybrid dual in-line memory module (DIMM) that integrates DRAM and flash memory. In a typical configuration, DRAM can be used as a cache memory for data stored in flash memory. To quickly access the DRAM cache, metadata of the DRAM cache can be stored in the static random-access memory (SRAM) of the hybrid memory module. In this case, the SRAM can function as a cache at a higher level (closer to the CPU) than DRAM.

그러나, DRAM 캐쉬의 메타데이터를 위해 요구되는 저장 공간의 크기는 SRAM의 사용 가능한 저장 공간의 크기보다 클 수 있다. 하이브리드 DIMM에 집적된 SRAM의 기억 용량은 그것의 가격 때문에 상대적으로 작게 유지될 수 있다. SRAM의 제한된 저장 공간의 크기로 인하여, DRAM 캐쉬의 전체 메타데이터는 SRAM에 들어가지 못하고, 결과적으로 SRAM에 들어가지 못하는 메타데이터의 남은 부분은 DRAM에 저장되어야 한다. 이러한 경우, DRAM에 저장된 메타데이터에 대한 느린 접근 속도는 데이터에 접근하는 경우 성능의 저하를 야기할 수 있다.However, the storage space required for metadata in the DRAM cache may be larger than the available storage space of the SRAM. The memory capacity of the SRAM integrated in the hybrid DIMM may be kept relatively small due to its price. Due to the limited storage space size of the SRAM, the entire metadata of the DRAM cache cannot fit into the SRAM, and as a result, the remaining part of the metadata that cannot fit into the SRAM must be stored in the DRAM. In this case, the slow access speed to the metadata stored in the DRAM may cause a performance degradation when accessing the data.

이러한 문제점에 대한 해결책으로, 몇몇 접근 방식이 제안되었다. 첫 번째 접근 방식은 SRAM에 저장되는 메타데이터의 크기를 줄이는 것이다. 예를 들어, 메타데이터의 크기는 SRAM에 저장되는 캐쉬 라인(line)들의 개수를 줄이거나 캐쉬라인의 크기를 줄임으로써 줄일 수 있다. 그러나 이러한 경우, 줄어든 캐쉬라인의 크기 및/또는 줄어든 한번에 저장되는 캐쉬라인의 개수는 히트율(hit rate)에 부정적인 영향을 줄 수 있고 캐쉬 미스(miss)의 경우 플래시 메모리로부터 복수의 페이지(page) 읽기가 필요할 수 있다. 다른 예에서는, 캐쉬 연관성(cache associativity)은 태그 비트들(tag bits)과 교체 비트들(replacement bits)을 줄임으로써 감소될 수 있지만, 이러한 접근 방식도 히트율에 부정적인 영향을 줄 수 있다. 다른 예에서는, 교체 정책은 교체 비트들을 이용하지 않고 교체될 수 있다.To address these issues, several approaches have been proposed. The first approach is to reduce the size of the metadata stored in the SRAM. For example, the metadata size can be reduced by reducing the number of cache lines stored in the SRAM or by reducing the size of the cache lines. However, in these cases, the reduced cache line size and/or the reduced number of cache lines stored at one time may negatively impact the hit rate, and multiple page reads from the flash memory may be required in case of a cache miss. In another example, the cache associativity can be reduced by reducing the tag bits and replacement bits, but this approach may also negatively impact the hit rate. In another example, the replacement policy can be replaced without utilizing the replacement bits.

이러한 종래 기술의 예시에서도 테스트 결과들은 메타데이터 크기를 줄이기 위한 이러한 효과들의 조합이, 요구되는 메타데이터 크기 일부만을 감소시키는 점이 드러나고 있다. 따라서, 메타데이터를 저장하기 위한 SRAM의 제한된 크기에 대한 문제점은 플래시 메모리의 데이터 기억 용량과 DRAM 캐쉬의 크기가 증가함에 따라 계속될 수 있다.Even in these prior art examples, the test results show that the combination of these effects to reduce metadata size only reduces a portion of the required metadata size. Therefore, the problem of limited size of SRAM for storing metadata may continue as the data storage capacity of flash memory and the size of DRAM cache increase.

본 발명은 이러한 종래 기술들과 차별화되는 해결 방안으로서, 한 번의 DRAM cache probe의 대상의 크기를 줄이지 않고도 cache probe의 빈도를 줄일 수 있는 방법을 제안한다. 즉, cacheline의 크기 또는 L2 cache(220)에 저장되는 cacheline들의 개수를 줄이지 않고도 cache probe의 빈도를 줄임으로써 overall 메모리 액세스 성능을 개선하는 하이브리드 메모리 및 그 관리 방법을 제안한다.The present invention proposes a method for reducing the frequency of cache probes without reducing the size of a target of a single DRAM cache probe as a solution that is differentiated from such conventional technologies. That is, the present invention proposes a hybrid memory and its management method that improves the overall memory access performance by reducing the frequency of cache probes without reducing the size of a cacheline or the number of cachelines stored in an L2 cache (220).

본 발명의 일 실시예에 따른 하이브리드 메모리 및 관리 방법은, SCM의 용량과 DRAM 캐쉬의 크기가 증가하는 경우에도 적용될 수 있으며, 여전히 cache probe의 빈도를 줄이고 overall 메모리 액세스 성능을 개선할 수 있다. The hybrid memory and management method according to one embodiment of the present invention can be applied even when the capacity of the SCM and the size of the DRAM cache increase, and can still reduce the frequency of cache probes and improve the overall memory access performance.

본 발명의 일 실시예에 따른 HBM(130)을 포함하는 하이브리드 메모리 장치(100)는, 이기종 적층 메모리에 포함되는 DRAM과 SCM의 특성 파라미터에 기반하여 효과적으로 GPU(200)에 DRAM-like 성능을 제공할 수 있다. 이때 본 발명의 일 실시예에 따른 하이브리드 메모리 장치(100) 및 관리 방법은 DRAM 캐쉬 바이패스 메커니즘을 이용하여 GPU(200) 측에서 바라본 SCM에 의한 메모리 접근 성능 저하를 최소화하고 DRAM으로만 이루어진 종래 기술의 HBM과 유사한 메모리 접근 대역폭 등 overall 메모리 접근 성능을 제공할 수 있다. A hybrid memory device (100) including an HBM (130) according to one embodiment of the present invention can effectively provide DRAM-like performance to a GPU (200) based on characteristic parameters of a DRAM and an SCM included in a heterogeneous stacked memory. At this time, the hybrid memory device (100) and management method according to one embodiment of the present invention can minimize the degradation of memory access performance due to the SCM as viewed from the GPU (200) side by using a DRAM cache bypass mechanism and provide overall memory access performance, such as memory access bandwidth, similar to that of a conventional HBM composed only of DRAM.

기존의 DRAM만 포함하는, CPU에 최적화된 DRAM 캐쉬의 바이패스 연구들과 달리 본 발명의 HBM(130)을 포함하는 하이브리드 메모리 장치(100)는 GPU 워크로드의 메모리 접근 패턴과 긴 쓰기 시간 등의 SCM의 특성을 반영한 DRAM 캐쉬 바이패스 메커니즘을 사용하여 메모리 대역폭을 효율적으로 사용할 수 있다. 본 발명의 하이브리드 메모리 장치(100)의 DRAM 캐쉬 바이패스 메커니즘에는 메모리 접근의 spatial 로컬리티(locality)와 접근 타입(읽기 또는 쓰기)을 반영한 device-sensitivity score와 페이지의 접근 빈도를 device-sensitivity score에 곱한 DRAM-affinity score 값이 사용될 수 있다. Unlike studies on bypass of CPU-optimized DRAM caches that include only conventional DRAMs, the hybrid memory device (100) including HBM (130) of the present invention can efficiently use memory bandwidth by using a DRAM cache bypass mechanism that reflects the memory access pattern of GPU workload and the characteristics of SCM such as long write time. The DRAM cache bypass mechanism of the hybrid memory device (100) of the present invention may use a device-sensitivity score that reflects the spatial locality of memory access and the access type (read or write) and a DRAM-affinity score value that multiplies the access frequency of a page by the device-sensitivity score.

본 발명의 일 실시예에 따른 하이브리드 메모리 장치(100)의 DRAM 캐쉬 바이패스 메커니즘은 SCM-aware DRAM 캐쉬 바이패스 정책이라고 이해될 수 있다. DRAM과 SCM 간의 성능 차이, 캐쉬 미스된 접근의 타입, 캐쉬 미스된 접근의 전체적인 크기가 다차원적으로 고려되어 DRAM 캐쉬를 바이패스할 지 여부가 결정될 수 있다. The DRAM cache bypass mechanism of the hybrid memory device (100) according to one embodiment of the present invention may be understood as an SCM-aware DRAM cache bypass policy. Whether to bypass the DRAM cache may be determined by multidimensionally considering the performance difference between the DRAM and the SCM, the type of cache-missed access, and the overall size of the cache-missed access.

본 발명의 일 실시예에 따른 하이브리드 메모리 장치(100)의 DRAM 캐쉬 바이패스 메커니즘은 DRAM 캐쉬 미스가 발생했을 때 실행되며 캐쉬 미스된 접근에 대해 device-sensitivity score를 계산하여 메모리의 채널 별로 관리되는 평균 device-sensitivity score와 비교하는 첫 번째 단계가 진행될 수 있다. 비교 과정은 각 score를 discretize한 후 진행될 수 있다. 캐쉬 미스된 접근이 채널 평균값보다 클 경우만 두 번째 단계로 진행되고 그렇지 않은 경우 바로 DRAM 캐쉬를 바이패스할 수도 있다. 두 번째 단계에서는 캐쉬 미스된 접근의 device-sensitivity score에 해당 페이지의 접근 빈도를 곱하여 DRAM-affinity score를 계산하고 discretize 된 DRAM-affinity score를 빅팀(victim) 후보의 discretize 된 DRAM-affinity score와 비교할 수 있다. 캐쉬 미스된 접근의 값이 빅팀(victim) 후보보다 클 경우만 DRAM 캐쉬 라인 교체가 발생하고 그 외의 경우 DRAM 캐쉬를 바이패스할 수 있다. The DRAM cache bypass mechanism of the hybrid memory device (100) according to one embodiment of the present invention is executed when a DRAM cache miss occurs, and a first step may be performed in which a device-sensitivity score is calculated for a cache-missed access and compared with an average device-sensitivity score managed for each channel of the memory. The comparison process may be performed after discretizing each score. The second step may be performed only if the cache-missed access is greater than the channel average, and otherwise the DRAM cache may be bypassed immediately. In the second step, the DRAM-affinity score may be calculated by multiplying the device-sensitivity score of the cache-missed access by the access frequency of the corresponding page, and the discretized DRAM-affinity score may be compared with the discretized DRAM-affinity score of a victim candidate. DRAM cache line replacement may occur only if the value of the cache-missed access is greater than that of the victim candidate, and otherwise the DRAM cache may be bypassed.

이와 같은 본 발명의 하이브리드 메모리 및 관리 방법의 구체적인 구성이 도 3 이하의 실시예에서 설명된다. The specific configuration of the hybrid memory and management method of the present invention is described in the embodiments below in FIG. 3.

도 3은 본 발명의 일 실시예에 따른 하이브리드 메모리 장치(100)의 관리 방법을 도시하는 동작 흐름도이다.FIG. 3 is a flowchart illustrating a management method of a hybrid memory device (100) according to one embodiment of the present invention.

본 발명의 일 실시예에 따른 하이브리드 메모리의 관리 방법은, 비휘발성 스토리지 클래스 메모리(SCM)인 제1 메모리(110)로부터 캐슁(caching)할 복수개의 제1 데이터를 수신하는 단계(S410); 복수개의 제1 데이터의 태그 정보를 랜덤 액세스가 가능한 제2 메모리(120)의 캐쉬라인(cacheline) 내의 제1 영역에 집합적으로(aggregately) 저장하는 단계(S420); 및 복수개의 제1 데이터를 제2 메모리(120)의 캐쉬라인 내의 제2 영역에 순차적으로 저장하는 단계(S430)를 포함한다. A method for managing a hybrid memory according to one embodiment of the present invention includes a step (S410) of receiving a plurality of first data to be cached from a first memory (110) which is a non-volatile storage class memory (SCM); a step (S420) of collectively storing tag information of the plurality of first data in a first area within a cache line of a second memory (120) which enables random access; and a step (S430) of sequentially storing the plurality of first data in a second area within a cache line of the second memory (120).

본 발명의 일 실시예에 따른 하이브리드 메모리의 관리 방법은, 제1 영역에 인접한 제2 메모리(120)의 캐쉬라인 내의 제3 영역에 복수개의 제1 데이터의 오류 제어 정보를 저장하는 단계(S440)를 더 포함할 수 있다. A method for managing a hybrid memory according to one embodiment of the present invention may further include a step (S440) of storing error control information of a plurality of first data in a third area within a cache line of a second memory (120) adjacent to the first area.

제1 영역은 제2 메모리(120)에 대한 한 번의 접근으로 모두 읽을 수 있는 크기의 영역일 수 있다. The first area may be an area of a size that can be read entirely with a single access to the second memory (120).

제1 메모리(110)와 제2 메모리(120)는 제1 개수의 비트의 공통 주소를 공유할 수 있고, 복수개의 제1 데이터의 태그 정보는 복수개의 제1 데이터가 저장되는 제1 메모리(110) 상의 제1 주소 중 공통 주소를 제외한 제1 주소와 제2 메모리(120) 상의 제2 주소 간의 제2 개수의 비트의 오프셋 정보를 포함할 수 있다. The first memory (110) and the second memory (120) can share a common address of a first number of bits, and the tag information of the plurality of first data can include offset information of a second number of bits between a first address excluding the common address among the first addresses on the first memory (110) where the plurality of first data are stored and a second address on the second memory (120).

도 4는 본 발명의 일 실시예에 따른 하이브리드 메모리 장치(100)와 관련되어, 태그가 저장되는 방식인 CTC 개념을 도시하는 개념도이다.FIG. 4 is a conceptual diagram illustrating the CTC concept, which is a method for storing tags, in relation to a hybrid memory device (100) according to one embodiment of the present invention.

본 발명의 일 실시예에 따른 하이브리드 메모리 장치(100)에서는 meta-data 정보 tagging을 L2 cache(220)(일반적으로 SRAM)에 저장할 수 있다. In a hybrid memory device (100) according to one embodiment of the present invention, meta-data information tagging can be stored in an L2 cache (220) (typically SRAM).

본 발명의 일 실시예에 따른 하이브리드 메모리 장치(100)에서는 DRAM cache probe로 인한 트래픽을 저감하기 위해서 L2 cache(220) 공간의 일부 영역에 DRAM cache tags를 저장하는 CTC (configurable Tag Cache) 개념을 제안할 수 있다. In a hybrid memory device (100) according to one embodiment of the present invention, a CTC (configurable Tag Cache) concept that stores DRAM cache tags in a part of the L2 cache (220) space may be proposed to reduce traffic caused by DRAM cache probes.

도 2의 실시예에, 및 후술할 도 6 내지 도 8의 실시예를 통하여 설명되는 DRAM 캐쉬 바이패스 메커니즘을 사용해도 발생하는 과도한 DRAM 캐쉬 probe 트래픽을 줄이기 위해 본 발명의 일 실시예에서는 DRAM 캐쉬의 태그들을 적은 비용으로 온 칩에 저장하는 Configurable Tag Cache (CTC)가 사용될 수 있다. 이를 위해 본 발명의 일 실시예에서는 GPU(200)의 L2 캐쉬(220)의 일부 way를 CTC로 사용할 수 있으며 사용하는 way는 유동적으로 조정될 수 있다. In order to reduce excessive DRAM cache probe traffic that occurs even when using the DRAM cache bypass mechanism described in the embodiment of FIG. 2 and through the embodiments of FIGS. 6 to 8 described below, a Configurable Tag Cache (CTC) that stores tags of a DRAM cache on-chip at a low cost may be used in one embodiment of the present invention. To this end, in one embodiment of the present invention, some ways of the L2 cache (220) of the GPU (200) may be used as the CTC, and the ways used may be flexibly adjusted.

도 4의 CTC는 GPU(200)의 L2 cache(220)의 cache residency control 기능을 이용하여 구현될 수 있다. 일반적으로 알려진 L2 cache residency control 기능은 cache에서 저장하거나 제어할 데이터에 대한 관리 권한을 제공하는 기능이다. 다만 본 발명의 일 실시예에 따른 CTC는 DRAM cache에 대한 cache probe를 위한 tags를 집합적으로 저장한다는 점에서 종래 기술과는 차별화될 수 있다. The CTC of FIG. 4 can be implemented by utilizing the cache residency control function of the L2 cache (220) of the GPU (200). The generally known L2 cache residency control function is a function that provides management authority for data to be stored or controlled in the cache. However, the CTC according to one embodiment of the present invention can be differentiated from the conventional technology in that it collectively stores tags for cache probe for the DRAM cache.

도 4를 참조하면 L2 캐쉬(220)의 일부 영역(222)에는 L2 cache 본래의 동작을 위한 데이터만이 저장될 수 있다. Referring to Fig. 4, only data for the original operation of the L2 cache can be stored in some areas (222) of the L2 cache (220).

L2 캐쉬(220)의 나머지 일부 영역(224)에는 L2 cache 본래의 동작 외에 Tags cache로서 필요한 tags를 저장할 수 있는 CTC 영역이 설정될 수 있다. 이때 영역(224)의 크기는 필요에 따라 조정될 수 있다. In the remaining part of the L2 cache (220) (224), a CTC area that can store necessary tags as a Tags cache in addition to the original operation of the L2 cache can be set. At this time, the size of the area (224) can be adjusted as needed.

예를 들어, 영역(222)에서는 L2 cache 본래의 동작을 위해 128 Byte 크기로 구분된 공간이 설정되고, 영역(224)에서는 tags를 저장하기 위하여 32 Byte 크기로 구분된 공간이 설정될 수 있다. For example, in area (222), a space divided into 128 bytes in size may be set for the original operation of the L2 cache, and in area (224), a space divided into 32 bytes in size may be set for storing tags.

도 5는 본 발명의 일 실시예에 따른 하이브리드 메모리 장치(100)에서 태그가 저장되는 방식인 ATIC 구조를 도시하는 개념도이다.FIG. 5 is a conceptual diagram illustrating an ATIC structure, which is a method for storing tags in a hybrid memory device (100) according to one embodiment of the present invention.

도 5에 도시된 실시예에 따르면, L2 cache(220) 내에서 Tags가 저장되는 방식인 ATIC (Aggregated Tags-Inside-Cacheline) 구조가 도시된다. According to the embodiment illustrated in FIG. 5, an ATIC (Aggregated Tags-Inside-Cacheline) structure is illustrated, which is a method in which tags are stored within an L2 cache (220).

도 4의 CTC 개념을 차용한 메모리 구조에서 캐쉬 미스가 발생했을 때 DRAM 캐쉬 probe가 발생하는데, 이 때 발생하는 트래픽을 줄이기 위해 본 발명의 일 실시예에서는 Aggregated Tags-Inside-Cacheline (ATIC) 구조를 사용하여 DRAM 캐쉬 태그를 관리할 수 있다. 본 발명의 일 실시예의 ATIC 구조에서는 한 DRAM row에 해당하는 DRAM 캐쉬 태그들을 첫 번째 단일 32B 컬럼의 ECC 부분에 집합적으로(aggregately) 저장할 수 있다. 이 방식을 통해 본 발명의 일 실시예에서는 한 번의 DRAM 접근으로 한 DRAM row에 해당하는 DRAM 캐쉬 라인들의 태그 정보들을 가지고 올 수 있다. 본 발명의 일 실시예에서는 첫 번째 단일 32B 컬럼 데이터의 ECC 정보를 같은 캐쉬 라인에 속한 다음 컬럼들의 여분의 ECC 부분에 저장하여 관리할 수 있다.In a memory structure that borrows the CTC concept of FIG. 4, when a cache miss occurs and a DRAM cache probe occurs, an embodiment of the present invention can manage DRAM cache tags using an Aggregated Tags-Inside-Cacheline (ATIC) structure to reduce the traffic that occurs at this time. In the ATIC structure of an embodiment of the present invention, DRAM cache tags corresponding to one DRAM row can be stored collectively in the ECC part of the first single 32B column. Through this method, an embodiment of the present invention can obtain tag information of DRAM cache lines corresponding to one DRAM row with one DRAM access. An embodiment of the present invention can store and manage ECC information of the first single 32B column data in the spare ECC parts of the following columns belonging to the same cache line.

도 5를 참조하면, DRAM 기반 제2 메모리(120) 내의 하나의 row에 저장되는 복수개의 제1 데이터에 대응하는 DRAM cache tags를 첫번째 단일 32Byte 컬럼의 ECC 부분인 제1 영역(510)에 집약적/집합적으로 저장함으로써 DRAM cache probe 시 발생하는 트래픽을 저감할 수 있다. Referring to FIG. 5, by intensively/collectively storing DRAM cache tags corresponding to a plurality of first data stored in one row within a DRAM-based second memory (120) in the first area (510), which is the ECC portion of the first single 32-byte column, the traffic occurring during a DRAM cache probe can be reduced.

본 발명의 일 실시예에 따른 하이브리드 메모리 장치(100)의 메모리 컨트롤러(140)는, 제1 메모리(110)로부터 캐슁(caching)할 복수개의 제1 데이터의 태그 정보를 제2 메모리(120)의 캐쉬라인(cacheline) 내의 제1 영역(510)에 집합적으로(aggregately) 저장하고, 복수개의 제1 데이터를 제2 메모리(120)의 캐쉬라인 내의 제2 영역(520)에 순차적으로 저장한다. A memory controller (140) of a hybrid memory device (100) according to one embodiment of the present invention collectively stores tag information of a plurality of first data to be cached from a first memory (110) in a first area (510) within a cache line of a second memory (120), and sequentially stores the plurality of first data in a second area (520) within a cache line of the second memory (120).

메모리 컨트롤러(140)는, 제1 영역(510)에 인접한 제2 메모리(120)의 캐쉬라인 내의 제3 영역(530)에 복수개의 제1 데이터의 오류 제어 정보를 저장할 수 있다. 오류 제어 정보는 오류 제어 코드(ECC, Error Control Code)를 포함할 수 있다. 실시예에 따라서는 ECC는 Error Correction Code로 불리기도 한다. The memory controller (140) can store error control information of a plurality of first data in a third area (530) within a cache line of a second memory (120) adjacent to the first area (510). The error control information can include an error control code (ECC). In some embodiments, ECC is also called Error Correction Code.

제1 영역(510)은 제2 메모리(120)에 대한 한 번의 접근으로 모두 읽을 수 있는 크기의 영역일 수 있다. 제2 메모리(120)에 대한 한 번의 접근으로 읽을 수 있는 크기는, 예를 들어, 하나의 바이트(byte), 16/32/64/128-bit 워드(word) 중 어느 하나일 수 있다. The first area (510) may be an area of a size that can be read entirely with a single access to the second memory (120). The size that can be read with a single access to the second memory (120) may be, for example, one byte or one of a 16/32/64/128-bit word.

예를 들어, 제2 메모리(120)의 하나의 row가 저장하는 복수개의 제1 데이터의 태그를 제1 영역(510)의 크기만큼의 집합적 태그 정보로서 생성하고, 집합적 태그 정보를 제1 영역(510)에 저장할 수 있다. 제1 영역(510)에 저장된 집합적 태그 정보가, 복수개의 제1 데이터가 제2 메모리(120)에 저장되어 있는지 제1 메모리(110)에 저장되어 있는 지를 체크하기 위한 태그 정보로서 제2 메모리(120)로부터 L2 캐쉬(220)로 전달될 수 있다. 이러한 태그 정보의 전달을 위한 캐쉬 프로브(cache probe) 동작 시 본 발명의 일 실시예에 따른 집합적 태그 정보, 즉, ATIC 구조에 의하면 ATIC 내의 제1 영역(510)에 대한 한 번의 접근만으로 복수개의 제1 데이터에 대한 DRAM 캐쉬 프로브가 완료될 수 있다. For example, tags of a plurality of first data stored in one row of the second memory (120) may be generated as collective tag information of the size of the first area (510), and the collective tag information may be stored in the first area (510). The collective tag information stored in the first area (510) may be transferred from the second memory (120) to the L2 cache (220) as tag information for checking whether the plurality of first data are stored in the second memory (120) or the first memory (110). In a cache probe operation for transferring such tag information, according to an embodiment of the present invention, the collective tag information, that is, according to the ATIC structure, a DRAM cache probe for the plurality of first data may be completed with only one access to the first area (510) in the ATIC.

즉, 복수개의 제1 데이터에 대한 캐쉬 프로브 시 종래 기술에서 여러 차례 캐쉬 프로브가 이루어지는 데 반해 본 발명의 실시예에서는 제1 영역(510)에 대한 한 번의 접근에 의하여 복수개의 제1 데이터에 대한 캐쉬 프로브를 완료할 수 있으므로 메모리 액세스의 횟수를 줄이고 메모리 액세스 비용을 저감할 수 있다. That is, while in the prior art, multiple cache probes are performed when performing cache probes on multiple first data, in the embodiment of the present invention, cache probes on multiple first data can be completed by a single access to the first area (510), thereby reducing the number of memory accesses and lowering memory access costs.

앞서 설명한 대로 도 4의 실시예는 캐쉬 프로브를 위한 DRAM 액세스 횟수를 절감할 수 있다. 또한 도 5의 실시예는 SRAM 캐쉬의 위치에 무관하게 한번의 DRAM 접근으로 한 DRAM row에 대항하는 DRAM 캐쉬 라인들의 태그를 모두 SRAM 캐쉬에 로드할 수 있어 역시 캐쉬 프로브를 위한 DRAM 액세스 횟수를 절감할 수 있다. As described above, the embodiment of FIG. 4 can reduce the number of DRAM accesses for cache probes. In addition, the embodiment of FIG. 5 can load all tags of DRAM cache lines corresponding to a DRAM row into the SRAM cache with a single DRAM access regardless of the location of the SRAM cache, thereby also reducing the number of DRAM accesses for cache probes.

이때 도 5의 실시예와 도 4의 실시예가 결합됨으로써 캐쉬 프로브를 위한 overall 메모리 액세스 비용(SRAM, DRAM을 모두 포함하는)을 더욱 절감할 수 있다. At this time, by combining the embodiment of Fig. 5 and the embodiment of Fig. 4, the overall memory access cost (including both SRAM and DRAM) for cache probe can be further reduced.

이기종 적층 메모리 기반 HBM(130)에서 DRAM을 SCM의 캐쉬처럼 이용하는 경우에는 GPU(200) 어플리케이션이 타겟이기 때문에 64-128B의 캐쉬 라인을 사용하던 일반적인 DRAM 캐쉬 기술과 달리 256B의 좀 더 큰 캐쉬 라인을 사용하여 GPU 워크로드의 높은 spatial 로컬리티를 활용하고 SCM의 높은 activation 비용을 완화할 수 있다. In the case of using DRAM as a cache for SCM in heterogeneous stacked memory-based HBM (130), since the target is GPU (200) applications, unlike the general DRAM cache technology that uses cache lines of 64-128B, a larger cache line of 256B can be used to utilize the high spatial locality of GPU workload and alleviate the high activation cost of SCM.

본 발명의 ATIC 구조는 이처럼 GPU(200)를 위한 큰 캐쉬라인에 대해서도 효과적인 캐쉬 프로브 저감 효과를 제공할 수 있다. The ATIC structure of the present invention can provide an effective cache probe reduction effect even for a large cache line for a GPU (200).

한편, 도 4 및 도 5의 구성은 NVRAM 기반 SCM을 전제로 할 때 DRAM에 대한 태그 프로브 액세스를 절감하는 효과를 더욱 높일 수 있다. Meanwhile, the configurations of FIGS. 4 and 5 can further increase the effect of reducing tag probe access to DRAM when assuming an NVRAM-based SCM.

NVRAM 기반 SCM은 상변화 메모리(PCM, Phase Change Memory) 등 적어도 row 기반 랜덤 액세스가 가능하고, in-place writing이 가능하며 별도의 erase 동작을 필요로 하지 않는 종류의 메모리를 의미할 수 있다. NVRAM-based SCM can mean a type of memory that enables at least row-based random access, in-place writing, and does not require a separate erase operation, such as phase change memory (PCM).

예를 들어, 제1 메모리(110)와 제2 메모리(120)는 제1 개수의 비트의 공통 주소를 공유할 수 있다. SCM과 DRAM이 direct-mapped cache organization을 채택할 수 있다. For example, the first memory (110) and the second memory (120) may share a common address of a first number of bits. The SCM and DRAM may adopt a direct-mapped cache organization.

복수개의 제1 데이터의 태그 정보는 복수개의 제1 데이터가 저장되는 제1 메모리(110) 상의 제1 주소 중 공통 주소를 제외한 제1 주소와 제2 메모리(120) 상의 제2 주소 간의 제2 개수의 비트의 오프셋 정보를 포함할 수 있다. Tag information of a plurality of first data may include offset information of a second number of bits between a first address excluding a common address among first addresses on a first memory (110) where a plurality of first data are stored and a second address on a second memory (120).

예를 들어 SCM이 DRAM의 크기의 n배인 경우, SCM 영역 및 DRAM 영역 간의 주소 매핑을 1:n으로 구현할 수 있다. 이때 제1 메모리(110) 상의 제1 주소 중 제2 메모리(120)의 제2 주소와 공통적으로 공유되는 공통 주소 필드가 존재하고, 제2 주소 이외의 주소 필드를 제1 메모리(110)의 제1 주소의 오프셋 정보로서 구현할 수 있다. 공통 주소를 제1 개수의 비트만큼 구성하면, 오프셋 정보는 제2 개수의 비트만큼 구현될 수 있다. For example, if the SCM is n times the size of the DRAM, the address mapping between the SCM area and the DRAM area can be implemented as 1:n. At this time, among the first addresses on the first memory (110), there is a common address field that is shared in common with the second address of the second memory (120), and an address field other than the second address can be implemented as offset information of the first address of the first memory (110). If the common address is configured as the first number of bits, the offset information can be implemented as the second number of bits.

메모리 컨트롤러(140)는 호스트로부터 요구되는 데이터의 공통 주소 부분을 제외한 오프셋 정보만을 cache tags 정보로서 관리할 수 있으므로 ATIC 구조 및 CTC 개념과 결합하면 더욱 많은 수의 제1 데이터를 한번의 DRAM cache probe를 통하여 관리할 수 있으므로 큰 데이터를 요구하는 어플리케이션에서 강점을 가지며, DRAM cache probe 빈도를 줄일 수 있다. The memory controller (140) can manage only the offset information excluding the common address part of the data requested from the host as cache tags information, so when combined with the ATIC structure and the CTC concept, a larger number of first data can be managed through a single DRAM cache probe, so it has an advantage in applications requiring large data, and the frequency of DRAM cache probes can be reduced.

본 발명의 일 실시예가 direct-mapped cache organization을 채택하고 있지만 본 발명의 사상은 이에 국한되지는 않으며 본 발명의 다른 실시예에서는 cell-associated cache organization에서도 주소를 효과적으로 관리함으로써 소기의 목적을 달성할 수 있도록 변형될 수 있을 것이다. Although one embodiment of the present invention adopts a direct-mapped cache organization, the spirit of the present invention is not limited thereto, and other embodiments of the present invention may be modified to achieve the intended purpose by effectively managing addresses even in a cell-associated cache organization.

도 4 및 도 5의 실시예를 결합한 본 발명의 일 실시예와 비교되는 또 다른 종래 기술로서 cache probe를 위한 tags를 압축하여 저장하는 방식이 있을 수 있다. 그러나 tags를 압축하여 저장하는 방식은 tags를 압축하여 저장하거나 압축 해제하는 데에 컴퓨팅 리소스를 이용해야 하고, 또한 tags를 압축/압축 해제하기 위한 별도의 cache 레벨 하드웨어 리소스를 필요로 하기 때문에 비효율적인 반면, 본 발명의 일 실시예에 따른 집합적 태그 정보는 압축/압축 해제를 위한 별도의 컴퓨팅 리소스 또는 하드웨어 리소스를 필요로 하지 않는 점에서 종래 기술보다 효율적이고, 메모리 액세스 비용을 크게 저감할 수 있다. Another conventional technique that can be compared with an embodiment of the present invention that combines the embodiments of FIGS. 4 and 5 may be a method of compressing and storing tags for a cache probe. However, the method of compressing and storing tags is inefficient because it requires computing resources to compress and store or decompress tags, and also requires separate cache level hardware resources for compressing/decompressing tags, whereas the collective tag information according to an embodiment of the present invention is more efficient than the conventional technique and can significantly reduce memory access costs because it does not require separate computing resources or hardware resources for compression/decompression.

도 6은 본 발명의 일 실시예에 따른 하이브리드 메모리 장치(100)의 관리 방법을 도시하는 동작 흐름도이다.FIG. 6 is a flowchart illustrating a management method of a hybrid memory device (100) according to one embodiment of the present invention.

도 7은 도 6의 방법의 일 부분을 더욱 상세히 도시하는 동작 흐름도이다.Figure 7 is a flowchart illustrating a portion of the method of Figure 6 in more detail.

본 발명의 일 실시예에 따른 하이브리드 메모리의 관리 방법은 메모리 컨트롤러(140)에 의하여 수행될 수 있다. A method for managing a hybrid memory according to one embodiment of the present invention can be performed by a memory controller (140).

본 발명의 일 실시예에 따른 하이브리드 메모리의 관리 방법은, 비휘발성인 제1 메모리(110)에 저장된 데이터로서 호스트(host)의 제1 요청에 대응하는 데이터가 랜덤 액세스가 가능한 제2 메모리(120)에 저장되어 있지 않은 경우에(S610), 제1 요청에 대한 제1 메모리(110) 및 제2 메모리(120) 간의 디바이스 민감도를 산출하는 단계(S620); 및 디바이스 민감도에 기반하여 제1 요청에 대하여 제2 메모리(120)를 바이패스하고 제1 메모리(110)에 직접 접근할 지 여부를 결정하는 단계(S630)를 포함한다. 디바이스 민감도는 제1 요청의 접근의 종류에 따른 제1 메모리(110)의 제1 비용, 및 제1 요청을 처리하기 위한 제2 메모리(120)의 제2 비용에 기반하여 산출된다. A method for managing a hybrid memory according to one embodiment of the present invention includes the steps of: calculating a device sensitivity between a first memory (110) and a second memory (120) for the first request when data corresponding to a first request of a host, which is stored in a non-volatile first memory (110), is not stored in a second memory (120) that allows random access (S610); and determining whether to bypass the second memory (120) and directly access the first memory (110) for the first request based on the device sensitivity (S630). The device sensitivity is calculated based on a first cost of the first memory (110) according to the type of access of the first request, and a second cost of the second memory (120) for processing the first request.

디바이스 민감도를 산출하는 단계(S620)에서는, 디바이스 민감도가 제1 요청이 포함하는 연속되는 데이터의 개수만큼 나누어져 산출될 수 있다. In the step of calculating device sensitivity (S620), the device sensitivity can be calculated by dividing it by the number of consecutive data included in the first request.

디바이스 민감도를 산출하는 단계(S620)에서는, 디바이스 민감도가 미리 결정된 최소값에서부터 관측된 최대값 사이의 어느 한 값으로 이산화될(discretized) 수 있다. In the step of calculating the device sensitivity (S620), the device sensitivity can be discretized into any value between a predetermined minimum value and an observed maximum value.

본 발명의 일 실시예에 따른 하이브리드 메모리의 관리 방법은, 디바이스 민감도에 기반하여 제2 메모리(120)를 바이패스하기로 결정된 경우가 아닐 때(S630: Yes) 제1 요청의 제2 메모리(120)에 대한 친화도를 산출하는 단계(S640); 및 제1 요청의 제2 메모리(120)에 대한 친화도 및 제2 메모리(120)로부터 추방될 빅팀 데이터의 제2 메모리(120)에 대한 친화도 간의 비교 결과에 기반하여 제1 요청에 의하여 제2 메모리(120)의 캐쉬라인의 일부를 대체할 지 여부를 결정하는 단계(S650)를 더 포함할 수 있다. A method for managing a hybrid memory according to one embodiment of the present invention may further include a step (S640) of calculating an affinity of a first request to a second memory (120) when it is not determined to bypass the second memory (120) based on device sensitivity (S630: Yes); and a step (S650) of determining whether to replace a part of a cache line of the second memory (120) by the first request based on a comparison result between the affinity of the first request to the second memory (120) and the affinity of the big team data to be expelled from the second memory (120) to the second memory (120).

제1 요청의 제2 메모리(120)에 대한 친화도를 산출하는 단계(S640)에서는, 제2 메모리(120)에 대한 친화도는, 제1 요청에 대응하는 데이터에 대한 접근 빈도 및 제1 요청의 디바이스 민감도에 기반하여 결정될 수 있다. In the step (S640) of calculating the affinity for the second memory (120) of the first request, the affinity for the second memory (120) can be determined based on the access frequency for data corresponding to the first request and the device sensitivity of the first request.

제1 요청의 제2 메모리(120)에 대한 친화도를 산출하는 단계(S640)에서는, 제2 메모리(120)에 대한 친화도는, 미리 결정된 최소값에서부터 관측된 최대값 사이의 어느 한 값으로 이산화될(discretized) 수 있다. In the step (S640) of calculating the affinity for the second memory (120) of the first request, the affinity for the second memory (120) can be discretized into any value between a predetermined minimum value and an observed maximum value.

본 발명의 일 실시예에 따른 하이브리드 메모리의 관리 방법은, 제1 요청에 의하여 제2 메모리(120)의 캐쉬라인의 일부를 대체하기로 결정된 경우가 아닐 때(S650: No) 제1 요청에 대하여 제2 메모리(120)를 바이패스하고 제1 메모리(110)에 직접 접근하여 제1 요청을 처리하는 단계(S680)를 더 포함할 수 있다. A hybrid memory management method according to one embodiment of the present invention may further include a step (S680) of bypassing the second memory (120) and directly accessing the first memory (110) for processing the first request when it is not determined to replace a part of the cache line of the second memory (120) by the first request (S650: No).

본 발명의 일 실시예에 따른 하이브리드 메모리의 관리 방법은, 제1 요청에 의하여 제2 메모리(120)의 캐쉬라인의 일부를 대체하기로 결정된 경우가 아닐 때(S650: No) 제1 요청에 대응하는 데이터가 제2 메모리(120)에 저장되어 있지 않은 이벤트에 대응하여 빅팀 데이터의 제2 메모리(120)에 대한 친화도를 미리 결정된 조건에 의하여 감소시키는 단계(S670)를 더 포함할 수 있다. A hybrid memory management method according to one embodiment of the present invention may further include a step (S670) of reducing the affinity of the big data for the second memory (120) according to a predetermined condition in response to an event in which data corresponding to the first request is not stored in the second memory (120) when it is not determined to replace a part of the cache line of the second memory (120) by the first request (S650: No).

본 발명의 일 실시예에 따른 하이브리드 메모리 장치(100)의 DRAM 캐쉬 바이패스 메커니즘은 DRAM 캐쉬 미스가 발생했을 때(S610) 실행되며 캐쉬 미스된 접근에 대해 device-sensitivity score를 계산하여(S620) 메모리의 채널 별로 관리되는 평균 device-sensitivity score와 비교하는 첫 번째 단계(S630)가 진행될 수 있다. 비교 과정은 각 score를 discretize한 후 진행될 수 있다. 캐쉬 미스된 접근이 채널 평균값보다 클 경우만 두 번째 단계(S640~S650)로 진행되고 그렇지 않은 경우 바로 DRAM 캐쉬를 바이패스할 수도 있다(S680). 두 번째 단계(S640~S650)에서는 캐쉬 미스된 접근의 device-sensitivity score에 해당 페이지의 접근 빈도를 곱하여 DRAM-affinity score를 계산하고(S640) discretize 된 DRAM-affinity score를 빅팀(victim) 후보의 discretize 된 DRAM-affinity score와 비교할 수 있다(S650). 캐쉬 미스된 접근의 값이 빅팀(victim) 후보보다 클 경우만(S650: Yes) DRAM 캐쉬 라인 교체가 발생하고(S660) 그 외의 경우 DRAM 캐쉬를 바이패스할 수 있다(S680). According to an embodiment of the present invention, a DRAM cache bypass mechanism of a hybrid memory device (100) is executed when a DRAM cache miss occurs (S610), and a first step (S630) may be performed in which a device-sensitivity score is calculated for a cache-missed access (S620) and compared with an average device-sensitivity score managed for each channel of a memory. The comparison process may be performed after discretizing each score. Only when the cache-missed access is greater than the channel average value, the second step (S640 to S650) may be performed, and otherwise, the DRAM cache may be bypassed immediately (S680). In the second step (S640~S650), the DRAM-affinity score is calculated by multiplying the device-sensitivity score of the cache-missed access by the access frequency of the corresponding page (S640), and the discretized DRAM-affinity score can be compared with the discretized DRAM-affinity score of the victim candidate (S650). Only if the value of the cache-missed access is greater than that of the victim candidate (S650: Yes), DRAM cache line replacement occurs (S660), otherwise the DRAM cache can be bypassed (S680).

캐쉬(cache)는 데이터 처리 장치에서의 데이터의 임시 저장을 위한 메모리이다. 일반적으로, 캐쉬는 백업용 저장 장치(backing storage device)로부터의 서브세트를 이루는 데이터의 사본들을 저장하는 소형의 고속 메모리(high speed memory)이다. 백업용 저장 장치는 통상적으로 대형의 저속 메모리 또는 데이터 저장 장치이다. 캐쉬 안의 데이터는 중앙 처리 장치(central processing unit; CPU)와 같은 캐쉬 클라이언트에 의해 사용된다. CPU의 성능은, 종종 사용되는 데이터가 캐쉬에서 이용가능한 경우 강화되므로, 백업용 저장 장치로부터의 데이터 판독과 관련된 레이턴시(latency; 지연 또는 대기시간)를 피할 수 있다. 캐쉬 안의 각각의 엔트리(entry)는 백업용 저장 장치 안에서 원래 데이터의 로케이션(location)과 관련된 태그와 함께 하는 데이터 그 자체, 유효 비트, 및 선택적으로는 하나 이상의 상태 비트들을 포함한다. 캐쉬의 크기는 데이터 그 자체에 추가하여 태그들 및 유효 비트와 상태 비트를 저장할 필요에 의해 결정된다.A cache is a memory for temporary storage of data in a data processing device. Typically, a cache is a small, high-speed memory that stores copies of data that constitute a subset from a backing storage device. The backing storage device is typically a large, low-speed memory or data storage device. The data in the cache is used by cache clients, such as a central processing unit (CPU). The performance of the CPU is enhanced when frequently used data is available in the cache, thereby avoiding the latency associated with reading data from the backing storage device. Each entry in the cache contains the data itself, a valid bit, and optionally one or more status bits, along with a tag relating to the location of the original data in the backing storage device. The size of the cache is determined by the need to store the tags and the valid bit and status bits in addition to the data itself.

대용량 데이터를 캐슁(caching)해야 하는 어플리케이션에서는 대용량의 캐쉬가 제공될 수 있다. 예를 들어 CPU 외부의 대용량 오프-칩(off-chip) 캐쉬를 위해서는 대용량의 태그 저장이 요구될 수 있다. 예를 들어, 대용량의 오프-칩(off-chip) DRAM 캐쉬를 위하여 태그 저장 요건들은 종종 온-칩(on-chip) SRAM (예를 들어, L2(Level 2) Cache(220)) 안에서 실제로 저장될 수 있는 크기를 초과할 수 있다. 그러나 DRAM 안에 태그들을 저장하는 경우에는 매우 큰 레이턴시 패널티(latency penalty)가 주어질 수 있다. 태그들과 데이터 모두를 DRAM으로부터 판독하는 경우에는 DRAM 액세스가 SRAM 액세스보다 훨씬 느리므로 레이턴시 페널티가 매우 클 수 있다. In applications that require caching large amounts of data, large caches may be provided. For example, large tag storage may be required for large off-chip caches outside the CPU. For example, tag storage requirements for large off-chip DRAM caches often exceed what can actually be stored in on-chip SRAM (e.g., Level 2 (L2) Cache (220)). However, storing tags in DRAM may incur a very large latency penalty. If both tags and data are read from DRAM, the latency penalty may be very large, since DRAM accesses are much slower than SRAM accesses.

도 8은 본 발명의 일 실시예에 따른 하이브리드 메모리 장치(100)에서 디바이스 친화도를 산출하는 과정을 도시하는 개념도이다.FIG. 8 is a conceptual diagram illustrating a process for calculating device affinity in a hybrid memory device (100) according to one embodiment of the present invention.

도 8에서는 개념의 설명을 위하여 각 시간 구간이 도시되었을 뿐, 각 시간 구간이 스케일대로 그려진 것은 아니다. In Figure 8, each time interval is depicted only for the purpose of explaining the concept, and each time interval is not drawn to scale.

일반적인 direct-mapped DRAM cache organization에서는 빅팀(victim)은 캐쉬 미스 발생 시 축출된다(evicted). 종래 기술의 빅팀이 축출되는 과정에서 빅팀 및 새로운 캐쉬라인의 유용성(utility)은 고려되지 않는다.In a typical direct-mapped DRAM cache organization, the victim is evicted when a cache miss occurs. In the process of evicting the victim in conventional technologies, the utility of the victim and the new cache line is not considered.

본 발명의 실시예에서는 호스트에서 요청되는 제1 요청의 타입(읽기, 쓰기, 또는 지우기), 제1 요청의 spatial locality 또는 제1 요청의 크기(burst size), 제1 메모리(110) 및 제2 메모리(120)의 접근 비용(제1 요청의 타입에 따라 달라지는) 등이 고려되어 device sensitivity가 계산될 수 있다(S620).In an embodiment of the present invention, device sensitivity can be calculated by considering the type of the first request requested from the host (read, write, or erase), the spatial locality of the first request or the size of the first request (burst size), and the access cost of the first memory (110) and the second memory (120) (which varies depending on the type of the first request) (S620).

이때 제1 메모리(110) 및 제2 메모리(120) 각각의 물리적인 특성과 row 접근을 위한 latency, 각 요청을 처리하기 위한 물리적 소요 시간 등이 접근 비용으로서 고려될 수 있다. At this time, the physical characteristics of each of the first memory (110) and the second memory (120), the latency for row access, and the physical time required to process each request can be considered as access costs.

본 발명의 일 실시예에 따른 하이브리드 메모리 장치(100)의 관리 방법에서 단계(S620)에서 계산되는 device sensitivity는 하기 수학식 1과 같이 계산될 수 있다.In a management method of a hybrid memory device (100) according to one embodiment of the present invention, the device sensitivity calculated in step (S620) can be calculated as shown in the following mathematical expression 1.

[수학식 1][Mathematical Formula 1]

DeviceSensitivityScore는 device sensitivity 점수를 의미하고, Latency__SCM은 SCM 기반 제1 메모리(110)의 row 접근을 위한 Latency, Latency_{_DRAM}은 DRAM 기반 제2 메모리(120)의 row 접근을 위한 Latency를 의미할 수 있다. DeviceSensitivityScore may refer to a device sensitivity score, _{Latency_SCM} may refer to latency for row access of a SCM-based first memory (110), and Latency _{_DRAM} may refer to latency for row access of a DRAM-based second memory (120).

NumColumnsAccessed는 한꺼번에 요청된 칼럼의 수, 즉, 제1 요청의 Spatial Locality, 및/또는 burst size에 대응하는 정보이다.NumColumnsAccessed is the number of columns requested at once, i.e., information corresponding to the Spatial Locality of the first request, and/or burst size.

도 8에 도시된 4개의 column을 순차적으로 읽기 요청에 대응하는 Device Sensitivity 점수를 상기 수학식 1에 기반하여 계산하면, DRAM의 읽기 Latency (DRAM의 "ACT" 시간 구간), 및 SCM의 읽기 Latency (SCM의 "ACT" 시간 구간)의 차이(=93)를 한꺼번에 요청된 칼럼의 개수(=4)로 나눈 값으로 계산될 수 있다. 이때 도 8에서는 메모리 연산을 완료하고 다음 메모리 연산을 준비하기 위한 PRECHARGE 시간 구간이 Latency 계산 시 포함되었음을 알 수 있다. If the Device Sensitivity score corresponding to the sequential read request of the four columns illustrated in Fig. 8 is calculated based on the mathematical expression 1, it can be calculated as the difference (=93) between the read Latency of DRAM (the "ACT" time interval of DRAM) and the read Latency of SCM (the "ACT" time interval of SCM) divided by the number of columns requested at once (=4). At this time, it can be seen that the PRECHARGE time interval for completing the memory operation and preparing the next memory operation is included in the calculation of the Latency in Fig. 8.

즉, 제1 요청이 4개의 연속된 칼럼을 읽기 요청인 경우 Device Sensitivity는 93/4 = 23 으로 주어질 수 있다. That is, if the first request is a request to read four consecutive columns, Device Sensitivity can be given as 93/4 = 23.

도 8에 도시된 1개의 column 쓰기 요청에 대응하는 Device Sensitivity 점수를 상기 수학식 1에 기반하여 계산하면, DRAM의 쓰기 Latency (DRAM의 "ACT" 시간 구간) + Write Recovery 및 SCM의 쓰기 Latency (SCM의 "ACT" 시간 구간) + Write Recovery의 차이(=955)를 한꺼번에 요청된 칼럼의 개수(=1)로 나눈 값으로 계산될 수 있다. 이때 도 8에서는 메모리 연산을 완료하고 다음 메모리 연산을 준비하기 위한 PRECHARGE 시간 구간이 Latency 계산 시 포함되었음을 알 수 있다. If the Device Sensitivity score corresponding to a single column write request illustrated in Fig. 8 is calculated based on the mathematical expression 1, it can be calculated as the difference (=955) between the write Latency of DRAM (the "ACT" time interval of DRAM) + Write Recovery and the write Latency of SCM (the "ACT" time interval of SCM) + Write Recovery divided by the number of columns requested at once (=1). At this time, it can be seen that the PRECHARGE time interval for completing the memory operation and preparing the next memory operation is included in the calculation of the latency in Fig. 8.

즉, 제1 요청이 1개의 칼럼 쓰기 요청인 경우 Device Sensitivity는 955/1 = 955 로 주어질 수 있다. That is, if the first request is a 1-column write request, Device Sensitivity can be given as 955/1 = 955.

제1 메모리(110)가 일반적인 비휘발성 메모리인 실시예에서는, 제1 메모리(110)의 지우기(=erase)가 필요한 경우에는 Device Sensitivity 점수가 더욱 클 수 있다.In an embodiment where the first memory (110) is a general non-volatile memory, the Device Sensitivity score may be even higher when erasing the first memory (110) is required.

즉, 제1 요청이 읽기인 경우, 쓰기인 경우, 지우기가 필요한 쓰기인 경우에 각각 Device Sensitivity가 달라질 수 있다. 또한 한번에 몇 개의 칼럼에 대한 연속적인 접근이 요구되었는 지에 따라서 각 접근에 대한 접근 비용이 분산되므로 Device Sensitivity는 연속적으로 요구된 칼럼의 개수로 디바이스 특성에 따른 Latency(+Recovery + Precharge) 차이를 나누어 얻어질 수 있다. That is, Device Sensitivity can be different depending on whether the first request is a read, a write, or a write that requires erasing. Also, since the access cost for each access is distributed depending on how many columns are sequentially accessed at a time, Device Sensitivity can be obtained by dividing the difference in Latency (+ Recovery + Precharge) according to device characteristics by the number of columns sequentially requested.

단계(S630)에서는 채널 별 평균 device sensitivity와 제1 요청에 대하여 계산된(S620)의 device sensitivity를 비교하여 제2 메모리(S680)를 바이패스할 지 여부를 결정할 수 있다. In step (S630), the average device sensitivity per channel can be compared with the device sensitivity calculated for the first request (S620) to determine whether to bypass the second memory (S680).

Device sensitivity 점수가 크지 않다면 하나의 데이터에 대한 평균적인 접근 비용이 크지 않을 것이므로 제2 메모리(120)를 바이패스하고 제1 메모리(110)에 직접 접근하여 제1 요청을 처리할 수 있다(S680). 즉, 이 경우에는 또 다시 캐쉬 미스가 발생하더라도 SCM 기반의 제1 메모리(110)에 직접 접근하여 제1 요청을 처리하는 비용이 상대적으로 제2 메모리(120)를 경유하여 제1 요청을 처리하는 경우보다 크지 않으므로 GPU(200) 측에서 성능 저하를 크게 인지하지 못할 수 있다. If the device sensitivity score is not large, the average access cost for one piece of data will not be large, so the second memory (120) can be bypassed and the first memory (110) can be directly accessed to process the first request (S680). That is, in this case, even if a cache miss occurs again, the cost of directly accessing the SCM-based first memory (110) to process the first request is relatively less than when the first request is processed via the second memory (120), so the GPU (200) may not perceive a significant performance degradation.

다시 도 7을 참조하면, device sensitivity가 커서 캐쉬인 제2 메모리(120)를 경유하여 제1 요청이 처리되는 것이 유리한 경우에도, 캐쉬인 제2 메모리(120)로부터 축출될(to be evited) 빅팀(victim)과 제1 요청에서 요구되는 데이터의 유용성(utility)이 비교될 수 있다. 이 경우 기본적으로 축출될 데이터가 hot 데이터인 지 cold 데이터인 지가 고려될 필요가 있다. Referring again to FIG. 7, even in cases where it is advantageous for the first request to be processed via the second memory (120) which is a cache because the device sensitivity is high, the utility of the victim to be evicted from the second memory (120) which is a cache and the data required in the first request can be compared. In this case, it is necessary to basically consider whether the data to be evicted is hot data or cold data.

이때 계산되는 DRAM affinity score는 discretized되어 도 5에 도시된 ATIC 구조의 필드의 일부에 저장될 수 있다. The DRAM affinity score calculated at this time can be discretized and stored in some of the fields of the ATIC structure shown in Fig. 5.

예를 들어 도 5에서 32Byte 데이터에 대한 10-bit ECC가 요구되는 경우에, direct-mapped 기법에 의하여 절약된 ECC bit만큼 spare bit가 존재하고, spare bit 공간에 DRAM affinity level이 저장될 수 있다. DRAM affinity score 또는 Device Sensitivity Score 는 저장, 비교, 및 관리를 용이하게 하기 위하여 discretized될 수 있다. For example, in case 10-bit ECC is required for 32-byte data in Fig. 5, spare bits exist as many as the ECC bits saved by the direct-mapped technique, and the DRAM affinity level can be stored in the spare bit space. The DRAM affinity score or Device Sensitivity Score can be discretized for easy storage, comparison, and management.

단계(S640)에서 DRAM affinity score는 제1 요청의 Device Sensitivity Score에 해당 페이지의 activation counter 정보를 곱함으로써 계산될 수 있다. 본 발명의 activation counter는 해당 페이지가 hot 데이터인 지 cold 데이터인 지를 나타내는 지표이며, 일종의 접근 빈도에 대응하는 데이터일 수 있다. 본 발명의 일 실시예에서는 per-page activation counter 가 관리되지만, 본 발명의 다른 실시예에서는 page가 아닌 다른 크기에 기반하여 activation counter가 관리될 수 있다. In step (S640), the DRAM affinity score can be calculated by multiplying the Device Sensitivity Score of the first request by the activation counter information of the corresponding page. The activation counter of the present invention is an indicator indicating whether the corresponding page is hot data or cold data, and may be data corresponding to a type of access frequency. In one embodiment of the present invention, a per-page activation counter is managed, but in another embodiment of the present invention, the activation counter may be managed based on a size other than a page.

단계(S640)에서 계산된 제1 요청의 DRAM affinity level이 빅팀 후보의 DRAM affinity level보다 크면 제1 요청이 더 hot 데이터이고 향후 접근 요청될 가능성이 높으므로 제2 메모리(120)의 캐쉬라인을 제1 요청의 데이터로 대체하고(S660), 빅팀은 제1 메모리(110)로 축출될 수 있다(be evicted).If the DRAM affinity level of the first request calculated in step (S640) is greater than the DRAM affinity level of the big team candidate, the first request is hotter data and is more likely to be accessed in the future, so the cache line of the second memory (120) is replaced with the data of the first request (S660), and the big team can be evicted to the first memory (110).

단계(S640)에서 계산된 제1 요청의 DRAM affinity level이 빅팀 후보의 DRAM affinity level보다 크지 않으면 제1 요청이 더 cold한 데이터이고 향후 접근 요청될 가능성이 높지 않으므로 빅팀 후보를 축출하지 않고 제2 메모리(120)를 바이패스할 수 있다(S680). If the DRAM affinity level of the first request calculated in step (S640) is not greater than the DRAM affinity level of the big team candidate, the first request is colder data and is unlikely to be accessed in the future, so the big team candidate can be bypassed without being evicted (S680).

이때 캐쉬인 제2 메모리(120)에 잔류한 빅팀 후보는 한번 캐쉬 미스된 상태이므로 activation counter가 감소될 필요가 있다. 따라서 미리 결정된 확률 p_{_dec}만큼 잔류한 빅팀 후보의 DRAM affinity 레벨을 감소시키거나, 또는 activation counter에 상응하는 연산을 수행할 수 있다(S670).At this time, since the big team candidate remaining in the cache second memory (120) has experienced a cache miss once, the activation counter needs to be decreased. Accordingly, the DRAM affinity level of the remaining big team candidate can be decreased by a predetermined probability p _{_dec} , or an operation corresponding to the activation counter can be performed (S670).

도 9는 도 1 내지 도 8의 과정의 적어도 일부를 수행할 수 있는 일반화된 하이브리드 메모리 장치 내의 메모리 컨트롤러, 메모리 관리 장치, 또는 컴퓨팅 시스템의 예시를 도시하는 개념도이다. FIG. 9 is a conceptual diagram illustrating an example of a memory controller, memory management device, or computing system within a generalized hybrid memory device capable of performing at least a portion of the processes of FIGS. 1 through 8.

메모리 관리 장치는 하이브리드 메모리 장치 내부에 포함될 수도 있고, 하이브리드 메모리 장치의 외부에 배치되어 하이브리드 메모리 장치 내의 각 메모리 요소들을 관리/제어할 수도 있다. The memory management device may be included within the hybrid memory device, or may be located externally to the hybrid memory device to manage/control each memory element within the hybrid memory device.

도 1 내지 도 8의 실시예에서도 도면 상으로는 생략되었으나 프로세서, 및 메모리가 전자적으로 각 구성 요소와 연결되고, 프로세서에 의하여 각 구성 요소의 동작이 제어되거나 관리될 수 있다. In the embodiments of FIGS. 1 to 8, although omitted in the drawings, a processor and a memory are electronically connected to each component, and the operation of each component can be controlled or managed by the processor.

본 발명의 일 실시예에 따른 방법의 적어도 일부의 과정은 도 9의 컴퓨팅 시스템(1000)에 의하여 실행될 수 있다. At least a portion of the process of the method according to one embodiment of the present invention can be executed by the computing system (1000) of FIG. 9.

도 9를 참조하면, 본 발명의 일 실시예에 따른 컴퓨팅 시스템(1000)은, 프로세서(1100), 메모리(1200), 통신 인터페이스(1300), 저장 장치(1400), 입력 인터페이스(1500), 출력 인터페이스(1600) 및 버스(bus)(1700)를 포함하여 구성될 수 있다.Referring to FIG. 9, a computing system (1000) according to one embodiment of the present invention may be configured to include a processor (1100), a memory (1200), a communication interface (1300), a storage device (1400), an input interface (1500), an output interface (1600), and a bus (1700).

본 발명의 일 실시예에 따른 컴퓨팅 시스템(1000)은, 적어도 하나의 프로세서(processor)(1100) 및 상기 적어도 하나의 프로세서(1100)가 적어도 하나의 단계를 수행하도록 지시하는 명령어들(instructions)을 저장하는 메모리(memory)(1200)를 포함할 수 있다. 본 발명의 일 실시예에 따른 방법의 적어도 일부의 단계는 상기 적어도 하나의 프로세서(1100)가 상기 메모리(1200)로부터 명령어들을 로드하여 실행함으로써 수행될 수 있다. A computing system (1000) according to one embodiment of the present invention may include at least one processor (1100) and a memory (1200) storing instructions that instruct the at least one processor (1100) to perform at least one step. At least some steps of a method according to one embodiment of the present invention may be performed by the at least one processor (1100) loading and executing instructions from the memory (1200).

프로세서(1100)는 중앙 처리 장치(central processing unit, CPU), 그래픽 처리 장치(graphics processing unit, GPU), 또는 본 발명의 실시예들에 따른 방법들이 수행되는 전용의 프로세서를 의미할 수 있다. The processor (1100) may mean a central processing unit (CPU), a graphics processing unit (GPU), or a dedicated processor on which methods according to embodiments of the present invention are performed.

메모리(1200) 및 저장 장치(1400) 각각은 휘발성 저장 매체 및 비휘발성 저장 매체 중에서 적어도 하나로 구성될 수 있다. 예를 들어, 메모리(1200)는 읽기 전용 메모리(read only memory, ROM) 및 랜덤 액세스 메모리(random access memory, RAM) 중에서 적어도 하나로 구성될 수 있다. Each of the memory (1200) and the storage device (1400) may be configured with at least one of a volatile storage medium and a nonvolatile storage medium. For example, the memory (1200) may be configured with at least one of a read only memory (ROM) and a random access memory (RAM).

또한, 컴퓨팅 시스템(1000)은, 무선 네트워크를 통해 통신을 수행하는 통신 인터페이스(1300)를 포함할 수 있다. Additionally, the computing system (1000) may include a communication interface (1300) that performs communication via a wireless network.

또한, 컴퓨팅 시스템(1000)은, 저장 장치(1400), 입력 인터페이스(1500), 출력 인터페이스(1600) 등을 더 포함할 수 있다.Additionally, the computing system (1000) may further include a storage device (1400), an input interface (1500), an output interface (1600), etc.

또한, 컴퓨팅 시스템(1000)에 포함된 각각의 구성 요소들은 버스(bus)(1700)에 의해 연결되어 서로 통신을 수행할 수 있다.Additionally, each component included in the computing system (1000) may be connected to each other by a bus (1700) and communicate with each other.

본 발명의 컴퓨팅 시스템(1000)의 예를 들면, 통신 가능한 데스크탑 컴퓨터(desktop computer), 랩탑 컴퓨터(laptop computer), 노트북(notebook), 스마트폰(smart phone), 태블릿 PC(tablet PC), 모바일폰(mobile phone), 스마트 워치(smart watch), 스마트 글래스(smart glass), e-book 리더기, PMP(portable multimedia player), 휴대용 게임기, 네비게이션(navigation) 장치, 디지털 카메라(digital camera), DMB(digital multimedia broadcasting) 재생기, 디지털 음성 녹음기(digital audio recorder), 디지털 음성 재생기(digital audio player), 디지털 동영상 녹화기(digital video recorder), 디지털 동영상 재생기(digital video player), PDA(Personal Digital Assistant) 등일 수 있다.Examples of the computing system (1000) of the present invention may include a desktop computer, a laptop computer, a notebook, a smart phone, a tablet PC, a mobile phone, a smart watch, a smart glass, an e-book reader, a portable multimedia player (PMP), a portable game console, a navigation device, a digital camera, a digital multimedia broadcasting (DMB) player, a digital audio recorder, a digital audio player, a digital video recorder, a digital video player, a PDA (Personal Digital Assistant), etc.

본 발명의 실시예에 따른 방법의 동작은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 프로그램 또는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의해 읽힐 수 있는 정보가 저장되는 모든 종류의 기록장치를 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산 방식으로 컴퓨터로 읽을 수 있는 프로그램 또는 코드가 저장되고 실행될 수 있다.The operation of the method according to an embodiment of the present invention can be implemented as a computer-readable program or code on a computer-readable recording medium. The computer-readable recording medium includes all types of recording devices that store information that can be read by a computer system. In addition, the computer-readable recording medium can be distributed over network-connected computer systems so that the computer-readable program or code can be stored and executed in a distributed manner.

또한, 컴퓨터가 읽을 수 있는 기록매체는 롬(rom), 램(ram), 플래시 메모리(flash memory) 등과 같이 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다. 프로그램 명령은 컴파일러(compiler)에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터(interpreter) 등을 사용해서 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함할 수 있다.Additionally, the computer-readable recording medium may include hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. The program instructions may include not only machine language codes produced by a compiler, but also high-level language codes that can be executed by the computer using an interpreter, etc.

본 발명의 일부 측면들은 장치의 문맥에서 설명되었으나, 그것은 상응하는 방법에 따른 설명 또한 나타낼 수 있고, 여기서 블록 또는 장치는 방법 단계 또는 방법 단계의 특징에 상응한다. 유사하게, 방법의 문맥에서 설명된 측면들은 또한 상응하는 블록 또는 아이템 또는 상응하는 장치의 특징으로 나타낼 수 있다. 방법 단계들의 몇몇 또는 전부는 예를 들어, 마이크로프로세서, 프로그램 가능한 컴퓨터 또는 전자 회로와 같은 하드웨어 장치에 의해(또는 이용하여) 수행될 수 있다. 몇몇의 실시 예에서, 가장 중요한 방법 단계들의 적어도 하나 이상은 이와 같은 장치에 의해 수행될 수 있다.While some aspects of the invention have been described in the context of an apparatus, they may also represent a description of a corresponding method, wherein a block or device corresponds to a method step or a feature of a method step. Similarly, aspects described in the context of a method may also be represented as a feature of a corresponding block or item or a corresponding device. Some or all of the method steps may be performed by (or using) a hardware device, such as, for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, at least one or more of the most important method steps may be performed by such a device.

실시예들에서, 프로그램 가능한 로직 장치(예를 들어, 필드 프로그래머블 게이트 어레이)가 여기서 설명된 방법들의 기능의 일부 또는 전부를 수행하기 위해 사용될 수 있다. 실시예들에서, 필드 프로그래머블 게이트 어레이(field-programmable gate array)는 여기서 설명된 방법들 중 하나를 수행하기 위한 마이크로프로세서(microprocessor)와 함께 작동할 수 있다. 일반적으로, 방법들은 어떤 하드웨어 장치에 의해 수행되는 것이 바람직하다.In embodiments, a programmable logic device (e.g., a field-programmable gate array) may be used to perform some or all of the functions of the methods described herein. In embodiments, a field-programmable gate array may operate in conjunction with a microprocessor to perform one of the methods described herein. In general, the methods are preferably performed by some hardware device.

이상 본 발명의 바람직한 실시 예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the present invention has been described above with reference to preferred embodiments thereof, it will be understood by those skilled in the art that various modifications and changes may be made to the present invention without departing from the spirit and scope of the present invention as set forth in the claims below.

Claims

A method for managing hybrid memory performed by a memory controller that manages hybrid memory,
A step of receiving a plurality of first data to be cached from a first memory which is a non-volatile storage class memory (SCM);
A step of collectively storing tag information of the plurality of first data in a first area within a cache line of a second memory capable of random access; and
A step of sequentially storing the plurality of first data in a second area within the cache line of the second memory;
Including,
The first memory and the second memory share a common address of a first number of bits,
The tag information of the plurality of first data includes offset information of a second number of bits between the first address excluding the common address among the first addresses on the first memory where the plurality of first data are stored and the second address on the second memory,
The tag information of the plurality of first data has a size that can be collectively stored in the first area, which is an area of a size that can be read with a single access to the second memory.
How to manage hybrid memory.

In the first paragraph,
A step of storing error control information of the plurality of first data in a third area within the cache line of the second memory adjacent to the first area;
Including more,
How to manage hybrid memory.

delete

In the first paragraph,
A step of providing the tag information for the plurality of first data stored in the first area to the host when information on the plurality of first data is requested from the host;
Including more,
How to manage hybrid memory.

In the first paragraph,
A step in which a host stores tag information for the plurality of first data stored in the first area in a part of a level 2 cache as identification information for the plurality of first data;
Including more,
How to manage hybrid memory.

delete

A method for managing hybrid memory performed by a memory controller that manages hybrid memory,
A step of calculating device sensitivity between the first memory and the second memory for the first request, when data corresponding to a first request of a host is not stored in a second memory capable of random access as data stored in a non-volatile first memory;
A step of determining whether to bypass the second memory and directly access the first memory for the first request based on the device sensitivity;
a step of calculating an affinity of the first request for the second memory when it is not determined to bypass the second memory based on the device sensitivity; and
A step of determining whether to replace a part of the cache line of the second memory by the first request based on a comparison result between the affinity of the first request to the second memory and the affinity of the big team data to be expelled from the second memory to the second memory;
Including,
The above device sensitivity is calculated based on a first cost of the first memory according to the type of access of the first request, and a second cost of the second memory for processing the first request.
How to manage hybrid memory.

In Article 7,
In the step of calculating the above device sensitivity,
The above device sensitivity is calculated by dividing by the number of consecutive data included in the first request.
How to manage hybrid memory.

In Article 7,
In the step of calculating the above device sensitivity,
The above device sensitivity is discretized into any value between a predetermined minimum value and an observed maximum value.
How to manage hybrid memory.

delete

In Article 7,
In the step of calculating the affinity of the above first request to the above second memory,
The affinity for the second memory is determined based on the access frequency for data corresponding to the first request and the device sensitivity of the first request.
How to manage hybrid memory.

In Article 7,
In the step of calculating the affinity of the above first request to the above second memory,
The affinity for the second memory is discretized to any value between a predetermined minimum and an observed maximum.
How to manage hybrid memory.

In Article 7,
A step of processing the first request by bypassing the second memory and directly accessing the first memory for the first request when it is not decided to replace a part of the cache line of the second memory by the first request;
Including more,
How to manage hybrid memory.

In Article 13,
A step of reducing the affinity of the big data for the second memory according to a predetermined condition in response to an event in which data corresponding to the first request is not stored in the second memory when it is not decided to replace a part of the cache line of the second memory by the first request;
Including more,
How to manage hybrid memory.

The first memory is non-volatile storage class memory (SCM);
A second memory capable of random access; and
memory controller;
Including,
The above memory controller,
Collectively storing tag information of a plurality of first data to be cached from the first memory in a first area within a cacheline of the second memory,
Sequentially storing the plurality of first data in the second area within the cache line of the second memory,
The first memory and the second memory share a common address of a first number of bits,
The tag information of the plurality of first data includes offset information of a second number of bits between the first address excluding the common address among the first addresses on the first memory where the plurality of first data are stored and the second address on the second memory,
The tag information of the plurality of first data has a size that can be collectively stored in the first area, which is an area of a size that can be read with a single access to the second memory.
Hybrid memory device.

In Article 15,
The above memory controller,
Storing error control information of the plurality of first data in a third area within the cache line of the second memory adjacent to the first area,
Hybrid memory device.

delete

In Article 15,
The above memory controller,
When information on the plurality of first data is requested from the host, the tag information on the plurality of first data stored in the first area is provided to the host.
Hybrid memory device.

In Article 18,
The above memory controller,
Providing the tag information for the plurality of first data to the host so that the host stores the tag information for the plurality of first data stored in the first area in a part of the level 2 cache as identification information for the plurality of first data.
Hybrid memory device.

delete