WO2021210998A1 - Malicious domain hosting type classification systems and methods - Google Patents
Malicious domain hosting type classification systems and methods Download PDFInfo
- Publication number
- WO2021210998A1 WO2021210998A1 PCT/QA2021/050004 QA2021050004W WO2021210998A1 WO 2021210998 A1 WO2021210998 A1 WO 2021210998A1 QA 2021050004 W QA2021050004 W QA 2021050004W WO 2021210998 A1 WO2021210998 A1 WO 2021210998A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- domain
- domains
- malicious
- private
- public
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
- H04L63/1483—Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/21—Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/2119—Authenticating web pages, e.g. with suspicious links
Definitions
- the present application relates generally to domain classification. More specifically, the present application provides a software-based classifier built on a machine learning model that distinguishes between public and private malicious URL hosting apex domains.
- Appropriate mitigation actions against a malicious website may differ greatly depending on how that site is hosted. If it is hosted under a private apex domain, where all its subdomains and pages are under the direct control of the apex domain owner, the malicious website could be blocked at the apex domain level. If it is hosted under a public apex domain (e.g., a web hosting sendee provider), it would be more appropriate to block at the subdomain level. Further, for the former case, the private apex domain may be legitimate but compromised, or may be attacker- generated, which, again, would warrant different mitigation actions. Attacker-owned apex domains could be blocked permanently, while compromised domains may be blocked only temporarily.
- a public apex domain e.g., a web hosting sendee provider
- the present application provides a software-based classifier built on a machine learning model that distinguishes between two kinds of malicious URL hosting apex domains: public and private.
- This classification helps security professionals specify which domain levels to block, the whole apex domain in the case of private apexes or specific subdomains in the case of public ones.
- the classifier is built on a machine learning model that differentiates attacker-owned hosting domains from compromised hosting domains. This distinction is crucial to help security operators take the appropriate mitigation actions. For example, attacker- owned domains could be blocked permanently whereas compromised ones temporarily.
- a system in light of the technical features set forth herein, and without limitation, in a first aspect of the disclosure in the present application, which may be combined with any other aspect unless specified otherwise, a system includes a display and a memory' m communication with a processor.
- the processor may be configured to identify a malicious domain from a set of received domains; determine, using a model, whether the identified malicious domain is a public domain or a private domain; determine, if the identified malicious domain is a private domain, using a model, whether the private domain is a compromised domain or an atacker-owned domain; and display the determined malicious domain hosting type on the display, the determined malicious hosting type being a public domain, a compromised private domain, or an attacker-owned private domain.
- a method includes identifying a malicious domain from a set of received domains. It may be determined, using a model, whether the identified malicious domain is a public domain or a private domain. If the identified malicious domain is a private domain, it may be determined, using a model, whether the private domain is a compromised domain or an attacker-owned domain.
- the determined malicious domain hosting type may be displayed. In this aspect, the determined malicious hosting type is a public domain, a compromised private domain, or an atacker-owned private domain.
- FIG. 1 illustrates a block diagram of an example system for malicious domain hosting type classification, according to an aspect of the present disclosure.
- FIG. 2 illustrates a flow chart of an example method for malicious domain hosting type classification, according to an aspect of the present disclosure.
- FIG. 3 illustrates a graph of a comparison of VT URL intelligence against SA
- FIG. 4 illustrates a graph showing the AUC of the ROC curves is 96% for GT1.
- FIG. 5 illustrates a graph showing the AUC of the ROC curve is 99% for GT2.
- FIG. 6 illustrates a table showing various features of the five features groups that the private domain classifier may take into account, according to an aspect of the present disclosure.
- FIG. 7 illustrates a correlation matrix for class labels, domain duration, quantity of scanner count, and Alexa rank, according to an aspect of the present disclosure.
- FIG 8 illustrates a graph showing the ROC curves and feature importance in an example in which the private domain classifier is a Random Forest classifier, according to an aspect of the present disclosure.
- FIG. 9 illustrates a graph showing the ROC curve for an example in which the private domain classifier is a Random Forest classifier, according to an aspect of the present disclosure.
- FIG. 10 illustrates a graph showing the CDF of the number of FQDNs per apex during a period for likely benign domains and malicious domains.
- FIG. 11A illustrates a graph showing the number of FQDNs per apex for the two categories of apex domains, public and private.
- FIG. 11B illustrates a graph showing the average Alexa ranking distribution for public and private apex domains.
- FIG. 11C illustrates a graph showing the domain lifetime distribution for public and private apex domains.
- FIG. 12A illustrates a graph showing #FQDNs per apex for compromised and attacker- owned domains.
- FIG. 12B illustrates a graph showing the average Alexa rank distribution for compromised and attacker-owned apex domains.
- FIG. 12C illustrates a graph showing the domain lifetime distribution for compromised and attacker-owned apex domains.
- FIG. 13 illustrates a feature correlation matrix for the features used in the public domain classifier, according to an aspect of the present disclosure.
- FIGS. 14A and 14B illustrate graphs showing the feature importance for a Random Forest-based public domain classifier for the two datasets GT1 and GT2.
- FIGS. 15A and 15B illustrate graphs showing the t-SNE for a Random Forest- based public domain classifier for the two datasets GT1 and GT2
- FIGS. 16A and 16B illustrate graphs showing the precision-recall for a Random Forest-based public domain classifier for the two datasets GT1 and GT2.
- FIGS. 17A and 17B illustrate graphs showing the feature importance for a Random Forest-based private domain classifier 140 for the two datasets GT1 and GT2.
- FIGS. 18A and 18B illustrate graphs showing the t-SNE for a Random Forest- based private domain classifier 140 for the two datasets GT1 and GT2.
- FIGS. 19A and 19B illustrate graphs showing the precision-recall for a Random Forest-based private domain classifier 140 for the two datasets GT1 and GT2,
- the present application provides a new and innovative malicious domain hosting type classification system and method.
- Early knowledge of what hosting type a malicious URL is coming from helps security operators take appropriate actions.
- the distinction between public and private apex domains has a profound impact on the inference and prediction of malicious domains, especially when it relies on the association of subdomains belonging to the same apex domain. Further, once malicious websites are detected, the actions against the hosting apex domains would be different depending on whether they are public or private.
- the provided classification system identifies public and private apex domains based on a key observation that subdomains of private apex domains have more consistent behavior and properties compared to those of public apex domains.
- the classification system may determine whether a hosting domain marked as malicious is compromised or attacker-owned. For example, once the provided system identifies a malicious website as hosted in a private apex domain, the provided system may further classify the apex domain based on its owner.
- a malicious website is either created by attackers on their own registered domains (e.g. getbinance.org) or on compromised benign domains (e.g. questionpro.com). In the latter case, legitimate domains exploited for malicious activities are victim domains. Takedown strategies and who should be contacted differ depending on the type of the apex domain.
- Detection of compromised domains early helps owners to identify root causes, take corrective measures, and control reputation damage, while Security Operation Center (SOC) teams may temporarily block such victim domains to protect their users.
- SOC Security Operation Center
- atacker-owned domains would require completely different actions. They are usually first blacklisted to contain the immediate damage. They could be further shutdown through third-party takedown services, domain registration deletion, or ownership transferring if they are involved in cybersquatting.
- the inventors have found that the provided classifier achieves a 97.2% accuracy with 97.7% precision and 95, 6% recall with respect to identifying public and private apex domains. In addition, the inventors have found that the provided classifier achieves a 96.4% accuracy with 99.1% precision and 92,6% recall with respect to determining whether a malicious hosting domain is compromised or attacker-owned.
- an apex domain is defined as a public apex domain if its subdomains (e.g., alice.000webhostapp.com) or pages (e.g., sites.google.com/alice) are not created and not under the control of the owner of the apex domain (e.g., 000webhostapp.com).
- an apex domain is defined as a private apex domain if its subdomains (e.g., careers.nsa.gov) are created and managed by the owner of the apex domain (e.g., nsa.gov).
- FIG. 1 illustrates a block diagram of an example system 100.
- the example system 100 may include an example classification system 110 that classifies the hosting type of a malicious domain.
- the example classification system 110 may automatically label malicious websites (i.e. URLs) as attacker- owned public domains (e.g. 000webhostapp.com), compromised (private) domains (questionpro.com) or attacker-owned (private) domains (getbinanace.org).
- the classification system 110 may be in communication wath at least one reputation system 160 over a network 150.
- the network 150 can include, for example, the Internet or some other data network, including, but not limited to, any suitable wide area network or local area network.
- the reputation system 160 may be any suitable blacklist or reputation system that provides a reputation (e.g., whether they are malicious) of websites, or URLs.
- the reputation system 160 is the VirusTotal (VT) system.
- VT VirusTotal
- VT is a known reputation sendee that provides aggregated intelligence on any URL by consulting third-party anti-virus tools and URL/domain reputation services. Each of these tools is referred to herein as a scanner. VT aggregates the query results every second and makes them available for subscribed users as a feed.
- the reputation system 160 may be generated/maintained by Google Safe Browsing (GSB), Phishtank, Anti-Phishing Working Group (APWG), McAfee Site Advisor (SA), or other suitable blacklists or reputation systems.
- GEB Google Safe Browsing
- APWG Anti-Phishing Working Group
- SA McAfee Site Advisor
- the classification system 110 may be in communication with more than one blacklist or reputation system.
- the classification system 110 may include a processor in communication with a memory 114.
- the processor may be a CPU 112, an ASIC, or any other similar device.
- the classification system 110 may include a display 116.
- the display 116 may be any suitable display for displaying information.
- the classification system 110 may include a malicious domain identifier 120.
- the malicious domain identifier 120 may identify a malicious domain based on information received from the reputation system 160.
- the classification system 110 may include a public domain classifier 130.
- the public domain classifier 130 may determine whether a malicious domain is a public apex domain or a private apex domain.
- the classification system 110 may include a private domain classifier 140.
- the private domain classifier 140 may determine whether a private apex domain is compromised or attacker-owned.
- Each of the malicious domain identifier 120, the public domain classifier 130, and the private domain classifier 140 may be implemented by software executed by the CPU 112. In other examples, the components of the classification system 110 may be combined, rearranged, removed, or provided on a separate device or server.
- the public domain classifier 130 may be a Random Forest classifier. In other examples, the public domain classifier may be a Support Vector Classification (SV), Extra Tree (ET), Logistic Regression (LR), Decision Tree (DT), Gradient Boosting (GB), Ada Boosting (AB), or K-Neighbors (KN) classifier.
- SV Support Vector Classification
- E Extra Tree
- LR Logistic Regression
- DT Decision Tree
- GB Gradient Boosting
- AB Ada Boosting
- KN K-Neighbors
- the private domain classifier 140 may be a Random Forest classifier or an Extra Tree (ET) classifier.
- the public domain classifier may be a Support Vector Classification (SV), Logistic Regression (LR), Decision Tree (DT), Gradient Boosting (GB), Ada Boosting (AB), or K- Neighbors (KN) classifier.
- SV Support Vector Classification
- LR Logistic Regression
- DT Decision Tree
- GB Gradient Boosting
- AB Ada Boosting
- KN K- Neighbors
- FIG. 2 shows a flow chart of an example method 200 for classifying the hosting type of a malicious domain.
- example method 200 is described with reference to the flowchart illustrated in FIG. 2, it will be appreciated that many other methods of performing the acts associated with the method 200 may be used. For example, the order of some of the blocks may be changed, certain blocks may be combined with other blocks, and some of the blocks described are optional.
- the method 200 may be performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software, or a combination of both.
- the example method 200 may begin with identifying a malicious domain (block 202).
- the malicious domain identifier 120 may identify a malicious domain.
- the malicious domain identifier 120 may identify the malicious domain from a set of received URLs (e.g., from the reputation system 160). Out of all the URLs marked by at least one scanner of the reputation system 160, a domain may be identified that is highly likely to be malicious. In some aspects, a domain may be identified as highly likely to be malicious if a threshold number of scanners marks the domain.
- the provided classification system may utilize historical VirusTotal (VT) URL feed information when identifying a malicious domain.
- VT historical VirusTotal
- a URL marked by five or more scanners may be identified as malicious.
- a different threshold quantity of scanners marking a URL may be utilized to identify a URL as malicious.
- FIG. 3 illustrates a graph of a comparison of VT URL intelligence against SA and GSB, where "ext mal" corresponds to the percentage of URLs from each class of VT that are marked as malicious by SA or GSB.
- "ext mal" corresponds to the percentage of URLs from each class of VT that are marked as malicious by SA or GSB.
- the majority of VT marked malicious URLs are not identified as malicious by either SA or GSB.
- #scanner 5 the majority (over 70%) of them are in agreement with the external intelligence from SA and GSB.
- the malicious domain identifier 120 continuously profiles domains observed in the VT URL feed.
- the malicious domain identifier 120 incrementally builds an aggregated record for each Fully Qualified Domain Name (FQDN).
- the profile record for a given FQDN may include the first time seen, the time last seen, number of times scanned, number of times marked malicious, and/or corresponding URLs and VT scan summaries.
- the malicious domain identifier 120 may identify a malicious domain from URLs received from the reputation system 160 when the reputation system 160 is a blacklist or reputation system other than VT. In some aspects, the malicious domain identifier 120 may identify a malicious domain from URLs received from more than one reputation system 160. For instance, the malicious domain identifier 120 may crosscheck results from one reputation system 160 with results from another reputation system 160.
- an identified malicious domain is a public domain or a private domain (block 204).
- the public domain classifier 130 may determine whether the identified malicious domain is a public domain or a private domain.
- Publicly available lists such as browser public suffix list, CDN lists, dynamic DNS lists, popular webhosting domains or proxy services can be useful, but they can also be prohibitively restrictive as they are slow to keep up-to-date, and thus tend to mistakenly include many non-existent domains and meanwhile miss newly appeared public domains.
- a ground truth data set for the public domain classifier 130 may be collected as follows.
- Publicly available lists including the public suffix list, popular web hosting providers and CDN lists, and dynamic DNS lists, may be aggregated and the intersection with apex domains in datasets DS1 and DS2 may be taken.
- Potential public domains may be identified by searching over our datasets for the keywords likely to be used by public apex domains such as hosting, free, web, share, upload, drop, cdn, file, photo, and proxy. Random samples of five hundred apex domains may be taken from DS1 and DS2 respectively.
- a tentative private domain ground truth data may be collected by randomly selecting 1000 apex domains from each dataset (DS1 and DS2) that are mutually exclusive from the tentative public dataset. From these tentative ground truth sets, manual verification may be done to create the final ground truth sets. For each apex domain, a confidence score can be assigned between 50 and 100 to indicate a confidence in the label, with 100 being the most confident and 50 being undecided. To impro ve the quality of labeling, two domain experts performed the labeling for all the domains and the domains with conflicting labels were excluded.
- the public domain classifier 130 may take into account at least some of the features detailed in the Table 1 below to determine whether a malicious domain is a public domain or a private domain.
- FIGS. 4 and 5 illustrate graphs showing the AUCs of the two ROC curves are 96% and 99% for GT1 and GT2 respectively, demonstrating high degrees of separability of the two classes.
- the FQDNs associated with such public domains are created by attackers and the number of such FQDNs could be utilized to assess the reputation of public domains.
- a public domain may be categorized into one of seven groups: Dynamic DNS, Web Proxy Services, CDN, Web hosting, Blogging and content hosting, contentsharing services, and shorteners and forms
- the public domain classifier 130 may determine whether the private domain is a compromised domain or an attacker-owned domain (block 206). For instance, the private domain classifier 130 may determine whether the private domain is a compromised domain or an attacker-owned domain.
- the private domain classifier 130 may determine whether the private domain is a compromised domain or an attacker-owned domain.
- compromised domains have very different contents compared to the main website and the auxiliary information such as hosting IPs are different for the main w3 ⁇ 4bsite (reputed hosting provider) and the domain under consideration (bullet proof hosting).
- attacker-owned domains have relatively new registration information, are likely to utilize fast flux networks, are short-lived (l ikely to be NX domain), and blacklisted.
- AC-GT1 (AC stands for attacker-owned/ compromised) and AC-GT2 may be manually created from the private domains identified from DS1 and DS2 respectively using our public/private classifier.
- a random sample of 2500 domains from each of DS1 and DS2 may be selected.
- manual inspection may be performed of each sample and a confidence score may be provided to indicate how confident the domain experts are about the label.
- the following information and sources are manually inspected to decide if a malicious apex is compromised or attacker-owned.
- auxiliary information was checked such as registration information including historical WHOIS records, hosting information, and PDNS information.
- the detailed report from two threat intelligence platforms, riskiq.com and otx.alienvault.com was also checked. Further, detailed reports were inspected from two reputation services, GSB and SA.
- compromised domains In order to identify compromised domains, deviations of the visual and auxiliary information in the apex domain and the domain under consideration were relied upon.
- the inventors observed that compromised domains have very different contents compared to the main website and the auxiliary information such as hosting IPs are different for the main website (reputed hosting provider) and the domain under consideration (bullet proof hosting).
- auxiliary information such as hosting IPs are different for the main website (reputed hosting provider) and the domain under consideration (bullet proof hosting).
- attacker-owned domains have relatively new registration information, likely to utilize fast flux networks, are short-lived (likely to be NX domain), and blacklisted. After manual verification, the high-confidence labels were selected.
- the private domain classifier 140 takes into account at least five groups of features: lexical, VT report, VT profile, PDNS, and Alexa features.
- Lexical features capture the properties of the URL under consideration.
- VT report features include those attributes that are directly available from VT reports, VT profile features are extracted from the VT NOD system, and PDNS features are extracted from the Farsight Passive DNS DB.
- VT profile features are extracted from the VT NOD system
- PDNS features are extracted from the Farsight Passive DNS DB.
- Most of the lexical, Alexa and PDNS features are known from previous research of detecting malicious domains or URLs.
- the table illustrated in FIG. 6 shows various features of the five features groups that the private domain classifier 140 may take into account.
- the novel features that the private domain classifier takes into account include VT_Duration, Positive_Count, Domain_ Malicious, #Total_Scans, #Benign_Scans, Sibling__Malicious, SOA_Domains_Nos, and SOA _Domain.
- VT Report Features are directly extracted from the VT reports.
- the inventors observed the VT_Duration feature for compromised domains tend to be higher than that for attacker- owned domains.
- compromised domains are in general harder to detect by existing systems as attackers are exploiting the reputation of legitimate domains. Due to the same reason, the inventors observed that the number of scanners that mark a compromised site as malicious is less than that for atacker-owned sites. Positive_count captures this observation.
- attackers more often use compromised domains as a redirection site in order to evade detection.
- VT profile features capture the intuition that almost all subdomains and scans of attacker-owned domains are malicious whereas only some of the subdomains and scans of compromised domains are malicious.
- the number of authoritative name servers and the number of SOA domains capture the observation that attacker-owned domains change their hosting providers more often than benign domains in order to evade detection or takedown. Additionally, attackers drop catch domains to exploit the residual trust in them, which also results in domain being associated with multiple name servers. Comparison of apex domains with name server domains and SOA features capture the observation that benign domains are more likely to be hosted in their own servers compared to attacker-owned ones.
- the present disclosure improves upon several lexical features presented in previous works. Specifically, the inventors observed that attacker-owned domains more often use these squatting methods to impersonate brands compared to compromised domains.
- the present disclosure profiles Alexa Top 1M domains over 1 year to identify Alexa top 1000 brands to detect combosquatting, levelsquatting and target embedding domains which are shown to be more than hundred times prevalent compared to more traditional squatting types.
- Features Brand, Similar, and Pop Keywords capture new squatting tactics used by attackers.
- the private domain classifier 140 In addition to VT features shown in the table of FIG. 6, the private domain classifier 140 considers three new classes of features, PDNS, Alexa and lexical features, to improve classification performance. It indeed improves the performance matrices, and as shown in FIG. 7, several classifiers including GB, ET and RF perform quite well resulting in an accuracy slightly above 90% with 10-fold cross validation for AC-GT1.
- FIG 8 illustrates a graph showing the ROC curves and feature importance in an example in which the private domain classifier 140 is a Random Forest classifier. The private domain classifier 140 achieves 90.6% accuracy with 94.7% precision and 86.1% recall. An important consideration in building robust machine learning models is that the model should generalize to different ground truth datasets.
- FIG. 9 illustrates a graph showing the ROC curve for an example in which the private domain classifier 140 is a Random Forest classifier.
- the inventors have made various insights of the VT URL Feed dataset that help determine the features used in the public domain classifier 130 and the private domain classifier 140.
- the VT URL Feed dataset contains 814,678,956 unique URLs in the period from Aug. 1 2019 to Nov. 18 2019. Note that the same URL may be scanned multiple times in a given day. Each new' scan is considered a different one. However, if VT is simply queried multiple times to retrieve an existing report instead of triggering new' scans, it does not change the scan ID. Hence, such multiple reports with the same scan id are considered as one record. It was observed that the daily average of observed likely benign scans (i.e.
- #scanners 0) are 89.3% of the total number of scans, which is around 4.8M.
- the inventors observed that, on average, malicious URLs are scanned 6 times whereas benign URLs are scanned only twice. This follows general user behavior where the more suspicious the URLs are, the more they are checked. Another observation was that the daily average scan count is roughly twice the average URL count.
- the inventors also compared the coverage of malicious websites in the inventors’ dataset compared to typical blacklists and reputation services. While there are many VT reports with 1 or 2 #scanners, on average 45.7% of the malicious scans have 5 or more #scanners (i.e. the top two areas in the Figure). The inventors focused on categorizing scans with 5 or more #seanners, which corresponds to 1659K weekly malicious reports on average. Tins corresponds to 276K malicious websites weekly on average. In comparison, Google Transparency Report and Phishtank report around 50K and 4K per week respectively. This shows that the classification system 110 is trained on a much larger set of malicious URLs compared to popular blacklists and thus have a higher impact.
- VT scanners assign each malicious URL one of the following class labels: malicious, malware, phishing, mining and suspicious site. Since VT scanners most of the time assign conflicting class labels, a simple majority voting heuristic may be used to derive the final class label for a malicious website. For example, the inventors took random samples of 100 websites of each class type and manually cross checked against several publicly available blacklists or APIs including phishtank, GSB and SA. The inventors’ manual inspection showed that more than 98% of the labels using majority voting are in agreement with external intelligence, validating our heuristic. While malware and phishing sites dominate the reported malicious websites, there are only a few malicious mining sites and suspicious sites in the dataset.
- the determined malicious domain hosting type may then be displayed (block 208).
- the classification system 110 may display the determined malicious domain hosting type on the display 116.
- the displayed malicious domain hosting type may be displayed with the malicious domain URL.
- the determined malicious domain hosting type may be a public domain (e.g., an attacker-owned public domain), a compromised private domain, or an attacker-owned private domain.
- Security operators may view the determined malicious domain hosting type and the malicious domain's URL on the display 116 to determine and take the appropriate actions.
- the inventors’ analysis identified 6,675 malicious public apex domains and 725,325 malicious private apex domains in both datasets. In other words, only 0.91% apex domains in VT URL feed are public. However, the inventors observed a high proportion of URLs and scans belonging to these public apex domains. Out of all reports, 46.5% of URLs are hosted on public apex domains. This observation is consistent with the fact that public apex domains host many subdomains whereas private apex domains host only a few in general.
- FIG. 11A illustrates a graph showing the number of FQDNs per apex for the two categories of apex domains, public and private. More than 80% of public apex domains have more than 20 FQDNs whereas 95% of private apex domains have less than 10 FQDNs. While many of the public domains have a large num- ber of subdomains, there is a long tail of public domains with a huge number of subdomains (over 200K).
- FIG. 11B illustrates a graph showing the average Alexa ranking distribution for public and private apex domains. For unranked domains, the insignificant rank of 1 million was assigned for better visualization. It is not surprising that public domains have higher average Alexa rankings compared to private domains as public apex domains are accessed more frequently by users. An interesting result is that half of public domains are not popular (unranked), showing that attackers also create subdomains on less popular public domains to launch attacks. Since public apex domains host many benign domains, current registration and domain reputation based systems and inference based systems may inadvertently blacklist public apex domains, disrupting benign sites.
- the domain lifetime can be estimated by taking the lifetime of the PDNS footprint for each apex domain.
- FIG. 11 C illustrates a graph showing the domain lifetime distribution for public and private apex domains.
- the inventors observed that public domains are longer lived compared to private domains even though a large majority of them are unranked sites providing a free platform for attackers to launch their attacks for prolonged time periods. Further, around 10% of private domains are very short-lived indicating they are likely to be attacker- owned domains.
- the private domain classifier 140 detects that 65.6% apex domains in VT URL feed are compromised, indicating there are more compromised websites than attacker-owned ones. This observation is consistent with prior work done on phishing websites and public threat intelligence reports.
- FIG. 12A illustrates a graph showing #FQDNs per apex for compromised and attacker-owned domains.
- FIG. 12B illustrates a graph showing the average Alexa rank distribution for compromised and attacker-owned apex domains. As expected, most of the attacker-owned domains have either a low Alexa rank or no rank. However, it is interesting to note that there are some attacker- owned domains with Alexa ranking below 100K. Another interesting observation is that there about 10% unranked compromised domains, indicating that attackers launch attacks from less popular benign websites as well, which could be utilized to launch attacks such as DDoS that do not require reputed sites.
- FIG. 12C illustrates a graph showing the domain lifetime distribution for compromised and attacker-owned apex domains. It is not surprising that compromised domains in general live longer than attacker-owned ones. However, there are about 40% of attacker-owned domains active for more than 200 days indicating there is a need to develop better techniques to detect these malicious domains early and take appropriate actions. One reason for their long duration is that attackers register domains and park them for a while as an evasive technique.
- FIG. 13 illustrates a feature correlation matrix for the features used in the public domain classifier 130.
- FIGS. 14A and 14B illustrate graphs showing the feature importance for a Random Forest-based public domain classifier 130 for the two datasets GT1 and GT2. The feature importance graphs indicate which features are important in constructing the model.
- FIGS. 15A and 15B illustrate graphs showing the t-SNE for a Random Forest-based public domain classifier 130 for the two datasets GT1 and GT2.
- the t-SNE graphs utilize a nonlinear dimensionality reduction technique to embed the feature vectors into two dimensional space data for visualization. They show how two classes are clustered based on the features collected.
- One reason for the better performance in the second ground truth set is that, as shown in FIGS.
- FIGS. 16A and 16B illustrate graphs showing the precision-recall for a Random Forest-based public domain classifier 130 for the two datasets GT1 and GT2.
- FIGS. 17A and 17B illustrate graphs showing the feature importance for a Random Forest-based private domain classifier 140 for the two datasets GT1 and GT2.
- FIGS. 18A and 18B illustrate graphs showing the t-SNE for a Random Forest-based private domain classifier 140 for the two datasets GT1 and GT2.
- FIGS. 19A and 19B illustrate graphs showing the precision-recall for a Random Forest-based private domain classifier 140 for the two datasets GT1 and GT2.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Virology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Computer And Data Communications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
Description
Claims
Priority Applications (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP21787700.0A EP4136592A4 (en) | 2020-04-13 | 2021-04-13 | Malicious domain hosting type classification systems and methods |
| JP2022562116A JP7686667B2 (en) | 2020-04-13 | 2021-04-13 | Malicious domain hosting type classification system and method |
| CN202180028248.7A CN115812200A (en) | 2020-04-13 | 2021-04-13 | Malicious domain hosting type classification system and method |
| AU2021257379A AU2021257379A1 (en) | 2020-04-13 | 2021-04-13 | Malicious domain hosting type classification systems and methods |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202063009151P | 2020-04-13 | 2020-04-13 | |
| US63/009,151 | 2020-04-13 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021210998A1 true WO2021210998A1 (en) | 2021-10-21 |
Family
ID=78084361
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/QA2021/050004 Ceased WO2021210998A1 (en) | 2020-04-13 | 2021-04-13 | Malicious domain hosting type classification systems and methods |
Country Status (5)
| Country | Link |
|---|---|
| EP (1) | EP4136592A4 (en) |
| JP (1) | JP7686667B2 (en) |
| CN (1) | CN115812200A (en) |
| AU (1) | AU2021257379A1 (en) |
| WO (1) | WO2021210998A1 (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR102645870B1 (en) * | 2023-07-24 | 2024-03-12 | 주식회사 누리랩 | Method and apparatus for detecting url associated with phishing site using artificial intelligence algorithm |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20180343272A1 (en) * | 2017-05-26 | 2018-11-29 | Qatar Foundation | Method to identify malicious web domain names thanks to their dynamics |
| US20190095512A1 (en) * | 2015-08-07 | 2019-03-28 | Cisco Technology, Inc. | Domain classification based on domain name system (dns) traffic |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8521667B2 (en) * | 2010-12-15 | 2013-08-27 | Microsoft Corporation | Detection and categorization of malicious URLs |
| US10176335B2 (en) | 2012-03-20 | 2019-01-08 | Microsoft Technology Licensing, Llc | Identity services for organizations transparently hosted in the cloud |
| WO2018163464A1 (en) | 2017-03-09 | 2018-09-13 | 日本電信電話株式会社 | Attack countermeasure determination device, attack countermeasure determination method, and attack countermeasure determination program |
| WO2018164701A1 (en) * | 2017-03-10 | 2018-09-13 | Visa International Service Association | Identifying malicious network devices |
| JP6869833B2 (en) | 2017-07-05 | 2021-05-12 | Kddi株式会社 | Identification device, identification method, identification program, model generation device, model generation method and model generation program |
-
2021
- 2021-04-13 JP JP2022562116A patent/JP7686667B2/en active Active
- 2021-04-13 AU AU2021257379A patent/AU2021257379A1/en active Pending
- 2021-04-13 EP EP21787700.0A patent/EP4136592A4/en active Pending
- 2021-04-13 WO PCT/QA2021/050004 patent/WO2021210998A1/en not_active Ceased
- 2021-04-13 CN CN202180028248.7A patent/CN115812200A/en active Pending
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190095512A1 (en) * | 2015-08-07 | 2019-03-28 | Cisco Technology, Inc. | Domain classification based on domain name system (dns) traffic |
| US20180343272A1 (en) * | 2017-05-26 | 2018-11-29 | Qatar Foundation | Method to identify malicious web domain names thanks to their dynamics |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4136592A4 * |
Also Published As
| Publication number | Publication date |
|---|---|
| AU2021257379A1 (en) | 2022-11-03 |
| EP4136592A4 (en) | 2024-05-15 |
| JP2023525653A (en) | 2023-06-19 |
| EP4136592A1 (en) | 2023-02-22 |
| CN115812200A (en) | 2023-03-17 |
| JP7686667B2 (en) | 2025-06-02 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| De Silva et al. | Compromised or {Attacker-Owned}: A large scale classification and study of hosting domains of malicious {URLs} | |
| US12238125B2 (en) | Detection and mitigation of recursive domain name system attacks | |
| US10740363B2 (en) | Domain classification based on domain name system (DNS) traffic | |
| US10574681B2 (en) | Detection of known and unknown malicious domains | |
| Khalil et al. | Discovering malicious domains through passive DNS data graph analysis | |
| Vinayakumar et al. | Scalable framework for cyber threat situational awareness based on domain name systems data analysis | |
| Marchal et al. | PhishStorm: Detecting phishing with streaming analytics | |
| US8205258B1 (en) | Methods and apparatus for detecting web threat infection chains | |
| US8516585B2 (en) | System and method for detection of domain-flux botnets and the like | |
| US8429180B1 (en) | Cooperative identification of malicious remote objects | |
| Fachkha et al. | Inferring distributed reflection denial of service attacks from darknet | |
| US20100235915A1 (en) | Using host symptoms, host roles, and/or host reputation for detection of host infection | |
| US20170272469A1 (en) | Using Private Threat Intelligence in Public Cloud | |
| Vadrevu et al. | Measuring and detecting malware downloads in live network traffic | |
| US20140047543A1 (en) | Apparatus and method for detecting http botnet based on densities of web transactions | |
| Kim et al. | Detecting fake anti-virus software distribution webpages | |
| Chen et al. | Intelligent malicious URL detection with feature analysis | |
| Le Page et al. | Domain classifier: Compromised machines versus malicious registrations | |
| Roopak et al. | On effectiveness of source code and SSL based features for phishing website detection | |
| JP7686667B2 (en) | Malicious domain hosting type classification system and method | |
| US20250202916A1 (en) | Real-time attribution of tools and campaigns for dns tunneling traffic | |
| Mohammed | Network-based detection and prevention system against DNS-based attacks | |
| Li | An empirical analysis on threat intelligence: Data characteristics and real-world uses | |
| Chen et al. | Doctrina: annotated bipartite graph mining for malware-control domain detection | |
| Lalouani et al. | Multi-observable reputation scoring system for flagging suspicious user sessions |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21787700 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2022562116 Country of ref document: JP Kind code of ref document: A |
|
| ENP | Entry into the national phase |
Ref document number: 2021257379 Country of ref document: AU Date of ref document: 20210413 Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2021787700 Country of ref document: EP Effective date: 20221114 |