WO2024223052A1

WO2024223052A1 - Apparatus and method for auditing rule-based command risk assessment systems

Info

Publication number: WO2024223052A1
Application number: PCT/EP2023/061225
Authority: WO
Inventors: Paolo NOTARO; Soroush HAERI; Jorge Cardoso
Original assignee: Huawei Cloud Computing Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2024-10-31
Anticipated expiration: 2025-10-28
Also published as: CN121079683A

Abstract

An apparatus (1001) for auditing rules in a rule-based command risk assessment system, the apparatus being communicatively connectable to one or more data stores (406, 407, 1003) storing one or more commands and a set of pre-existing rules for classifying the commands, and comprising one or more processors (1002) configured to, for each of at least one of the commands: receive (901) a first predicted classification for the respective command, the first predicted classification being determined based on the set of pre-existing rules; form (902) a second predicted classification for the respective command based on external documentation for the respective command; and if the first prediction differs from the second prediction, output (903) a notification. This may provide a supervision system that allows the rules to reflect the correct risk of executed commands over time. This may improve risk assessment quality and reduce incident frequency.

Description

APPARATUS AND METHOD FOR AUDITING RULE-BASED COMMAND RISK ASSESSMENT SYSTEMS

TECHNICAL FIELD

This present disclosure relates to risk assessment systems, in particular to the auditing of rules in command risk assessment systems.

BACKGROUND

The growth of security threats has encouraged organizations to develop effective security solutions to safeguard digital applications, data and resources. Solutions target the prevention of incidents and malicious attacks. In order to prevent failures and threats in large IT systems, resulting from the execution of high-risk commands in the command-line interface by operators, it is desirable to automatically recognize and possibly block high-risk commands.

In large IT systems, operations and maintenance (O&M) personnel can access production systems via a remote terminal to perform maintenance operations. Incorrect or malicious operations may result in major system failures and data losses. In this context, security solutions are desirable to prevent execution of potentially risky operations, by classifying risky operations as such before their execution. One example of risk classification is shown in the table of Fig. 1 , where operations (in the left-hand column) are classified into ‘Risky’ and ‘Safe’ groups.

Rule-based systems for command risk assessment can apply sets of whitelist/blacklist rules to classify and handle commands. A common solution for the above scenario is schematically illustrated in Fig. 2, where a command interception system 203, deployed on network gateways between an operator 201 and a production host 202, contains a rule-based classifier, where IF-THEN-ELSE rules define which commands are allowed (or whitelisted) and which commands are blocked (or blacklisted). The rules are stored at a rule database 204 and may be defined based on the expertise of O&M operators and the historical records of executed commands. In large-scale environments, this approach requires to regularly maintain and update existing rules to reflect the dynamicity in the environment. New, unseen commands may be executed, for which the existing rules may not be suitable. Moreover, the risk level assigned to commands by operators based on their expertise may diverge from the inherent risk of commands, as described in their technical documentation. An auditing system, as shown at 205, can be used in the command risk assessment system to evaluate whether the risk assignment performed by a rule-based classifier is correct.

Previous solutions in the area of auditing systems can generally be divided into two main categories: direct auditing systems and auditing systems based on classification.

Direct auditing systems rely on human expertise (for example in the form of administrator 206) to confirm if the risk assignment performed by a rule-based classifier is correct. If they are exclusively based on human feedback, they cannot evaluate if a past operator judgement is wrong, because their judgement represents the ground truth. Therefore, such systems require high quality labels and cannot correct past misjudgements. Moreover, when no existing rule can match the current command exactly, the risk cannot be evaluated and human intervention is required.

Auditing systems based on classification apply machine learning classifiers to predict the risk of commands, when an existing rule-based classifier does not provide a rule or provides an incorrect rule. By providing a more accurate classification, rules can then be created and updated using the classifier knowledge. Such approaches cannot generalize to unseen commands that are significantly different to the commands used for training the classifier. This renders the auditing task difficult. Moreover, the design choices applied often do not enable the classifier to analyse the temporal and contextual relationships between command tokens, which are important to understand the behaviour of commands. Classification by similarity matching is very sensitive to the choice of the similarity function and to the availability of a large quantity of labelled commands, and may not be able to analyse all commands executed in a real-world environment.

It is desirable to develop an approach that can overcome at least some of the above issues.

SUMMARY

According to a first aspect, there is provided an apparatus for auditing rules in a rule-based command risk assessment system, the apparatus being communicatively connectable to one or more data stores storing one or more commands and a set of pre-existing rules for classifying the commands, and comprising one or more processors configured to, for each of at least one of the commands: receive a first predicted classification for the respective command, the first predicted classification being determined based on the set of pre-existing rules; form a second predicted classification for the respective command based on external documentation for the respective command; and if the first prediction differs from the second prediction, output a notification.

By leveraging previous knowledge coming from an existing set of classification rules and additionally utilising external sources of knowledge from command documentation to evaluate the inherent risk of commands, an existing set of classification rules can be challenged and revised. This may provide a supervision system that allows the rules to reflect the correct risk of executed commands over time. This may improve risk assessment quality and reduce incident frequency.

The data store(s) may be, for example, a data management system or a database system, and may be stored in the cloud or in hardware and may be a computer-readable medium, such as a non-transitory computer-readable medium. The external documentation may be stored at the data store(s). The data store(s) may be communicatively coupled to a data network such as the internet to allow the external documentation stored at the data store(s) to be periodically updated. This may allow the external documentation to be kept up-to-date in view of code or software updates and the like. The set of rules may comprise one or more rules; preferably multiple rules.

The one or more commands may be historical commands issued by an operator of a target host controlled by an access control system implementing the rule-based command risk assessment system according to the set of pre-existing rules, each historical command having a corresponding first predicted classification determined based on the set of pre-existing rules. This may allow the apparatus to audit sets of rules used in a rule-based risk assessment system to determine whether the rules are appropriate for classifying commands. If the set of rules does not provide appropriate classifications, the set of rules can be updated.

The first classification and the second classification may each comprise an indication of a level of risk relating to the execution of the respective command. For example, the classifications may be ‘Safe’ or ‘Risky’. This may allow the system to prevent the execution of potentially risky operations, by classifying risky operations as such before their execution.

The external documentation may comprise one or more of the following: documentation describing standard operating system commands, files or system calls; internally-defined scripts, programs or aliases; and third-party programs, tools or internal filenames. This may allow the system to take into account information contained within the external documentation relating to the command in determining whether the set of rules are appropriate for classifying the command correctly. The one or more processors may be configured to process the external documentation or information derived therefrom (for example, a cleaned or compressed version of extracted information) and the command to form the second predicted classification. This may allow the system to take into account information contained within the external documentation relating to the command and a description of the command in determining the second prediction.

The one or more processors may be configured to: retrieve relevant documentation for the respective command from the external documentation; extract information relevant to the respective command from the relevant documentation; and in dependence on the extracted information, form a command description for the respective command. The command description may be an explanation summary of the respective command. This may allow the operation of the command to be expressed.

The command description may have a natural language format. The command may have a computer programming language format. The one or more processors may be configured to input the command description to an artificial intelligence model and form the second predicted classification in dependence on an output of the artificial intelligence model. Using language models (such as BERT or GPT3) for classifying commands by extracting information from documentation allows to capture the context of the command. The language model can generalize to unseen commands, subcommands, and flag combinations and preserve the sequential nature of the executed commands, hence considering the context in which flags and sub-commands are being used. This may improve the accuracy of classifications.

The artificial intelligence model may comprise a first head for processing the command description. The artificial intelligence model may comprise a second head for processing the respective command. For example, the first head may be an English language head and the second head may be a BERT head, such as a BERT model in a programming language, such as Bash. Other models and languages may be used. This may allow the context of the command to be captured.

The one or more processors may be configured to form the second predicted classification in dependence on a third classification prediction formed in dependence on one or more risk indications corresponding to one or more rules of the set of rules selected based on similarity with the respective command. The third classification prediction may be formed by aggregating the risk indications corresponding to rules retrieved based on similarity with the respective command. The third classification may be the output of a similarity search classification system. This may allow similar rules to be used to form the second classification when no existing rule can match the current command exactly. The one or more processors may be configured to form the second predicted classification by aggregating the output of the artificial intelligence model and the third classification prediction. This may improve the accuracy of the final prediction.

The one or more processors may be configured to output respective notifications for multiple commands. The notifications may be pre-grouped by similarity and reported to an administrator of the system. This may improve the efficiency of the notification process.

The one or more processors may be configured to, in dependence on the notification, update one of more of the rules of the pre-existing set of rules. This may allow the set of rules to be continuously updated to ensure that the set of rules are appropriate for classifying commands.

According to a second aspect, there is provided a method for auditing rules in a rule-based command risk assessment system for an apparatus being communicatively connectable to one or more data stores storing one or more commands and a set of pre-existing rules for classifying the commands, the method comprising, for each of at least one of the commands: receiving a first predicted classification for the respective command, the first predicted classification being determined based on the set of pre-existing rules; forming a second predicted classification for the respective command based on external documentation for the respective command; and if the first prediction differs from the second prediction, outputting a notification.

By leveraging previous knowledge coming from an existing set of classification rules and additionally utilising external sources of knowledge from command documentation to evaluate the inherent risk of commands, the method may allow an existing set of classification rules can be challenged and revised, such that the rules reflect the correct risk of executed commands over time. This may improve risk assessment quality and reduce incident frequency.

The above method may be a computer-implemented method that is implemented on-device (for example, on a laptop or smartphone) or externally, such as on cloud-based services.

According to a further aspect, there is provided a computer program comprising instructions that when executed by a computer cause the computer to perform the methods above.

According to a further aspect, there is provided a computer-readable storage medium having stored thereon computer readable instructions that when executed at a computer (for example, comprising one or more processors) cause the computer to perform the methods above. The computer-readable storage medium may be a non-transitory computer-readable storage medium. The computer may be implemented as a system of interconnected devices. BRIEF DESCRIPTION OF THE FIGURES

The present disclosure will now be described by way of example with reference to the accompanying drawings.

In the drawings:

Fig. 1 illustrates classifying operations into ‘Risky’ and ‘Safe’ groups.

Fig. 2 schematically illustrates an application scenario of a rule-based command risk assessment system and the corresponding auditing system.

Fig. 3 schematically illustrates an exemplary system in which the artificial intelligencebased auditing system of embodiments of the present disclosure may be applied.

Fig. 4 schematically illustrates an overview of an exemplary auditing system and its interaction with an existing rule-based risk assessment system.

Fig. 5 schematically illustrates an exemplary documentation system of the auditing system.

Fig. 6 schematically illustrates an exemplary classifier of the auditing system.

Fig. 7 schematically illustrates an exemplary similarity search system that can optionally be implemented by the auditing system.

Fig. 8 shows an exemplary operation sequence diagram for the auditing system described herein.

Fig. 9 shows an example of a method for auditing rules in a rule-based command risk assessment system.

Fig. 10 shows an example of an apparatus for auditing rules in a rule-based command risk assessment system and some of its associated components.

DETAILED DESCRIPTION

The present disclosure concerns a system for auditing a rule-based risk assessment system for commands, which may be carried out in a cloud or other large-scale computing infrastructure. The system can analyse historical risk predictions to verify if past block/allow decisions for commands are correct based on additional external knowledge. The system can report possible discrepancies to system administrators. The system compares the output of the risk assessment system and the auditing prediction to evaluate the quality of the prediction provided by the existing system. If the predictions differ, the event is reported to system administrators for revision.

In the embodiments described below, Natural Language Processing (NLP) techniques are used for command risk classification. Common NLP problems include speech recognition, text summarization, and sentiment analysis, a category of text classification used to determine the polarity of a given text, where the polarity can represent different aspects of language, such as positivity/negativity, objectivity/subjectivity, or hateful/supportive content. Recent advances in neural network-based methods (also known as deep learning) have encouraged the development of several neural-based architectures specialized for speech and text, such as the transformer and other transformer-based Large Language Models (LLMs), such as Bidirectional Transformers for Language Understanding (BERT) (see Devlin, Jacob, et al., "Bert: Pre-training of deep bidirectional transformers for language understanding", arXiv preprint arXiv: 1810.04805 (2018)) or GPT3 (see Brown, Tom, et al., "Language models are few-shot learners", Advances in neural information processing systems 33 (2020): 1877-1901). Thanks to the large quantity of text data and effective representation learning, these models are able to capture relationships between different subunits of a text, called tokens (such as words, punctuation or letters). Transformer-based architectures may also be applied to documentation text and CLI commands. Similar to other sentimental analysis tasks, LLMs can be applied on documentation text to evaluate the polarity in terms of dangerous/safe behaviour, i.e. if the English documentation of a command thoroughly describes the behaviour of a command, the LLM may be able to classify if such command can constitute a risk for the host system. A similar argument applies to the remote command itself which, as an instance of a programming language with predefined lemmas, syntax, context and structural dependencies, resembles in any form a natural language, such as English.

Fig. 3 schematically illustrates a use-case diagram for an exemplary risk assessment system 300. An operator 301 accesses a production system for maintenance through a proxy 302. In this example, the proxy intercepts operations from a Secure Shell (SSH) bastion to be executed on a target host 303 and forwards them to the risk assessment system 304, where the risk of the operation represented by the command is evaluated.

The risk assessment system 304 uses internal configuration resources, for example a rule management system 305, for estimating operation risk. If a request for an operation is safe, it is forwarded to the target host 303 and the result of the operation is forwarded back to the operator 301. However, if the requested operation is risky, the operation is blocked and the operator is notified that the requested operation is not permitted. Risk prediction classifications are stored in a command database 306 with the corresponding command for offline analysis.

In embodiments of the present disclosure, the auditing system 350 retrieves historical risk prediction classifications and current command pairs and analyses them, by estimating a new risk class, based on command documentation. If the historical prediction differs from the new risk class, the command is reported. The command may be reported to an expert reviewer (such as a system administrator) for revision. The expert reviewer can then decide to act and modify the runtime behaviour of the risk assessment system, to better reflect the correct risk class as reported by the auditing system. To do so, the expert reviewer can modify the rule management system which provides prediction capability to the risk assessment system. Anomalous predictions that are reported may be grouped by similarity and notifications may be grouped by similarity. Similar predictions may be reported together.

A complete diagram of an exemplary operating environment 400 is depicted in Fig. 4.

In this example, the operator 401 is an engineer working on an O&M task. The engineer may perform actions on a terminal (not shown), approved through the access control system 402. The target host 403 may be a machine in production environment where the operator performs their tasks. Its access is controlled by the access control system 402. The administrator 404 may be an engineer from the O&M department. The administrator may be manually responsible for keeping the risk assessment system accurate and up-to-date. The system may also be automatically updated based on the notifications that are output by the system.

Access control system 402 manages the execution of commands to the target host 403. It receives commands from the operator terminal and requests a risk prediction from the rulebased classifier 405. If the command is classified as SAFE, the command is forwarded to the target host 403 and the command output is forwarded back to the operator terminal. Otherwise, the command is blocked and an error message is reported to the operator. In any case, the access control system 402 stores the command and the corresponding prediction in the command database 406.

The rule-based classifier 405 evaluates the risk associated with each individual command. The rule-based classifier 405 can evaluate the risk associated with each command using a set of predefined rules. Rule database 407 stores the classification rules for evaluating commands. The rules may be IF-THEN-ELSE rules that define which commands are allowed (or whitelisted) and which commands are blocked (or blacklisted). Each rule is composed of a pattern and a label (for example, SAFE or RISKY). If a command matches the rule pattern, it is assigned the corresponding rule label.

Rules are fetched from the rule database 407 and compared with the incoming command to find rule matches for the command. Based on the matched rules, the command is classified. The classification gives an indication of a level of risk relating to the execution of the respective command. A risk label corresponding to the classification is selected. The label can be, for example, SAFE or RISKY. A default behaviour for unmatched commands is defined (e.g. RISKY).

Command database 406 stores past risk predictions determined based on the set of rules in the rule database 407, including a respective command and its corresponding predicted classification. The command-prediction pairs can be used for offline auditing of the rule-based classifier.

The documentation system 408 comprises a command knowledge base, in this example in the form of documentation database 409, which stores external documentation of terminalexecutable programs; and a command analysis module, in this example in the form of a command describer system 410, to produce explanation summaries of terminal commands, shown at 411.

The documentation system 408 extracts relevant information regarding the executed command from command documentation, and summarizes the information in the form of a command description in natural language. In this example, the natural language is English, but other languages may be used.

The Al classifier 412 takes a raw command and its generated text command description as input and predicts a classification for the command. The classification may have a corresponding risk label that infers the risk level of the executed command. The Al model of the Al classifier can be trained on historical command predictions, retrieved from the command database 406.

The system compares the output of rule-based system 405 and Al-based classifier 412 to audit the rules.

In this example, a similarity search module 413 additionally classifies commands through similarity analysis with existing rules. Classification rules are retrieved from the rule database 407. The rules that are most similar to the input command are identified by means of a similarity function. Using the labels of the most similar rules, a risk prediction is obtained by aggregation of responses (for example using averaging or voting). The system integrates the existing rule knowledge into the auditing system and provides a fall-back solution in case of missing documentation. The similarity search system allows to overcome potential limitations in the documentation knowledge base, by exploiting the existing system rules to generalize risk predictability in proximal cases.

Responses from the similarity search system and Al-based classifier are aggregated to produce a final output response. Response aggregator 414 receives the output of the similarity search system 413 and the Al classifier 412 and aggregate the responses to generate a final risk prediction for the command.

Where the similarity search system is not used, the final risk prediction is the output of the Al classifier 412.

Prediction comparator 415 retrieves the historical prediction of the rule-based predictor and compares it to the final prediction produced by the response aggregator 414. If the predictions differ, it reports the issue to the administrator 404 to signal the potential misprediction, as shown at 416.

An example of the documentation system 408 is depicted in Fig. 5. The documentation database 409 is a document-oriented database where documentation files of terminalexecutable programs can be stored in a structured format. A single documentation file can provide details on how to use a single program in detail, including syntax usage, program behaviour, and available options. The documentation can be organized in paragraphs, so that each paragraph either describes the general usage or specific option and/or argument. The documentation can be imported in a preliminary setup step using an import script, which takes standardized documentation files as input. Documentation that does not adhere to the standardized format required by the import script can be converted. Internally-defined commands can also be included to improve auditing quality. They can be documented directly in the standardized format, or may be converted.

Some examples of the different data sources employed can be summarized as follows:

- documentation pages describing standard operating system commands, files, and system calls;

- internally-defined scripts, programs and aliases;

- additional third-party programs and tools frequently used in the IT environment; - additional third-party or internal filenames frequently used in the IT environment.

The documentation system acts as a command describer, which is a software module that produces explanation summaries of terminal commands. Given an incoming command, the command describer retrieves the relevant documentation pages from the documentation database, extracts information relevant to current command instance, and produces a short text describing the command functionality in English language.

External command documentation included in the documentation database 409 may, for example, comprise any of the following:

- documentation pages describing common Linux commands (‘manpages’), which are freely available online and are directly importable into the documentation database;

- additional third-party commands that are documented can be converted to the manpages format;

- scripts, internal commands and aliases can be documented directly in manpages format to improve auditing quality;

- frequent filenames.

In the example shown in Fig. 5, the documentation system 408 comprises three submodules.

The first module is a parser program 501 , which produces an Abstract Syntax Tree (AST) representation of the incoming command 500, where command tokens may be, for example, tree nodes with program, option, or argument tags.

The second module is a matcher program 502, which associates elements in the AST with documents in the documentation database, by matching AST element names with documentation file titles and aliases. After the relevant documents are identified, for each document relevant paragraphs are extracted, by matching options and arguments of the AST with the paragraphs of the documentation. General description paragraphs for the program (synopsis, usage, main functionality) are preferably included in the final selection.

The third module is a post-processing program 503, which concatenates the selected paragraphs and/or cleans the obtained text, for example by removing HTML tags, links, author and copyright information, and/or by removing trailing spaces and normalizing text. The final output is a single string describing the behaviour of the input command in a natural language format. The command description is specific to the set of programs, options and arguments specified. The description is sent to the Al model as additional source of information for risk classification. In case no relevant documents are identified, no command description is returned.

Therefore, the incoming command is parsed to recognize named entities. The output is an AST with entities (programs, options, arguments) as nodes. Documentation corresponding to matched entities is retrieved. The system queries the documentation database with program names and files found in the command, and returns exact matches. Relevant text information is extracted to construct a concise command description. Elements in the AST are associated with database matches. From the matches, the following can be selected:

- Base description (usage, main functionality)

- If a program, paragraphs associated to the flags and parameters specified (e.g. -I, - r).

The extracted text can be cleaned and merged to obtain a concise command description. This cleaning process may remove redundant information (such as links, copyright, author information). The extracted text may be concatenated, for example by merging selected paragraphs into a single block of text. The command description is sent to the Al model as additional source for risk classification prediction. The natural language model (in this example, the English language model) can learn to extract semantic patterns from the text, such as ‘dangerous’ keywords (erase, delete, stop, disable) and ‘safe’ keywords (query, print, return), which can help to inform the risk posed by the command.

Fig. 6 summarizes the main elements of an exemplary Al classifier 412. In this example, the Al classifier is a neural network model based on a BERT model. Other model architectures may be used to classify the commands in dependence on the external documentation.

In this example, the Al classifier comprises a BERT head 601 pretrained on an English corpus (referred to herein as the English BERT head). In alternative implementations, the head may implement an alternative natural language. This first head 601 is for processing the documentation text (the command description) produced by the documentation system. This head is used in combination with a full tokenizer 602. In this example, the full tokenizer 602 combines base tokenization (for example, based on spacing and punctuation) with WordPiece tokenization (see https://ai.googleblog.com/2021/12/a-fast-wordpiece-tokenization- system.html). The full tokenizer 602 allows to efficiently encode the English text to a numeric representation. The classifier also comprises a BERT head 603 pretrained on a corpus of Bash commands (referred to herein as the Bash BERT head). In alternative implementations, the head may implement an alternative computer programming language. This second head 603 is for processing the command string. This head is used in combination with a byte-pair encoder (BPE) tokenizer 604, which can adapt the set of encoding tokens to the input provided during pre-training and to the desired vocabulary size. It is capable of preserving the sequential nature of the executed commands, thereby preserving the context of the arguments, options and flags. Byte-pair tokenization and encoding specialized for the Bash language is therefore employed by the Al-based classifier. It considers the relationships between the tokens appearing together often, thereby improving the classification results.

The Al classifier can be set up in two steps: pre-training and fine-tuning, as shown in Fig. 6. During pre-training, the BERT heads are pre-trained on corresponding large corpora of unlabelled text (English-language documents for the English head, Bash commands for the Bash head). Exemplary tasks employed for pre-training are masked token prediction and next sentence prediction (see Devlin, Jacob, et al., "Bert: Pre-training of deep bidirectional transformers for language understanding", arXiv preprint arXiv: 1810.04805 (2018)). In this same step, the BPE tokenizer 604 can be trained on the Bash corpus to recognize and encode tokens present in Bash commands, while the full tokenizer 602 for English may not require a training step.

The pre-trained models are then able to capture the complex relationship between tokens and learn the syntax constructs of the two languages (English and Bash in this example). During fine-tuning, the classification network is connected to the BERT heads to learn the classification task of determining classifications for the commands, such as RISKY and SAFE. Labelled commands stored in the command database can be used for the training. The pretrained models are presented with the commands, the corresponding command description and their associated label. The models can then be trained during the fine-tuning stage to discriminate the SAFE from RISKY commands using a back-propagation algorithm.

During execution, the Al classifier 412 may be employed as follows. The respective command to classify is retrieved from the command database 406. The command is tokenized using the BPE tokenizer 604 and fed into the pre-trained Bash BERT head 603. The command description is retrieved from the documentation system 409. If the command description is available, the command description is tokenized using the full tokenizer 602 and fed into the pre-trained English BERT head 601. An exemplary classification network, shown at 607, connecting the two BERT heads, comprises a fully connected layer mapping the BERT outputs 605, 606 to a class prediction, such as ‘RISKY’, as shown in Fig. 6 for a particular command. The output vectors 605 and 606 produced by the two BERT heads are concatenated. Then, a fully-connected layer applies a linear transformation to map the concatenated inputs to the number of prediction classes (for example, SAFE or RISKY). The output can be scaled using a SoftMax activation to obtain class probabilities. In case no relevant documentation is available for a specific command, the Bash BERT head can operate while the English Bert head is disabled, as described below.

In case a command description is available, both BERT heads 601 , 603 and the classification network 607 process the tokenized inputs 605, 606 to output a classification prediction. In case no command description is available, only the Bash BERT head 6603 and the classification network 607 process the tokenized input 605 to output a classification prediction. In this case, the input of the classification network from the English BERT head is set to 0.

The solution therefore utilizes two language models (Bash for commands, English for documentation) to classify executed commands, for example as RISKY or SAFE.

Optionally, a similarity search system can also be used to overcome potential limitations in the documentation knowledge base, which may occur in some implementations, by exploiting the existing system rules to generalize risk predictability in proximal cases. The similarity search thus provides a fall-back solution based on the most similar rules, so that a risk classification from the auditing system is always available.

A diagram of an exemplary similarity search system is shown in Fig. 7.

In this example, the similarity search system 413 implements a k-Nearest-Neighbours (KNN) algorithm to classify incoming commands based on the rules stored in the rule database 407. Each rule is associated with a blocking/allowing policy, which can be used as training labels for the classification algorithm. In one implementation, during training, databased rules are retrieved and an efficient tree-like (k-d tree) data structure is fitted for enabling fast neighbour queries during the inference phase. During inference, the incoming command is used as query to return the most similar matching rules. To evaluate similarity, the similarity distance (for example, Jaccard, Hamming or Levenshtein distance) can be used, to measure the minimum number of insertions, deletion, or replacement changes necessary to convert the command into a potential match. Once the top k matches are retrieved, their associated labels are used to predict the class of the incoming command by majority voting. Therefore, rules similar to the input command can be identified, and the risk classifications of the identified rules can be combined to predict the risk of the input command.

When both predictions from the Al-based classifier 412 and the similarity search system 413 are available, responses are aggregated to generate a single prediction. Depending on the objectives of the auditing system, different aggregation methods may be defined.

Using a weighting aggregation method, each classification prediction (T) is considered in proportion to its weight:

Y final = ar gmax(p final) where: pflnal = wBERT * pBERT + wsimilarity * psimilarity where pBERT, psimilarity, pflnal are probability vectors for each class and wBERT, wsimilarity are scalar constants with unit sum. This requires tracking the class probabilities (p) for each predictor, in addition to the prediction classes (T).

Alternatively, using a recall-first weighting aggregation method, the riskiest prediction class is selected:

Y final = max(yBERT,ysimilarity) with "RISKY" > "SAFE"

By default, the recall-first aggregation may be enabled.

After the final prediction is determined, it is compared with the historical prediction performed by the rule-based system for the same command, as stored in the command database 406. If the two classification predictions differ, an alert is output, which can be notified to the system administrator to report an anomalous prediction. The administrator can choose to intervene on the rule database to improve prediction quality of the existing risk assessment system. Alternatively, the rule database may be automatically updated.

Fig. 8 shows an operation sequence diagram for the exemplary system described above. The system components are numbered as described previously.

At 801 , the prediction job is started and an instruction is sent to the command database 406 to retrieve a command of the historical commands and their corresponding classification (determined according to the rules in the rule database 407). At 802, the command is sent to the similarity search system 413, which queries the rule database 407 at 803. At 804, similar rules are returned to the similarity search system 413. A similarity prediction is sent to response aggregator 414 at 805.

At 806, the command is also sent to the command descriptor 410. The documentation of the documentation database 409 is queried at 807 and documentation matches for the command are sent to the documentation system at 808. A command description is formed, which is sent to the Al classifier 412 at 809.

The Al classifier 412 also receives the command at 810. The classification prediction formed by the Al classifier 412 is sent to the response aggregator 414 at 811. The final auditing classification prediction is sent to the prediction comparator 415 at 812. The prediction comparator also receives the rule-based classification prediction from the command database 406 at 813.

At 814, the two classification predictions (the first being determined based on the rules and the second being determined by the Al classifier as described herein) are compared. If they differ, a notification is output at 815.

Fig. 9 shows a flowchart illustrating the steps of an exemplary method 900 for auditing rules in a rule-based command risk assessment system. The apparatus implementing the method is communicatively connectable to one or more data stores for storing one or more commands and a set of pre-existing rules for classifying the commands. For each of at least one of the commands, the method comprises the following steps. At step 901 , the method comprises receiving a first predicted classification for the respective command, the first predicted classification being determined based on the set of pre-existing rules. At step 902, the method comprises forming a second predicted classification for the respective command based on external documentation for the respective command. At step 903, the method comprises, if the first prediction differs from the second prediction, outputting a notification.

Fig. 10 shows an example of an apparatus 1001 for performing the methods described herein. The apparatus 1001 may comprise at least one processor, such as processor 1002, and at least one memory, such as memory 1003. The memory stores in a non-transient way code that is executable by the processor(s) to implement the apparatus in the manner described herein. The apparatus may be associated with multiple processors and each processor may be used to perform a different function. One or more of the processors or memories associated with the apparatus may, for example, be based in the cloud. Internally deployed rules in risk assessment systems are traditionally based on the past knowledge of operators, which may not cover all combination of arguments and/or options of past commands, and equally may not cover new or unseen commands. Moreover, even in the case of commands covered and classified by the existing set of rules, the traditional classification risk described by the whitelist/blacklist rule may differ from the concrete risk of the command. In that case, either a dangerous command may be considered safe, allowing the operators to run and causing damage to the system; or a safe command may be unjustifiably blocked, resulting in an unnecessary blocking action, which may hinder the prompt operations.

The approach described herein advantageously leverages previous knowledge coming from an existing set of classification rules and additionally utilises external sources of knowledge coming from command documentation to evaluate the inherent risk of commands, so that an existing classification rule can be challenged and revised. In embodiments of the present disclosure, by evaluating the predictions made by the rule-based system and comparing them to internal predictions computed using an external knowledge base, risk predictions considered inaccurate can be reported, and the corresponding rules can be updated accordingly.

The solution for the auditing of rules in a rule-based risk classification system described herein thus combines external information and embedded knowledge from different sources, to provide a more accurate command classification. It provides a mechanism to dynamically discover and revise the security rules to govern a cloud infrastructure and adopts an auditing mechanism to evaluate the existing rules and suggest improvements. As a result, the quality of the existing rules for classification is enhanced, the effort required to revise the existing rules is reduced, and incident frequency may be reduced.

The solution combines documentation and operator risk knowledge to classify commands by means of sequential neural network model, which can capture the relationships and context of commands expressed in CLI language. The system can produce more accurate command classifications, resulting in fewer dangerous operations executed. The language model can generalize to unseen commands, subcommands, and flag combinations.

Existing rules can be re-used to predict risk of unseen commands. This allows the system to cover cases where no documentation can be found, or the command risk cannot be assessed otherwise. The ability of a system to re-use existing knowledge allows to take full advantage of existing resources, and reduces human intervention and the need for re-training. The solution can improve itself over time by training on new samples of data observed from real environments. It can provide a higher grade of security to the target hosts. Furthermore, it can adapt itself to new application command line interfaces (CLIs) if they are newly introduced to the environment.

The present approach operates for both discovery of false positive and false negative predictions during the auditing process, via comparison of all predictions for all samples, not only the positive predictions or for specific rules.

The approach can work on all valid commands, including commands resulting from concatenation of multiple commands via pipes and other possibilities allowed by the commandline language (&, xargs, ...).

Embodiments of the present disclosure also overcome several limitations of existing auditing systems. Previous ML-based direct command classifiers are trained on a large quantity of labelled commands, but cannot generalize beyond commands that are sufficiently similar to the ones shown during the training phase. That is because the classifier learns to recognize the risky patterns inside the command string. When the strings change significantly, the classifier is no longer able to match the corresponding risk patterns, and cannot classify accurately. Moreover, a classifier trained on labels provided by operators is trained to agree with the operators’ opinion, and therefore cannot signal disagreement to it. Man-in-the-loop auditing systems report to humans when existing rules are not applicable and no risk assessment is available. Such systems therefore require constant human intervention to provide feedback and improve existing classification systems. Embodiments of the present disclosure can reduce the amount of commands reported for revision, by only reporting commands when it is deemed a concrete revision action must be taken, because there is a discrepancy in the predictions, and by avoiding overreporting due to lack of knowledge, thanks to its fall-back mechanism that allows to always produce a risk classification.

Previous solutions for auditing also generally cannot effectively capture the context, by applying algorithms that do not consider the order of tokens and their relationship in a sequence, and cannot discover past wrong judgements, especially false negative predictions. They are unable to correctly evaluate unseen commands, because of a generalization gap, and lack of applicable rules or lack of neighbour commands/patterns for similarity classification. In implementations of the present disclosure, using language models (such as BERT or GPT3) for classifying commands by extracting information from documentation allows to capture the context of the command. The language model can generalize to unseen commands, subcommands, and flag combinations and preserve the sequential nature of the executed commands, hence considering the context in which flags and sub-commands are being used. The use of byte-pair tokenization and encoding prior to classifying the executed commands, which considers correlation between tokens, and a transformer architecture, can also help to effectively model and process text sequences.

The solution can also be extended to other CLI tools such as databases, interactive shell languages, and network device management and is not only restricted to the UNIX CLI and cloud environments.

In other implementations, the system may be deployed to prevent employees from running commands that, for example, can damage Windows OS. Risky commands can be blocked and require additional permissions to be executed. The method may also be implemented in such systems as proxy servers, firewall systems, and databases and can be used to automatically review access control policies, including exceptions and whitelisted operations (defined in the access points as rules), to prevent harmful connections from being initiated. The method may also be used as part of standard operating procedure (SOP) verification. Given a natural language description of the SOP (acting as documentation) and a list of commands or operations executed (acting as the command), the system can be used to verify if the SOP was followed. The system may also be used in the medical industry to automatically review diagnostic tests to verify that important diagnoses have not been overlooked. The medical review system may take a diagnostic exam outcome as input and revise the decision of medical personnel to prevent false negative (FN) diagnoses of high-risk diseases.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present disclosure may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the disclosure.

Claims

1. An apparatus (1001) for auditing rules in a rule-based command risk assessment system, the apparatus being communicatively connectable to one or more data stores (406, 407, 1003) storing one or more commands and a set of pre-existing rules for classifying the commands, and comprising one or more processors (1002) configured to, for each of at least one of the commands: receive (901) a first predicted classification for the respective command, the first predicted classification being determined based on the set of pre-existing rules; form (902) a second predicted classification for the respective command based on external documentation for the respective command; and if the first prediction differs from the second prediction, output (903) a notification.

2. The apparatus (1001) as claimed in claim 1 , wherein the one or more commands are historical commands issued by an operator (301 , 401) of a target host (303, 403) controlled by an access control system (302, 402) implementing the rule-based command risk assessment system (304, 405) according to the set of pre-existing rules, each historical command having a corresponding first predicted classification determined based on the set of pre-existing rules.

3. The apparatus (1001) as claimed in any preceding claim, wherein the first classification and the second classification each comprise an indication of a level of risk relating to the execution of the respective command.

4. The apparatus (1001) as claimed in any preceding claim, wherein the external documentation comprises one or more of the following: documentation describing standard operating system commands, files or system calls; internally-defined scripts, programs or aliases; and third-party programs, tools or internal filenames.

5. The apparatus (1001) as claimed in any preceding claim, wherein the one or more processors are configured to process the external documentation or information derived therefrom and the command to form the second predicted classification.

6. The apparatus (1001) as claimed in any preceding claim, wherein the one or more processors are configured to: retrieve relevant documentation for the respective command from the external documentation; extract information relevant to the respective command from the relevant documentation; and in dependence on the extracted information, form a command description for the respective command.

7. The apparatus (1001) as claimed in claim 6, wherein the command description has a natural language format and the command has a computer programming language format.

8. The apparatus (1001) as claimed in claim 6 or claim 7, wherein the one or more processors are configured to input the command description to an artificial intelligence model (412) and form the second predicted classification in dependence on an output of the artificial intelligence model.

9. The apparatus (1001) as claimed in claim 8, wherein the artificial intelligence model comprises a first head (601) for processing the command description and a second head (603) for processing the respective command.

10. The apparatus (1001) as claimed in any preceding claim, wherein the one or more processors are configured to form the second predicted classification in dependence on a third classification prediction formed in dependence on one or more risk indications corresponding to one or more rules of the set of rules selected based on similarity with the respective command.

11. The apparatus (1001) as claimed in claim 10 as dependent on claim 8 or claim 9, wherein the one or more processors are configured to form the second predicted classification by aggregating the output of the artificial intelligence model (412) and the third classification prediction.

12. The apparatus (1001) as claimed in any preceding claim, wherein the one or more processors are configured to output respective notifications for multiple commands, and wherein the notifications are pre-grouped by similarity and reported to an administrator of the system.

13. The apparatus (1001) as claimed in any preceding claim, wherein the one or more processors are configured to, in dependence on the notification, update one of more of the rules of the pre-existing set of rules.

14. A method (900) for auditing rules in a rule-based command risk assessment system for an apparatus (1001) being communicatively connectable to one or more data stores (406, 407, 1003) storing one or more commands and a set of pre-existing rules for classifying the commands, the method comprising, for each of at least one of the commands: receiving (901) a first predicted classification for the respective command, the first predicted classification being determined based on the set of pre-existing rules; forming (902) a second predicted classification for the respective command based on external documentation for the respective command; and if the first prediction differs from the second prediction, outputting (903) a notification.

15. A computer-readable storage medium (1003) having stored thereon computer readable instructions that when executed by a computer cause the computer to perform the method (900) of claim 14.