CN109299610B

CN109299610B - Method for verifying and identifying unsafe and sensitive input in android system

Info

Publication number: CN109299610B
Application number: CN201811163790.XA
Authority: CN
Inventors: 杨珉; 杨哲慜; 张磊; 何郁郁; 张振宇; 洪庚; 张源
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2018-10-02
Filing date: 2018-10-02
Publication date: 2021-03-30
Anticipated expiration: 2038-10-02
Also published as: CN109299610A

Abstract

The invention belongs to the technical field of program security analysis vulnerability mining, in particular to an insecure sensitive input verification and identification method in an Android system. The method of the invention includes: input verification and identification, firstly extracting the interrupt branch in the program code, and after analyzing the code structure features, finds the independent program branch containing the interrupt instruction, to judge whether the current program execution contains the intention of verifying the input; Input verification and identification, use natural language processing to perform semantic-based clustering on a large number of input parameters, and then use machine learning to infer other unknown sensitive parameters by specifying a small number of known sensitive parameters; finally, vulnerability identification, by checking these contain sensitive parameters Whether the input validation meets the security rules to determine whether it is an unsafe input validation. By identifying this type of input validation, it is possible to determine the system-level security loopholes formed by it, which is of great significance for strengthening the security of mobile systems and preventing system-level attacks.

Description

Method for verifying and identifying unsafe and sensitive input in android system

Technical Field

The invention belongs to the technical field of program security analysis and vulnerability discovery, and particularly relates to a natural language processing, machine learning technology and static information flow analysis technology, in particular to an unsafe input verification identification method in an android system.

Background

Over 60% of mobile devices are using the android system, where a large number of applications related to our daily lives are running. To implement various functions, the application can read and operate android system resources, such as a GPS device and screen display, and perform sensitive operations, such as sending and deleting SMS messages. In the android system, these resources and sensitive operations are managed by more than 100 system services. It is clear that access control in these services plays an important role in the security of the overall system.

In the present invention we have performed empirical studies on a special set of key security checks in system services, which we define as sensitive input validation. The android system contains at least 700 different sensitive input verifications compared to 351 permissions contained in the system. They are used in large numbers for various purposes, such as to prevent general applications from accessing sensitive system level devices by restricting device names.

The present invention is different from conventional input validation studies. Traditional input validation research focuses on a narrow and well-defined set of sensitive inputs, such as Web inputs that may cause SQL injection attacks, and user-space pointers that are passed to the Linux kernel that may cause memory leak attacks. While the android system, by its uniqueness, does not know which inputs should be verified. Thus, the present invention is more focused on the other aspect of knowing neither which inputs should be verified, nor where these inputs need to be verified. Specifically, this is determined by the following android properties: (1) and (4) unstructured. Unlike android permission checks that rely on system-defined interfaces, such as context. In fact, any input to the disclosed method in a system service may result in sensitive input validation (a conditional statement that involves parameter checking). (2) The definition is ambiguous. Unlike rights authentication, which is described by the detailed documentation in the android rights model, there is no publicly available source to define how sensitive input verification should be performed in the android system service. Thus, it is not clear whether the input needs to be validated and completed correctly. (3) And (4) fragmenting. Sensitive input validation is scattered across a large number of Java classes. For example, in android 7.0, our evaluations show that they are widely dispersed in 173 different Java classes, while android rights guarantees are concentrated in 6 classes. Furthermore, even in the same service approach, sensitive input validation is often scattered across various execution paths, limiting system operation in a fine-grained manner.

Thus, while sensitive input verification in android services is important, security personnel overlook their security due to its inadequate design and implementation. First, system developers confuse their system security models. The Android system service may incorrectly trust input from a generic application, and even sometimes place input validation in the application program (Android SDK) process. Second, system developers can also ignore the issue of input verification when customizing the android system. However, in these contexts, there is no way in the android system to automatically identify sensitive input verifications and the security vulnerabilities that they constitute.

Disclosure of Invention

The invention aims to provide a brand-new unsafe sensitive input verification and identification method based on a code layer structure and semantic analysis driving, which is suitable for automatically identifying unsafe sensitive input verification contained in codes of an android system in a large scale.

The invention provides an insecure sensitive input verification and identification method, which is used for identifying an insecure data source depended on during verification input and comprises three parts: the method comprises the steps of code structure analysis-based input verification identification, natural language processing and machine learning-based sensitive input verification identification and security rule-based vulnerability identification.

First, based on the input verification identification of code structure analysis, an interrupt branch in the program code is extracted, such as an exception is thrown. The code structure characteristics are analyzed to find out the independent program branch containing the interrupt instruction, so as to judge whether the current program execution contains the intent of checking input.

Secondly, sensitive input verification and identification based on natural language processing and machine learning, and semantic-based clustering is carried out on a large number of input parameters by adopting natural language processing to enable synonymous parameters to be clustered together; and then, a small number of known sensitive parameters are specified, and other unknown sensitive parameters are presumed by adopting machine learning.

And (III) judging whether the input verification containing the sensitive parameters is unsafe input verification by checking whether the input verification meets the security rules based on the vulnerability recognition of the security rules.

The final design architecture of the present invention is shown in fig. 1, and the following describes three parts of the present invention in detail:

input verification recognition based on code structure analysis

Since input validation is a core problem of the present invention, we need a method to automatically identify and study input validation in the android system. This problem is very challenging because they are neither performed through predefined system interfaces nor identified through fixed APIs (e.g., permission checking). The present invention utilizes the inherent code structure features in the input validation for identification. In particular, the first requirement for input validation is that the input must be passed through the data stream to the compare statement and compared against some predefined value or result obtained dynamically from other APIs. Then, different actions are taken based on the result of the comparison. However, unlike a general program branch statement, input verification not only requires comparing the input with other data, but also immediately interrupts program execution when verification fails. For example, interrupting execution by throwing a SecurityException exception when verification fails causes the program to exit immediately. Thus, the present invention requires knowledge of which termination action will typically be taken when authentication fails. After analyzing some actual input validation in the android system, the present invention summarizes the following four interrupt operations: (1) an exception is thrown. A straightforward way to mark an application input violation of input validation is to throw exceptions such as SecurityException and IllegalArgmentException. (2) A constant is returned. The system service uses some predefined constants to show that the caller failed in the input validation and then returns as a return value in the interrupt branch. (3) And logging and returning. Logging information is useful for monitoring the operation of the system. In the interrupt branch, they typically record some information about the illegal entry and then return. (4) And recycling the resources and returning. In some cases, system services require that allocated resources be reclaimed and then returned directly.

By means of the identification of the four interrupt operations, the method for identifying input verification of the invention comprises the following steps: firstly, determining all program branch statements which can accept application input in system service; then, judging whether the branch statements contain an interrupt branch or not through code structure analysis; furthermore, some branch statements may generate a large number of program branches depending on different inputs, and these branches are generally used to process different input situations, and are not intended to check the inputs, so they should be deleted from the recognition result.

(II) sensitive input verification recognition based on natural language processing and machine learning

Currently, there is no efficient way to distinguish sensitive input validations from all input validations. It is more accurate and complete by understanding the processing logic of the input parameters in the system service and the corresponding operation type. However, this analysis method requires a large amount of a priori knowledge to describe which operations in the system are sensitive. Which is often difficult to obtain. The present invention therefore takes another distinct approach. By utilizing machine learning, we can mark a set of less known sensitive input validations as starting samples and let the machine learning automatically learn the rest using Association Rule Mining (Association Rule Mining).

When sensitive inputs are marked, the traditional method is to mark the sensitive inputs by using semantic information of variable names. For example, the identity of the caller is represented by the sensitive variable "packageName". However, the android system manages a large amount of system resources and uses multiple variable names to represent different parts. It is difficult to confirm their sensitivity if the entire android system is not fully understood. Thus, the present invention automatically discovers other potentially sensitive input validations by specifying a few initial known sensitive input validations, and then using association rule mining techniques. The reason for choosing this method is based on the correlation between sensitive input verifications, which are usually by being located in the same service method. Taking the example of "packageName" and "uid," the android system often uses them together to verify the identity of an application. Thus, their sensitivities may be positively correlated. The detailed method is as follows:

the incoming authentications are pre-grouped. One important requirement in association rule mining is the need to observe sufficient samples/occurrences of any given variable. However, if we deal with each unique variable name separately, it may eventually happen that variables flag1 and flag2 occur only once in the codebase, respectively, making association rule mining ineffective. Thus, if variables share a common term (or prefix/suffix), we can simply group them together because they are semantically highly related. To this end, the invention provides a pre-grouping of input verifications by means of input parameters in two steps: (1) the variable names are segmented and the stems are extracted. Normally, the input parameter is a word that is letter case segmented. For example, 'componentName' may be divided into 'component' and 'name', 'groupOwnerAddress' may be divided into 'group', 'lower' and 'address'. Therefore, we can break these long words into separate words. In addition, for each separate word, the present invention attempts to further identify the base word. For example, words such as 'types' and 'subtype' are derived from the basic word 'type', and prefixes'm' of the words 'mflag' and 'mname' should also be deleted. After this step, the present invention obtains the root word for each input parameter. (2) And normalizing the variable name. We can obtain a normalized name by merging the roots of each input parameter. However, even if words are segmented and word stems are used, it is difficult to avoid meaningless qualifiers, which in turn causes deviations in the final names. For example, the variable "linkaddress" may be divided into "link" and "address," and both "address" and the qualifier "link" are considered root words. To delete a qualifier, the present invention calculates the frequency of occurrence of each pair of words. If two words often occur simultaneously, we only retain the more popular words. After these steps, the present invention groups all input validations in the android system, and the input validations whose input parameters have the same normalized name are all divided into the same group for later machine learning based on association rule mining techniques.

New sensitive input verifications are learned. Without a priori knowledge, it is difficult to ascertain whether the verification involves any sensitive input. However, system developers tend to perform similar input verification in close proximity. For example, the validation of "packageName" and "uid" are typically adjacent. Thus, the present invention utilizes input verification proximity as a feature to perform association mining. Specifically, we extend the sensitive input validation set by way of association rule mining. First, the distance between each pair of input validations is calculated. Two input verifications are considered to be adjacent to each other if they occur on two basic blocks having a common edge. Then, if two input authentication groups contain multiple adjacent pairs, the groups are associated together. By this approach, starting with a few known sensitive input verifications, the present invention iteratively collects all relevant groups until a new group can no longer be discovered. The method can effectively find a large number of sensitive input verifications.

(III) vulnerability identification based on security rules

According to the invention, the vulnerability problem caused by unsafe input verification in the android system is detected from two different latitudes, and corresponding safety rules are formulated.

First, the present invention looks for unsafe input validation in each android system through intra-system analysis.

Erroneously trusting data provided by an application. Some system services verify caller identity based on input parameters, but because these parameters come from an application, they can be forged, and all of these parameters should be untrusted. Thus, if sensitive input verification verifies sensitive data provided by an application, such sensitive input verification is not secure.

Erroneously trusting code in an application process. Because of the unstructured nature of input validation, they are often placed into application processes. In particular, Android SDKs that run within application processes often include various checks on input parameters. Typically, Android SDKs package data from an application and forward it to an Android system services process. Whereas in the data packing process a large number of input verifications are used to check for illegitimate parameters, many of which are sensitive. However, these sensitive input verifications are ignored in system services. These sensitive input verifications are insecure because the application can bypass the Android SDK to directly access system services. Furthermore, the traditional understanding of the Android SDK scope includes only the exposed interfaces, but in fact those not exposed interfaces labeled as @ hide or @ systempai are also within the reach of the application, as the application can still access these hidden interfaces via reflection.

Secondly, inconsistent sensitive input verification is searched for in a plurality of android systems through intersystem analysis. In order to find sensitive input verification weakened by a third-party manufacturer, the invention detects inconsistent sensitive input verification between the android original system and the third-party customized system. First, the present invention needs to find similar system methods among different systems to compare if the input verification is consistent. Conventional looking at only class names and method names to determine similarity is not applicable here. Since many third party vendor customized systems introduced new system services that, while performing similar functions to the android original services, their function nomenclature is also significantly different and significantly reduced in security. Therefore, the invention proposes to cluster the common interfaces of different systems according to the similarity of the method behavior. In particular, we utilize static taint analysis techniques to represent the behavior of a method based on its data dependency graph. When the behavior similarity of two function interfaces is higher than a threshold value, the invention classifies the two function interfaces into similar methods. Then, by comparing whether similar methods have the same sensitive input verification, the invention can find many sensitive input verifications which are ignored by third-party manufacturers.

The invention can determine the system level security vulnerability formed by the input verification by identifying the input verification, and has important significance for strengthening the security of the mobile system and preventing the system level attack. Specifically, the traditional identity verification vulnerability mining for the android system aims at the android permission system, and the identity verification of the android system is considered from a brand-new angle, namely the angle of input verification. Quantitatively, the android system only contains about 350 authorities, but the number of sensitive inputs is as high as 700; from the aspect of identification difficulty, the interfaces of authority verification are well defined by the android system and are only distributed in a few java classes, sensitive input verification is widely distributed, and java classes of any system service can contain the sensitive input verification. Furthermore, technically, recognizing sensitive inputs directly in the system code is a problem that is very dependent on expert experience, especially in cases where the android system code volume is large. Therefore, extracting sensitive parameters directly in the android system is a nearly impossible matter. To solve the problem, the invention firstly proposes that the sensitive input can be identified by identifying the sensitive input verification. In general, input parameters used for sensitive input verification are also necessarily sensitive. Finally, the security rule for identifying the vulnerability is formulated by deeply understanding the android system hierarchical model, and the security rule has important significance for analyzing the system architecture and enhancing the security of the system architecture.

Drawings

Fig. 1 is a system overall framework.

FIG. 2 is a sample sensitive input authentication code.

FIG. 3 shows sensitive input keywords and clusters.

Fig. 4 is a security rule.

Detailed Description

The invention designs and realizes the brand-new unsafe input verification identification method based on the combination of natural language processing and machine learning. This section introduces details of the specific implementation of the framework.

Input verification recognition based on code structure analysis

The android system is analyzed on the basis of a Soot framework tool. The Soot framework is a mature Java program decompilation tool. Firstly, the android system image is decompressed, all Java class files are extracted from the android system image, then the Soot is used for decompiling, and the intermediate representation (the Jimple format file) of the system code is obtained. Then, in the decompiled Jimple codes, all android system services, methods in the system services and input variables are extracted to serve as code information sources to be analyzed. In extracting system services, the present invention considers not only all system services declared and registered in Java class, but also the system services they use. This enables the invention to cover the capabilities of a part of the system services that are implemented based on Native.

For the extracted system services, the invention finds out all the open methods contained in the system services by analyzing the interface definition, and then performs path-sensitive data flow taint analysis on each method. Meanwhile, aiming at data flow taint analysis among methods, the invention optimizes a large number of inaccessible nodes by filtering out nodes protected by system-level authority, thereby greatly reducing the time overhead of taint analysis sensitive to the path and reducing the complexity of path traversal.

Sensitive input verification identification based on natural language processing and machine learning

The invention uses the Stanford Parser implemented based on Java to perform the natural language processing analysis. The Stanford Parser is a common grammar parsing tool, can parse the structure of a sentence and mark part-of-speech tags for different participle units in the sentence, and also provides a plurality of methods for displaying the dependency relationship among the participle units in the sentence. Therefore, the method is selected to realize lexical analysis and dependency analysis. In addition, the present invention uses WordNet for longest word matching to identify the valid root word for each word. The specific method is that all characters are matched continuously until the longest character which can be matched is matched.

The present invention uses rule mining techniques for machine learning. A particular feature is to calculate the distance between adjacent input validations. The distance threshold is set to 3, i.e. if two input validations can be found within 3 basic code blocks, then both are considered relevant.

Vulnerability identification based on security rules

The present invention uses behavior similarity as a feature to find similar system service methods. The similarity threshold is set to 0.7. Experiments with 4 third party vendor customized systems showed that 0.7 is the largest threshold for finding similar methods. A larger threshold, e.g. 0.8, would only find the same, but not similar, approach, which is not in accordance with the requirements of the present invention. The present invention then performs a difference set comparison of the input verifications of similar methods, i.e., checks to see if one method lacks some input verification relative to another, to find systematic methods for which the input verification is weakened.

Through the framework, the invention realizes a vulnerability tool for mining based on unsafe input verification in the android system. The effectiveness of the method provided by the invention is proved through detection and analysis of an actual system. Firstly, through static analysis, the tool covers the detection of most system services in the android system, including the system services which are partially realized depending on Native. Secondly, the tool of the present invention discovers 20 system level vulnerabilities in 8 android systems. For example, the system service accessitimymanagervice identifies the application identity by using the untrusted parameter packageName, and by forging the parameter, malicious software can bypass authentication and border-crossing access, so that attacks such as interface hijacking, password leakage and the like are caused; the system service WindowManagerService utilizes an untrusted parameter Toast _ Type to identify the window Type, and by forging the parameter, malicious software can bypass authority verification to construct a system level window, so that window phishing attack aiming at any application software is caused; in addition, there are other vulnerabilities that lead to attacks such as delegation, information leakage, system log cleansing, and the like. The vulnerabilities cannot be covered by traditional work because the vulnerabilities are based on unsafe sensitive input verification, and the invention fills up the gap in the research field.

Claims

1. an unsafe sensitive input verification and identification method in an Android system, is characterized in that, identifying the unsafe data source that relies on when verifying the input, and concrete steps are:

(1) Input verification and identification based on code structure analysis, firstly extract the interrupt branch in the program code, and after analyzing the code structure features, find out the independent program branch containing the interrupt instruction, to judge whether the current program execution contains the verification input. intention;

(2) Sensitive input verification and recognition based on natural language processing and machine learning, using natural language processing to perform semantic-based clustering on a large number of input parameters, so that synonymous parameters can be aggregated together; learn to infer other unknown sensitive parameters; finally,

(3) Vulnerability identification based on security rules, by checking whether these input validations containing sensitive parameters meet the security rules to determine whether they are unsafe input validations;

The input verification recognition based on code structure analysis:

First, the following four interrupt operations are summarized: (1) throwing an exception, that is, the direct way to indicate that the application input violates input validation is to throw an exception; (2) returning a constant, the system service uses some predefined constants to show that the caller is in the Fail in input validation, then return as return value in interrupt branch; (3) log and return, record log information, in interrupt branch, record some information about illegal input, then return; (4) reclaim resources and return , in some cases, the system service needs to recycle the allocated resources first, and then return directly;

With the identification of these four interrupt operations, the input verification process is as follows: first, determine all program branch statements that can accept application input in the system service; then, determine whether these branch statements contain an interrupt branch through code structure analysis; In addition, some branch statements will generate a large number of program branches according to different inputs, which are deleted from the recognition results.

2. in the Android system according to claim 1, insecure sensitive input verification identification method, is characterized in that, the described sensitive input verification identification based on natural language processing and machine learning:

By leveraging machine learning, a small set of known sensitive input validations are marked as starting samples, and association rule mining techniques are used to let machine learning learn the rest automatically;

When marking sensitive inputs, specify a few initial known sensitive input validations, and then use association rule mining technology to automatically discover other potentially sensitive input validations; the specific approach is as follows:

Pre-grouping for input validation: Pre-grouping for input validation with the help of input parameters in two steps: (1) splitting variable names and stemming; (2) normalizing variable names by combining the stems of each input parameter get the normalized name, where the frequency of occurrence of each pair of words is calculated; (3) remove the qualifier; if two words frequently co-occur, only keep the more popular word; this way, input validations that make input parameters with the same normalized name are all divided For the same group, it is used for the machine learning part of mining technology based on association rules later;

Learning new sensitive input validation: Extend the sensitive input validation set by mining association rules; first, calculate the distance between each pair of input validations; if two input validations occur on two basic blocks with common edges, consider The two input validations are adjacent to each other; then, if the two input validation groups contain multiple adjacent pairs, associate the groups together; this way, starting with a few known sensitive input validations, iteratively collects all relevant groups, until new groups can no longer be discovered.

3. in the Android system according to claim 2, insecure sensitive input verification and identification method, is characterized in that, described vulnerability identification based on security rules:

First, look for insecure input validation in every Android system through in-system analysis; including:

Incorrectly trusting data provided by the application, some system services verify the caller's identity based on input parameters, but because these parameters come from the application and can be forged, all these parameters should not be trusted; therefore, if sensitive input validation checks If the sensitive data provided by the application is verified, then this sensitive input verification is not secure;

Incorrectly trusting code in application processes, which are often placed into application processes because of the unstructured nature of input validation;

Second, find inconsistent sensitive input validations in multiple Android systems through inter-system analysis; in order to find sensitive input validations weakened by third-party manufacturers, detect inconsistent sensitive input validations between the Android original system and third-party customized systems; first, in the Find similar system methods among different systems to compare whether their input validations are consistent; cluster the common interfaces of different systems according to the similarity of method behavior; specifically, use static taint analysis technology to represent the data based on their data dependency graphs. Method behavior; when the behavior similarity of two functional interfaces is higher than the threshold, they are classified as similar methods; then, by comparing whether similar methods have the same sensitive input verification, so as to find many ignored by third-party vendors Sensitive input validation.