[go: up one dir, main page]

WO2022041406A1 - Ocr and transfer learning-based app violation monitoring method - Google Patents

Ocr and transfer learning-based app violation monitoring method Download PDF

Info

Publication number
WO2022041406A1
WO2022041406A1 PCT/CN2020/120724 CN2020120724W WO2022041406A1 WO 2022041406 A1 WO2022041406 A1 WO 2022041406A1 CN 2020120724 W CN2020120724 W CN 2020120724W WO 2022041406 A1 WO2022041406 A1 WO 2022041406A1
Authority
WO
WIPO (PCT)
Prior art keywords
app
violation
ocr
transfer learning
monitoring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2020/120724
Other languages
French (fr)
Chinese (zh)
Inventor
蔡树彬
明仲
林旭恒
吴东阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Publication of WO2022041406A1 publication Critical patent/WO2022041406A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/09Recognition of logos
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • the invention relates to the technical field of data monitoring, in particular to an APP violation monitoring method based on OCR and transfer learning.
  • the semi-structured and unstructured data in the Internet target pages can be extracted in batches and accurately, converted into structured records, and saved in the local database for internal use or external network publishing. Realize the acquisition of external information.
  • network data collection there is no way to collect mobile APP data; and websites have different complexities and anti-crawling measures, the success rate of data crawling cannot be guaranteed.
  • the short text classification model refers to text forms with no more than 200 words, such as microblogs, chat messages, news topics, opinion comments, question texts, mobile phone messages, literature summaries, etc.
  • the purpose of the short text classification task is to automatically process the short text input by the user to obtain valuable classification output.
  • the short text classification model is supervised learning, which often requires massive data as support and requires a lot of manual labeling workload.
  • the main purpose of the present invention is to provide an APP violation monitoring method based on OCR and transfer learning, which aims to solve the problem that target information cannot be quickly and effectively obtained in the prior art.
  • the present invention provides an APP violation monitoring method based on OCR and transfer learning, and the APP violation monitoring method based on OCR and transfer learning includes the following steps:
  • a sample set is constructed through keywords and regular expressions, and manual annotation is performed;
  • the scores of different APPs are counted to obtain the violation scores of the APPs.
  • the method for monitoring APP violations based on OCR and transfer learning wherein, in the regularly updating APK, data collection of the corresponding APP is performed according to the updated APK, specifically including:
  • the APK of each application is crawled, and the APK of the application store is updated regularly, and the data collection of the corresponding APP is carried out according to the updated APK.
  • the data collection method specifically includes: using a crawler to directly capture packets of promotional data and using an Appium script to automatically take screenshots of pages.
  • a corpus is constructed for supervising the training of the deep learning model.
  • the construction process of the corpus includes:
  • a keyword-based training corpus is constructed and labeled manually for generating the corpus.
  • the described APP violation monitoring method based on OCR and transfer learning, wherein, the scores of different APPs are counted to obtain the APP violation scores, specifically including:
  • the violation score of the APP is obtained by weighted average:
  • f1 ⁇ fk are the weights configured for the violation items of each dimension
  • x1 ⁇ xk are the actual number of abnormal quality inspection results of the violation items in each dimension
  • n is the total number of dimensions, and different dimensions represent different violation scenarios.
  • the scores of different APPs are counted to obtain the violation scores of the APPs, and then the include:
  • the tasks include: APP crawling timed task, APP screenshot timed task, and violation monitoring timed task.
  • the present invention also provides an intelligent terminal, wherein the intelligent terminal includes: a memory, a processor, and an OCR-based and transfer learning based OCR and transfer learning system stored in the memory and running on the processor.
  • the APP violation monitoring program when the APP violation monitoring program based on OCR and transfer learning is executed by the processor, implements the steps of the above-mentioned OCR and transfer learning-based APP violation monitoring method.
  • the present invention also provides a storage medium, wherein the storage medium stores an APP violation monitoring program based on OCR and transfer learning, and the APP violation monitoring program based on OCR and transfer learning is processed by a processor When executed, the steps of implementing the above-mentioned OCR and transfer learning-based APP violation monitoring method.
  • the present invention periodically updates the APK, and collects data corresponding to the APP according to the updated APK.
  • the data collection includes data packet capture and page screenshots; text recognition and extraction are performed on the screenshots based on the OCR algorithm; Keywords and regular expressions are used to construct a sample set, and manually annotate; input the manually labeled sample set into a pre-trained deep learning model for model adjustment, and divide business scenarios to achieve text violation discrimination in different scenarios; Based on the discrimination results output by the deep learning model, the scores of different APPs are counted to obtain the APP's violation score.
  • the present invention effectively and quickly detects the illegal use of the APP by collecting and analyzing the data of the APP.
  • Fig. 1 is a schematic diagram of a cross-platform, multi-language mobile terminal automated testing framework based on Client/Server architecture;
  • Figure 2 is a schematic diagram of the PaddleHub architecture in the pre-training model management and transfer learning tool
  • Fig. 3 is the flow chart of CTPN algorithm
  • Fig. 4 is the framework schematic diagram of the monitoring system of the mobile terminal based on OCR and transfer learning
  • Figure 5 is a schematic diagram of a microservice architecture
  • FIG. 6 is a flowchart of a preferred embodiment of the method for monitoring APP violations based on OCR and transfer learning of the present invention
  • FIG. 7 is a schematic diagram of the execution process of a preferred embodiment of the APP violation monitoring method based on OCR and transfer learning of the present invention.
  • FIG. 8 is a schematic diagram of a configuration path table structure formed when a screenshot is taken in a preferred embodiment of the APP violation monitoring method based on OCR and transfer learning of the present invention
  • FIG. 9 is a schematic diagram of a monitoring function for real-time monitoring of APP publicity data in a preferred embodiment of the APP violation monitoring method based on OCR and transfer learning of the present invention.
  • FIG. 10 is a schematic diagram of an operating environment of a preferred embodiment of an intelligent terminal of the present invention.
  • Appium is a cross-platform, multi-language mobile automated testing framework based on the Client/Server architecture. Appium can support both Android and iOS across platforms. As shown in Figure 1, through any client (such as Java, Python, Javascript, etc.) that implements the WebDriver JSONWriteProtocol protocol corresponding to the Appium client class library, the Appium server can perform operations such as screenshots and clicks on the mobile device through parsing. such as commands. In addition, Appium Destop can easily and accurately assist developers to obtain the coordinates of each APP control and xpath (language used to determine the location of a certain part of the XML document) path attribute information.
  • client such as Java, Python, Javascript, etc.
  • Appium server can perform operations such as screenshots and clicks on the mobile device through parsing. such as commands.
  • Appium Destop can easily and accurately assist developers to obtain the coordinates of each APP control and xpath (language used to determine the location of a certain part of the XML document) path attribute information.
  • Transfer Learning is a sub-field of deep learning.
  • the goal of this research field is to use the similarity between data, tasks, or models to transfer the knowledge learned in the old field and apply it to the new field.
  • PaddleHub is an open source pre-trained model management and migration learning tool .
  • developers can use high-quality pre-trained models combined with the Fine-tune API to quickly complete the entire process from transfer learning to application deployment and complete model convergence in a shorter time, while enabling the model to have better generalization capabilities.
  • the OCR algorithm can generally be divided into two parts: text detection (detecting the area where the text is located) and character recognition (recognizing the text in the area).
  • CTPN Connectionist Text Proposal Network
  • the seamless combination of RNN and CNN is used to improve the detection accuracy (CNN is used to extract deep features, and RNN is used for sequence feature recognition), so that the effect, speed and robustness of text detection have been qualitatively improved.
  • the algorithm in the CTPN paper can be implemented through the Keras+TensorFlow framework: specifically, a series of proposals (pre-selection boxes) are generated by using the feature map (feature map) output by VGG16 convolution for detection, and the CTPN text recognition model is trained on the VOC2007_text_detection dataset.
  • the text area of the image can be detected using the CTPN algorithm.
  • This algorithm can detect text lines of multiple sizes and aspect ratios on a single-scale image by setting vertical anchors and fine-grained detection strategies. It also limits CTPN to only horizontal direction. The detected text effect is relatively good, but the detection effect in other directions is relatively poor.
  • VGG16 a classic model of CNN convolutional neural network
  • a sliding window of size 3 ⁇ 3 is set on the basis of the fifth layer.
  • the windows all get a feature vector of length 3 ⁇ 3 ⁇ C.
  • step (2) The convolution feature obtained in step (1) is used as the input of 256-dimensional bidirectional LSTM (two 128-dimensional LSTM), and the output of length W ⁇ 256 is obtained.
  • the introduction of LSTM is to solve the problem of RNN layer gradient disappearance and further expansion RNN layer.
  • the output layer part contains three outputs, which are 2k vertical coordinates, 2k scores, and k side-refinements, using a standard non-maximum suppression algorithm (NMS) to filter out duplicate text boxes.
  • NMS non-maximum suppression algorithm
  • DenseNet is one of the character recognition algorithms. It uses Relu as the activation function and uses 3 Dense Block layers for calculation. The DenseNet network is formed by connecting each Dense Block through the Transition structure. Finally, it is obtained by training with CTC loss.
  • the convolution operation is performed and then transmitted to the Transition structure for parameter integration specification. The parameters are reduced by pooling and then transmitted to the lower Dense Block structure to achieve higher accuracy.
  • the accuracy rate and loss value reach the limit and fall into oscillation, the learning rate needs to be reduced exponentially, which can make the accuracy rate jump sharply immediately.
  • the technology stack used by the system includes front-end, back-end, algorithm and operation and maintenance, wherein the front-end, back-end, algorithm and operation and maintenance can be classified as follows:
  • Front-end Vue, ElementUI, Vuex, Axios
  • the system adopts the design idea of separating front and back ends.
  • the front end is based on Vue, combined with ElementUI, Vuex and Axios and other technology stacks to build the management system interface. Realize the configuration center and registration center, uniformly access the backend interface through the SpringGateWay gateway component, and monitor the backend service uniformly through SpringAdmin.
  • the timing task background implements distributed timing tasks based on Java combined with the open source framework XXL-JOB.
  • the algorithm background is based on Flask combined with TensorFlow and Keras to implement the OCR text recognition algorithm, and fine-tuned on the basis of PaddleHub's pre-training model to generate a violation detection model.
  • MySQL In the background data persistence layer, MySQL is used, Mybatis Plus is used to add, delete, modify and search data, MongoDb is used to store HTTP packet capture data, and Redis is used to cache hot data and implement distributed locks.
  • the background is divided into three categories: web background, timed task background and algorithm background.
  • Web backend Based on Java, it provides basic interfaces for adding, deleting, modifying and searching various data, including but not limited to data and model management.
  • Timing task background Based on Java and XXL-JOB, it is used to periodically execute timing tasks.
  • Algorithm background OCR algorithm, violation detection algorithm and semantic similarity algorithm are implemented based on Python, CTPN, CRNN and PaddleHub.
  • microservices Drawing on the idea of microservices, the background can be abstractly subdivided into five categories: data query services, timed task services, data collection services, data violation detection services, and data analysis services.
  • data query services timed task services
  • data collection services data violation detection services
  • data analysis services data analysis services.
  • the basic functions are provided for the above two through data collection service, data violation detection service and data analysis service.
  • the microservice architecture is shown in Figure 5.
  • a mobile terminal violation monitoring system based on OCR and transfer learning is constructed, and each module of the system is integrated based on the technical architecture to ensure the maintainability and scalability of the system.
  • the method for monitoring APP violations based on OCR and transfer learning includes the following steps:
  • Step S10 update the APK regularly, and perform data collection of the corresponding APP according to the updated APK, where the data collection includes data packet capture and page screenshots.
  • the APK of each application is crawled with the help of the Jsoup library, and the APK of the application store is updated regularly, and the data of the corresponding APP is collected according to the updated APK.
  • the collected application stores include Huawei App Store, App Store, and Baidu App Assistant. , 360 Mobile Assistant, Pea Pod, PP Assistant, Sogou Mobile Assistant, etc.
  • the purpose of data collection is to monitor the relevant behavior of the APP, such as the fund overview, fund manager, fund announcement, fund promotion carousel, fund promotion activity introduction and other content in the APP.
  • the data collection method specifically includes: using a crawler to directly capture packets of publicity data and using an Appium script to automatically take screenshots of pages.
  • Fidder For example, when capturing packets through crawler data, some APPs use Fidder to capture packets, such as Tiantian Fund, China Asset Manager, Egg Roll Fund, GF Yitaojin, Flush Fund, Xingquan Fund, Cathay Fund, Haomai Funds, Profit Funds, etc.
  • Fidder For example, when capturing packets through crawler data, some APPs use Fidder to capture packets, such as Tiantian Fund, China Asset Manager, Egg Roll Fund, GF Yitaojin, Flush Fund, Xingquan Fund, Cathay Fund, Haomai Funds, Profit Funds, etc.
  • the difficulty of automated screenshots based on Appium is how to ensure the stability and comprehensiveness of automated click scripts and the accuracy of OCR recognition.
  • the present invention mainly locates control elements through text, and simultaneously uses Appium Destop to obtain XPath path assistance to locate control elements.
  • the Appium configuration path is abstracted into five fields, namely application, root path, sub-path, exception log, and enable status.
  • the specific format definitions of the root path and subpath are as follows:
  • the three paths (home-0, current-0 and com.hctforgf.gff:id/risk_warn_tv-5-0) separated by the
  • the specific numbers and their meanings are shown in the table below:
  • the positioning carousel map of the APP by providing the positioning carousel map with text and coordinates, the essence of positioning the carousel map by text is still positioning the carousel map by coordinates, and the carousel map can be slid left and right with the help of coordinates.
  • the problem that needs to be considered at this time is the problem that the automatic sliding of the APP carousel image and the sliding of the script result in an abnormal number of carousel images.
  • the additional parameters of the positioning carousel image method define the total number of carousel images. Image similarity prevents the same carousel from being processed until the total number of corresponding carousels is obtained before exiting.
  • Step S20 performing text recognition and extraction on the screenshot based on the OCR algorithm.
  • the OCR algorithm is used to identify and extract the text to obtain the required information.
  • Step S30 constructing a sample set by using keywords and regular expressions for the recognized text content, and performing manual annotation.
  • Keywords are a kind of logical formula for string operations, that is, some pre-defined specific characters and combinations of these specific characters are used to form A "rule string", this "rule string” is used to express a filtering logic for strings) to construct a sample set and improve reliability through manual annotation.
  • Step S40 Input the manually labeled sample set into the pre-trained deep learning model to adjust the model, and realize the violation judgment of text in different scenarios by dividing the business scenarios.
  • the method further includes: constructing a corpus for supervising the training of the deep learning model. Specifically: obtaining a plurality of keywords, and matching the keywords; constructing training corpus based on keywords, and manually labeling them for generating the corpus.
  • the fund sales violation discrimination model is trained in a supervised manner, so a corpus needs to be constructed for the training of a supervised deep learning model. Since the richness of the corpus will directly affect the accuracy of the semantic labels, in order to ensure the quality of the semantic label samples, the corpus of the illegal propaganda of fund sales is collected manually from the network.
  • keywords such as high yield, zero risk, cash red envelopes, quick purchases, guaranteed, guaranteed, etc. for the parade.
  • the keywords are matched by means of including, not including, greater than, less than, equal to, and regular expressions, to construct a training corpus based on keywords, and manually label them for model training. Work.
  • the system uses the PaddleHub framework to build a violation monitoring model.
  • the model fine-tuning work based on transfer learning (ie model retraining) is carried out.
  • transfer learning ie model retraining
  • separate training work is carried out according to different violation scenarios (such as exaggerating income scenarios, slandering other fund managers, etc.) to classify violation discrimination models in different scenarios; finally, the classification of judging whether the data is illegal in each scenario is obtained.
  • the model can be used for data classification.
  • Step S50 according to the discrimination result output by the deep learning model, count the scores of different APPs, and obtain the violation scores of the APPs.
  • the violation score of the APP is obtained by the weighted average (that is, the weighted average represents the violation score of the APP):
  • f1 ⁇ fk are the weights configured for the violation items of each dimension
  • x1 ⁇ xk are the actual number of abnormal quality inspection results of the violation items in each dimension
  • n is the total number of dimensions
  • the weighted average score of each dimension is used as the average score, for example, the violation score is calculated in five dimensions, such as illegal promised income, lack of reasonable risk warning, slandering other fund managers, promise to guarantee capital and income, and exaggerated income.
  • the invention realizes the automatic crawling of the mobile terminal APP data regularly; for the data that cannot be crawled by the mobile terminal, the Appium simulates the touch screen operation to take screenshots; the OCR technology is used to perform character monitoring and character recognition on the pictures obtained from the screenshots, which solves the problem of fundraising.
  • the invention can collect and monitor the network fund publicity data uniformly, and the data and model results that can be obtained subsequently can be made into a knowledge base, which can be used in big data analysis, knowledge map and other fields.
  • the present invention also provides an intelligent terminal correspondingly, the intelligent terminal includes a processor 10 , a memory 20 and a display 30 .
  • FIG. 10 only shows some components of the smart terminal, but it should be understood that it is not required to implement all the shown components, and more or less components may be implemented instead.
  • the memory 20 may be an internal storage unit of the smart terminal, such as a hard disk or a memory of the smart terminal.
  • the memory 20 may also be an external storage device of the smart terminal, such as a plug-in hard disk equipped on the smart terminal, a smart memory card (Smart Media Card, SMC), a secure digital (Secure) Digital, SD) card, flash card (Flash Card), etc.
  • the memory 20 may also include both an internal storage unit of the smart terminal and an external storage device.
  • the memory 20 is used to store application software and various types of data installed in the smart terminal, such as program codes for installing the smart terminal.
  • the memory 20 can also be used to temporarily store data that has been output or is to be output.
  • an APP violation monitoring program 40 based on OCR and transfer learning is stored on the memory 20, and the APP violation monitoring program 40 based on OCR and transfer learning can be executed by the processor 10, so as to realize the OCR-based application in this application. and transfer learning method for APP violation detection.
  • the processor 10 may be a central processing unit (Central Processing Unit, CPU), a microprocessor or other data processing chips, for running program codes or processing data stored in the memory 20, such as Execute the APP violation monitoring method based on OCR and transfer learning, etc.
  • CPU Central Processing Unit
  • microprocessor or other data processing chips, for running program codes or processing data stored in the memory 20, such as Execute the APP violation monitoring method based on OCR and transfer learning, etc.
  • the display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like.
  • the display 30 is used for displaying information on the smart terminal and for displaying a visual user interface.
  • the components 10-30 of the intelligent terminal communicate with each other through the system bus.
  • the processor 10 executes the APP violation monitoring program 40 based on OCR and transfer learning in the memory 20, the following steps are implemented:
  • a sample set is constructed through keywords and regular expressions, and manual annotation is performed;
  • the scores of different APPs are counted to obtain the violation scores of the APPs.
  • data collection of the corresponding APP is performed according to the updated APK, which specifically includes:
  • the APK of each application is crawled, and the APK of the application store is updated regularly, and the data collection of the corresponding APP is carried out according to the updated APK.
  • the data collection method specifically includes: using a crawler to directly capture packets of publicity data and using an Appium script to automatically take screenshots of pages.
  • the manual annotated sample set is input into a pre-trained deep learning model for model adjustment, and the text violation judgment in different scenarios is realized by dividing the business scenarios, which also includes:
  • a corpus is constructed for supervising the training of the deep learning model.
  • the construction process of the corpus includes:
  • a keyword-based training corpus is constructed and labeled manually for generating the corpus.
  • the violation scores of the APPs are obtained, including:
  • the violation score of the APP is obtained by weighted average:
  • f1 ⁇ fk are the weights configured for the violation items of each dimension
  • x1 ⁇ xk are the actual number of abnormal quality inspection results of the violation items in each dimension
  • n is the total number of dimensions, and different dimensions represent different violation scenarios.
  • the scores of different APPs are counted to obtain the violation scores of the APPs, and the following further includes:
  • the tasks include: APP crawling timed tasks, APP screenshot timed tasks, and violation monitoring timed tasks.
  • the present invention also provides a storage medium, wherein the storage medium stores an APP violation monitoring program based on OCR and transfer learning, and when the APP violation monitoring program based on OCR and transfer learning is executed by a processor, the above-mentioned Steps of an APP violation detection method based on OCR and transfer learning.
  • the present invention provides an APP violation monitoring method based on OCR and transfer learning.
  • the method includes: updating the APK regularly, and collecting data corresponding to the APP according to the updated APK, and the data collection includes data packet capture and page screenshots; text recognition and extraction based on the OCR algorithm; for the recognized text content, a sample set is constructed through keywords and regular expressions, and manually labeled; the manually labeled sample set is input into the pre-trained
  • the deep learning model adjusts the model, and realizes the violation judgment of texts in different scenarios by dividing the business scenarios; according to the judgment results output by the deep learning model, the scores of different APPs are counted to obtain the APP violation scores.
  • the present invention effectively and quickly detects the illegal use of the APP by collecting and analyzing the data of the APP.
  • the storage medium may be a memory, a magnetic disk, an optical disk, or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed in the present invention is an OCR and transfer learning-based app violation monitoring method. Said method comprises: periodically updating an APK, and performing data acquisition on a corresponding app according to the updated APK, the data acquisition comprising the capture of data packets and the acquisition of page screenshots; performing text recognition and extraction on the screenshots on the basis of an OCR algorithm; for the recognized text content, constructing a sample set by means of keywords and regular expressions, and performing manual annotation on same; inputting the manually annotated sample set into a pre-trained deep learning model for model adjustment, and implementing violation determination of text in different scenarios by means of division of service scenarios; and collecting statistics of scores of different apps according to the determination results outputted by the deep learning model, so as to obtain violation scores of the apps. By acquiring and analyzing data of apps, the present invention effectively and quickly detects the violation usage situations of the apps.

Description

一种基于OCR和迁移学习的APP违规监测方法An App Violation Monitoring Method Based on OCR and Transfer Learning 技术领域technical field

本发明涉及数据监控技术领域,尤其涉及一种基于OCR和迁移学习的APP违规监测方法。The invention relates to the technical field of data monitoring, in particular to an APP violation monitoring method based on OCR and transfer learning.

背景技术Background technique

通过对海量网络舆论信息进行实时的自动舆情采集,舆情分析,舆情汇总,舆情监视,并识别其中的关键舆情信息,及时通知到相关人员,从而第一时间应急响应,为正确舆论导向及收集网友意见提供直接支持的一套信息化平台。但是,只针对舆情数据进行采集,无法针对特殊的内容进行检测;并且一般只针对网站数据进行检测,没有对移动端数据进行检测。Through real-time automatic public opinion collection, public opinion analysis, public opinion summary, and public opinion monitoring of massive network public opinion information, and identify key public opinion information, timely notify relevant personnel, so as to respond to emergency in the first time, for correct public opinion orientation and collection of netizens A set of information platforms that provide direct support for opinions. However, it only collects public opinion data, and cannot detect special content; and generally only detects website data, and does not detect mobile data.

根据用户自定义的任务配置,批量而精确地抽取因特网目标网页中的半结构化与非结构化数据,转化为结构化的记录,保存在本地数据库中,用于内部使用或外网发布,快速实现外部信息的获取。但是,一般只针对网络数据进行采集,没有办法采集移动端APP的数据;并且网站具有其不同的复杂性及反爬措施,数据爬取的成功率无法保证。According to the user-defined task configuration, the semi-structured and unstructured data in the Internet target pages can be extracted in batches and accurately, converted into structured records, and saved in the local database for internal use or external network publishing. Realize the acquisition of external information. However, generally only for network data collection, there is no way to collect mobile APP data; and websites have different complexities and anti-crawling measures, the success rate of data crawling cannot be guaranteed.

短文本分类模型指的是不超过200字的文本形式,如微博、聊天信息、新闻主题、观点评论、问题文本、手机短信、文献摘要等。短文本分类任务的目的是自动对用户输入的短文本进行处理,得到有价值的分类输出。但是,短文本分类模型为有监督学习,往往需要海量的数据作为支撑,需要大量的人工标注工作量。The short text classification model refers to text forms with no more than 200 words, such as microblogs, chat messages, news topics, opinion comments, question texts, mobile phone messages, literature summaries, etc. The purpose of the short text classification task is to automatically process the short text input by the user to obtain valuable classification output. However, the short text classification model is supervised learning, which often requires massive data as support and requires a lot of manual labeling workload.

也就是说,现有技术中无法快速、有效获取目标信息,例如某种信息违规监测所需的数据无从获取;若APP具备一定的反爬措施,将无法使用爬虫进行数据爬取;目标信息宣传包含大量图片,无法对图片格式的数据进行处理;某种信息违规监测的数据样本不足,即使获取了网络数据,也需要大量人工标注;有监督的深度学习模型需要海量的训练数据,想要获得良好的效果,还需要大量的机器资源进行训练;对某种信息违规监测缺乏数据查看与比较的平台。That is to say, it is impossible to obtain target information quickly and effectively in the existing technology, for example, the data required for a certain information violation monitoring cannot be obtained; if the APP has certain anti-crawling measures, it will not be able to use crawlers for data crawling; target information promotion Contains a large number of pictures and cannot process data in picture format; there are insufficient data samples for certain information violation monitoring, even if network data is obtained, a large amount of manual annotation is required; supervised deep learning models require massive training data, if you want to obtain Good results also require a lot of machine resources for training; there is a lack of data viewing and comparison platforms for certain information violation monitoring.

因此,现有技术还有待于改进和发展。Therefore, the existing technology still needs to be improved and developed.

发明内容SUMMARY OF THE INVENTION

本发明的主要目的在于提供一种基于OCR和迁移学习的APP违规监测方法,旨在解决现有技术中无法快速、有效获取目标信息的问题。The main purpose of the present invention is to provide an APP violation monitoring method based on OCR and transfer learning, which aims to solve the problem that target information cannot be quickly and effectively obtained in the prior art.

为实现上述目的,本发明提供一种基于OCR和迁移学习的APP违规监测方法,所述基于OCR和迁移学习的APP违规监测方法包括如下步骤:In order to achieve the above object, the present invention provides an APP violation monitoring method based on OCR and transfer learning, and the APP violation monitoring method based on OCR and transfer learning includes the following steps:

定期更新APK,根据更新后的APK进行对应APP的数据采集,所述数据采集包括数据抓包和页面截图;Regularly update the APK, and perform data collection of the corresponding APP according to the updated APK, and the data collection includes data capture and page screenshots;

基于OCR算法对截图进行文字识别及提取;Text recognition and extraction of screenshots based on OCR algorithm;

对识别后的文字内容,通过关键字及正则表达式进行样本集构建,并进行人工标注;For the recognized text content, a sample set is constructed through keywords and regular expressions, and manual annotation is performed;

将人工标注后的样本集输入预训练的深度学习模型进行模型调整,通过划分业务场景实现不同场景下文本的违规判别;Input the manually labeled sample set into the pre-trained deep learning model for model adjustment, and realize the text violation discrimination in different scenarios by dividing the business scenarios;

根据所述深度学习模型输出的判别结果,对不同APP的得分进行统计,得出APP的违规得分。According to the discrimination results output by the deep learning model, the scores of different APPs are counted to obtain the violation scores of the APPs.

可选地,所述的基于OCR和迁移学习的APP违规监测方法,其中,所述定期更新APK,根据更新后的APK进行对应APP的数据采集,具体包括:Optionally, the method for monitoring APP violations based on OCR and transfer learning, wherein, in the regularly updating APK, data collection of the corresponding APP is performed according to the updated APK, specifically including:

基于Java并借助Jsoup库爬取各应用的APK,并对应用商店APK进行定期更新,依据更新后的APK进行对应APP的数据采集。Based on Java and with the help of Jsoup library, the APK of each application is crawled, and the APK of the application store is updated regularly, and the data collection of the corresponding APP is carried out according to the updated APK.

可选地,所述的基于OCR和迁移学习的APP违规监测方法,其中,所述数据采集的方式具体包括:使用爬虫直接进行宣传数据抓包和使用Appium脚本进行页面自动化截图。Optionally, in the method for monitoring APP violations based on OCR and transfer learning, the data collection method specifically includes: using a crawler to directly capture packets of promotional data and using an Appium script to automatically take screenshots of pages.

可选地,所述的基于OCR和迁移学习的APP违规监测方法,其中,所述将人工标注后的样本集输入预训练的深度学习模型进行模型调整,通过划分业务场景实现不同场景下文本的违规判别,之前还包括:Optionally, in the method for monitoring APP violations based on OCR and transfer learning, wherein the manually annotated sample set is input into a pre-trained deep learning model for model adjustment, and by dividing business scenarios, the text in different scenarios is realized. Violation judgment, which previously included:

构建用于监督所述深度学习模型的训练的语料库。A corpus is constructed for supervising the training of the deep learning model.

可选地,所述的基于OCR和迁移学习的APP违规监测方法,其中,所述语料库的构建过程包括:Optionally, in the method for monitoring APP violations based on OCR and transfer learning, the construction process of the corpus includes:

获取多个关键词,对所述关键词进行匹配;Obtain multiple keywords, and match the keywords;

构建基于关键词的训练语料,并人工进行标签标注,用于生成所述语料库。A keyword-based training corpus is constructed and labeled manually for generating the corpus.

可选地,所述的基于OCR和迁移学习的APP违规监测方法,其中,所述对 不同APP的得分进行统计,得出APP的违规得分,具体包括:Optionally, the described APP violation monitoring method based on OCR and transfer learning, wherein, the scores of different APPs are counted to obtain the APP violation scores, specifically including:

所述APP的违规得分通过加权平均数得出:The violation score of the APP is obtained by weighted average:

Figure PCTCN2020120724-appb-000001
Figure PCTCN2020120724-appb-000001

其中,

Figure PCTCN2020120724-appb-000002
表示加权平均数,f1~fk为每个维度违规项配置的权重,x1~xk为实际每个维度违规项的质检结果异常数,n表示维度总个数,不同维度表示不同违规场景。 in,
Figure PCTCN2020120724-appb-000002
Indicates the weighted average number, f1~fk are the weights configured for the violation items of each dimension, x1~xk are the actual number of abnormal quality inspection results of the violation items in each dimension, n is the total number of dimensions, and different dimensions represent different violation scenarios.

可选地,所述的基于OCR和迁移学习的APP违规监测方法,其中,所述根据所述深度学习模型输出的判别结果,对不同APP的得分进行统计,得出APP的违规得分,之后还包括:Optionally, in the method for monitoring APP violations based on OCR and transfer learning, wherein, according to the discrimination results output by the deep learning model, the scores of different APPs are counted to obtain the violation scores of the APPs, and then the include:

对所有任务设定定时启动任务。Set a scheduled start task for all tasks.

可选地,所述的基于OCR和迁移学习的APP违规监测方法,其中,所述任务包括:APP爬取定时任务、APP截图定时任务以及违规监测定时任务。Optionally, in the APP violation monitoring method based on OCR and transfer learning, the tasks include: APP crawling timed task, APP screenshot timed task, and violation monitoring timed task.

此外,为实现上述目的,本发明还提供一种智能终端,其中,所述智能终端包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的基于OCR和迁移学习的APP违规监测程序,所述基于OCR和迁移学习的APP违规监测程序被所述处理器执行时实现如上所述的基于OCR和迁移学习的APP违规监测方法的步骤。In addition, in order to achieve the above object, the present invention also provides an intelligent terminal, wherein the intelligent terminal includes: a memory, a processor, and an OCR-based and transfer learning based OCR and transfer learning system stored in the memory and running on the processor. The APP violation monitoring program, when the APP violation monitoring program based on OCR and transfer learning is executed by the processor, implements the steps of the above-mentioned OCR and transfer learning-based APP violation monitoring method.

此外,为实现上述目的,本发明还提供一种存储介质,其中,所述存储介质存储有基于OCR和迁移学习的APP违规监测程序,所述基于OCR和迁移学习的APP违规监测程序被处理器执行时实现如上所述的基于OCR和迁移学习的APP违规监测方法的步骤。In addition, in order to achieve the above object, the present invention also provides a storage medium, wherein the storage medium stores an APP violation monitoring program based on OCR and transfer learning, and the APP violation monitoring program based on OCR and transfer learning is processed by a processor When executed, the steps of implementing the above-mentioned OCR and transfer learning-based APP violation monitoring method.

本发明通过定期更新APK,根据更新后的APK进行对应APP的数据采集,所述数据采集包括数据抓包和页面截图;基于OCR算法对截图进行文字识别及提取;对识别后的文字内容,通过关键字及正则表达式进行样本集构建,并进行人工标注;将人工标注后的样本集输入预训练的深度学习模型进行模型调整,通过划分业务场景实现不同场景下文本的违规判别;根据所述深度学习模型输出的判别结果,对不同APP的得分进行统计,得出APP的违规得分。本发明通过对 APP的数据进行采集和分析,有效、快速检测出APP的违规使用情况。The present invention periodically updates the APK, and collects data corresponding to the APP according to the updated APK. The data collection includes data packet capture and page screenshots; text recognition and extraction are performed on the screenshots based on the OCR algorithm; Keywords and regular expressions are used to construct a sample set, and manually annotate; input the manually labeled sample set into a pre-trained deep learning model for model adjustment, and divide business scenarios to achieve text violation discrimination in different scenarios; Based on the discrimination results output by the deep learning model, the scores of different APPs are counted to obtain the APP's violation score. The present invention effectively and quickly detects the illegal use of the APP by collecting and analyzing the data of the APP.

附图说明Description of drawings

图1是基于Client/Server架构的跨平台、多语言的移动端自动化测试框架的示意图;Fig. 1 is a schematic diagram of a cross-platform, multi-language mobile terminal automated testing framework based on Client/Server architecture;

图2是预训练模型管理和迁移学习工具中PaddleHub架构的示意图;Figure 2 is a schematic diagram of the PaddleHub architecture in the pre-training model management and transfer learning tool;

图3是CTPN算法流程图;Fig. 3 is the flow chart of CTPN algorithm;

图4是基于OCR和迁移学习的移动端的监控系统的框架示意图;Fig. 4 is the framework schematic diagram of the monitoring system of the mobile terminal based on OCR and transfer learning;

图5是微服务架构的示意图;Figure 5 is a schematic diagram of a microservice architecture;

图6是本发明基于OCR和迁移学习的APP违规监测方法的较佳实施例的流程图;6 is a flowchart of a preferred embodiment of the method for monitoring APP violations based on OCR and transfer learning of the present invention;

图7是本发明基于OCR和迁移学习的APP违规监测方法的较佳实施例的执行过程的示意图;7 is a schematic diagram of the execution process of a preferred embodiment of the APP violation monitoring method based on OCR and transfer learning of the present invention;

图8是本发明基于OCR和迁移学习的APP违规监测方法的较佳实施例中进行截图时形成的配置路径表结构的示意图;8 is a schematic diagram of a configuration path table structure formed when a screenshot is taken in a preferred embodiment of the APP violation monitoring method based on OCR and transfer learning of the present invention;

图9是本发明基于OCR和迁移学习的APP违规监测方法的较佳实施例中实时监测APP宣传数据的监控功能的示意图;9 is a schematic diagram of a monitoring function for real-time monitoring of APP publicity data in a preferred embodiment of the APP violation monitoring method based on OCR and transfer learning of the present invention;

图10为本发明智能终端的较佳实施例的运行环境示意图。FIG. 10 is a schematic diagram of an operating environment of a preferred embodiment of an intelligent terminal of the present invention.

具体实施方式detailed description

为使本发明的目的、技术方案及优点更加清楚、明确,以下参照附图并举实施例对本发明进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer and clearer, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

Appium是一个基于Client/Server架构的跨平台、多语言的移动端自动化测试框架,Appium可以跨平台同时支持Android、iOS。如图1所示,通过任意实现了和Appium客户端类库相应WebDriver JSONWriteProtocol协议的Client端(如Java、Python、Javascript等),Appium服务端通过解析,可以对手机设备执行诸如屏幕截图和点击下拉之类的命令。此外,通过Appium Destop可以方便准确地协助开发者获取APP各控件的坐标和xpath(用来确定XML文档中某部分位置的语言)路径属性信息。Appium is a cross-platform, multi-language mobile automated testing framework based on the Client/Server architecture. Appium can support both Android and iOS across platforms. As shown in Figure 1, through any client (such as Java, Python, Javascript, etc.) that implements the WebDriver JSONWriteProtocol protocol corresponding to the Appium client class library, the Appium server can perform operations such as screenshots and clicks on the mobile device through parsing. such as commands. In addition, Appium Destop can easily and accurately assist developers to obtain the coordinates of each APP control and xpath (language used to determine the location of a certain part of the XML document) path attribute information.

迁移学习(Transfer Learning)是属于深度学习的一个子研究领域,该研究领域的目标在于利用数据、任务、或模型之间的相似性,将在旧领域学习过的知识,迁移应用于新领域中,通常可以在预先训练好的模型上进行Fine-tune(微调)来实现模型迁移,从而达到模型适应新领域数据的目的,如图2所示,PaddleHub是开源的预训练模型管理和迁移学习工具,通过PaddleHub开发者可以使用高质量的预训练模型结合Fine-tune API快速完成迁移学习到应用部署的全流程工作和更短的时间完成模型的收敛,同时让模型具备更好的泛化能力。Transfer Learning is a sub-field of deep learning. The goal of this research field is to use the similarity between data, tasks, or models to transfer the knowledge learned in the old field and apply it to the new field. , you can usually perform Fine-tune (fine-tuning) on the pre-trained model to achieve model migration, so as to achieve the purpose of adapting the model to new field data, as shown in Figure 2, PaddleHub is an open source pre-trained model management and migration learning tool , Through PaddleHub, developers can use high-quality pre-trained models combined with the Fine-tune API to quickly complete the entire process from transfer learning to application deployment and complete model convergence in a shorter time, while enabling the model to have better generalization capabilities.

OCR算法一般可分为两部分:文本检测(检测出文本所在区域)和字符识别(识别出区域内文字)例如,CTPN(Connectionist Text Proposal Network)模型,该模型极大的简化了检测的流程,利用了RNN和CNN的无缝结合来提高检测精度(CNN用来提取深度特征,RNN用于序列的特征识别),使文本检测的效果、速度、稳健性得到了质的提升,由于文字信息是由“字符、字符的一部分、多字符”组成的一个序列,所以文字这个检测目标并不是独立、封闭的,而是有前后关联的,因此在CTPN中采用了RNN(Recurrent Neural Networks,循环神经网络)来利用前后文的信息进行文本位置的预测。通过Keras+TensorFlow框架能够实现CTPN论文中算法:具体为通过利用VGG16卷积输出的feature map(特征图)生成一系列proposals(预选框)进行检测,在VOC2007_text_detection数据集上训练了CTPN文本识别模型,使用CTPN算法可以检测出图片的文本区域,此算法通过设置垂直anchor和细粒度检策略,在一个单尺度的图片上能够检测多尺寸和纵横比的文本行,同样也限制了CTPN只在水平方向检测到的文字效果比较好,但是在其它方向的检测效果相对较差。但由于本系统通过Appium自动截图采集的图片数据大多水平方向、干扰背景较少、几乎不存在边缘重叠并且图片倾斜角度不大,因此CTPN文本检测算法拥有较高的准确率,它的算法步骤如图3所示:The OCR algorithm can generally be divided into two parts: text detection (detecting the area where the text is located) and character recognition (recognizing the text in the area). For example, the CTPN (Connectionist Text Proposal Network) model, which greatly simplifies the detection process, The seamless combination of RNN and CNN is used to improve the detection accuracy (CNN is used to extract deep features, and RNN is used for sequence feature recognition), so that the effect, speed and robustness of text detection have been qualitatively improved. A sequence composed of "characters, parts of characters, and multiple characters", so the detection target of text is not independent and closed, but related before and after. Therefore, RNN (Recurrent Neural Networks) is used in CTPN. ) to use the information of the context to predict the text position. The algorithm in the CTPN paper can be implemented through the Keras+TensorFlow framework: specifically, a series of proposals (pre-selection boxes) are generated by using the feature map (feature map) output by VGG16 convolution for detection, and the CTPN text recognition model is trained on the VOC2007_text_detection dataset. The text area of the image can be detected using the CTPN algorithm. This algorithm can detect text lines of multiple sizes and aspect ratios on a single-scale image by setting vertical anchors and fine-grained detection strategies. It also limits CTPN to only horizontal direction. The detected text effect is relatively good, but the detection effect in other directions is relatively poor. However, since most of the picture data collected by this system through Appium's automatic screenshots are in horizontal direction, there is less interference background, there is almost no edge overlap, and the picture tilt angle is not large, so the CTPN text detection algorithm has a high accuracy rate. Its algorithm steps are as follows As shown in Figure 3:

(1)使用VGG16(CNN卷积神经网络的经典模型)作为提取特征,得到大小是W×H×C的特征映射,在第5层的基础上设置大小是3×3的滑窗,每个窗口都得到一个长度为3×3×C的特征向量。(1) Using VGG16 (a classic model of CNN convolutional neural network) as the extraction feature, a feature map of size W×H×C is obtained, and a sliding window of size 3×3 is set on the basis of the fifth layer. The windows all get a feature vector of length 3×3×C.

(2)将步骤(1)得到卷积特征作为256维双向LSTM(两个128维LSTM)的输入,得到长度为W×256的输出,引入LSTM是为了解决RNN层梯度消失 的问题和进一步扩展RNN层。(2) The convolution feature obtained in step (1) is used as the input of 256-dimensional bidirectional LSTM (two 128-dimensional LSTM), and the output of length W × 256 is obtained. The introduction of LSTM is to solve the problem of RNN layer gradient disappearance and further expansion RNN layer.

(3)输出层部分包含三个输出,分别是2k个垂直坐标(vertical coordinate),2k个评分(score),k个边缘细化(side-refinement),使用一个标准的非极大值抑制算法(NMS)来滤除重复多余的文本框。(3) The output layer part contains three outputs, which are 2k vertical coordinates, 2k scores, and k side-refinements, using a standard non-maximum suppression algorithm (NMS) to filter out duplicate text boxes.

得到图片中包含文本的区域后,接下来的工作便是识别各个区域的文本。DenseNet是其中一种字符识别算法,它选用Relu作为激活函数,使用了3个Dense Block层进行演算,各个Dense Block之间通过Transition结构连接在一起组成的DenseNet网络,最后配合CTC loss进行训练得出数据模型,数据经由DenseBlock层处理后,进行卷积操作然后传送给Transition结构进行参数整合规范,通过池化减少参数后传送给下层的Dense Block结构,从而达到较高的精度,在训练过程中,当正确率和损失值达到极限陷入震荡时,需要指数性的减少学习率,这可以使正确率立即大幅跳跃提升。After getting the areas containing text in the image, the next task is to identify the text in each area. DenseNet is one of the character recognition algorithms. It uses Relu as the activation function and uses 3 Dense Block layers for calculation. The DenseNet network is formed by connecting each Dense Block through the Transition structure. Finally, it is obtained by training with CTC loss. In the data model, after the data is processed by the DenseBlock layer, the convolution operation is performed and then transmitted to the Transition structure for parameter integration specification. The parameters are reduced by pooling and then transmitted to the lower Dense Block structure to achieve higher accuracy. During the training process, When the accuracy rate and loss value reach the limit and fall into oscillation, the learning rate needs to be reduced exponentially, which can make the accuracy rate jump sharply immediately.

进一步地,如图4所示,系统使用的技术栈包括前端、后端、算法和运维,其中,前端、后端、算法和运维可按如下分类:Further, as shown in Figure 4, the technology stack used by the system includes front-end, back-end, algorithm and operation and maintenance, wherein the front-end, back-end, algorithm and operation and maintenance can be classified as follows:

前端:Vue、ElementUI、Vuex、Axios;Front-end: Vue, ElementUI, Vuex, Axios;

后端:Java、SpringBoot、SpringCloud、Nacos、SpringGateWay、SpringAdmin、Feign、XXL-JOB、Mybatis-Plus、Maven;Backend: Java, SpringBoot, SpringCloud, Nacos, SpringGateWay, SpringAdmin, Feign, XXL-JOB, Mybatis-Plus, Maven;

算法:Flask、TensorFlow、Keras、Pytorch、CTPN、CRNN;Algorithms: Flask, TensorFlow, Keras, Pytorch, CTPN, CRNN;

运维:Docker、Linux;Operation and maintenance: Docker, Linux;

系统采用前后端分离的设计思想,前端基于Vue,结合ElementUI、Vuex和Axios等技术栈搭建管理系统界面,后台统一基于Docker在Linux镜像化部署,Web后台基于Java结合SpringCloud实现微服务架构,通过Nacos实现配置中心和注册中心,通过SpringGateWay网关组件统一访问后台接口,通过SpringAdmin统一监控后端服务。定时任务后台基于Java结合开源框架XXL-JOB实现分布式定时任务。算法后台基于Flask结合TensorFlow和Keras实现OCR文字识别算法,在PaddleHub的预训练模型基础上进行微调生成违规检测模型。The system adopts the design idea of separating front and back ends. The front end is based on Vue, combined with ElementUI, Vuex and Axios and other technology stacks to build the management system interface. Realize the configuration center and registration center, uniformly access the backend interface through the SpringGateWay gateway component, and monitor the backend service uniformly through SpringAdmin. The timing task background implements distributed timing tasks based on Java combined with the open source framework XXL-JOB. The algorithm background is based on Flask combined with TensorFlow and Keras to implement the OCR text recognition algorithm, and fine-tuned on the basis of PaddleHub's pre-training model to generate a violation detection model.

在后台数据持久层中,使用的是MySQL,使用Mybatis Plus来对数据进行增加、删除、修改及查找等操作,使用MongoDb存储HTTP抓包数据,使用Redis对热点数据进行缓存和实现分布式锁。In the background data persistence layer, MySQL is used, Mybatis Plus is used to add, delete, modify and search data, MongoDb is used to store HTTP packet capture data, and Redis is used to cache hot data and implement distributed locks.

根据功能需求,将后台分为Web后台、定时任务后台和算法后台3类。According to functional requirements, the background is divided into three categories: web background, timed task background and algorithm background.

Web后台:基于Java提供基本的对各类数据增加、删除、修改及查找查的接口,包括但不限于数据和模型的管理。定时任务后台:基于Java和XXL-JOB用于周期执行定时任务。算法后台:基于Python、CTPN、CRNN和PaddleHub实现OCR算法、违规检测算法和语义相似度算法。Web backend: Based on Java, it provides basic interfaces for adding, deleting, modifying and searching various data, including but not limited to data and model management. Timing task background: Based on Java and XXL-JOB, it is used to periodically execute timing tasks. Algorithm background: OCR algorithm, violation detection algorithm and semantic similarity algorithm are implemented based on Python, CTPN, CRNN and PaddleHub.

借鉴微服务的思想,可将后台在抽象细分为数据查询服务、定时任务服务、数据采集服务、数据违规检测服务和数据分析服务5类,其中数据查询服务和定时任务服务直接面向用户,底层通过数据采集服务、数据违规检测服务和数据分析服务为上述二者提供基础功能,微服务架构如图5所示。Drawing on the idea of microservices, the background can be abstractly subdivided into five categories: data query services, timed task services, data collection services, data violation detection services, and data analysis services. The basic functions are provided for the above two through data collection service, data violation detection service and data analysis service. The microservice architecture is shown in Figure 5.

通过上述的各类技术框架,构建基于OCR和迁移学习的移动端的违规监控系统,并将系统的各个模块基于技术架构进行整合,确保系统的可维护性及可拓展性。Through the above-mentioned various technical frameworks, a mobile terminal violation monitoring system based on OCR and transfer learning is constructed, and each module of the system is integrated based on the technical architecture to ensure the maintainability and scalability of the system.

本发明较佳实施例所述的基于OCR和迁移学习的APP违规监测方法,如图6和图7所示,所述基于OCR和迁移学习的APP违规监测方法包括以下步骤:The method for monitoring APP violations based on OCR and transfer learning according to the preferred embodiment of the present invention is shown in FIG. 6 and FIG. 7 . The method for monitoring APP violations based on OCR and transfer learning includes the following steps:

步骤S10、定期更新APK,根据更新后的APK进行对应APP的数据采集,所述数据采集包括数据抓包和页面截图。Step S10, update the APK regularly, and perform data collection of the corresponding APP according to the updated APK, where the data collection includes data packet capture and page screenshots.

具体地,基于Java借助Jsoup库爬取各应用的APK,并对应用商店APK进行定期更新,依据更新的APK进行对应APP的数据采集,采集的应用商店包括小米应用商店、应用宝、百度应用助手、360手机助手、豌豆荚、PP助手、搜狗手机助手等。Specifically, based on Java, the APK of each application is crawled with the help of the Jsoup library, and the APK of the application store is updated regularly, and the data of the corresponding APP is collected according to the updated APK. The collected application stores include Xiaomi App Store, App Store, and Baidu App Assistant. , 360 Mobile Assistant, Pea Pod, PP Assistant, Sogou Mobile Assistant, etc.

数据采集的目的在于对APP的相关行为进行监控,例如APP中的基金概况、基金经理、基金公告、基金推广轮播图、基金推广活动介绍等内容进行采集。The purpose of data collection is to monitor the relevant behavior of the APP, such as the fund overview, fund manager, fund announcement, fund promotion carousel, fund promotion activity introduction and other content in the APP.

其中,所述数据采集的方式具体包括:使用爬虫直接进行宣传数据抓包和使用Appium脚本进行页面自动化截图。The data collection method specifically includes: using a crawler to directly capture packets of publicity data and using an Appium script to automatically take screenshots of pages.

例如,在通过爬虫数据抓包时,部分APP通过Fidder抓包的形式进行数据采集,如天天基金、华夏基金管家、蛋卷基金、广发易淘金、同花顺基金、兴全基金、国泰基金、好买基金、利得基金等。首先通过Fidder获取基金列表接口的URL和参数,根据其中的基金代码,可以获取基金概况和基金公告的数据;然后通过Python构建相应的http请求,并将结果存入MongoDb数据库;最终通 过Python的requests库,可以轻松地构建请求并获取相应的数据结果。For example, when capturing packets through crawler data, some APPs use Fidder to capture packets, such as Tiantian Fund, China Asset Manager, Egg Roll Fund, GF Yitaojin, Flush Fund, Xingquan Fund, Cathay Fund, Haomai Funds, Profit Funds, etc. First, obtain the URL and parameters of the fund list interface through Fidder. According to the fund code, you can obtain the fund profile and fund announcement data; then construct the corresponding http request through Python, and store the result in the MongoDb database; finally through the Python requests Libraries to easily build requests and get corresponding data results.

例如,在进行自动化截图时,基于Appium的页面自动化截图难点在于如何保证自动化点击脚本的稳定性和全面性和OCR识别的准确性。本发明主要通过文本对控件元素进行定位,同时采用Appium Destop获取XPath路径辅助进行控件元素定位。For example, when taking automated screenshots, the difficulty of automated screenshots based on Appium is how to ensure the stability and comprehensiveness of automated click scripts and the accuracy of OCR recognition. The present invention mainly locates control elements through text, and simultaneously uses Appium Destop to obtain XPath path assistance to locate control elements.

例如,由于基金APP需要截图监控的界面存在界面深度不高、同时界面查找按钮较少的共性,因此可以通过Appium配置路径表的设计,覆盖到大部分APP的自动化截图需求。最终形成的配置路径表结构如图8所示。For example, because the interface of the fund APP that needs to be monitored by screenshots has a common feature of low interface depth and few interface search buttons, Appium can configure the design of the path table to cover the automated screenshot requirements of most APPs. The finally formed configuration path table structure is shown in Figure 8.

为了实现脚本的通用性,将Appium配置路径抽象成了5个字段,分别是应用、根路径、子路径、异常日志以及启用状态。其中根路径和子路径的具体格式定义举例如下:In order to achieve the versatility of scripts, the Appium configuration path is abstracted into five fields, namely application, root path, sub-path, exception log, and enable status. The specific format definitions of the root path and subpath are as follows:

【首页-0|活期-0|com.hctforgf.gff:id/risk_warn_tv-5-0】;【Home-0|Current-0|com.hctforgf.gff:id/risk_warn_tv-5-0】;

其中,通过|符号分隔的3个路径(首页-0、活期-0和com.hctforgf.gff:id/risk_warn_tv-5-0)代表依次点击的3个元素控件,首页-0和活期-0中的0代表通过文本定位控件,com.hctforgf.gff:id/risk_warn_tv-5-0中5代表通过指定方式进行定位,而其中的0代表通过元素ID定位控件。由此可以总结出路径的通用格式为文本-定位控件数字方式-额外参数|文本-定位控件数字方式-额外参数|文本-定位控件数字方式-额外参数。具体的数字及含义如下表所示:Among them, the three paths (home-0, current-0 and com.hctforgf.gff:id/risk_warn_tv-5-0) separated by the | symbol represent the three element controls that are clicked in sequence, home-0 and current-0 The 0 in com.hctforgf.gff:id/risk_warn_tv-5-0 represents positioning by the specified method, and the 0 represents positioning the control by element ID. From this, it can be concluded that the general format of the path is text-positioning control numerical method-extra parameter|text-positioning control numerical method-extra parameter|text-positioning control numerical method-extra parameter. The specific numbers and their meanings are shown in the table below:

Figure PCTCN2020120724-appb-000003
Figure PCTCN2020120724-appb-000003

Figure PCTCN2020120724-appb-000004
Figure PCTCN2020120724-appb-000004

在APP的轮播图中,则通过提供以文字和以坐标的定位轮播图,以文字定位轮播图的实质仍是以坐标定位轮播图,可以借助坐标对轮播图进行左右滑动,同时此时需要考虑的问题就是APP轮播图自动滑动和脚本滑动同时进行导致获取轮播图数量异常的问题,引出定位轮播图方式的额外参数中定义了轮播图的总数,基于它借助图片相似度防止处理相同的轮播图,直到获取到相应轮播图总数后再退出。In the carousel map of the APP, by providing the positioning carousel map with text and coordinates, the essence of positioning the carousel map by text is still positioning the carousel map by coordinates, and the carousel map can be slid left and right with the help of coordinates. At the same time, the problem that needs to be considered at this time is the problem that the automatic sliding of the APP carousel image and the sliding of the script result in an abnormal number of carousel images. The additional parameters of the positioning carousel image method define the total number of carousel images. Image similarity prevents the same carousel from being processed until the total number of corresponding carousels is obtained before exiting.

步骤S20、基于OCR算法对截图进行文字识别及提取。Step S20, performing text recognition and extraction on the screenshot based on the OCR algorithm.

具体地,获取了所有截图后,使用OCR算法对文字进行识别及提取,以获取所需的信息。Specifically, after all the screenshots are obtained, the OCR algorithm is used to identify and extract the text to obtain the required information.

步骤S30、对识别后的文字内容,通过关键字及正则表达式进行样本集构建,并进行人工标注。Step S30 , constructing a sample set by using keywords and regular expressions for the recognized text content, and performing manual annotation.

具体地,对识别后的文字内容,通过关键字及正则表达式(正则表达式是对字符串操作的一种逻辑公式,就是用事先定义好的一些特定字符、及这些特定字符的组合,组成一个“规则字符串”,这个“规则字符串”用来表达对字符串的一种过滤逻辑)进行样本集构建,并通过人工标注提高可靠性。Specifically, for the recognized text content, through keywords and regular expressions (regular expressions are a kind of logical formula for string operations, that is, some pre-defined specific characters and combinations of these specific characters are used to form A "rule string", this "rule string" is used to express a filtering logic for strings) to construct a sample set and improve reliability through manual annotation.

步骤S40、将人工标注后的样本集输入预训练的深度学习模型进行模型调整,通过划分业务场景实现不同场景下文本的违规判别。Step S40: Input the manually labeled sample set into the pre-trained deep learning model to adjust the model, and realize the violation judgment of text in different scenarios by dividing the business scenarios.

在所述步骤S40之前还包括:构建用于监督所述深度学习模型的训练的语料库。具体为:获取多个关键词,对所述关键词进行匹配;构建基于关键词的训练 语料,并人工进行标签标注,用于生成所述语料库。Before the step S40, the method further includes: constructing a corpus for supervising the training of the deep learning model. Specifically: obtaining a plurality of keywords, and matching the keywords; constructing training corpus based on keywords, and manually labeling them for generating the corpus.

例如,基金销售违规判别模型采用有监督的方式进行训练,因此需要构建语料库,用于有监督的深度学习模型的训练工作。由于语料的丰富度将直接影响语义标签的准确度,为保证语义标签样本的质量,采用人工的方式从网络中收集基金销售违规宣传的语料。For example, the fund sales violation discrimination model is trained in a supervised manner, so a corpus needs to be constructed for the training of a supervised deep learning model. Since the richness of the corpus will directly affect the accuracy of the semantic labels, in order to ensure the quality of the semantic label samples, the corpus of the illegal propaganda of fund sales is collected manually from the network.

通过对网络违规案例的分析及总结,整理出如高收益、零风险、现金红包、欲购从速、有保证、有保证等用于巡礼的关键词。基于上述关键词的,结合包含、不包含、大于、小于、等于及正则表达式等方式对关键词进行匹配,构建出基于关键词的训练语料,并人工对其标签标注,用于模型的训练工作。Through the analysis and summary of network violation cases, we sorted out keywords such as high yield, zero risk, cash red envelopes, quick purchases, guaranteed, guaranteed, etc. for the parade. Based on the above keywords, the keywords are matched by means of including, not including, greater than, less than, equal to, and regular expressions, to construct a training corpus based on keywords, and manually label them for model training. Work.

进一步地,由于深度学习对模型训练过程的数据量要求很高,因此直接从头训练模型的方式并不能获得有效的模型,因此系统采用PaddleHub框架来构建违规监测模型,在经过海量网络数据训练后的与训练模型Ernie模型的基础上进行基于迁移学习的模型微调工作(即模型再训练)。在训练过程中,根据不同的违规场景(如夸大收益场景、诋毁其他基金管理人等)进行单独的训练工作,以划分不同场景下的违规判别模型;最终得到各场景下判断数据是否违规的分类模型。Further, since deep learning requires a high amount of data in the model training process, it is impossible to obtain an effective model by directly training the model from scratch. Therefore, the system uses the PaddleHub framework to build a violation monitoring model. On the basis of the training model Ernie model, the model fine-tuning work based on transfer learning (ie model retraining) is carried out. In the training process, separate training work is carried out according to different violation scenarios (such as exaggerating income scenarios, slandering other fund managers, etc.) to classify violation discrimination models in different scenarios; finally, the classification of judging whether the data is illegal in each scenario is obtained. Model.

也就是说,为了得到用于分类模型,就把大量的带标签的数据放到模型里进行训练,模型里面的通过这些数据和标签对自身神经网络节点的权重进行调整,然后最后得到的特定权重的模型就可以用于数据分类。That is to say, in order to obtain a classification model, a large amount of labeled data is put into the model for training, and the model adjusts the weight of its own neural network nodes through these data and labels, and finally obtains the specific weight. The model can be used for data classification.

步骤S50、根据所述深度学习模型输出的判别结果,对不同APP的得分进行统计,得出APP的违规得分。Step S50, according to the discrimination result output by the deep learning model, count the scores of different APPs, and obtain the violation scores of the APPs.

具体地,所述APP的违规得分通过加权平均数(即加权平均数就表示APP的违规得分)得出:Specifically, the violation score of the APP is obtained by the weighted average (that is, the weighted average represents the violation score of the APP):

Figure PCTCN2020120724-appb-000005
Figure PCTCN2020120724-appb-000005

其中,

Figure PCTCN2020120724-appb-000006
表示加权平均数,f1~fk为每个维度违规项配置的权重,x1~xk为实际每个维度违规项的质检结果异常数,n表示维度总个数,不同维度表示不同违规场景(例如“缺少合理风险提示”、“承诺保本收益”、“夸大收益”等);根据质检结果异常数查找指定范围内的分数即是维度得分,同理可以计算出5(即n=5 时)个维度的加权平均分作为平均得分,例如以非法承诺收益、缺少合理风险提示、诋毁其他基金管理人、承诺保本保收益、夸大收益等五个维度计算违规得分。 in,
Figure PCTCN2020120724-appb-000006
Indicates the weighted average number, f1~fk are the weights configured for the violation items of each dimension, x1~xk are the actual number of abnormal quality inspection results of the violation items in each dimension, n is the total number of dimensions, and different dimensions indicate different violation scenarios (for example, "Lack of reasonable risk warning", "promised capital-guaranteed return", "exaggerated return", etc.); according to the abnormal number of quality inspection results to find the score within the specified range is the dimension score, and similarly, 5 can be calculated (that is, when n=5) The weighted average score of each dimension is used as the average score, for example, the violation score is calculated in five dimensions, such as illegal promised income, lack of reasonable risk warning, slandering other fund managers, promise to guarantee capital and income, and exaggerated income.

进一步地,在实现了上述步骤后,对所有任务设定定时启动任务,如图9所示,包括APP爬取定时任务、APP截图定时任务及违规监测定时任务,以满足实时监测APP宣传数据的监控功能。Further, after implementing the above steps, set a timed start task for all tasks, as shown in Figure 9, including the APP crawling timed task, the APP screenshot timed task and the violation monitoring timed task, so as to meet the requirements of real-time monitoring of APP publicity data. monitoring function.

本发明实现了对移动端APP数据定时自动爬取;对于移动端爬取不到的数据,采用Appium模拟触屏操作进行截图;对截图所得图片使用OCR技术进行字符监测及字符识别,解决了基金宣传图片中的字符监测问题;针对宣传文本的违规监测问题,提出基于关键字、正则表达式的语料构建方法;针对深度学习领域模型需要大量训练语料的问题,使用迁移学习模型,在预训练模型的基础上进行标注数据微调工作;构建数据集对深度学习模型进行了训练,得到了具备判断基金违规文本分类能力的模型;针对违规监测的结果,对应用商店和基金APP进行违规统计分析,实现了针对基金APP的自动化违规监测。The invention realizes the automatic crawling of the mobile terminal APP data regularly; for the data that cannot be crawled by the mobile terminal, the Appium simulates the touch screen operation to take screenshots; the OCR technology is used to perform character monitoring and character recognition on the pictures obtained from the screenshots, which solves the problem of fundraising. The problem of character monitoring in publicity pictures; for the problem of violation monitoring of publicity texts, a corpus construction method based on keywords and regular expressions is proposed; for the problem that deep learning domain models require a large amount of training corpus, the transfer learning model is used, and the pre-training model is used. Fine-tune the labeled data on the basis of the data set; build a dataset to train the deep learning model, and obtain a model capable of judging the illegal text classification of the fund; Automated violation monitoring for fund APPs.

本发明可以对网络基金宣传数据进行统一的采集及监控,后续还能获取的数据及模型的结果做成知识库,用于大数据分析、知识图谱等更多领域。The invention can collect and monitor the network fund publicity data uniformly, and the data and model results that can be obtained subsequently can be made into a knowledge base, which can be used in big data analysis, knowledge map and other fields.

进一步地,如图10所示,基于上述基于OCR和迁移学习的APP违规监测方法,本发明还相应提供了一种智能终端,所述智能终端包括处理器10、存储器20及显示器30。图10仅示出了智能终端的部分组件,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。Further, as shown in FIG. 10 , based on the above-mentioned APP violation monitoring method based on OCR and transfer learning, the present invention also provides an intelligent terminal correspondingly, the intelligent terminal includes a processor 10 , a memory 20 and a display 30 . FIG. 10 only shows some components of the smart terminal, but it should be understood that it is not required to implement all the shown components, and more or less components may be implemented instead.

所述存储器20在一些实施例中可以是所述智能终端的内部存储单元,例如智能终端的硬盘或内存。所述存储器20在另一些实施例中也可以是所述智能终端的外部存储设备,例如所述智能终端上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器20还可以既包括所述智能终端的内部存储单元也包括外部存储设备。所述存储器20用于存储安装于所述智能终端的应用软件及各类数据,例如所述安装智能终端的程序代码等。所述存储器20还可以用于暂时地存储已经输出或者将要输出的数据。在一实施例中,存储器20上存储有基于OCR和迁移学习的APP违规监测程序40,该基于OCR和迁移学习的APP违规监测程序40可被处理器10所执行,从而实现本申请中基于OCR和迁移学习的APP 违规监测方法。In some embodiments, the memory 20 may be an internal storage unit of the smart terminal, such as a hard disk or a memory of the smart terminal. In other embodiments, the memory 20 may also be an external storage device of the smart terminal, such as a plug-in hard disk equipped on the smart terminal, a smart memory card (Smart Media Card, SMC), a secure digital (Secure) Digital, SD) card, flash card (Flash Card), etc. Further, the memory 20 may also include both an internal storage unit of the smart terminal and an external storage device. The memory 20 is used to store application software and various types of data installed in the smart terminal, such as program codes for installing the smart terminal. The memory 20 can also be used to temporarily store data that has been output or is to be output. In one embodiment, an APP violation monitoring program 40 based on OCR and transfer learning is stored on the memory 20, and the APP violation monitoring program 40 based on OCR and transfer learning can be executed by the processor 10, so as to realize the OCR-based application in this application. and transfer learning method for APP violation detection.

所述处理器10在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其他数据处理芯片,用于运行所述存储器20中存储的程序代码或处理数据,例如执行所述基于OCR和迁移学习的APP违规监测方法等。In some embodiments, the processor 10 may be a central processing unit (Central Processing Unit, CPU), a microprocessor or other data processing chips, for running program codes or processing data stored in the memory 20, such as Execute the APP violation monitoring method based on OCR and transfer learning, etc.

所述显示器30在一些实施例中可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。所述显示器30用于显示在所述智能终端的信息以及用于显示可视化的用户界面。所述智能终端的部件10-30通过系统总线相互通信。In some embodiments, the display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. The display 30 is used for displaying information on the smart terminal and for displaying a visual user interface. The components 10-30 of the intelligent terminal communicate with each other through the system bus.

在一实施例中,当处理器10执行所述存储器20中基于OCR和迁移学习的APP违规监测程序40时实现以下步骤:In one embodiment, when the processor 10 executes the APP violation monitoring program 40 based on OCR and transfer learning in the memory 20, the following steps are implemented:

定期更新APK,根据更新后的APK进行对应APP的数据采集,所述数据采集包括数据抓包和页面截图;Regularly update the APK, and perform data collection of the corresponding APP according to the updated APK, and the data collection includes data capture and page screenshots;

基于OCR算法对截图进行文字识别及提取;Text recognition and extraction of screenshots based on OCR algorithm;

对识别后的文字内容,通过关键字及正则表达式进行样本集构建,并进行人工标注;For the recognized text content, a sample set is constructed through keywords and regular expressions, and manual annotation is performed;

将人工标注后的样本集输入预训练的深度学习模型进行模型调整,通过划分业务场景实现不同场景下文本的违规判别;Input the manually labeled sample set into the pre-trained deep learning model for model adjustment, and realize the text violation discrimination in different scenarios by dividing the business scenarios;

根据所述深度学习模型输出的判别结果,对不同APP的得分进行统计,得出APP的违规得分。According to the discrimination results output by the deep learning model, the scores of different APPs are counted to obtain the violation scores of the APPs.

其中,所述定期更新APK,根据更新后的APK进行对应APP的数据采集,具体包括:Wherein, for the regularly updated APK, data collection of the corresponding APP is performed according to the updated APK, which specifically includes:

基于Java并借助Jsoup库爬取各应用的APK,并对应用商店APK进行定期更新,依据更新后的APK进行对应APP的数据采集。Based on Java and with the help of Jsoup library, the APK of each application is crawled, and the APK of the application store is updated regularly, and the data collection of the corresponding APP is carried out according to the updated APK.

其中,所述数据采集的方式具体包括:使用爬虫直接进行宣传数据抓包和使用Appium脚本进行页面自动化截图。The data collection method specifically includes: using a crawler to directly capture packets of publicity data and using an Appium script to automatically take screenshots of pages.

其中,所述将人工标注后的样本集输入预训练的深度学习模型进行模型调整,通过划分业务场景实现不同场景下文本的违规判别,之前还包括:Wherein, the manual annotated sample set is input into a pre-trained deep learning model for model adjustment, and the text violation judgment in different scenarios is realized by dividing the business scenarios, which also includes:

构建用于监督所述深度学习模型的训练的语料库。A corpus is constructed for supervising the training of the deep learning model.

其中,所述语料库的构建过程包括:Wherein, the construction process of the corpus includes:

获取多个关键词,对所述关键词进行匹配;Obtain multiple keywords, and match the keywords;

构建基于关键词的训练语料,并人工进行标签标注,用于生成所述语料库。A keyword-based training corpus is constructed and labeled manually for generating the corpus.

其中,所述对不同APP的得分进行统计,得出APP的违规得分,具体包括:Wherein, according to the statistics of the scores of different APPs, the violation scores of the APPs are obtained, including:

所述APP的违规得分通过加权平均数得出:The violation score of the APP is obtained by weighted average:

Figure PCTCN2020120724-appb-000007
Figure PCTCN2020120724-appb-000007

其中,

Figure PCTCN2020120724-appb-000008
表示加权平均数,f1~fk为每个维度违规项配置的权重,x1~xk为实际每个维度违规项的质检结果异常数,n表示维度总个数,不同维度表示不同违规场景。 in,
Figure PCTCN2020120724-appb-000008
Indicates the weighted average number, f1~fk are the weights configured for the violation items of each dimension, x1~xk are the actual number of abnormal quality inspection results of the violation items in each dimension, n is the total number of dimensions, and different dimensions represent different violation scenarios.

其中,所述根据所述深度学习模型输出的判别结果,对不同APP的得分进行统计,得出APP的违规得分,之后还包括:Wherein, according to the judgment result output by the deep learning model, the scores of different APPs are counted to obtain the violation scores of the APPs, and the following further includes:

对所有任务设定定时启动任务。Set a scheduled start task for all tasks.

其中,所述任务包括:APP爬取定时任务、APP截图定时任务以及违规监测定时任务。The tasks include: APP crawling timed tasks, APP screenshot timed tasks, and violation monitoring timed tasks.

本发明还提供一种存储介质,其中,所述存储介质存储有基于OCR和迁移学习的APP违规监测程序,所述基于OCR和迁移学习的APP违规监测程序被处理器执行时实现如上所述的基于OCR和迁移学习的APP违规监测方法的步骤。The present invention also provides a storage medium, wherein the storage medium stores an APP violation monitoring program based on OCR and transfer learning, and when the APP violation monitoring program based on OCR and transfer learning is executed by a processor, the above-mentioned Steps of an APP violation detection method based on OCR and transfer learning.

综上所述,本发明提供一种基于OCR和迁移学习的APP违规监测方法,所述方法包括:定期更新APK,根据更新后的APK进行对应APP的数据采集,所述数据采集包括数据抓包和页面截图;基于OCR算法对截图进行文字识别及提取;对识别后的文字内容,通过关键字及正则表达式进行样本集构建,并进行人工标注;将人工标注后的样本集输入预训练的深度学习模型进行模型调整,通过划分业务场景实现不同场景下文本的违规判别;根据所述深度学习模型输出的判别结果,对不同APP的得分进行统计,得出APP的违规得分。本发明通过对APP的数据进行采集和分析,有效、快速检测出APP的违规使用情况。To sum up, the present invention provides an APP violation monitoring method based on OCR and transfer learning. The method includes: updating the APK regularly, and collecting data corresponding to the APP according to the updated APK, and the data collection includes data packet capture and page screenshots; text recognition and extraction based on the OCR algorithm; for the recognized text content, a sample set is constructed through keywords and regular expressions, and manually labeled; the manually labeled sample set is input into the pre-trained The deep learning model adjusts the model, and realizes the violation judgment of texts in different scenarios by dividing the business scenarios; according to the judgment results output by the deep learning model, the scores of different APPs are counted to obtain the APP violation scores. The present invention effectively and quickly detects the illegal use of the APP by collecting and analyzing the data of the APP.

当然,本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关硬件(如处理器,控制器等)来完成,所述的程序可存储于一计算机可读取的存储介质中,所述程序在执行时可包括如上 述各方法实施例的流程。其中所述的存储介质可为存储器、磁碟、光盘等。Of course, those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware (such as processors, controllers, etc.) through a computer program, and the programs can be stored in a In a computer-readable storage medium, when the program is executed, it may include the processes of the foregoing method embodiments. The storage medium may be a memory, a magnetic disk, an optical disk, or the like.

应当理解的是,本发明的应用不限于上述的举例,对本领域普通技术人员来说,可以根据上述说明加以改进或变换,所有这些改进和变换都应属于本发明所附权利要求的保护范围。It should be understood that the application of the present invention is not limited to the above examples. For those of ordinary skill in the art, improvements or transformations can be made according to the above descriptions, and all these improvements and transformations should belong to the protection scope of the appended claims of the present invention.

Claims (10)

一种基于OCR和迁移学习的APP违规监测方法,其特征在于,所述基于OCR和迁移学习的APP违规监测方法包括:An APP violation monitoring method based on OCR and transfer learning, characterized in that, the APP violation monitoring method based on OCR and transfer learning includes: 定期更新APK,根据更新后的APK进行对应APP的数据采集,所述数据采集包括数据抓包和页面截图;Regularly update the APK, and perform data collection of the corresponding APP according to the updated APK, and the data collection includes data capture and page screenshots; 基于OCR算法对截图进行文字识别及提取;Text recognition and extraction of screenshots based on OCR algorithm; 对识别后的文字内容,通过关键字及正则表达式进行样本集构建,并进行人工标注;For the recognized text content, a sample set is constructed through keywords and regular expressions, and manual annotation is performed; 将人工标注后的样本集输入预训练的深度学习模型进行模型调整,通过划分业务场景实现不同场景下文本的违规判别;Input the manually labeled sample set into the pre-trained deep learning model for model adjustment, and realize the text violation discrimination in different scenarios by dividing the business scenarios; 根据所述深度学习模型输出的判别结果,对不同APP的得分进行统计,得出APP的违规得分。According to the discrimination results output by the deep learning model, the scores of different APPs are counted to obtain the violation scores of the APPs. 根据权利要求1所述的基于OCR和迁移学习的APP违规监测方法,其特征在于,所述定期更新APK,根据更新后的APK进行对应APP的数据采集,具体包括:The method for monitoring APP violations based on OCR and transfer learning according to claim 1, characterized in that, in the regularly updating APK, data collection of the corresponding APP is performed according to the updated APK, specifically comprising: 基于Java并借助Jsoup库爬取各应用的APK,并对应用商店APK进行定期更新,依据更新后的APK进行对应APP的数据采集。Based on Java and with the help of Jsoup library, the APK of each application is crawled, and the APK of the application store is updated regularly, and the data collection of the corresponding APP is carried out according to the updated APK. 根据权利要求1所述的基于OCR和迁移学习的APP违规监测方法,其特征在于,所述数据采集的方式具体包括:使用爬虫直接进行宣传数据抓包和使用Appium脚本进行页面自动化截图。The method for monitoring APP violations based on OCR and transfer learning according to claim 1, wherein the data collection method specifically includes: using a crawler to directly capture packets of publicity data and using an Appium script to automatically take screenshots of pages. 根据权利要求1所述的基于OCR和迁移学习的APP违规监测方法,其特征在于,所述将人工标注后的样本集输入预训练的深度学习模型进行模型调整,通过划分业务场景实现不同场景下文本的违规判别,之前还包括:The method for monitoring APP violations based on OCR and transfer learning according to claim 1, wherein the manually labeled sample set is input into a pre-trained deep learning model for model adjustment, and the business scenarios are divided into different scenarios. Violation judgment of text, which previously included: 构建用于监督所述深度学习模型的训练的语料库。A corpus is constructed for supervising the training of the deep learning model. 根据权利要求4所述的基于OCR和迁移学习的APP违规监测方法,其特征在于,所述语料库的构建过程包括:The APP violation monitoring method based on OCR and transfer learning according to claim 4, wherein the construction process of the corpus comprises: 获取多个关键词,对所述关键词进行匹配;Obtain multiple keywords, and match the keywords; 构建基于关键词的训练语料,并人工进行标签标注,用于生成所述语料库。A keyword-based training corpus is constructed and labeled manually for generating the corpus. 根据权利要求1所述的基于OCR和迁移学习的APP违规监测方法,其特征在 于,所述对不同APP的得分进行统计,得出APP的违规得分,具体包括:The APP violation monitoring method based on OCR and transfer learning according to claim 1, is characterized in that, the described score of different APP is counted, the violation score of APP is obtained, specifically comprises: 所述APP的违规得分通过加权平均数得出:The violation score of the APP is obtained by weighted average:
Figure PCTCN2020120724-appb-100001
Figure PCTCN2020120724-appb-100001
其中,
Figure PCTCN2020120724-appb-100002
表示加权平均数,f1~fk为每个维度违规项配置的权重,x1~xk为实际每个维度违规项的质检结果异常数,n表示维度总个数,不同维度表示不同违规场景。
in,
Figure PCTCN2020120724-appb-100002
Indicates the weighted average number, f1~fk are the weights configured for the violation items of each dimension, x1~xk are the actual number of abnormal quality inspection results of the violation items in each dimension, n is the total number of dimensions, and different dimensions represent different violation scenarios.
根据权利要求1所述的基于OCR和迁移学习的APP违规监测方法,其特征在于,所述根据所述深度学习模型输出的判别结果,对不同APP的得分进行统计,得出APP的违规得分,之后还包括:The method for monitoring APP violations based on OCR and transfer learning according to claim 1, wherein, according to the discrimination result output by the deep learning model, the scores of different APPs are counted to obtain the violation scores of the APPs, After that it also includes: 对所有任务设定定时启动任务。Set a scheduled start task for all tasks. 根据权利要求7所述的基于OCR和迁移学习的APP违规监测方法,其特征在于,所述任务包括:APP爬取定时任务、APP截图定时任务以及违规监测定时任务。The method for monitoring APP violations based on OCR and transfer learning according to claim 7, wherein the tasks include: APP crawling timing tasks, APP screenshot timing tasks, and violation monitoring timing tasks. 一种智能终端,其特征在于,所述智能终端包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的基于OCR和迁移学习的APP违规监测程序,所述基于OCR和迁移学习的APP违规监测程序被所述处理器执行时实现如权利要求1-8任一项所述的基于OCR和迁移学习的APP违规监测方法的步骤。An intelligent terminal, characterized in that the intelligent terminal comprises: a memory, a processor, and an APP violation monitoring program based on OCR and transfer learning that is stored on the memory and can run on the processor, and the The OCR and transfer learning-based APP violation monitoring program is executed by the processor to implement the steps of the OCR and transfer learning-based APP violation monitoring method according to any one of claims 1-8. 一种存储介质,其特征在于,所述存储介质存储有基于OCR和迁移学习的APP违规监测程序,所述基于OCR和迁移学习的APP违规监测程序被处理器执行时实现如权利要求1-8任一项所述的基于OCR和迁移学习的APP违规监测方法的步骤。A storage medium, characterized in that the storage medium stores an APP violation monitoring program based on OCR and transfer learning, and the APP violation monitoring program based on OCR and transfer learning is executed by a processor. The steps of any one of the OCR and transfer learning-based APP violation monitoring methods.
PCT/CN2020/120724 2020-08-25 2020-10-14 Ocr and transfer learning-based app violation monitoring method Ceased WO2022041406A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010862575.X 2020-08-25
CN202010862575.XA CN112101335B (en) 2020-08-25 2020-08-25 APP violation monitoring method based on OCR and transfer learning

Publications (1)

Publication Number Publication Date
WO2022041406A1 true WO2022041406A1 (en) 2022-03-03

Family

ID=73753383

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/120724 Ceased WO2022041406A1 (en) 2020-08-25 2020-10-14 Ocr and transfer learning-based app violation monitoring method

Country Status (2)

Country Link
CN (1) CN112101335B (en)
WO (1) WO2022041406A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114978936A (en) * 2022-05-24 2022-08-30 身边云(北京)信息服务有限公司 Method, system and storage medium for upgrading shared service platform
CN114996309A (en) * 2022-05-24 2022-09-02 天元大数据信用管理有限公司 Method and system for realizing policy matching efficient query by using intermediate table
CN115185520A (en) * 2022-07-21 2022-10-14 武汉众邦银行股份有限公司 Configured report development method based on Springboot + vue framework
CN115641021A (en) * 2022-10-17 2023-01-24 北京知道创宇信息技术股份有限公司 Violation detection method and device, electronic equipment and storage medium
CN116664825A (en) * 2023-06-26 2023-08-29 北京智源人工智能研究院 Self-supervised comparative learning method and system for point cloud object detection in large scenes
CN116912867A (en) * 2023-09-13 2023-10-20 之江实验室 Textbook structure extraction method and device combining automatic annotation and recall completion
CN117197816A (en) * 2023-06-19 2023-12-08 珠海盈米基金销售有限公司 User material identification method and system
CN117235150A (en) * 2023-09-25 2023-12-15 中国科学院计算技术研究所 Small sample time series data extrapolation analysis method and system for complex scenarios
CN117272113A (en) * 2023-10-10 2023-12-22 深圳福恋智能信息科技有限公司 Method and system for detecting illegal behaviors based on virtual social network
CN117541269A (en) * 2023-12-08 2024-02-09 北京中数睿智科技有限公司 Third party module data real-time monitoring method and system based on intelligent large model
CN119415361A (en) * 2024-10-31 2025-02-11 北京百度网讯科技有限公司 Method, device, electronic device, and medium for determining application download behavior
CN120302088A (en) * 2025-04-10 2025-07-11 北京飞鹰互动科技有限公司 A monitoring method and system for Internet live broadcast violations based on image and voice recognition

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112686022A (en) * 2020-12-30 2021-04-20 平安普惠企业管理有限公司 Method and device for detecting illegal corpus, computer equipment and storage medium
CN112948830B (en) * 2021-03-12 2023-11-10 安天科技集团股份有限公司 File risk identification method and device
CN113076339B (en) * 2021-03-18 2024-08-20 北京沃东天骏信息技术有限公司 Data caching method, device, equipment and storage medium
CN113221890A (en) * 2021-05-25 2021-08-06 深圳市瑞驰信息技术有限公司 OCR-based cloud mobile phone text content supervision method, system and system
CN113326376A (en) * 2021-05-28 2021-08-31 南京大学 Code review opinion quality evaluation system and method based on machine learning
CN113568823A (en) * 2021-09-27 2021-10-29 深圳市永达电子信息股份有限公司 Employee operation behavior monitoring method, system and computer readable medium
CN113888760B (en) * 2021-09-29 2024-04-23 平安银行股份有限公司 Method, device, equipment and medium for monitoring violation information based on software application

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110210484A (en) * 2019-04-19 2019-09-06 成都三零凯天通信实业有限公司 System and method for detecting and identifying poor text of view image based on deep learning
CN110210542A (en) * 2019-05-24 2019-09-06 厦门美柚信息科技有限公司 Picture character identification model training method, device and character identification system
US20200005071A1 (en) * 2019-08-15 2020-01-02 Lg Electronics Inc. Method and apparatus for recognizing a business card using federated learning

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108205674B (en) * 2017-12-22 2022-04-15 广州爱美互动网络科技有限公司 Social APP content identification method, electronic device, storage medium and system
CN109492143A (en) * 2018-09-21 2019-03-19 平安科技(深圳)有限公司 Image processing method, device, computer equipment and storage medium
CN110275958B (en) * 2019-06-26 2021-07-27 北京市博汇科技股份有限公司 Website information identification method and device and electronic equipment
CN110837615A (en) * 2019-11-05 2020-02-25 福建省趋普物联科技有限公司 Artificial intelligent checking system for advertisement content information filtering
CN111400132B (en) * 2020-03-09 2023-08-18 北京版信通技术有限公司 Automatic monitoring method and system for on-shelf APP

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110210484A (en) * 2019-04-19 2019-09-06 成都三零凯天通信实业有限公司 System and method for detecting and identifying poor text of view image based on deep learning
CN110210542A (en) * 2019-05-24 2019-09-06 厦门美柚信息科技有限公司 Picture character identification model training method, device and character identification system
US20200005071A1 (en) * 2019-08-15 2020-01-02 Lg Electronics Inc. Method and apparatus for recognizing a business card using federated learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHENG RUI: "Research on Intelligent Accumulator Based on Convolutional Neural Network", CHINA MASTER’S THESES FULL-TEXT DATABASE, 1 May 2019 (2019-05-01), pages 1 - 82, XP055902621, ISSN: 1674-0246, DOI: 10.27774/d.cnki.gzygx.2020.000014 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114978936A (en) * 2022-05-24 2022-08-30 身边云(北京)信息服务有限公司 Method, system and storage medium for upgrading shared service platform
CN114996309A (en) * 2022-05-24 2022-09-02 天元大数据信用管理有限公司 Method and system for realizing policy matching efficient query by using intermediate table
CN115185520A (en) * 2022-07-21 2022-10-14 武汉众邦银行股份有限公司 Configured report development method based on Springboot + vue framework
CN115641021A (en) * 2022-10-17 2023-01-24 北京知道创宇信息技术股份有限公司 Violation detection method and device, electronic equipment and storage medium
CN117197816A (en) * 2023-06-19 2023-12-08 珠海盈米基金销售有限公司 User material identification method and system
CN116664825A (en) * 2023-06-26 2023-08-29 北京智源人工智能研究院 Self-supervised comparative learning method and system for point cloud object detection in large scenes
CN116912867A (en) * 2023-09-13 2023-10-20 之江实验室 Textbook structure extraction method and device combining automatic annotation and recall completion
CN116912867B (en) * 2023-09-13 2023-12-29 之江实验室 Textbook structure extraction method and device combining automatic annotation and recall completion
CN117235150A (en) * 2023-09-25 2023-12-15 中国科学院计算技术研究所 Small sample time series data extrapolation analysis method and system for complex scenarios
CN117272113A (en) * 2023-10-10 2023-12-22 深圳福恋智能信息科技有限公司 Method and system for detecting illegal behaviors based on virtual social network
CN117541269A (en) * 2023-12-08 2024-02-09 北京中数睿智科技有限公司 Third party module data real-time monitoring method and system based on intelligent large model
CN119415361A (en) * 2024-10-31 2025-02-11 北京百度网讯科技有限公司 Method, device, electronic device, and medium for determining application download behavior
CN120302088A (en) * 2025-04-10 2025-07-11 北京飞鹰互动科技有限公司 A monitoring method and system for Internet live broadcast violations based on image and voice recognition

Also Published As

Publication number Publication date
CN112101335A (en) 2020-12-18
CN112101335B (en) 2022-04-15

Similar Documents

Publication Publication Date Title
CN112101335B (en) APP violation monitoring method based on OCR and transfer learning
US8972397B2 (en) Auto-detection of historical search context
CN111107048B (en) Phishing website detection method and device and storage medium
CN104765874B (en) For detecting the method and device for clicking cheating
US12266203B2 (en) Multiple input machine learning framework for anomaly detection
WO2019218514A1 (en) Method for extracting webpage target information, device, and storage medium
WO2018235252A1 (en) Analyzer, method of analyzing log and recording medium
CN110972499A (en) Annotation System for Neural Networks
US11176403B1 (en) Filtering detected objects from an object recognition index according to extracted features
CN114513355A (en) Malicious domain name detection method, device, equipment and storage medium
CN112818200A (en) Data crawling and event analyzing method and system based on static website
US20200394263A1 (en) Representation learning for tax rule bootstrapping
CN115296892B (en) Data information service system
CN114780891B (en) A Method and Apparatus for Website Key Resource Analysis Based on Page Rendering Contribution
CN118211102A (en) Intelligent disease category analysis method and device, electronic equipment and storage medium
CN117540368A (en) Data leakage detection method, device, equipment and storage medium
CN115186240A (en) Social network user alignment method, device and medium based on relevance information
CN119718736B (en) Page monitoring method, device, computer equipment and storage medium
US20230186664A1 (en) Method for text recognition
US12154356B2 (en) Automated key-value pair extraction
US20250284480A1 (en) Techniques for updating content for software applications using vector tagging
US20240427824A1 (en) Zyft A Decentralised Edge-based Search Engine for Products and Services
US20250335165A1 (en) Systems and methods for processing web platform source code
US20250335219A1 (en) On-screen application object detection
US20240233426A9 (en) Method of classifying a document for a straight-through processing

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 16/06/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20951074

Country of ref document: EP

Kind code of ref document: A1