WO2025019561A2

WO2025019561A2 - Topic transfer technique for neural networks

Info

Publication number: WO2025019561A2
Application number: PCT/US2024/038343
Authority: WO
Inventors: Sven Dietrich; Pavel Laskov; Shoufu Luo; Saskia Laura SCHRÖER; Jeremy D. Seideman
Original assignee: Liechtenstein, University of; Research Foundation of City University of New York
Current assignee: Liechtenstein, University of; Research Foundation of City University of New York
Priority date: 2023-07-17
Filing date: 2024-07-17
Publication date: 2025-01-23
Anticipated expiration: 2026-01-17
Also published as: WO2025019561A3

Abstract

A method for deriving topics from known textual sources to apply those models to unknown textual sources comprises accepting a first input from a first data source that has data related to at least one specified subject, thereby producing a first accepted input; grouping the accepted first input into first groups using a language model; labeling the first groups with identifiers; receiving a second input from a second data source; searching the accepted second input using a language model, thereby producing searched second input; identifying a first group for at least some of the searched second input; and creating at least one new group for any searched second input that was not identified with a first group.

Description

TOPIC TRANSFER TECHNIQUE FOR NEURAL NETWORKS RELATED APPLICATIONS [0001] This application claims priority to U.S. Provisional Application Serial No. 63/514,009, filed July 17, 2023, and entitled “Topic Transfer Technique,” the entirety of which is incorporated by reference herein. FIELD [0002] The present concepts relate generally to machine learning, and more specifically, to topic transfer techniques using topic modeling when searching for topics in a text corpus. BACKGROUND [0003] Artificial Intelligence (AI) is a growing innovation due at least to its ability to simplify and automate human-like tasks by permitting computers to “learn” from experience. While efforts are being made to implement AI to protect computers, networks, and systems, AI is also being considered by cybercriminals in an offensive manner, namely, to attack computers by password guessing, phishing, identity theft, and so on. [0004] Underground forums are electronic meeting places, for example, chat rooms, communities, and the like, which allow users to interact with communities that exchange information on illicit activities. There is interest in computer security to search for evidence of an attacker’s inclination to use AI for malicious purposes in underground forums. The reason one may expect to find such evidence in underground forums is that the typical cybercriminal has no expertise in AI and lacks the skill set to implement AI. Furthermore, experience shows that attackers seldom re-invent the wheel and prefer ready-to-use tools, if they exist. Since open-source AI tools are becoming increasingly available, attackers are very likely to search for, apply, and customize them. Hence, discussions in underground forums may reveal attackersb curiosity with respect to AI technologies, akin to AI researchers extensively using legitimate forums such as Stack Exchange to boost their learning curves. [0005] The characterization of potential evidence for AI usage, however, is inherently difficult. It is virtually impossible to manually describe the key traits of AI to search for in underground forums. Aside from looking at an extremely large list of terms of the cited subject matter, forum discussions and other text corpora need to take into account both alternate ways of constructing sentences, the ways that posts or threads can be constructed, etc. While language models are trained on the language itself (e.g., a model trained in English), also concerning is how discussion topics can be constructed using specific terminology and word order/grouping. One can only derive that by examining other legitimate forums (i.e., forums that we know are discussing a specific topic so we can use it as a baseline) to learn the topics that could be discussed and how they might be constructed. As described below, embodiments of the topic transfer techniques can apply this information to the unknown corpus to allow for topic discovery. [0006] Topic modeling is an unsupervised machine learning technique, i.e., it does not require training and instead relies on pre-trained models, that is used for detecting word and phrase patterns within the text of collected data sets such as documents, and automatically clustering terms, or topics, into groups that characterize the documents from where the words of interest are detected. One application where topic modeling can be used identify the topics relates to underground forums, for example, the topic “Artificial Intelligence” by detecting patterns and recurring words. [0007] However, it is challenging to define some terms such as the term “Artificial Intelligence,” or to define some topics of interest and search for these topics in completely unrelated data sets. It is particularly challenging to model the topics, for example, to understand the key topics discussed in legitimate AI forums or other data sources.^^Even if a topic model can be created for a first data source, for example, a legitimate AI discussion forum, it is desirable to assess the occurrence of the modeled topic from the first data source used for searching for topics in a different second data source such as an underground forum. SUMMARY [0008] In another aspect, a method for deriving topics from known textual sources to apply those models to unknown textual sources, the method comprising steps of: accepting a first input from a first data source that has data related to at least one specified subject, thereby producing a first accepted input; grouping the accepted first input into first groups using a language model; labeling the first groups with identifiers; receiving a second input from a second data source; searching the accepted second input using a language model, thereby producing searched second input; identifying a first group for at least some of the searched second input; and creating at least one new group for any searched second input that was not identified with a first group. [0009] In an aspect of the present inventive concept, a method for electronic searching for topics in a text corpus, comprising: applying a topic model technique to a first set of data from a first data source; capturing, by the topic model, at least one topic from the first set of data; receiving a second set of data from a second data source unrelated to the first data source; searching for topics of interest in the second set of data, including: applying a prediction function to generate a probability distribution of the at least one topic from the first set of data is present in the second set of data; and applying a heuristic to define a threshold for determining a topic relevance in response to analyzing the probability distribution. BRIEF DESCRIPTION OF THE DRAWINGS [00010] The above and further advantages of this invention may be better understood by referring to the following description in conjunction with the accompanying drawings, in which like numerals indicate like structural elements and features in various figures. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. In the drawings: [00011] FIG.1 is a block diagram of a system for performing a topic transfer operation, in accordance with some embodiments. [00012] FIG.2 is a flow diagram of a method for searching for topics in underground forum threads, in accordance with some embodiments. [00013] FIG.3 is a flow diagram of a method for deriving topics from a known textual source and applying a model of the derived topics to an unknown text source, in accordance with some embodiments. [00014] FIG.4 is a flow diagram of a method for topic transfer, in accordance with some embodiments. [00015] FIG.5 is a diagram of a system for performing a topic transfer operation, in accordance with some embodiments. [00016] FIG.6 is a table displaying an assignment of various topics to meta-topics, in accordance with some embodiments. [00017] FIG.7 is a table displaying a coverage rate of a 50-topic model for a plurality of different experiments, in accordance with some embodiments. DETAILED DESCRIPTION [00018] As described above, it would be desirable to assess the occurrence of a modeled topic in a different data source such as an online discussion forum without any a priori assumption of topics presented in the different data source. Although underground forums are described by way of example herein, applications of the present inventive concept are not limited thereto and may include other data sources unrelated to Artificial Intelligence. It is challenging to define some terms such as “Artificial Intelligence,” or to define some topics which are of question and search for these topics in completely unrelated data sets, for example, underground forum threads which are unrelated to legitimate AI forums. Topic modeling techniques such as Contextualized Topic Modeling (CTM) may be used for identifying topics in the legitimate AI forums. However, the issue remains how to search for the identified topics in a corpus of unrelated stored documents such as forum threads of a data source other than the data source for which the topic model is generated. [00019] In brief overview, provided in accordance with embodiments of the present inventive concept are systems and methods that can operate as an extension of topic modeling, and permit a generated model to transfer knowledge from one data source, or domain, to another, generally referred to as a topic transfer. [00020] Instead of starting from scratch, topic models can leverage pre- trained knowledge and adapt it to new tasks or domains. To achieve this, the systems and methods in accordance with some embodiments rely on pretrained knowledge, topic models, and topic transfer techniques to derive topics from known textual sources and apply them to unknown and unclassified text corresponding to new tasks or domains. One application of interest is to address the abuse of AI by cybercriminals who use underground forums to gather information about AI. In some embodiments, the systems and methods deploy and extend topic modeling techniques known from Natural Language Processing (NLP) or the like to generate the search algorithm to search for identified topics in unrelated text using the topic model generated for the topics extracted from the known sources. After the model is built and the topics of interest are derived from the known textual sources, the model can be applied to unknown and unclassified text to derive the topics used in the unknown and unclassified text. [00021] FIG.1 is a block diagram of a system 10 for performing an electronic topic transfer operation, in accordance with some embodiments. In some embodiments, as shown, the system 10 includes some or all of a topic identification module 110, a topic extraction module 120, a topic transfer module 130, topic relevance model 140, and a labeling module 150. Some or all of the topic identification module 110, topic extraction module 120, topic transfer module 130, topic relevance module 140, and labeling module 150 may include program code that is stored in a memory and executed by a processor of the system 10. [00022] The topic identification module 110 is constructed and arranged to train one or more topic models for a data source 102, for example, where legitimate AI discussions in the form of data exchanges may occur. The topic identification module 110 may also perform other topic model processing operations. Topic modeling may be performed to describe a set of topics contained in a pre-defined document corpus from the data source 102, for example, news articles, legal documents, white papers, and so on. In doing so, the topic identification module 110 identifies topics and word groupings in the text corpus. The topic identification module 110 may have a first input that receives the corpus and a second input that receives a pre-trained language model 104. The CTM, which includes a combination of representations of the text using pre-trained words and/or sentence embeddings from the pre-trained language model 104 along with topic modeling allows a more granular and precise understanding of documents, sentences, or words. In some embodiments, the pre-trained language model 104 may use pre-trained representations of language such as Bidirectional Encoder Representations from Transformers (BERT), or Sentence-BERT (SBERT), which further advances BERT by adjusting the BERT architecture to derive embeddings for sentences, to support topic modeling and simplify the analysis of text documents. In these embodiments, SBERT 104 can be combined with CTM 110. [00023] In some embodiments, the topic identification module 110 stores and executes at least one CTM for receiving one or more pre-trained language models 104. The CTM may be configured to identify the most likely topic for a text document based on the topic distribution calculated for that document. Although those distributions are useful when compared to related documents, when compared to unrelated documents, the CTM will return the most likely topic based on the training data, but the topic might not be as relevant. The topic transfer module 130 (described below) when communicating with the CTM mitigates this problem by rejecting topics whose likelihood is insufficiently high, which makes CTM applicable to a broad range of problems in which searching for given topics in potentially unrelated texts is appropriate. [00024] The topic extraction module 120 extracts concepts from text regarding topics of interest in accordance with the trained topic model provided by the topic identification module 110. For example, extracted topics of interest may include AI topics that pertain to AI research, techniques, algorithms, and/or other topics of interest to a party who plans to implement and use AI, including actors interested in using AI for cybercrime purposes. In embodiments where the topic extraction module 120 is used on AI-related data, the purpose of the extraction of topics from legitimate AI discussion forums is to exploit this knowledge for the exploration of underground forums with regard to abuse of AI. In some embodiments, the topic extraction module 120 may arrange the topics into groups, each having an identifier to confirm that a received input is an accepted input, i.e., having a topic that is identified as a topic of interest. [00025] The topic transfer module 130 applies a model generated from one source data to another source of data. In doing so, the topic transfer module 130 receives via a computer input the unprocessed (i.e., unknown and unclassified) text from the other data sources 112, analyzes the extracted topics to determine if the topics appear in one or more other data sources 112 such as an underground forum, and performs a topic transfer operation where the extracted topics of the legitimate forum, or first data source 102 or domain, processed by the topic model are used for searching for topics in a second data source 112 or domain such as an underground forum. [00026] In some embodiments, the topic transfer module 130 includes a prediction module 132 and a heuristic threshold module 134. The prediction module 132 can generate an assessment whether the topics captured by the model are related to or relevant with respect to the received text from the other data source(s) 112. The heuristic threshold module 134 may incorporate a customized or parameterized heuristic to define a threshold for determining topic relevance in response to analyzing probability distributions produced by the prediction module 132. [00027] The topic transfer module 130 can perform the topic transfer operation where any determined occurrence of the topic in the underground forum or other data source 112 can be made without any prior knowledge or assumption that any of the topics identified via the topic model are present in the other data source 112. In other words, instead of identifying topics for a fixed data corpus, the topic transfer module 130 determines if a given topic (extracted from some text corpus) is present in the other data source 112. [00028] In particular, the topic identification module 110 and topic extraction module 120 may perform a “blind” topic assignment operation in that if using a CTM model alone will result in topic probabilities for a new document that are always normalized. Therefore, the topic identification module 110 will always find the most relevant topic for a new document even if all topics in an analyzed text corpus are irrelevant. The topic transfer module 130 addresses this issue with a heuristic (described in greater detail in FIG.4). For example, if a document is unrelated to the topics captured by the model, the probability distribution computed by the model is approximately uniform. As a result, for n topics inferred by the model, their probabilities are approximately 1/n. Conversely, if some topics are relevant for the document, their probabilities should be significantly larger than 1/n. The heuristic threshold module 134 can define a threshold above which one can claim that a given topic is relevant for a new document. For example, the threshold may be set to 10/n, i.e., 10 times higher than the uniform topic probability, to achieve a reasonable trade-off between the true positive and the false positive rates. [00029] The labeling module 150 is configured to assign labels to topics identified via the topic model. Labeling may be performed after a topic transfer is performed to an unlabeled target domain. While the labeling module 150 can automatically label topics, in some embodiments, topic labels are manually reviewed and assigned. Topics identified by the extraction module 120 can be assigned labels, either by input from human experts or automated labeling via previously derived topics and associated keywords, for example, grouping topics under a topic label entitled “Artificial Intelligence” if the topics are identified. Labels can be determined from previously derived topics and associated keywords. [00030] FIG.2 is a flow diagram of a method 200 for searching for topics in underground forum threads, in accordance with some embodiments. Method 200 may be implemented by some or all of the computer system 10 of FIG.1. [00031] At step 210, a training dataset is received from a first data source. The dataset can be data from sources that are related to specific subjects that can be broken into topics. A training dataset may include data from a known discussion of the field of interest, for example, AI tools and discussions used to define the types of topics that are discussed. As is well-known, a model is trained according to a process by which one starts to identify those topics that could be discussed. One example is SBERT, which is a pre-trained module that represents language itself; the additional training fine-tunes it for the discussion domain. The pre-trained model may be initially trained on a large corpus of text by feeding it the text and allowing it to identify/derive relationships between words, encoding them into a format that can be used by a neural network. SBERT itself uses a neural network, for example, a Siamese neural network, to identify and encode words and phrases. The method 200 uses such a neural network that incorporates SBERT or SBERT-like models to process input. [00032] At step 220, a topic model is built on the training dataset. In some embodiments, the knowledge transfer facilitates building topic models on a target dataset by using knowledge extracted from other data corpora via, e.g., topic embodiments, or more specifically, pretrained word/sentence embeddings from the pre-trained language model 104. The topic model can be used for detecting word and phrase patterns with the received dataset(s), and automatically cluster word groups and similar expressions that best characterize the dataset(s). [00033] At step 230, the generated topic model is used to perform a topic transfer operation to determine if any topics from the dataset from which the model is built appear in another, or target, dataset, for example, an underground forum at a different data source from the training dataset. In doing so, the topic model during the topic transfer operation in response to receiving and analyzing an unseen document of the target dataset can determine if the known topics of the document are relevant to the topics captured by the model. In some embodiments, the topic transfer operation includes a threshold heuristic (described herein) that accepts or rejects topics identified in the target dataset, which enables the system to search for extracted topics via a CTM or the link in unseen and unrelated documents, e.g. underground forums. [00034] FIG.3 is a flow diagram of a method 300 for deriving topics from a known textual source and applying a model of the derived topics to an unknown text source, in accordance with some embodiments. In describing the method 300, reference is made to the computer system 10 of FIG.1. [00035] At step 310, a first input is received by the system 10 from a first data source that has data related to a topic of interest. For example, the first data source may be a legitimate forum that includes topic data regarding artificial intelligence. The data can be training data. [00036] At step 320, the first input is grouped into a first group using a language model, for example, BERT, SBERT, RoBERTa, and so on, which can derive topics from the first input. Here, the topics can be grouped with identifiers. For example, the first input can include related to specific subjects that can be broken into topics by a language model such as such as RoBERTa or the like. The topics are not necessarily labeled at this point, but rather grouped with identifiers. [00037] After the topics have been identified in step 320, the method 300 can proceed to step 330, where labels can be assigned to the identified topics. Labels can be assigned to the identified topics either through the use of human experts or automated labeling via previously derived topics and associated keywords. [00038] At step 340, a second input is received from a second data source that is unrelated to the first data source, although the topic model generated from the first input may be used in step 350 for searching for a topic of interest in the text corpus of the second data source. At step 350, the text corpus of the second data source is searched using the extracted topics using the language model. For example, the topics captured in the training data source, e.g., a legitimate forum, can be used for detecting the same or similar topics in the target training source. [00039] At step 360, at least one of the first groups is identified for at least some of the searched second input, in particular, relevant topics found in response to the search of the text corpus of the second data source as a third input. For example, a first group identified as “Artificial Intelligence Education” may be identified for topic data detected from the search of data received from the second data source from the third input in step 350. [00040] There may be topic data that was identified in the search in step 350 but not identified with any of the first groups in steps 320 and 330. In this case, a new group may be created for this unidentified topic data. [00041] FIG.4 is a flow diagram of a method 400 for topic transfer, in accordance with some embodiments. In describing the method 400, reference is made to the computer system 10 of FIG.1. [00042] At step 410, the system 10 receives a text corpus from a data source, for example, the (i.e., unknown and unclassified) text from a data source 112 different than the data source 102 from which a topic model is generated. [00043] At step 420, the text corpus is searched for topics. As described below, some identified topics may be related to topics of the training data source 102 on which the topic model is built. [00044] At step 430, a transfer operation is performed. More specifically, the topic model, given an unseen document as an input, can compute the probability distribution for all known topics in that document. [00045] At step 440, a determination is made whether a known topic identified in the text corpus, e.g., a document, is relevant. If a document is unrelated to the topics captured by the model, the probability distribution computed by the model is approximately uniform. As a result, for n topics inferred by the topic model, their probabilities are approximately 1/n if none of the inferred topics is relevant. Conversely, if some topics are relevant for the document, their probabilities should be significantly larger than 1/n. Hence, the system 10 can define a cutoff threshold above which one can claim that a given topic is relevant for a new document. Following an empirical investigation of topic probability distributions on various data corpora, in some embodiments, the cutoff threshold can be set to 10/n, i.e., 10 times higher than the average topic probability, to achieve a reasonable tradeoff between the true positive and the false positive rates. [00046] FIG.5 is a diagram of a system 500 for performing a topic transfer operation, in accordance with some embodiments. In some embodiments, the system 500 is similar to or the same as the system 10 of FIG.1 and includes components, modules, and the like that are similar to or the same as the system 10 of FIG.1. Details of the system are therefore not repeated for brevity. [00047] In the application illustrated in FIG.5, the system 500 receives a first input from a legitimate AI forum and a second input including a pre-trained language model that is trained on a sufficient amount of the dataset so that the CTM can be used to build topic models for conversations in the legitimate AI discussion forum. The pre- trained language model provides a benefit from general language understanding when defining a topic, Accordingly, the CTM is formed from a combination of representations of the first input text using pre-trained words and/or sentence embeddings from the second input, and allows a more granular and precise understanding of documents, sentences, or words due to the advantage of using Large Language Models, i.e. pre-trained words and/or sentence embeddings. [00048] The topic transfer system 500 provides an extension of topic modeling by applying a prediction function of the CTM to assess the likelihood of the learned set of topics from the first input being present in underground forum threads. In particular, a topic transfer operation is performed where the topic model, given an unseen document from the underground forum as a third input, can compute the probability distribution for all known topics in that document, for example, described in Hendrycks, D., Gimpel, K.: A Baseline for Detecting Misclassified and Out of Distribution Examples in Neural Networks. In: ICLR (2017) incorporated by reference herein in its entirety. By looking at the possible topics, the system 500 can determine if a topic is relevant to the data through the examination of the identified topics. CTM techniques will identify the more relevant topic in the data, whether or not that topic is relevant to what is being examined. The system 500 can calculate that probability by examining the appearance of topics against all topics that are discovered. [00049] The third input may include social media posts, emails, chats, open-ended survey responses, and so on. The topic transfer system 500 can apply a simple heuristic to determine if any of the AI topics identified by the model are found in the third input, i.e., from the underground forum. Here, a cutoff threshold can be determined above which one can claim that a given topic is relevant for a new document. For example, the cutoff threshold can be set to 10/n, i.e., 10 times higher than the average topic probability, to achieve a reasonable tradeoff between the true positive and the false positive rates. [00050] As part of the topic transfer, the topic modeling extended with a new heuristic technique enables the system 500 to search for extracted topics via CTM in unseen and unrelated documents, e.g. underground forums. Here, there is no a priori assumption that any of the topics identified via CTM are present in the searched source, e.g. underground forum. Indeed, the topic transfer technique makes this essentially a search of the model topics in different data sets, and thus can distinguish between (i) related data, and thus assign it to one of the topics, and (ii) unrelated data, and thus not assign it to any topic. As a result, the system 500 can be used to identify attackers’ conversations in underground forums which are potentially related to AI topics. [00051] The topic transfer by the system 500 from one data set to another, e.g., legitimate forums to underground forums is a unique feature of the present inventive concept. It is well-known that challenges exist with defining some terms, e.g., Artificial Intelligence, or to define some topics which are of question and search for these topics in completely unrelated data sets. To address this, the system includes a CTM (e.g., 110 shown in FIG.1) to model the topics. Subsequently, the system 500 (e.g., using the topic transfer module 130 and topic relevance model 140 of FIG.1) can search for the identified topics in unrelated text. The topic transfer operation occurs given a topic model created for one data source (legitimate AI discussions), the system 500 can assess the occurrence of such topics in a different data source (underground forums), without any a priori assumption of topics present therein. Using the model output, the system 500 can then take data from various other sources, process the data, and then determine which of the precalculated topics are present in the data. Once the topics that are present have been identified, it is possible to assign labels to those topics, either through the use of human experts or automated labeling via previously derived topics and associated keywords. [00052] The system 500 of FIG.5 was used to perform several different experiments, namely, a Self-Check Experiment (SCE), a Positive Control Experiment (PCE), Negative Control Experiment (NCE), and a Search Experiment (SE). [00053] In the Self-Check Experiment (SCE), the model was applied to the training data set, where the system 500 could measure the breadth of the topic coverage on the original data set. Here, no topic transfer was performed. In the Positive Control Experiment (PCE), a model was tested on another AI forum, i.e., a topic transfer operation was performed to another AI forum. The expected outcome was that the learned topics are identified with a high true positive rate. In the Negative Control Experiment (NCE), the model was tested on another dataset unrelated to AI, i.e., a topic transfer operation was performed to a data corpus from a different domain. The expected outcome was that few matches can be found and hence a low false positive rate is obtained. In the Search Experiment (SE), the model was tested on underground forum threads, i.e., a topic transfer operation was performed to the unlabeled target domain. [00054] During the experiments, the topics for which a match was found can be claimed to be present in the threads if the control experiments (SCE, PCE, NCE) are sound. Due to the lack of ground truth for underground forums, an additional manual review may be performed of a selected number of samples, since the manual review of the entire data set is unfeasible. For each of the underground forums, a manual review was performed on 10 threads assigned to each AI topic, and 10 randomly chosen unassigned threads, not discussing any AI topic. [00055] In preparation of the experiments, four data sets were used. The first data set derived from a legitimate question-and-answer discussion forum for topics from diverse fields, referred to as Stack Exchange. Stack Exchange is a combination of two specific forums, namely, Stack Exchange Data Science with a focus on Data Science and AI, respectively. Stack Exchange was formed by extracting the data dump for both forums and combining them into one data set. The data set includes 92,576 threads and was used for the topic model training and the SCE experiment. [00056] The second data set pertains to another legitimate online discussion forum, referred to as Kaggle, that attracts users with different backgrounds in Machine Learning (ML) to participate in competitions and solve challenges. On Kaggle, individuals can download data sets and explore code with Kaggle notebooks. The extracted data consists of 242,217 threads and is used for PCE experiment. [00057] The third data set pertains to Presidential Speeches. For the NCE experiment, a data set was collected with 622 speeches given by the Presidents of the United States from 1789 to 2010. This data set is unrelated to either AI or underground forums, as required by the experimental design. [00058] For the analysis of underground forums, the fourth data set consists of 91 million posts from 32 forums, spanning both Clearnet (CN) and Darknet (DN)6 sources, resulting in 18 underground forums with a total of 7.2 million threads. The underground forums were grouped into five categories: – DN Cybercrime Forums (DN-CF): Four forums focusing on malware development, hiring a hacker, selling of malware/drugs, social engineering. – CN Cybercrime Forums (CN-CF): Seven forums focusing on hacking, exploits, requests for hackers, malware development, human manipulation, selling drugs. – CN Gaming Forums (CN-GF): Three forums focusing on hacking in games. – CN Cybercrime and Gaming Forums (CN-CGF): Two forums focusing on hacking tutorials, advanced hacking, scam, malicious software, game hacking. - Other Forums (CN-O): Two different forums, one focusing on illegal downloads of movies, etc., and the other on black hat marketing techniques. Another example may include a CrimeBB data set, described in “CrimeBB: Enabling Cybercrime Research on Underground Forums at Scale", Sergio Pastrana, Daniel R. Thomas, Alice Hutchings, and Richard Clayton. ACM The Web Conference (WWW) , Lyon, France, April 2018, incorporated by reference herein in its entirety. [00059] To improve the data quality, several preprocessing steps where performed by embodiments of a system of the present inventive concepts. [00060] During the experiments, the data analysis was performed at the thread level, since entire discussions yield more powerful topics compared to single posts. To create threads, various posts were merged into a single thread. For example, the Stack Exchange, mergers were based on post id, for Kaggle based on forum topic id, and for underground forums based on thread id. [00061] In some cases, data cleansing was performed. CrimeBB already removed the so-called “junk” data, so additional data cleansing was performed to reinforce consistency. The system 500 removed attachment information, e.g. file attachments, web URIs, etc., removed named XML or HTML tags, special characters such as hex characters, and forum- specific markup tags such as IMG, added during the CrimeBB data collection process. The system 500 also filtered out threads that (i) contain less than 50 words and (ii) more than 10,000 words, which overall accounts for 0.15% of all threads. Also, stop words were removed only for the topic word matrix creation, which defines the words belonging to a specific topic. Stop words were retrieved from several AI frameworks, and the system 500 examined their union and removed duplicates, resulting in 714 stop words. [00062] A set of Topic Model Evaluation Criteria was applied. Here, the quality of topics obtained from a text corpus was compared quantitatively and qualitatively, and taken as a reference for hyperparameter tuning. To evaluate topic quality quantitatively, topic coherence was empirically measured by means of Normalized Pointwise Mutual Information (NPMI) and topic diversity (TD). A coherent topic has words with high mutual information, which implies better interpretability by humans. TD refers to the percentage of unique words among the top k words of all topics. Overall topic quality (TQ) is measured as the product of NPMI and TD. For the qualitative evaluation of topic quality, an intuitive visual representation of the semantics of extracted topics was provided by word clouds. The words are randomly positioned in a picture, their size was chosen according to the importance of a word in a topic. To further simplify the perception of topics, a manual review was performed, and topic labels were assigned. Initial annotations for each of the 50 topics were separately conducted by three individuals based on the word clouds and subsequently discussed. In 45 out of the 50 cases, the differences only arose in the wording of the topic label. In all cases, the differences were resolved by consensus. [00063] Table 1: Evaluation of vocab parameter averaged over 10 runs. Hyper- NPMI TD TQ= Hyper- NPMI TD TQ=NPMI*MD *

increase performance. For the evaluation, The quantitative and qualitative topic quality metrics in Table 1 were introduced. Several pre-trained language models were available as input to CTM. Since topic modeling was applied to English data, the pre- training model used was a monolingual SBERT model. More precisely, the SBERT model was trained on data including Stack Exchange duplicate questions, and thus adequate for the AI context. This was confirmed based on the word clouds from the CTM output. [00065] Vocabulary size refers to the most frequent words from the documents. The topic model clusters topics based on the words included in the “vocab list,” whose length is defined by this hyperparameter. Number of topics is the number of topics the model attempts to identify. To define vocabulary size and number of topics, the number of topics were selected based on a vocabulary size of 5,000, since a higher number of vocabularies includes more words in the vocabulary list, based on which the characterizing words for each topic are selected. Second, the number of topics were fixed and evaluated the vocabulary parameter. NPMI ranges from í1.0 to 1.0; higher values correspond to higher coherence. TD close to 1.0 indicates more diverse topics. [00066] Based on Table 1 on the left, TQ is similar whether the number of topics is set to 30, 40, and 50; the latter selected as the number of topics as 50 topics provide the highest coherence and more granular topics. The word clouds were examined and observed more interpretable topics for parameter 50 which is in line with the higher NPMI. Based on Table 1 on the right, 5,000 was selected as the vocab parameter since the overall TQ is similar a higher number of vocabularies is preferred. The number of samples were collected to estimate the final distribution of topics, with better results for higher values. The default is set to 20; and this was increased to 100 to enhance the consistency of results and reduce variance. [00067] As part of topic grouping performed by the system 500, 50 topics were identified and grouped by the CTM into ten meta-topics to simplify the interpretation of results (shown in FIG.6). This step was performed manually (i.e., independently of the automated portions of the system 500) to ensure a semantic coherence of meta- topics with human judgment. At the thread level, the granularity of 50 topics is significant, since single forum threads discuss specific topics related to AI. However, to answer a research question, this level of granularity was not required. For interpretation of the results, three types of meta-topics were distinguished: (1) AI Core Topics (ڃ); (2) AI Supporting Topics (^), and (3) AI Surrounding Topics (Ƒ). AI Supporting Topics are less concerned about the statistical perspective of algorithms such as, for example, loss function optimization, but rather focus on the application of AI and respective use cases. [00068] The complete quantitative results of the experimental evaluation are summarized in FIG.7 showing the results of all four experiments described above. For each experiment (table column, multiple forum categories for SE), the coverage rate, i.e., the percentage of threads with the identified AI meta-topics, was presented (in the respective table rows). The results of the specific experiments are further discussed as follows. [00069] Referring again to the SCE, the main outcome of the SCE is that the 50 topics derived from the Stack Exchange data set appear in approximately 50% of its threads. Stack Exchange forums are very heterogeneous and are likely to contain more than 50 topics. Hence, it is not surprising that only about half of this data corpus is covered by a 50-topic model. A thread can remain without a topic assignment if (i) it discusses some topics unrelated to AI9, (ii) it discusses some AI-related topics which are different from the ones selected during model training, or (iii) it is weakly related to a large number of learned topics and hence none of them passes the likelihood threshold. A characteristic example for the latter case is a thread with a topic probability of 18.70% for Data Input Formatting, meta-topic Data Preparation, which does not exceed the threshold. A weak discussion about data preparation is present, but due to the focus on the manual part, it is correctly denoted as unclassified. [00070] It is also interesting to see that discussions on Stack Exchange have a substantial focus on technical topics, either related to core AI or the supporting infrastructure. Topics related to “surrounding” questions, such as support requests and human-technology interaction, are not very prominent on Stack Exchange. [00071] A goal of the PCE is to demonstrate that a topic model attains similar coverage rates on a different AI-related data set. FIG.7 reveals that 28.91% of threads in Kaggle discuss at least one of the 50 topics learned from Stack Exchange. The difference in coverage rates is in line with expectations. Even though Stack Exchange and Kaggle are both platforms for AI-related discussions, these data sets also have some inherent differences. Kaggle’s focus lies on AI challenges and competitions, and it is hence not surprising that a large share of threads on Kaggle was not among the 50 topics derived from Stack Exchange. Also, the difference in coverage rates of specific meta-topics highlights the diversity of these two forums. For example, the coverage rates of discussions on Python/AI Setup and Support Requests are higher on Kaggle than on Stack Exchange. [00072] With further regard to the NCE, the coverage rates in the NCE can be interpreted as false positive rates, and hence expected to be low. Results show that “AI-related topics” are found in about 5% of Presidential Speeches. These hits were mainly on meta-topics Human-Technology Interaction (4.82%), and very few on Learning Algorithms (0.32%), the topic Reinforcement Learning to be precise. [00073] A manual examination revealed some correlations between the vocabulary of Presidential Speeches and certain AI-related topics. For example, such words as world, people, human, law, life, and system are part of the topic Human- Technology Interaction and obviously not uncommon in Presidential Speeches. The two hits on the topic Reinforcement Learning, barely exceeding the 20% threshold, contained such words as policy, action, and reward. [00074] The main finding of the experiments are the results of the Search Experiment (SE) described above. FIG.7 illustrates that the coverage rates for AI- related topics in underground forums range between 10% and 20%. This reveals that such topics have, in general, some significance; certainly, less than in legitimate AI forums but also certainly more than in the NCE. Besides the two general topics Support Request and Human-Technology Interaction, a large share of discussions focus on using or learning AI techniques, such as Python/AI Setup (20.94%), and AI Education (8.90%), followed by discussions on AI Core Topics like Learning Algorithms (1.61%), Data Preparation (0.91%), NLP (0.64%) and Model Training, Tuning and Evaluation (0.12%). These findings are further elucidated according to the manual investigation of selected threads corresponding to all found meta-topics10. AI Core Topics. In underground forums, observed was a relatively small percentage of threads associated with AI Core Topics. Still, the identified threads for AI Core Topics cover a large spectrum of AI techniques. [00075] For the meta-topic Learning Algorithms, observed were threads in DN- CF, referring to the use of experimentation techniques which help researchers in testing performance and discovery of related security problems, specifically highlighting Hidden Markov Modeling. Threads from CN-CF discuss Regression Analysis, Binary Tree Search, Linear Tree Search, Genetic Algorithms, Q-learning, and Reinforcement Learning. The discussions are very precise, e.g., about the difference between Q-Learning and Temporal Difference Learning in the context of Reinforcement Learning. On the other hand, threads on gaming forums discuss path- finding algorithms, such as Aכ, Dijkstra’s Algorithm, Bresenham’s Line Algorithm, etc. In some cases, the discussions even include shared Python code. NLP. In this category, observed were many threads discussing password cracking and dictionary lists, and a strong interest in the word semantics. As an example, a user was looking for “good” dictionary lists, which are commonly used for password brute-force attacks. In another thread, someone offers to sell the Oxford English Dictionary as an XML file. This thread advertises that the dictionary contains 290,000 words and compares it to the well-known WordNet lexical database, which has 150,000 words. In other threads, the use of statistical algorithms for password cracking was discussed, with one thread referring to a GitHub repository related to “Probable Wordlists” Also observed was a high interest in open-source intelligence, e.g., for phishing as well as (ab)use of tools designed for defensive purposes. As NLP can be instrumental for crafting phishing content, it is not surprising to see NLP discussions related to phishing kits in cybercrime forums. Some threads assigned to NLP discuss illegitimate Search Engine Optimization (SEO) techniques used to improve search engine ranking based on content analysis of sites. Users discuss practices such as keyword usage, keyword bloating, keyword density, and TF-IDF. [00076] Overall, it was identified that adversaries are mostly interested in NLP techniques when it comes to AI Core Topics. Even if Learning Algorithms is not a prevalent topic in underground forums in general, based on the above examples, it can be deduced that underground forums are still a place to seek answers in regard to AI Core Topics. This is also confirmed by results presented below in the section regarding AI Education. AI Supporting Topics. Data Preparation, AI Education, and Python/AI Setup are categorized as topics supporting AI. [00077] Data Preparation is a critical step for applying many AI techniques. The experiments produced clear examples of interest in data analytics among discussion threads, and threads mostly fall into this category as they focus on preliminary data processing. Threads were found related to logarithmic scaling, vectorization, dimensionality reduction, and transformation through matrix multiplication. Overall, discussions on Data Preparation are mostly related to three main areas: (i) formatting issues, in particular, in Python; (ii) explanations on the handling of data in a certain mathematical way; and (iii) regex functions to identify patterns. Many of these discussions refer to the handling of data dumps for malicious purposes, preparing data for phishing campaigns or specific code injection attacks. This reveals attackers’ activities related to handling leaked data, raising the question of what adversaries will do next with the preprocessed data dumps. Interestingly, when a question classified as Data Preparation is too specific, e.g., addresses Python coding related to data preprocessing, users are often referred to Stack Overflow. [00078] With regard to AI Education, there are four topics in this category: Skill Requirements/ Learning, AI Resources, Object Detection (AI Toy Problem), and Demand Forecasting (AI Toy Problem). Object Detection and Demand Forecasting are common toy problems referred to by AI tutorials to illustrate relevant Python libraries. However, even for such elementary problems one may observe characteristic traits of offensive intent. For instance, in one conversation tagged as Object Detection, a user wants to apply color and image recognition to detect an item and click on it, further revealing that she wants to gain this understanding to build a bot. In another thread, a user asks about circumventing image plagiarism detection and inquires if this is achievable with AI. [00079] Discussions on AI-related programming languages and on how to learn these languages are also prevalent in underground forums. A common recommendation for implementation of various hacking techniques is Python, and specifically Luna for game hacking. These discussions often go beyond programming languages, e.g., users inquire about AI usage in general, discuss specific AI techniques, and ask about easy-to-use AI libraries, such as numpy, pandas, keras, pytorch, scikit-learn, etc. Also, the use of CPU vs. GPU for specific algorithms and libraries was compared. Similarly to the topic Data Preparation, observed are users referring others to the Clearnet (CN) if they think the questions are too AI-specific. [00080] During the experiments, some users asked for advice on AI and hacking techniques in the same thread. One user, for instance, is looking for general advice on Neural Networks, rootkits, making money online, and hiding online presence. These terms clearly reveal an offensive purpose behind the intended use of AI. Some users also share knowledge about certain AI techniques, e.g., Google’s research on determining the location of any image, weather forecasting using Time Series and Neural Networks, and cheating in online games using Neural Networks. An example of AI-assisted online game cheating from the threads is AimBot, which uses object detection to automatically identify a target and to point the respective weapon used in the game at this target. [00081] Additionally, observed was a large number of posts looking for skilled developers specializing in AI. Python programming, Big Data and AI are the most common requirements in such threads. In some cases. users express explicit interest in hiring individuals with proficiency in Data Mining, NLP, and Database Management, with an additional required knowledge of the Darknet (DN). Python/AI Setup. The meta-topic Python/AI Setup often appears in underground forums in connection with the Python IDA, various Python packages, AI library issues as well as system setup for AI development. Similarly, there were discussions on setting up or configuring the system environment for various hacking purposes, not necessarily in a clear context of AI. [00082] Surrounding topics often occur in the context of AI discussions but do not directly address AI techniques. Examples of such surrounding topics are threads in which users ask for help or recommendations, or discuss the impact of technology on society. Support Request. For CN-CGF, observed was a very high occurrence of Support Requests. The support requests on the DN and CN forums are of diverse nature, e.g., getting access to someone’s website, hacking someone’s account, asking about the correct link to another DN forum, or phishing. For CN-GF, the discussions are more related to attacking gaming servers or asking about specific hacks. Also observed were a large number of topics related to the setup of virtual environments, as well as to Python related topics such as code requirements, library updates, and installation. Threads on this topic primarily discuss the threat of AI to society from different viewpoints, e.g. the role and the impact of robots, or whether robots can outsmart humans with their intelligence. [00083] In sum, the finding from these experiments revealed a substantial presence of discussions on AI in the 18 underground forums that were analyzed. Based on the topic modeling applied to the legitimate forum Stack Exchange, 50 topics were extracted by the system 500 of FIG.5 and grouped into 10 meta-topics, which were further categorized into Core AI, Supporting AI, and Surrounding AI. [00084] The majority of topic matches that were discovered in the underground forums were concerned with Surrounding AI subjects, especially with the meta-topic Support Requests. Support requests are a natural element of any discussion forum, and hence it comes as no surprise that such topics show up in the topic model created over legitimate forums. It is, however, quite striking that the share of Support Requests discussions in underground forums was almost 10 times larger than in legitimate forums (underground forums: 41.71%, Stack Exchange: 0.84%, Kaggle: 3.72%). This fact may suggest that in general participants in underground forums have lower professional skills compared to those in legitimate forums. Supporting AI topics appear to be the second most important in underground forums, whereas the Core AI topics are the least prominent. In contrast, the majority of discussions in legitimate AI- related forums clearly revolve around the Core AI topics. This observation reveals that the focus of offensive AI is not to develop new AI methods but rather to reuse existing ones for malicious goals. To achieve their goal at the lowest possible cost – including the least intellectual effort – the attackers’ best strategy is to build appropriate infrastructure and learn how to use existing AI tools. These are precisely the main topics observed in the Supporting AI category. Still, it was also observed attackers asking for AI services, which might be an indicator of an “as-a-service” demand without any yet-established offerings in the underground economy. [00085] In the experiments above, underground forums were studied with regard to attackers’ inclination to abuse AI. The systems and methods in accordance with embodiments of the present inventive concept have demonstrated the high potential for the deployment of advanced NLP techniques such as CTM to analyze underground forums. Further, CTM has been extended with a topic transfer heuristic to make it applicable to previously unseen documents from different domains. This technique has a strong potential of becoming a viable tool in a much broader scope of threat intelligence. As more and more data can be put to service for security-related research, finding intrinsic semantic relations in such data can improve the understanding of complex technical artifacts and operational practices underlying modern attacks. [00086] With the ongoing proliferation of AI in all spheres of modern society, AI’s dual-use potential becomes ever apparent. In cybersecurity, AI has proven to be a viable defense instrument; however, it is also believed to have an equally strong potential for being abused. As described herein, clear signs were observed in underground forums that the offensive side is demonstrating increasing interest in AI. Even though the technical skill level in the “offensive AI community” currently seems to be inferior to that of legitimate AI users, concerns should arise, in particular, with respect to the potential re-purposing of open-source AI tools for nefarious goals. [00087] While the invention has been described with reference to certain embodiments, it will be understood by those skilled in the art that various changes may be made, and equivalents may be substituted for elements thereof to adapt to particular situations without departing from the scope of the disclosure. Therefore, it is intended that the claims are not limited to the particular embodiments disclosed, but that the claims will include all embodiments falling within the scope and spirit of the appended claims.

Claims

What is claimed is: 1. A method for electronic searching for topics in a text corpus, comprising: applying a topic model technique to a first set of data from a first data source; capturing, by the topic model, at least one topic from the first set of data; receiving a second set of data from a second data source unrelated to the first data source; searching for topics of interest in the second set of data, including: applying a prediction function to generate a probability distribution of the at least one topic from the first set of data is present in the second set of data; and applying a heuristic to define a threshold for determining a topic relevance in response to analyzing the probability distribution.

2. The method of claim 1, wherein the topic model technique is a Contextualized Topic Modeling (CTM) technique.

3. The method of claim 1, wherein the topic model technique builds a topic model for conversations in the first data source, and wherein applying the heuristic extends the topic model to search for unseen and unrelated documents in the second data source.

4. The method of claim 3, wherein the first data source includes an artificial intelligence (AI)-related legitimate forum

5. The method of claim 3, wherein the second data source includes an underground forum.

6. The method of claim 1, wherein the threshold is set to 10 times higher than an average topic probability.

7. The method of claim 1, wherein if the second set of data includes topics that are unrelated to the topics captured by the model, the probability distribution computed by the model is uniform.

8. A method for deriving topics from known textual sources to apply those models to unknown textual sources, the method comprising steps of: accepting a first input from a first data source that has data related to at least one specified subject, thereby producing a first accepted input; grouping the accepted first input into first groups using a language model; labeling the first groups with identifiers; receiving a second input from a second data source; searching the accepted second input using a language model, thereby producing searched second input; identifying a first group for at least some of the searched second input; creating at least one new group for any searched second input that was not identified with a first group.

9. The method of claim 8, wherein the topic model technique is a Contextualized Topic Modeling (CTM) technique.

10. The method of claim 8, wherein the topic model technique builds a topic model for conversations in the first data source, and wherein applying the heuristic extends the topic model to search for unseen and unrelated documents in the second data source.

11. The method of claim 10, wherein the first data source includes an artificial intelligence (AI)-related legitimate forum 12. The method of claim 10, wherein the second data source includes an underground forum.

12. The method of claim 8, wherein the threshold is set to 10 times higher than an average topic probability.

13. The method of claim 8, wherein if the second set of data includes topics that are unrelated to the topics captured by the model, the probability distribution computed by the model is uniform.