-
Notifications
You must be signed in to change notification settings - Fork 0
Implementation of a text based information retrieval system - a domain specific search engine, according to the Vector Space model. Ranked retrieval uses tf-idf scoring.
License
Niloth-p/Search-Engine
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Implementation of a text based information retrieval system - a domain specific search engine. The searching module is built according to the Vector Space model of Information Retrieval. The program builds the index of all the files, then retrieves the top 10 documents (if any) relevant to queries input by user. This program performs ranked retrieval using the tf-idf scoring model. Command to extract .txt files from all the .html files in the corpus: python CommonExtractor.py Before executing the command, create an empty directory 'ExtractedText' within the same folder as the CommonExtractor.program Command to index all the files in the folder where you want to use the search engine: python Indexer.py The indexer program needs to be run just once for a file system. Command that runs the program that gets a query(s) from the user and returns top 10 relevant documents: python runner.py The runner program needs to be run every time a query needs to be processed. The results of the latest query to the search engine is displayed and stored in a file named DocNames.txt Single and multiple term queries have been handled using the tf-idf indexes. Phrase queries are yet to be implemented, even though postingslist building is done. PACKAGES REQUIRED: Packages os, re, pickle, collections, math, nltk, urllib, bs4 have been used in this program. To install any required packages, type pip install <package-name> FILES IN THE REPOSITORY: IR Assignment-3.pdf is the document describing the problem given as the assignment for the course Information Retrieval DesignDoc.docx - The Design document stopwords.dat - the list of stopwords data file Folder GENERATED: Files generated by the program during runtime, attached for reference: postingslist.txt - the postingslist stored as an object using pickle postingslistreadable.txt - the postingslist for our reference idf.txt - the idf index stored as an object using pickle idfreadable.txt - the idf index for our reference tf.txt - the tf index stored as an object using pickle tfreadable.txt - the tf index for our reference ranks.txt - the ranks list stored as an object using pickle (of the latest query) ranksreadable.txt - the ranks list for our reference (of the latest query) scores.txt - the scores dictionary stored as an object using pickle (of the latest query) scoresreadable.txt - the scores dictionary for our reference (of the latest query) The generation of these files can be turned off by commenting out the lines calling - writeToFile() and writeToHumanReadableFile() MY DATASET: I have used the corpus 'Wiki Small' from: the book "Search Engines: Information Retrieval in Practice"'s webpage: http://www.search-engines-book.com/collections/ The corpus was created from a snapshot of the English Wikipedia downloaded from static.wikipedia.org in September 2008. The corpus contains 6043 articles as .html files. which I extracted as .txt files. The .html files are categorized according to each consecutive letter of their filenames, but in my ExtractedText folder, I have removed the categorization and just collected all the files together. IMPLEMENTATION ON OTHER DATASETS: To implement the search engine on other datasets, If you need to extract .txt files from .html files: In CommonExtractor.py, change the directory variable within main to the suitable directory name, and if you are using Linux, please manage the / and \ suitably in the traverse function. Create the directory ExtractedText if you are about to use CommonExtractor.py Create a directory containing all the files to be searched and change the global variable 'direc' in Indexer.py and runner.py, to the name of your directory. Change the value of N in runner.py to the number of documents in your dataset. RUNNING TIME : CommonExtractor.py : approx 217 seconds (for 6043 articles) Indexer.py : approx 443 seconds runner.py : <1s, varies for each query PRECISION : varies for each query, and is displayed at the end of the program RECALL : varies for each query, and is displayed at the end of the program USER INPUT: The user can input singel/multiple term queries in runner.py, but phrase queries have not yet been handled.
About
Implementation of a text based information retrieval system - a domain specific search engine, according to the Vector Space model. Ranked retrieval uses tf-idf scoring.
Topics
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published