GitHub - Niloth-p/Search-Engine: Implementation of a text based information retrieval system - a domain specific search engine, according to the Vector Space model. Ranked retrieval uses tf-idf scoring.

Niloth-p / Search-Engine Public

Notifications You must be signed in to change notification settings
Fork 0
Star 2

Implementation of a text based information retrieval system - a domain specific search engine, according to the Vector Space model. Ranked retrieval uses tf-idf scoring.

MIT license

2 stars 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
CommonExtractor.html		CommonExtractor.html
CommonExtractor.py		CommonExtractor.py
DesignDoc.pdf		DesignDoc.pdf
Indexer.html		Indexer.html
Indexer.py		Indexer.py
LICENSE		LICENSE
README		README
runner.html		runner.html
runner.py		runner.py
stopwords.dat		stopwords.dat

Repository files navigation

Implementation of a text based information retrieval system - a domain specific search engine. 
The searching module is built according to the Vector Space model of Information Retrieval. 
The program builds the index of all the files, then retrieves the top 10 documents (if any) relevant to queries input by user. 
This program performs ranked retrieval using the tf-idf scoring model.


Command to extract .txt files from all the .html files in the corpus:
python CommonExtractor.py
Before executing the command, create an empty directory 'ExtractedText'
within the same folder as the CommonExtractor.program

Command to index all the files in the folder where you want to use the search engine:
python Indexer.py
The indexer program needs to be run just once for a file system.

Command that runs the program that gets a query(s) from the user and returns top 10 relevant documents:
python runner.py
The runner program needs to be run every time a query needs to be processed.

The results of the latest query to the search engine is displayed and stored in a file named DocNames.txt 

Single and multiple term queries have been handled using the tf-idf indexes.
Phrase queries are yet to be implemented, even though postingslist building is done. 


PACKAGES REQUIRED:

Packages os, re, pickle, collections, math, nltk, urllib, bs4 have been used in this program.
To install any required packages, type
pip install <package-name>


FILES IN THE REPOSITORY:

IR Assignment-3.pdf is the document describing the problem given as the assignment for the course Information Retrieval
DesignDoc.docx - The Design document
stopwords.dat - the list of stopwords data file

Folder GENERATED:
Files generated by the program during runtime, attached for reference:
postingslist.txt - the postingslist stored as an object using pickle
postingslistreadable.txt - the postingslist for our reference
idf.txt - the idf index stored as an object using pickle
idfreadable.txt - the idf index for our reference
tf.txt - the tf index stored as an object using pickle
tfreadable.txt - the tf index for our reference

ranks.txt - the ranks list stored as an object using pickle	(of the latest query)
ranksreadable.txt - the ranks list for our reference (of the latest query)
scores.txt - the scores dictionary stored as an object using pickle (of the latest query)
scoresreadable.txt - the scores dictionary for our reference (of the latest query)

The generation of these files can be turned off by commenting out the lines calling -
writeToFile() and writeToHumanReadableFile()


MY DATASET:

I have used the corpus 'Wiki Small' from:
the book "Search Engines: Information Retrieval in Practice"'s webpage:
http://www.search-engines-book.com/collections/
The corpus was created from a snapshot of the English Wikipedia downloaded from static.wikipedia.org in September 2008.
The corpus contains 6043 articles as .html files. 
which I extracted as .txt files.
The .html files are categorized according to each consecutive letter of their filenames,
but in my ExtractedText folder, I have removed the categorization and just collected all the files together.


IMPLEMENTATION ON OTHER DATASETS:

To implement the search engine on other datasets, 

If you need to extract .txt files from .html files:
In CommonExtractor.py, change the directory variable within main to the suitable directory name,
and if you are using Linux, please manage the / and \ suitably in the traverse function.
Create the directory ExtractedText if you are about to use CommonExtractor.py

Create a directory containing all the files to be searched and
change the global variable 'direc' in Indexer.py and runner.py, to the name of your directory.
Change the value of N in runner.py to the number of documents in your dataset.


RUNNING TIME :
	CommonExtractor.py : approx 217 seconds (for 6043 articles)
	Indexer.py : approx 443 seconds
	runner.py : <1s, varies for each query
PRECISION : varies for each query, and is displayed at the end of the program
RECALL : varies for each query, and is displayed at the end of the program


USER INPUT:
The user can input singel/multiple term queries in runner.py,
but phrase queries have not yet been handled.

About

Implementation of a text based information retrieval system - a domain specific search engine, according to the Vector Space model. Ranked retrieval uses tf-idf scoring.

search-engine information-retrieval rank tf-idf preprocessing

Readme

MIT license

Activity

2 stars

1 watching

0 forks

Report repository