Logic.core.utility package

Contents

Logic.core.utility package#

Logic.core.utility.crawler module#

class IMDbCrawler(crawling_threshold=1000)#

Bases: object

put your own user agent in the headers

crawl(URL)#

Make a get request to the URL and return the response

Parameters:

URL (str) – The URL of the site

Returns:

The response of the get request

Return type:

requests.models.Response

crawl_page_info(URL)#

Main Logic of the crawler. It crawls the page and extracts the information of the movie. Use related links of a movie to crawl more movies.

Parameters:

URL (str) – The URL of the site

extract_movie_info(res, movie, URL)#

Extract the information of the movie from the response and save it in the movie instance.

Parameters:
  • res (requests.models.Response) – The response of the get request

  • movie (dict) – The instance of the movie

  • URL (str) – The URL of the site

extract_top_250()#

Extract the top 250 movies from the top 250 page and use them as seed for the crawler to start crawling.

get_budget()#

Get the budget of the movie from box office section of the soup

Parameters:

soup (BeautifulSoup) – The soup of the page

Returns:

The budget of the movie

Return type:

str

get_countries_of_origin()#

Get the countries of origin of the movie from the soup

Parameters:

soup (BeautifulSoup) – The soup of the page

Returns:

The countries of origin of the movie

Return type:

List[str]

get_director()#

Get the directors of the movie from the soup

Parameters:

soup (BeautifulSoup) – The soup of the page

Returns:

The directors of the movie

Return type:

List[str]

get_first_page_summary()#

Get the first page summary of the movie from the soup

Parameters:

soup (BeautifulSoup) – The soup of the page

Returns:

The first page summary of the movie

Return type:

str

get_genres()#

Get the genres of the movie from the soup

Parameters:

soup (BeautifulSoup) – The soup of the page

Returns:

The genres of the movie

Return type:

List[str]

get_gross_worldwide()#

Get the gross worldwide of the movie from box office section of the soup

Parameters:

soup (BeautifulSoup) – The soup of the page

Returns:

The gross worldwide of the movie

Return type:

str

get_id_from_URL(URL)#

Get the id from the URL of the site. The id is what comes exactly after title. for example the id for the movie https://www.imdb.com/title/tt0111161/?ref_=chttp_t_1 is tt0111161.

Parameters:

URL (str) – The URL of the site

Returns:

The id of the site

Return type:

str

get_imdb_instance()#
get_languages()#

Get the languages of the movie from the soup

Parameters:

soup (BeautifulSoup) – The soup of the page

Returns:

The languages of the movie

Return type:

List[str]

get_mpaa()#

Get the MPAA of the movie from the soup

Parameters:

soup (BeautifulSoup) – The soup of the page

Returns:

The MPAA of the movie

Return type:

str

get_rating()#

Get the rating of the movie from the soup

Parameters:

soup (BeautifulSoup) – The soup of the page

Returns:

The rating of the movie

Return type:

str

Get the related links of the movie from the More like this section of the page from the soup

Parameters:

soup (BeautifulSoup) – The soup of the page

Returns:

The related links of the movie

Return type:

List[str]

get_release_year()#

Get the release year of the movie from the soup

Parameters:

soup (BeautifulSoup) – The soup of the page

Returns:

The release year of the movie

Return type:

str

Get the link to the review page of the movie Example: https://www.imdb.com/title/tt0111161/ is the page https://www.imdb.com/title/tt0111161/reviews is the review page

get_reviews_with_scores()#

Get the reviews of the movie from the soup reviews structure: [[review,score]]

Parameters:

soup (BeautifulSoup) – The soup of the page

Returns:

The reviews of the movie

Return type:

List[List[str]]

get_stars()#

Get the stars of the movie from the soup

Parameters:

soup (BeautifulSoup) – The soup of the page

Returns:

The stars of the movie

Return type:

List[str]

get_summary()#

Get the summary of the movie from the soup

Parameters:

soup (BeautifulSoup) – The soup of the page

Returns:

The summary of the movie

Return type:

List[str]

Get the link to the summary page of the movie Example: https://www.imdb.com/title/tt0111161/ is the page https://www.imdb.com/title/tt0111161/plotsummary is the summary page

Parameters:

url (str) – The URL of the site

Returns:

The URL of the summary page

Return type:

str

get_synopsis()#

Get the synopsis of the movie from the soup

Parameters:

soup (BeautifulSoup) – The soup of the page

Returns:

The synopsis of the movie

Return type:

List[str]

get_title()#

Get the title of the movie from the soup

Parameters:

soup (BeautifulSoup) – The soup of the page

Returns:

The title of the movie

Return type:

str

get_writers()#

Get the writers of the movie from the soup

Parameters:

soup (BeautifulSoup) – The soup of the page

Returns:

The writers of the movie

Return type:

List[str]

headers = {'User-Agent': None}#
read_from_file_as_json()#

Read the crawled files from json

start_crawling()#

Start crawling the movies until the crawling threshold is reached. .. todo:

replace WHILE_LOOP_CONSTRAINTS with the proper constraints for the while loop.
replace NEW_URL with the new URL to crawl.
replace THERE_IS_NOTHING_TO_CRAWL with the condition to check if there is nothing to crawl.
delete help variables.

ThreadPoolExecutor is used to make the crawler faster by using multiple threads to crawl the pages. You are free to use it or not. If used, not to forget safe access to the shared resources.

top_250_URL = 'https://www.imdb.com/chart/top/'#
write_to_file_as_json()#

Save the crawled files into json

main()#

Logic.core.utility.evaluation module#

class Evaluation(name: str)#

Bases: object

cacluate_DCG(actual: List[List[str]], predicted: List[List[str]]) float#

Calculates the Normalized Discounted Cumulative Gain (NDCG) of the predicted results

Parameters:
  • actual (List[List[str]]) – The actual results

  • predicted (List[List[str]]) – The predicted results

Returns:

The DCG of the predicted results

Return type:

float

cacluate_MRR(actual: List[List[str]], predicted: List[List[str]]) float#

Calculates the Mean Reciprocal Rank of the predicted results

Parameters:
  • actual (List[List[str]]) – The actual results

  • predicted (List[List[str]]) – The predicted results

Returns:

The MRR of the predicted results

Return type:

float

cacluate_NDCG(actual: List[List[str]], predicted: List[List[str]]) float#

Calculates the Normalized Discounted Cumulative Gain (NDCG) of the predicted results

Parameters:
  • actual (List[List[str]]) – The actual results

  • predicted (List[List[str]]) – The predicted results

Returns:

The NDCG of the predicted results

Return type:

float

cacluate_RR(actual: List[List[str]], predicted: List[List[str]]) float#

Calculates the Mean Reciprocal Rank of the predicted results

Parameters:
  • actual (List[List[str]]) – The actual results

  • predicted (List[List[str]]) – The predicted results

Returns:

The Reciprocal Rank of the predicted results

Return type:

float

calculate_AP(actual: List[List[str]], predicted: List[List[str]]) float#

Calculates the Mean Average Precision of the predicted results

Parameters:
  • actual (List[List[str]]) – The actual results

  • predicted (List[List[str]]) – The predicted results

Returns:

The Average Precision of the predicted results

Return type:

float

calculate_F1(actual: List[List[str]], predicted: List[List[str]]) float#

Calculates the F1 score of the predicted results

Parameters:
  • actual (List[List[str]]) – The actual results

  • predicted (List[List[str]]) – The predicted results

Returns:

The F1 score of the predicted results

Return type:

float

calculate_MAP(actual: List[List[str]], predicted: List[List[str]]) float#

Calculates the Mean Average Precision of the predicted results

Parameters:
  • actual (List[List[str]]) – The actual results

  • predicted (List[List[str]]) – The predicted results

Returns:

The Mean Average Precision of the predicted results

Return type:

float

calculate_evaluation(actual: List[List[str]], predicted: List[List[str]])#

call all functions to calculate evaluation metrics

Parameters:
  • actual (List[List[str]]) – The actual results

  • predicted (List[List[str]]) – The predicted results

calculate_precision(actual: List[List[str]], predicted: List[List[str]]) float#

Calculates the precision of the predicted results

Parameters:
  • actual (List[List[str]]) – The actual results

  • predicted (List[List[str]]) – The predicted results

Returns:

The precision of the predicted results

Return type:

float

calculate_recall(actual: List[List[str]], predicted: List[List[str]]) float#

Calculates the recall of the predicted results

Parameters:
  • actual (List[List[str]]) – The actual results

  • predicted (List[List[str]]) – The predicted results

Returns:

The recall of the predicted results

Return type:

float

log_evaluation(precision, recall, f1, ap, map, dcg, ndcg, rr, mrr)#

Use Wandb to log the evaluation metrics

Parameters:
  • precision (float) – The precision of the predicted results

  • recall (float) – The recall of the predicted results

  • f1 (float) – The F1 score of the predicted results

  • ap (float) – The Average Precision of the predicted results

  • map (float) – The Mean Average Precision of the predicted results

  • dcg (float) – The Discounted Cumulative Gain of the predicted results

  • ndcg (float) – The Normalized Discounted Cumulative Gain of the predicted results

  • rr (float) – The Reciprocal Rank of the predicted results

  • mrr (float) – The Mean Reciprocal Rank of the predicted results

print_evaluation(precision, recall, f1, ap, map, dcg, ndcg, rr, mrr)#

Prints the evaluation metrics

Parameters:
  • precision (float) – The precision of the predicted results

  • recall (float) – The recall of the predicted results

  • f1 (float) – The F1 score of the predicted results

  • ap (float) – The Average Precision of the predicted results

  • map (float) – The Mean Average Precision of the predicted results

  • dcg (float) – The Discounted Cumulative Gain of the predicted results

  • ndcg (float) – The Normalized Discounted Cumulative Gain of the predicted results

  • rr (float) – The Reciprocal Rank of the predicted results

  • mrr (float) – The Mean Reciprocal Rank of the predicted results

Logic.core.utility.preprocess module#

class Preprocessor(documents: list)#

Bases: object

normalize(text: str)#

Normalize the text by converting it to a lower case, stemming, lemmatization, etc.

Parameters:

text (str) – The text to be normalized.

Returns:

The normalized text.

Return type:

str

preprocess()#

Preprocess the text using the methods in the class.

Returns:

The preprocessed documents.

Return type:

List[str]

Remove links from the text.

Parameters:

text (str) – The text to be processed.

Returns:

The text with links removed.

Return type:

str

remove_punctuations(text: str)#

Remove punctuations from the text.

Parameters:

text (str) – The text to be processed.

Returns:

The text with punctuations removed.

Return type:

str

remove_stopwords(text: str)#

Remove stopwords from the text.

Parameters:

text (str) – The text to remove stopwords from.

Returns:

The list of words with stopwords removed.

Return type:

list

tokenize(text: str)#

Tokenize the words in the text.

Parameters:

text (str) – The text to be tokenized.

Returns:

The list of words.

Return type:

list

Logic.core.utility.scorer module#

class Scorer(index, number_of_documents)#

Bases: object

compute_score_with_unigram_model(query, document_id, smoothing_method, document_lengths, alpha, lamda)#

Calculates the scores for each document based on the unigram model.

Parameters:
  • query (str) – The query to search for.

  • document_id (str) – The document to calculate the score for.

  • smoothing_method (str (bayes | naive | mixture)) – The method used for smoothing the probabilities in the unigram model.

  • document_lengths (dict) – A dictionary of the document lengths. The keys are the document IDs, and the values are the document’s length in that field.

  • alpha (float, optional) – The parameter used in bayesian smoothing method. Defaults to 0.5.

  • lamda (float, optional) – The parameter used in some smoothing methods to balance between the document probability and the collection probability. Defaults to 0.5.

Returns:

The Unigram score of the document for the query.

Return type:

float

compute_scores_with_unigram_model(query, smoothing_method, document_lengths=None, alpha=0.5, lamda=0.5)#

Calculates the scores for each document based on the unigram model.

Parameters:
  • query (str) – The query to search for.

  • smoothing_method (str (bayes | naive | mixture)) – The method used for smoothing the probabilities in the unigram model.

  • document_lengths (dict) – A dictionary of the document lengths. The keys are the document IDs, and the values are the document’s length in that field.

  • alpha (float, optional) – The parameter used in bayesian smoothing method. Defaults to 0.5.

  • lamda (float, optional) – The parameter used in some smoothing methods to balance between the document probability and the collection probability. Defaults to 0.5.

Returns:

A dictionary of the document IDs and their scores.

Return type:

float

compute_scores_with_vector_space_model(query, method)#

compute scores with vector space model

Parameters:
  • query (List[str]) – The query to be scored

  • method (str ((n|l)(n|t)(n|c).(n|l)(n|t)(n|c))) – The method to use for searching.

Returns:

A dictionary of the document IDs and their scores.

Return type:

dict

compute_socres_with_okapi_bm25(query, average_document_field_length, document_lengths)#

compute scores with okapi bm25

Parameters:
  • query (List[str]) – The query to be scored

  • average_document_field_length (float) – The average length of the documents in the index.

  • document_lengths (dict) – A dictionary of the document lengths. The keys are the document IDs, and the values are the document’s length in that field.

Returns:

A dictionary of the document IDs and their scores.

Return type:

dict

get_idf(term)#

Returns the inverse document frequency of a term.

Parameters:

term (str) – The term to get the inverse document frequency for.

Returns:

The inverse document frequency of the term.

Return type:

float

Note

It was better to store dfs in a separate dict in preprocessing.

get_list_of_documents(query)#

Returns a list of documents that contain at least one of the terms in the query.

Parameters:

query (List[str]) – The query to be scored

Returns:

A list of documents that contain at least one of the terms in the query.

Return type:

list

Note

The current approach is not optimal but we use it due to the indexing structure of the dict we’re using. If we had pairs of (document_id, tf) sorted by document_id, we could improve this.

We could initialize a list of pointers, each pointing to the first element of each list. Then, we could iterate through the lists in parallel.

get_okapi_bm25_score(query, document_id, average_document_field_length, document_lengths)#

Returns the Okapi BM25 score of a document for a query.

Parameters:
  • query (List[str]) – The query to be scored

  • document_id (str) – The document to calculate the score for.

  • average_document_field_length (float) – The average length of the documents in the index.

  • document_lengths (dict) – A dictionary of the document lengths. The keys are the document IDs, and the values are the document’s length in that field.

Returns:

The Okapi BM25 score of the document for the query.

Return type:

float

get_query_tfs(query)#

Returns the term frequencies of the terms in the query.

Parameters:

query (List[str]) – The query to get the term frequencies for.

Returns:

A dictionary of the term frequencies of the terms in the query.

Return type:

dict

get_vector_space_model_score(query, query_tfs, document_id, document_method, query_method)#

Returns the Vector Space Model score of a document for a query.

Parameters:
  • query (List[str]) – The query to be scored

  • query_tfs (dict) – The term frequencies of the terms in the query.

  • document_id (str) – The document to calculate the score for.

  • document_method (str (n|l)(n|t)(n|c)) – The method to use for the document.

  • query_method (str (n|l)(n|t)(n|c)) – The method to use for the query.

Returns:

The Vector Space Model score of the document for the query.

Return type:

float

Logic.core.utility.snippet module#

class Snippet(number_of_words_on_each_side=5)#

Bases: object

find_snippet(doc, query)#

Find snippet in a doc based on a query.

Parameters:
  • doc (str) – The retrieved doc which the snippet should be extracted from that.

  • query (str) – The query which the snippet should be extracted based on that.

Returns:

  • final_snippet (str) – The final extracted snippet. IMPORTANT: The keyword should be wrapped by * on both sides. For example: Sahwshank ***redemption* is one of … (for query: redemption)

  • not_exist_words (list) – Words in the query which don’t exist in the doc.

remove_stop_words_from_query(query)#

Remove stop words from the input string.

Parameters:

query (str) – The query that you need to delete stop words from.

Returns:

The query without stop words.

Return type:

str

Logic.core.utility.spell_correction module#

class SpellCorrection(all_documents)#

Bases: object

find_nearest_words(word)#

Find correct form of a misspelled word.

Parameters:

word (stf) – The misspelled word.

Returns:

5 nearest words.

Return type:

list of str

jaccard_score(first_set, second_set)#

Calculate jaccard score.

Parameters:
  • first_set (set) – First set of shingles.

  • second_set (set) – Second set of shingles.

Returns:

Jaccard score.

Return type:

float

shingle_word(word, k=2)#

Convert a word into a set of shingles.

Parameters:
  • word (str) – The input word.

  • k (int) – The size of each shingle.

Returns:

A set of shingles.

Return type:

set

shingling_and_counting(all_documents)#

Shingle all words of the corpus and count TF of each word.

Parameters:

all_documents (list of str) – The input documents.

Returns:

  • all_shingled_words (dict) – A dictionary from words to their shingle sets.

  • word_counter (dict) – A dictionary from words to their TFs.

spell_check(query)#

Find correct form of a misspelled query.

Parameters:

query (stf) – The misspelled query.

Returns:

Correct form of the query.

Return type:

str