Logic.core.utility package#
Logic.core.utility.crawler module#
- class IMDbCrawler(crawling_threshold=1000)#
Bases:
object
put your own user agent in the headers
- crawl(URL)#
Make a get request to the URL and return the response
- Parameters:
URL (str) – The URL of the site
- Returns:
The response of the get request
- Return type:
requests.models.Response
- crawl_page_info(URL)#
Main Logic of the crawler. It crawls the page and extracts the information of the movie. Use related links of a movie to crawl more movies.
- Parameters:
URL (str) – The URL of the site
- extract_movie_info(res, movie, URL)#
Extract the information of the movie from the response and save it in the movie instance.
- extract_top_250()#
Extract the top 250 movies from the top 250 page and use them as seed for the crawler to start crawling.
- get_budget()#
Get the budget of the movie from box office section of the soup
- Parameters:
soup (BeautifulSoup) – The soup of the page
- Returns:
The budget of the movie
- Return type:
- get_countries_of_origin()#
Get the countries of origin of the movie from the soup
- Parameters:
soup (BeautifulSoup) – The soup of the page
- Returns:
The countries of origin of the movie
- Return type:
List[str]
- get_director()#
Get the directors of the movie from the soup
- Parameters:
soup (BeautifulSoup) – The soup of the page
- Returns:
The directors of the movie
- Return type:
List[str]
- get_first_page_summary()#
Get the first page summary of the movie from the soup
- Parameters:
soup (BeautifulSoup) – The soup of the page
- Returns:
The first page summary of the movie
- Return type:
- get_genres()#
Get the genres of the movie from the soup
- Parameters:
soup (BeautifulSoup) – The soup of the page
- Returns:
The genres of the movie
- Return type:
List[str]
- get_gross_worldwide()#
Get the gross worldwide of the movie from box office section of the soup
- Parameters:
soup (BeautifulSoup) – The soup of the page
- Returns:
The gross worldwide of the movie
- Return type:
- get_id_from_URL(URL)#
Get the id from the URL of the site. The id is what comes exactly after title. for example the id for the movie https://www.imdb.com/title/tt0111161/?ref_=chttp_t_1 is tt0111161.
- get_imdb_instance()#
- get_languages()#
Get the languages of the movie from the soup
- Parameters:
soup (BeautifulSoup) – The soup of the page
- Returns:
The languages of the movie
- Return type:
List[str]
- get_mpaa()#
Get the MPAA of the movie from the soup
- Parameters:
soup (BeautifulSoup) – The soup of the page
- Returns:
The MPAA of the movie
- Return type:
- get_rating()#
Get the rating of the movie from the soup
- Parameters:
soup (BeautifulSoup) – The soup of the page
- Returns:
The rating of the movie
- Return type:
Get the related links of the movie from the More like this section of the page from the soup
- Parameters:
soup (BeautifulSoup) – The soup of the page
- Returns:
The related links of the movie
- Return type:
List[str]
- get_release_year()#
Get the release year of the movie from the soup
- Parameters:
soup (BeautifulSoup) – The soup of the page
- Returns:
The release year of the movie
- Return type:
- get_review_link()#
Get the link to the review page of the movie Example: https://www.imdb.com/title/tt0111161/ is the page https://www.imdb.com/title/tt0111161/reviews is the review page
- get_reviews_with_scores()#
Get the reviews of the movie from the soup reviews structure: [[review,score]]
- Parameters:
soup (BeautifulSoup) – The soup of the page
- Returns:
The reviews of the movie
- Return type:
List[List[str]]
- get_stars()#
Get the stars of the movie from the soup
- Parameters:
soup (BeautifulSoup) – The soup of the page
- Returns:
The stars of the movie
- Return type:
List[str]
- get_summary()#
Get the summary of the movie from the soup
- Parameters:
soup (BeautifulSoup) – The soup of the page
- Returns:
The summary of the movie
- Return type:
List[str]
- get_summary_link()#
Get the link to the summary page of the movie Example: https://www.imdb.com/title/tt0111161/ is the page https://www.imdb.com/title/tt0111161/plotsummary is the summary page
- get_synopsis()#
Get the synopsis of the movie from the soup
- Parameters:
soup (BeautifulSoup) – The soup of the page
- Returns:
The synopsis of the movie
- Return type:
List[str]
- get_title()#
Get the title of the movie from the soup
- Parameters:
soup (BeautifulSoup) – The soup of the page
- Returns:
The title of the movie
- Return type:
- get_writers()#
Get the writers of the movie from the soup
- Parameters:
soup (BeautifulSoup) – The soup of the page
- Returns:
The writers of the movie
- Return type:
List[str]
- headers = {'User-Agent': None}#
- read_from_file_as_json()#
Read the crawled files from json
- start_crawling()#
Start crawling the movies until the crawling threshold is reached. .. todo:
replace WHILE_LOOP_CONSTRAINTS with the proper constraints for the while loop. replace NEW_URL with the new URL to crawl. replace THERE_IS_NOTHING_TO_CRAWL with the condition to check if there is nothing to crawl. delete help variables.
ThreadPoolExecutor is used to make the crawler faster by using multiple threads to crawl the pages. You are free to use it or not. If used, not to forget safe access to the shared resources.
- top_250_URL = 'https://www.imdb.com/chart/top/'#
- write_to_file_as_json()#
Save the crawled files into json
- main()#
Logic.core.utility.evaluation module#
- class Evaluation(name: str)#
Bases:
object
- cacluate_DCG(actual: List[List[str]], predicted: List[List[str]]) float #
Calculates the Normalized Discounted Cumulative Gain (NDCG) of the predicted results
- cacluate_MRR(actual: List[List[str]], predicted: List[List[str]]) float #
Calculates the Mean Reciprocal Rank of the predicted results
- cacluate_NDCG(actual: List[List[str]], predicted: List[List[str]]) float #
Calculates the Normalized Discounted Cumulative Gain (NDCG) of the predicted results
- cacluate_RR(actual: List[List[str]], predicted: List[List[str]]) float #
Calculates the Mean Reciprocal Rank of the predicted results
- calculate_AP(actual: List[List[str]], predicted: List[List[str]]) float #
Calculates the Mean Average Precision of the predicted results
- calculate_F1(actual: List[List[str]], predicted: List[List[str]]) float #
Calculates the F1 score of the predicted results
- calculate_MAP(actual: List[List[str]], predicted: List[List[str]]) float #
Calculates the Mean Average Precision of the predicted results
- calculate_evaluation(actual: List[List[str]], predicted: List[List[str]])#
call all functions to calculate evaluation metrics
- calculate_precision(actual: List[List[str]], predicted: List[List[str]]) float #
Calculates the precision of the predicted results
- calculate_recall(actual: List[List[str]], predicted: List[List[str]]) float #
Calculates the recall of the predicted results
- log_evaluation(precision, recall, f1, ap, map, dcg, ndcg, rr, mrr)#
Use Wandb to log the evaluation metrics
- Parameters:
precision (float) – The precision of the predicted results
recall (float) – The recall of the predicted results
f1 (float) – The F1 score of the predicted results
ap (float) – The Average Precision of the predicted results
map (float) – The Mean Average Precision of the predicted results
dcg (float) – The Discounted Cumulative Gain of the predicted results
ndcg (float) – The Normalized Discounted Cumulative Gain of the predicted results
rr (float) – The Reciprocal Rank of the predicted results
mrr (float) – The Mean Reciprocal Rank of the predicted results
- print_evaluation(precision, recall, f1, ap, map, dcg, ndcg, rr, mrr)#
Prints the evaluation metrics
- Parameters:
precision (float) – The precision of the predicted results
recall (float) – The recall of the predicted results
f1 (float) – The F1 score of the predicted results
ap (float) – The Average Precision of the predicted results
map (float) – The Mean Average Precision of the predicted results
dcg (float) – The Discounted Cumulative Gain of the predicted results
ndcg (float) – The Normalized Discounted Cumulative Gain of the predicted results
rr (float) – The Reciprocal Rank of the predicted results
mrr (float) – The Mean Reciprocal Rank of the predicted results
Logic.core.utility.preprocess module#
- class Preprocessor(documents: list)#
Bases:
object
- normalize(text: str)#
Normalize the text by converting it to a lower case, stemming, lemmatization, etc.
- preprocess()#
Preprocess the text using the methods in the class.
- Returns:
The preprocessed documents.
- Return type:
List[str]
Logic.core.utility.scorer module#
- class Scorer(index, number_of_documents)#
Bases:
object
- compute_score_with_unigram_model(query, document_id, smoothing_method, document_lengths, alpha, lamda)#
Calculates the scores for each document based on the unigram model.
- Parameters:
query (str) – The query to search for.
document_id (str) – The document to calculate the score for.
smoothing_method (str (bayes | naive | mixture)) – The method used for smoothing the probabilities in the unigram model.
document_lengths (dict) – A dictionary of the document lengths. The keys are the document IDs, and the values are the document’s length in that field.
alpha (float, optional) – The parameter used in bayesian smoothing method. Defaults to 0.5.
lamda (float, optional) – The parameter used in some smoothing methods to balance between the document probability and the collection probability. Defaults to 0.5.
- Returns:
The Unigram score of the document for the query.
- Return type:
- compute_scores_with_unigram_model(query, smoothing_method, document_lengths=None, alpha=0.5, lamda=0.5)#
Calculates the scores for each document based on the unigram model.
- Parameters:
query (str) – The query to search for.
smoothing_method (str (bayes | naive | mixture)) – The method used for smoothing the probabilities in the unigram model.
document_lengths (dict) – A dictionary of the document lengths. The keys are the document IDs, and the values are the document’s length in that field.
alpha (float, optional) – The parameter used in bayesian smoothing method. Defaults to 0.5.
lamda (float, optional) – The parameter used in some smoothing methods to balance between the document probability and the collection probability. Defaults to 0.5.
- Returns:
A dictionary of the document IDs and their scores.
- Return type:
- compute_scores_with_vector_space_model(query, method)#
compute scores with vector space model
- compute_socres_with_okapi_bm25(query, average_document_field_length, document_lengths)#
compute scores with okapi bm25
- Parameters:
- Returns:
A dictionary of the document IDs and their scores.
- Return type:
- get_idf(term)#
Returns the inverse document frequency of a term.
- Parameters:
term (str) – The term to get the inverse document frequency for.
- Returns:
The inverse document frequency of the term.
- Return type:
Note
It was better to store dfs in a separate dict in preprocessing.
- get_list_of_documents(query)#
Returns a list of documents that contain at least one of the terms in the query.
- Parameters:
query (List[str]) – The query to be scored
- Returns:
A list of documents that contain at least one of the terms in the query.
- Return type:
Note
The current approach is not optimal but we use it due to the indexing structure of the dict we’re using. If we had pairs of (document_id, tf) sorted by document_id, we could improve this.
We could initialize a list of pointers, each pointing to the first element of each list. Then, we could iterate through the lists in parallel.
- get_okapi_bm25_score(query, document_id, average_document_field_length, document_lengths)#
Returns the Okapi BM25 score of a document for a query.
- Parameters:
query (List[str]) – The query to be scored
document_id (str) – The document to calculate the score for.
average_document_field_length (float) – The average length of the documents in the index.
document_lengths (dict) – A dictionary of the document lengths. The keys are the document IDs, and the values are the document’s length in that field.
- Returns:
The Okapi BM25 score of the document for the query.
- Return type:
- get_query_tfs(query)#
Returns the term frequencies of the terms in the query.
- get_vector_space_model_score(query, query_tfs, document_id, document_method, query_method)#
Returns the Vector Space Model score of a document for a query.
- Parameters:
query (List[str]) – The query to be scored
query_tfs (dict) – The term frequencies of the terms in the query.
document_id (str) – The document to calculate the score for.
document_method (str (n|l)(n|t)(n|c)) – The method to use for the document.
query_method (str (n|l)(n|t)(n|c)) – The method to use for the query.
- Returns:
The Vector Space Model score of the document for the query.
- Return type:
Logic.core.utility.snippet module#
- class Snippet(number_of_words_on_each_side=5)#
Bases:
object
- find_snippet(doc, query)#
Find snippet in a doc based on a query.
- Parameters:
- Returns:
final_snippet (str) – The final extracted snippet. IMPORTANT: The keyword should be wrapped by * on both sides. For example: Sahwshank ***redemption* is one of … (for query: redemption)
not_exist_words (list) – Words in the query which don’t exist in the doc.
Logic.core.utility.spell_correction module#
- class SpellCorrection(all_documents)#
Bases:
object
- find_nearest_words(word)#
Find correct form of a misspelled word.
- jaccard_score(first_set, second_set)#
Calculate jaccard score.
- shingle_word(word, k=2)#
Convert a word into a set of shingles.
- shingling_and_counting(all_documents)#
Shingle all words of the corpus and count TF of each word.