Logic.core.indexer package#

Logic.core.indexer.LSH module#

class MinHashLSH(documents, num_hashes)#

Bases: object

build_characteristic_matrix()#

Build the characteristic matrix representing the presence of shingles in documents.

Returns:

The binary characteristic matrix.

Return type:

ndarray

jaccard_score(first_set, second_set)#

Calculate jaccard score for two sets.

Parameters:
  • first_set (set) – Set of first shingled document.

  • second_set (set) – Set of second shingled document.

Returns:

Jaccard score.

Return type:

float

jaccard_similarity_test(buckets, all_documents)#

Test your near duplicate detection code based on jaccard similarity.

Parameters:
  • buckets (dict) – A dictionary mapping bucket IDs to lists of document indices.

  • all_documents (list) – The input documents for similarity analysis.

lsh_buckets(signature, bands=10, rows_per_band=10)#

Group documents into Locality-Sensitive Hashing (LSH) buckets based on Min-Hash signatures.

Parameters:
  • signature (ndarray) – Min-Hash signatures for documents.

  • bands (int) – Number of bands for LSH.

  • rows_per_band (int) – Number of rows per band.

Returns:

A dictionary mapping bucket IDs to lists of document indices.

Return type:

dict

min_hash_signature()#

Perform Min-Hashing to generate hash signatures for documents.

Returns:

The Min-Hash signatures matrix.

Return type:

ndarray

perform_lsh()#

Perform the entire Locality-Sensitive Hashing (LSH) process.

Returns:

A dictionary mapping bucket IDs to lists of document indices.

Return type:

dict

shingle_document(document, k=2)#

Convert a document into a set of shingles.

Parameters:
  • document (str) – The input document.

  • k (int) – The size of each shingle.

Returns:

A set of shingles.

Return type:

set

Logic.core.indexer.document_lengths_index module#

class DocumentLengthsIndex(path='index/')#

Bases: object

get_documents_length(where)#

Gets the documents’ length for the specified field.

Parameters:

where (str) – The field to get the document lengths for.

Returns:

A dictionary of the document lengths. The keys are the document IDs, and the values are the document’s length in that field (where).

Return type:

dict

store_document_lengths_index(path, index_name)#

Stores the document lengths index to a file.

Parameters:
  • path (str) – The path to the directory where the indexes are stored.

  • index_name (Indexes) – The name of the index to store.

Logic.core.indexer.index module#

class Index(preprocessed_documents: list)#

Bases: object

add_document_to_index(document: dict)#

Add a document to all the indexes

Parameters:

document (dict) – Document to add to all the indexes

check_add_remove_is_correct()#

Check if the add and remove is correct

check_if_index_loaded_correctly(index_type: str, loaded_index: dict)#

Check if the index is loaded correctly

Parameters:
  • index_type (str) – Type of index to check (documents, stars, genres, summaries)

  • loaded_index (dict) – The loaded index

Returns:

True if index is loaded correctly, False otherwise

Return type:

bool

check_if_indexing_is_good(index_type: str, check_word: str = 'good')#

Checks if the indexing is good. Do not change this function. You can use this function to check if your indexing is correct.

Parameters:
  • index_type (str) – Type of index to check (documents, stars, genres, summaries)

  • check_word (str) – The word to check in the index

Returns:

True if indexing is good, False otherwise

Return type:

bool

check_if_key_exists(index_before_add, index, key)#
delete_dummy_keys(index_before_add, index, key)#
get_posting_list(word: str, index_type: str)#

get posting_list of a word

Parameters:
  • word (str) – word we want to check

  • index_type (str) – type of index we want to check (documents, stars, genres, summaries)

Returns:

posting list of the word (you should return the list of document IDs that contain the word and ignore the tf)

Return type:

list

index_documents()#

Index the documents based on the document ID. In other words, create a dictionary where the key is the document ID and the value is the document.

Returns:

The index of the documents based on the document ID.

Return type:

dict

index_genres()#

Index the documents based on the genres.

Returns:

The index of the documents based on the genres. You should also store each terms’ tf in each document. So the index type is: {term: {document_id: tf}}

Return type:

dict

index_stars()#

Index the documents based on the stars.

Returns:

The index of the documents based on the stars. You should also store each terms’ tf in each document. So the index type is: {term: {document_id: tf}}

Return type:

dict

index_summaries()#

Index the documents based on the summaries (not first_page_summary).

Returns:

The index of the documents based on the summaries. You should also store each terms’ tf in each document. So the index type is: {term: {document_id: tf}}

Return type:

dict

load_index(path: str)#

Loads the index from a file (such as a JSON file)

Parameters:

path (str) – Path to load the file

remove_document_from_index(document_id: str)#

Remove a document from all the indexes

Parameters:

document_id (str) – ID of the document to remove from all the indexes

store_index(path: str, index_name: str = None)#

Stores the index in a file (such as a JSON file)

Parameters:
  • path (str) – Path to store the file

  • index_name (str) – name of index we want to store (documents, stars, genres, summaries)

Logic.core.indexer.index_reader module#

class Index_reader(path: str, index_name: Indexes, index_type: Index_types = None)#

Bases: object

get_index()#

Gets the index from the file.

Returns:

The index.

Return type:

dict

Logic.core.indexer.indexes_enum module#

class Index_types(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)#

Bases: Enum

DOCUMENT_LENGTH = 'document_length'#
METADATA = 'metadata'#
TIERED = 'tiered'#
class Indexes(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)#

Bases: Enum

DOCUMENTS = 'documents'#
GENRES = 'genres'#
STARS = 'stars'#
SUMMARIES = 'summaries'#

Logic.core.indexer.metadata_index module#

class Metadata_index(path='index/')#

Bases: object

create_metadata_index()#

Creates the metadata index.

get_average_document_field_length(where)#

Returns the sum of the field lengths of all documents in the index.

Parameters:

where (str) – The field to get the document lengths for.

read_documents()#

Reads the documents.

store_metadata_index(path)#

Stores the metadata index to a file.

Parameters:

path (str) – The path to the directory where the indexes are stored.

Logic.core.indexer.tiered_index module#

class Tiered_index(path='index/')#

Bases: object

convert_to_tiered_index(first_tier_threshold: int, second_tier_threshold: int, index_name)#

Convert the current index to a tiered index.

Parameters:
  • first_tier_threshold (int) – The threshold for the first tier

  • second_tier_threshold (int) – The threshold for the second tier

  • index_name (Indexes) – The name of the index to read.

Returns:

The tiered index with structure of {

”first_tier”: dict, “second_tier”: dict, “third_tier”: dict

}

Return type:

dict

store_tiered_index(path, index_name)#

Stores the tiered index to a file.