- Find the movie dataset
- Configure a EC-2 web server provided by AWS, this includes install apache, MySQL, phpmyadmin and python interpreter (for latter development).
- Deploy the movie dataset into MySQL DBMS.
- Develop the front end page by html and css.
- Develop the back end by php. Php access the data stored in MySQL. Php calculates the tf-idf value for the overview of each movie’s. Each movie assigns a n dimensional vector after that. When an user type in keywords, Php calculates the user query as a new tf-idf vector. Finally we calculate the similarities between the user query and each of the movies by tf-idf values. Returning all the results ordered by similarity value decreasingly.
Scripts on Github: GitHub Pages
TF-IDF is short for Term Frequency and Inverse Document Frequency. It is a ranking mechanism applied to search engine mostly. It takes each unique words as terms in the document. In our project, we produce a TF-IDF vector for each description of movies. Each TF-IDF vector contains data on how important a given word is to that document. As the name indicates, it consists of two components, TF and IDF.
Compute TF: Term Frequency The term frequency (TF) is a measure of how frequently a term appears in a document. We compute it using this formula: Notice the more the term appears in the document, the higher the TF score.
Compute IDF: Inverse Document Frequency The inverse document frequency (IDF) measures how frequently a term appears in all documents using this formula: Through IDF, we punish common words to minimize the importance of the term for the documents. For example, common words like “the”, “is”, or “a” may appear in all of documents, it apparently not important at all, which means these words are not relevant to a user query.
Combined TF-IDF score Finally, we compute the final TF-IDF relevance score for the term by multiplying the two above numbers together:
Implement cosine similarity After this we get a sparse matrix including all the terms and documents in terms of description. When users give a query, we treat it as a new vector, the cosine similarity between the query vector and all document vectors take the search job completed. We return all the movie title which has similarity value larger than one ordered decreasingly.
This is the main page movie searching. Take keywords “beautiful day” as an example, the page returns a bunch of results consisting by movie title and overview in an ranked manner. It clearly indicates the top ranked results contain all or part of the keywords, “beautiful” and “day”, in this example. The bottom ranked results although contain all or partial keywords (in most cases partial), the overview apparently consist of more words/terms than the top ones. This indicates out term frequency mechanism works.