Please refer to the previous blog for data acquisition and processing as well as sever and DBMS configuration. For the classifier developing, SVM and Naive Bayes have been tested on the movie dataset. Due to the similar accuracy and better time complexity for Naive Bayes, we finally developed the classifier by Naive Bayes. Html/CSS, Php and Python are used for developing the classifier. Procedure are listed as following,
Since this classifier is developed depended on LAMP, runing python directly on Apache is not a good choice. We use an intermediate php script to call the python script running which is similiar to run python through bash.
The Naive Bayes Classifier was trained in python by movie dataset, and using 10% to test the accuracy of the classifier. The movie dataset include two numeric columns, budget and revenue, one character column, genre. We use the equation of Naive Bayes to calculate the probability of a movie with certain budget and revenue belonging to certain genres, and pick up the genre to be the movie genre which has the largest probability. For example, if we want to check if a movie is action with budget $15,000,000 and revenue $356,296,601, we can calculate the probability by the formula below. We also have to calculate other probabilities for other genres with the same budget and revenue. If probability of action is the highest, we conclude a movie with budget $15,000,000 and revenue $356,296,601 belong to genre action.
We save the trained classifier as a file for future quick access.
A simple webpage is developed by html/css. A php script developed for handling user input and call another python script to run the classifier retrieving final genre.
Scripts on Github: GitHub
Naive Bayes Algorithm
Bayes theorem provides a way of calculating the posterior probability, P(c|x), from P©, P(x), and P(x|c). Naive Bayes classifier assume that the effect of the value of a predictor (x) on a given class © is independent of the values of other predictors. This assumption is called class conditional independence.
In our case, we want to calculate the probability for a given genre by providing budget and revenue. The equation can be written as: The equation can be further rewrite as: Since our attribute is not a category, we need to use the distribution of the numerical variable to have a good guess of the frequency. We take the drama and action budget distribution as an example. From the two chart above, we can see action movie is prone to have a higher budget compare to drama.
This is the main page for user typing. After user type in budget and revenue, it will return the most likely genre by the giving two values along with the probability.