Document Topic Prediction


There are numerous sources on the Internet that generate massive amounts of daily news. Furthermore, as the user demand for information grows, it is critical that news be classified in order for users to access relevant information quickly and effectively. As a result, the machine learning model for automated news classification could be used to identify topics of untracked news and/or make personalised recommendations based on the user's prior interests. The goal of this project is to create models that take as input a news content and output the corresponding news topic.

The Data Demo

DATA

The dataset consist of 2225 raw text documents from the BBC news website corresponding to stories in five topical areas(business, entertainment, politics, sport, tech) during 2004-2005.

Technical Aspect

This is a simple Flask app for news topic prediction based on NLP and machine learning. The trained model(web/model.pkl) takes text as input and predict the news topic.

Tools : Python( Flask, Numpy, Pandas, NLTK)

The project is impimented in the following steps.
  1. Data Preparation
    • Combined data from multiple .txt files into csv file with 3 columns(topic, title and content)
  2. Data Preprocessing
    • The textual data(document content) was preprocessed by removing unnecessary characters and stopwords. After lemmatization, bag of words model was used for vectorizing the textual data. The vectorizer was saved as pickle file for the implementation in flask.
  3. Model Building
    • Divided the feature extracted dataset into two parts train and test set.
    • Trained 3 models( Multinomial NB, SVM and Decision Tree) for making prediction. The best model obtained was Multinomial NB (best accuracy and least no.of miss classifications) and created a pickle file of the model
  4. Building a Flask web app for end user.
    • Used model.pkl file to predict the news topic

Demo



View other projects Bact to top