Use fake-train.tab data and load it with Corpus. The data set contains 2725 news from 2016 labelled either REAL or FAKE.

Load the data and preprocess it. Use the default settings, but add Normalization (with Lemmagen lemmatizer) between Tokenization and Filtering. URGENT: Check Most frequent tokens and set them to 1000. This will keep only 1000 most frequent tokens. This is necessary for speed, otherwise your computer might choke. :O

  1. Try Bag of Words with Count and observe the Logistic Regression model (in terms of accuracy and explainability). Compare it to the model with TF-IDF transformation. Which one is better and why?

  1. Now explore the which words are significant for FAKE and REAL news with Word Enrichment. Are the results different than those from Logistic Regression? If yes, why? If no, why?

Last modified: Monday, 14 March 2022, 5:38 PM