Data Mining (HSE)
| Topic | Name | Description | 
|---|---|---|
| Every section below contains a few papers (or even Wikipedia pages) that easy to read, without much math. Some students asked more advanced questions (e.g. relation between Bayesian modeling and logistic regression), so here is a list of some more advanced books. | ||
| Exam | ||
| Visualizations (and Getting to Know Orange) | ||
| (If the video doesn't play: it happens to me, too. There seem to be something wrong on YT's side. I hope it resolves, otherwise I'll reupload.) | ||
| A blog post about how statistical tests work and why we have to be very careful when using them in data mining. | ||
| A shorter and drier version of the same. | ||
| A famous, juicy paper with general arguments against null-hypothesis testing. | ||
| Introduction to predictive modelling | ||
| The page contains the crux of the lecture. Its title and the fact that the first link on the page points to a completely unrelated kind of decision trees demonstrate why classification tree is a better term than decision tree. [Mandatory reading, with grain of salt; you most certainly don't need to know that "It is also possible for a tree to be sampled using MCMC." :) ] | ||
| We have spent a lot of time explaining the concept of entropy. Wikipedia page is rather mathematical and dry, but may be a good antidote to less formal and less organized exposition at the lecture. :) | ||
| Quinlan is the author of one of the first and most influential algorithm for induction of classification trees. The article is more of historical interest, but it shows the thinking of the pioneers of AI. After some philosophy in the first two sections, it explains the reasoning behind the tree-induction algorithms. | ||
| Model Performance | ||
| Lists all kinds of scores, useful as reference | ||
| Use this page as a list of different sampling techniques. | ||
| A very accessible paper about ROC curves. | ||
| Linear models for classification | ||
| A more mathematical (compared to our lecture), but still friendly explanation of logistic regression. Read the first 6 pages, that is, section 12.1 and the (complete) section 12.2. You don't have to know the formulas, but you need to understand their meaning. (This is Chapter 12 from Advanced Data Analysis from an Elementary Point of View. Download the draft from the author's site while it's free.) | ||
| A quick derivation of the Naive Bayesian classifier, and derivation and explanation of nomograms. | ||
| Other types of classifiers | ||
| The best-known book about kernel methods like SVM. Warning: lots of mathematics. Not a required reading for this class. | ||
| Contrary from SVM, random forests are so easy to explain and understand that they don't require additional literature. But if anybody is interested, here's the definitive paper about them. | ||
| ... and this is the paper from the less-known inventor of the method. It was Breiman (above paper) though, who thoroughly examined the method and gave it a name. Neither this paper nor the above is a required reading, though. | ||
| Regularization | ||
| We are just telling you about this book because we must do it at some point. It is too difficult for this course, but it provides an overview of machine learning methods from statistical point of view. Chapters 4 and 5 should not be too difficult, and you can read them to better understand linear models and regularization. You can download the book for free. | ||
| Clustering | ||
| Obligatory reading: sections 8.2 (you may skip 8.2.6), 8.3 (skip 8.3.3), The Silhouette Coefficient (pg. 541). Everything else is also quite easy to read, so we recommend it. | ||
| Text mining | ||
| Comprehensive overview of text mining techniques and algorithms. [obligatory] | ||
| Why regular expression can be very helpful. [optional read] | ||
| Why using TF-IDF is a good idea. [technical, interesting read] | ||
| Notes from the text mining lecture. | ||
| A book on sentiment analysis and opinion mining. Freely available in the link. | ||
| Projections and embeddings | ||
| The chapter is particularly interesting because of some nice examples at the end. | ||
| See the example in the introduction. You can also read the Methods section, if you're curious. | ||
| Embeddings ... and a practical case | ||
| Assignments | ||