Project # 2
(Due Date: 31st December, 2012)

You are required to analyze a real world data mining problem using techniques studied in this course. The problem can be related to banks, hospitals, cinemas, restaurants, super-stores, academic institutions, etc. In case you can't find data on your own then you can pick any competition of interest from the following website

http://www.kaggle.com/

The website lists several projects posted by real companies who also pay a cash reward to the best solution of their problem. The projects' deadline (posted on Kaggle) may be different but you need to make sure that you submit your Project # 2 to me by 31st of December.

In addition to submitting a comprehensive report describing your analysis, you would be required a detailed presentation (10-15 mins.) on the problem you chose, the nature of data and how you cleaned/prepared it, and your findings. The presentations will be held on the 1st of January, 2013 but you are required to submit your report by 31st December (mid night).

As is the case with Project 1, it is a group-based project (max 2 person) but if you want to do it alone then it's fine as well. The report will be submitted via turnitin and the zero-tolerance policy of IBA towards plagiarism is applicable in this case. Any two reports found similar would result in a straight F for both groups and further action would be decided by the Examination department.


Project # 1
(Due Date: 19th November, 2012)

You are required to analyze a data set posted on Kaggle.com (the premier platform for data mining competitions). The competition is about predicting whether a car purchased by a car dealer is "Good Buy" or "Bad Buy". You can read more about the data set at
https://www.kaggle.com/c/DontGetKicked

The data set (provided below) is a slight modification of the original data set. The training and testing data sets have 50,000 and 30,000 records (approx.), respectively.

Training Data:

Testing Data:

Variable List:

You need to explain each and every step you took to clean, normalize, discretize the data. How did you perform feature selection and why you decided to retrain/remove certain features. Why you stick to a particular discretization method in case you descritize the data or why you didn't feel the need to discretize it. How missing values were handled, and so on and so forth.



The evaluation of different models (with different combination of features) would be based on Area Under the Curve in ROC Curve option available in KNIME. But feel free to make use of any available tools be it MS Excel, Weka, SQL Server, R, etc. Just make sure that they are mention in your explanation.



In addition to ROC Curve value, also make use of F-measure for both bad and good customers in deciding the best model.



You will be a submitting a report describing your analysis. Few sample research papers posted in the reading list section would make it clear how to present your findings. Also prepare a short-presentation (max 5 mins.) that you will present in the class. The deadline for report, however, is 19th November.



It is a group-based project (max 2 person) but if you want to do it alone then it's fine as well. The report will be submitted via turnitin and the zero-tolerance policy of IBA towards plagiarism is applicable in this case. Any two reports found similar would result in a straight F for both groups and further action would be decided by the Examination department.

Assignment # 1
(Due Date: October 23, 2012)

You need to present a case study (or a research paper) describing a data mining application. You can work in a group of 2. I have posted several case studies on the "Reading List" page of this wiki. But feel free to search Google or any other database to find a case study of your own choice. You will have 10 minutes to present your findings in front of the whole class.

S. #
Group Members
Topic
1
Maira Ata, Uzma Zehra
Using A Decision Tree and Neural Net to Identify Severe Weather Radar
Characteristics
2
Nisar Hussain, Umair Hakeem
Data Mining Application in Enrollment Management: A Case Study
3
Daniyal Khaliq, Nayyar Mashkoor
External Search Marketing Program: A Return on Investment Approach
4
Erum Shahid, Mehwish Khatri
Data Mining in Mobile Communication
5
Nasrullah Khan
Cloud based Social and Sensor Data Fusion
6
Jehanzaib Chaudhri
Data Mining and Wireless Sensor Network for Agriculture Pest Prediction
7
Muhammad Zubair Noor
Combining Naive Bayes and Decision Tree for Adaptive Intrusion Detection
8
Muhammad Hassan-ur-Rehman
Neural Network Model for the Prediction of Thrombo-embolic Stroke
9
Taimoor Waseem
Naive Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages
10
Syed Aamir Ali
Analysis of Heart Diseases Dataset using Neural Network Approach
11
Usman Arif
Data Mining Techniques for Detection of Fraudulent Financial Statements
12
Daniyal Khan
Minnesota Intrusion Detection System