CSE4DMI Data Mining Sem 2 2014, Assignment One 20 Marks (Due Thursday 4 September 2014, 9:30am) Copying, Plagiarism: Plagiarism is the submission of somebody else’s work in a manner that gives the impression that the work is your own. The Department of Computer Science and Computer Engineering at La Trobe University treats plagiarism very seriously. When it is detected, penalties are strictly imposed. INDIVIDUAL assignment. Part I (10 marks) In this part, we are going to build a decision tree classifier to predict the ages of abalones from the measurement results. The dataset can be found in the CSV file abalone.csv. 1. Before creating the classifier, the abalone dataset has to be pre-processed according to the following criteria: 1) The sex attribute has three possible values (M, F and I), encode them into integers by M = 1, F = 2 and I = 3. (1.5 marks) 2) The rings attribute is the number of rings an abalone has, the age can be estimated by number of rings + 1.5. Also, from this estimated age, each sample will be assigned to different age groups according to the table below: Age Class label <= 5.5 <= 5.5 > 5.5 and <= 8.5 (5.5, 8.5] > 8.5 and <= 11.5 (8.5, 11.5]> 11.5 and <= 14.5 (11.5, 14.5]> 14.5 and <= 17.5 (14.5, 17.5]> 17.5 and <= 20.5 (17.5, 20.5]> 20.5 > 20.5 Replace the rings column with the age group column. The age group attribute is the class label. (Hint: You can use any method to set the age group for each sample, including formula in Excel or MATLAB script.) (1.5 marks) After pre-processing, the dataset is divided into the training dataset and the testing dataset. Download the program “DataSplit.exe” and execute it. Enter your student ID and specify the locations of the dataset file and the destination folder. The dataset will be split for you by clicking the “OK” button. Note that your training and testing datasets are unique to others. Make sure you enter the student ID correctly. Show only your pre-processed training and testing datasets. Only the first 20 rows of each dataset are required in your answer. Also, please submit your MATLAB source codes (in MATLAB script file) with the assignment answer (2 marks). No marks will be given to your answer unless the relevant source codes are submitted. 2. Load both the training and testing datasets in Q1 into the MATLAB workspace. It is recommended to separate the class label (i.e. the attribute age group) from other attributes such that all the class labels of a dataset are stored in a matrix. As a result, there are four matrices after the import process, two for the attribute values from the two datasets, and the other two for the class labels from these datasets. a. Build a decision tree classifier (using the age group attribute as the class label). Show the decision tree. (1 mark) b. Use the built classifier to predict the age groups for the samples in the testing dataset. Show the predicted class labels for the first 20 rows of the testing dataset. (1 mark) c. Using the testing dataset, evaluate the error rate, sensitivity, specificity, and the confusion matrix. (1 mark) Please submit your MATLAB source codes with the assignment answer. (2 marks) No marks will be given to your answer unless the relevant source codes are submitted. Part II 3. The table below shows the statistics of interviewees about their current status of continuing education: ID Education level Annual Income Continuing Education? 1 Tertiary 35000 Y 2 Secondary 28000 N 3 Secondary 40000 Y 4 Tertiary 52000 N 5 Postgrad 31000 Y 6 Secondary 47000 N 7 Secondary 22000 Y 8 Secondary 19000 N 9 Postgrad 22000 Y 10 Tertiary 44000 Y 11 Tertiary 20000 Y 12 Postgrad 32000 N 13 Secondary 62000 Y 14 Postgrad 30000 Y 15 Tertiary 55000 Y a. Calculate the Gini index for the education level attribute, with multi-way split. Show your steps. (1 mark) b. Calculate the Gini index for the annual income attribute, for each of the following split points: i. ? 25000 and > 25000 ii. ? 35000 and > 35000 iii. ? 45000 and > 45000 iv. ? 55000 and > 55000 Show your steps. (1 mark) c. Calculate the entropy for the education level attribute, with multi-way split. Show your steps. (1 mark) d. From the results in (c). Calculate the information gain for the education level attribute, with multi-way split. (1 mark) e. Explain why the attribute with the maximum information gain is selected as the splitting attribute, in terms of the physical meaning of information gain. (1 mark) 4. a. Plot the receiver operating characteristic (ROC) curves for classifiers 1 and 2 using the following information: Instance Classifier 1 Classifier 2 P1(1|A) P2(1|A) True class 1 0.28 0.67 0 2 0.63 0.81 1 3 0.44 0.25 0 4 0.26 0.6 1 5 0.36 0.45 0 6 0.62 0.39 0 7 0.71 0.78 1 8 0.66 0.17 0 9 0.94 0.88 1 10 0.49 0.73 1 Px(1|A) denotes the probability for the instance belonging to class 1, based on its attribute A. It is computed by the classifier x. (2 marks) b. What does it mean when a segment of the ROC curve is below the diagonal? (1 mark) c. Calculate the area under curve (AUC) for both curves. Which classifier is better? Why? (2 marks) Dataset Reference: Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.