Skip to content

Data Mining Assignment#

Introduction#

Data mining is the process of working with large data sets to identify patterns and establish relationships to solve problems through data analysis. As a part of the assignment, you will be learning to design and implement the complete DM processing pipeline and gain an understanding how to perform preprocessing to analysis to draw insights for a given dataset. Different tasks involved in the assignment
1. Dataset Selection: You can select one of the datasets from the below:
1. https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset
2. https://www.kaggle.com/uciml/mushroom-classification
3. https://www.kaggle.com/fedesoriano/stroke-prediction-dataset
Once you decide on the dataset, you would need to study and understand the dataset in detail and identify the key questions or insights that you would like to draw from the dataset.
2. Data Preprocessing and Visualization: You would need to perform exploratory data analysis along with suitable visualizations and identify/employ different preprocessing techniques suitable for the dataset. You should implement atleast 2 data preprocessing techniques studied in the class in addition to any data cleaning (if required) and show the results.
3. Data Analysis: Based on the insights that you wish to draw from the dataset, you would need to identify the key DM tasks like association analysis, clustering, classification or outlier analysis that are applicable for the dataset. You should implement atleast 3 data analysis techniques and corresponding algorithms studied in the class and show the results.

Weightage of Individual Components#

Total Assignment Weightage 20%
1. Data Preprocessing and Visualization 5%
2. Data Analysis (at least 3 Data mining tasks learnt during the class) 15%

Main Deliverable and Deadline#

  1. Main deliverable is executable code in the form of Jupyter Ipython Notebooks with detailed markdown text explaining tasks performed/results obtained, comments and plots. Notebooks should be able to run smoothly and be self contained with all necessary information specified in it.
  2. Submissions need to be made on CMS (https://elearn.bits-pilani.ac.in) by November 15th, midnight. No late submissions will be accepted.
  3. Only one team member can make submission and team no/member information should be clearly mentioned at the beginning of the notebook.
  4. Please update submission status clearly in the spreadsheet shared.

Few Important Things to Remember#

  1. Teams once decided cannot be changed, else submission will not be accepted/marked.
  2. You need to use Python for executing the project. Use of standard libraries is not allowed for data processing/analysis (provide implementations from scratch) other than for data handling/visualization purposes. We will run and execute the notebooks for marking purposes and those which fail to produce desired results/plots will be penalized. (path settings etc. should be done accordingly)
  3. No email submissions will be accepted.
  4. Emails related to team formation will not be entertained.

Assignment Submission#

The assignment submission can be found here (https://github.com/Akhilsudh/BITS-Assignment/tree/master/Semester%203/Data%20Mining).

The readme for this assignment:

 _____                _
|  __ \              | |               
| |__) |___  __ _  __| |_ __ ___   ___ 
|  _  // _ \/ _` |/ _` | '_ ` _ \ / _ \
| | \ \  __/ (_| | (_| | | | | | |  __/
|_|  \_\___|\__,_|\__,_|_| |_| |_|\___|


Datamining Assignment by group 13:
| Roll No.    | Name              |
| ----------- | ----------------- |
| 2021MT<xxx> | dolor sit amet    |
| 2021MT<xxx> | Akhil S           |
| 2021MT<xxx> | commodo consequat |
| 2021MT<xxx> | totam rem aperiam |


Make sure all the python files submitted along with the notebook is placed in the same working directory.
These python files hold the custom implementations for all the algorithms that are used in this assignment.

The dependent python files are:
1. GaussianNaiveBayes.py
2. DecisionTree.py
3. KMeansClustering.py
4. dbscan.py
5. lof.py

and the dataset found from kaggle is the healthcare-dataset-stroke-data.csv file


Tags: !DMIndex Assignments