DYNAMISM(E) - Biannual Student Magazine 1

Phase 1: Questions Are we asking the right questions? Questions are very critical before we do any type of data analysis / analytics. Mistaking the type of question being considered is the most common error in data analysis. Please look here: http://science. sciencemag.org/content/early/2015/02/25/ science.aaa6146.full In this article, Jeff Leek and Roger D. Peng argued the importance of the asking right questions. Here, I summarized some of the key questions we might ask while doing our (big) data analytics process. 11 key questions before you start analysing your data Data Source: What was the source of your data or how data was collected? Please read more on this topic https://hbr.org/2015/10/the-two- questions-you-need-to-ask-your-data-analysts Error structure: What type of error you can be expect in this stage? Did you consider error associated with the data which can be human error, sampling error, technical error etc.? Right Data: Is it right data for you to do analysis? There are already lot of discussions regarding this aspect. Please look herehttp://www.mckinsey. com/business-functions/business-technology/ our-insights/three-keys-to-building-a-data- driven-strategy Sample representation: How well do the sample data represent the population? Sometimes it is hard to get a grip on this aspect but back of mind it helps us to do downstream analysis. Number of samples/objects/individuals (n): How many samples (individuals/objects) are there? The number of samples give impact on further statistical analysis. We will discuss this part later on and called as “Power analyses” Number of features/variables (p): How many numbers of features are in the data set? Number of Samples (n) vs. number of features (p): What is the size of your data set? Is it p>n or n>p situation? If p>n (assuming features are in columns and rows represents samples), then we call them wide data otherwise tall data. Features/ Variables: Are you interested to find out which features or group of similar features are important for prediction?Do you have dependent and/or independent variables(s)? Samples: How samples are related? Do you want to find out structure/pattern in the samples? Software: Which software you are going to use? R, python or other scripting language or object oriented programming like JAVA, C++? Data types: What is the type of your data? Is it qualitative or quantitative? All above questions help you to decide right method(s) for further downstream data analytics process. Please share your views on this aspect. In the next post, I will discuss about phase 2 of the data analytics process. In this part, I will discuss about initial pre-processing steps mainly on “missing value”treatment on the data set. Why Data Pre-processing? Data in the real world is dirty, noisy, lacking attribute values, lacking certain attributes of interest and incomplete. It has been said that 80% of data analysis (data analytics) is spent on the process of cleaning and preparing the data. So, it is important to make data error free as much as possible. We will discuss some of the pre-processing / data cleaning steps. Missing value: It is very common problem in the data analysis (or data analytics). Data can be missing in the process of a) data extraction b) data collection or c) data documentation

DYNAMISM(E) - Biannual Student Magazine 1 | Page 8