Skip to the content.

ExploriPy

PyPI version Downloads Build Status Build Status

Exploratory Data Analysis (EDA) is one of the crucial steps in data science that facilitates generating insights and statistical measures which are essential for building predictive models. EDA is always a time-consuming activity and require a thorough analysis of datasets to summarize their main characteristics. It is always required to do an initial analysis on the data, and then deep dive on further domain specific analysis, based on the initial insights. Currently, there is no comprehensive library in Python, which could do the initial Data Analysis and statistical tests, and present in an output, which could be easily interpreted shared across the stakeholders. Though there are several individual packages available for statistical tests, interpretation of the output requires certain level of statistical knowledge.
ExploriPy reduces a data analyst’s efforts significantly in the initial EDA. It is designed in a way to perform automated EDA, and statistical tests including Analysis of Variance, Chi Square Test of Independence, Weight of Evidence, Information Value and Tukey Honest Significance Difference. It provides easy interpretation on these statistical test results, based on industry standard assumptions. It expects a Pandas DataFrame, along with a list of categorical variables, as input. Output will be a presentable HTML document, with the result of analysis and statistical tests, represented through several interactive charts, and tables (with option to download as CSV). The ExploriPy package is available in the Python Package Index.

Installation Steps
Usage
Parameters
Output
    List of Features
    Null Values
    Target Variable
    Categorical Vs Target
    Continuous Vs Target

Installation Steps

pip install ExploriPy

Usage

from ExploriPy import EDA
import pandas as pd
df = pd.read_csv('BigMartSales_Train.csv',na_values = 'nan')
CategoricalFeatures = ['Item_Identifier','Outlet_Identifier','Item_Fat_Content','Item_Type','Outlet_Establishment_Year','Outlet_Size','Outlet_Location_Type','Outlet_Type']
eda = EDA(df,CategoricalFeatures,OtherFeatures=['Outlet_Establishment_Year'],title='Exploratory Data Analysis for Big Mart Sales III - Based on Item_Outlet_Sales')
eda.TargetAnalysis('Item_Outlet_Sales') # For Target Specific Analysis

Parameters

Parameter for TargetAnalysis:

Output

The output of the package is a HTML file with the following features.

List of Features

Null Values

Percentage of null values in each column. Additionally, a bar chart is also populated with the data.

Target Variable

Info displayed for Target Variable:

For Categorical Target Variable:

For Continuous Target Variable

Categorical Vs Target

List of Top 30 categories along with their count and percentage, for every categorical variable

For Target Categorical Feature



For Target Continuous Feature:

Continuous Vs Target

Continuous Vs Target Categorical Feature

Continuous Vs Target Continuous Feature



Authors & License

The ExploriPy package is released under a MIT License. ExploriPy Python package has been developed by Kunjithapatham Sivakumar, Shashank Shekhar and Sajan Bhagat, . Pull requests submitted to the GitHub Repo are highly encouraged!