01-05-2021

Eda Python Cheat Sheet

Eda Python Pandas
Python For Data Science Cheat Sheet Lists Also See NumPy Arrays
Eda Python Cheat Sheet Download
Exploratory Data Analysis (EDA) And Data Visualization With ...

This PySpark cheat sheet with code samples covers the basics like initializing Spark in Python, loading data, sorting, and repartitioning. Apache Spark is generally known as a fast, general and open-source engine for big data processing, with built-in modules for. Python - Exploratory Data Analysis CheatSheet Reading a CSV file. Use header=None when the columns are not labeled in your csv file. Exploratory data analysis(EDA) With Python. Multiple libraries are available to perform basic EDA but I am going to use pandas and matplotlib for this post. Pandas for data manipulation and matplotlib, well, for plotting graphs. Jupyter Nootbooks to write code and other findings. Jupyter notebooks is kind of diary for data analysis. See full list on elitedatascience.com. DataCamp has created a Seaborn cheat sheet for those who are ready to get started with this data visualization library with the help of a handy one-page reference. You'll see that this cheat sheet presents you with the five basic steps that you can go through to make beautiful statistical graphs in Python.

Related Questions & Answers

Selected Reading

Eda Python Pandas

PythonServer Side ProgrammingProgramming

For data analysis, Exploratory Data Analysis (EDA) must be your first step. Exploratory Data Analysis helps us to −

To give insight into a data set.
Understand the underlying structure.
Extract important parameters and relationships that hold between them.
Test underlying assumptions.

Understanding EDA using sample Data set

To understand EDA using python, we can take the sample data either directly from any website or from your local disk. I’m taking the sample data from the UCI Machine Learning Repository which is publicly available of a red variant of Wine Quality data set and try to grab much insight into the data set using EDA.

Running above script in jupyter notebook, will give output something like below −

To start with,

Firstly, import the necessary library, pandas in the case.
Read the csv file using read_csv() function of pandas library and each data is separated by the delimiter “;” in given data set.
Return the first five observation from the data set with the help of “.head” function provided by the pandas library. We can get last five observation similarly by using the “.tail()” function of pandas library.

We can get the total number of rows and columns from the data set using “.shape” like below −

To find what all columns it contains, of what types and if they contain any value in it or not, with the help of info() function.

By observing the above data, we can conclude −

Data contain an only float an integer value.
All the columns variable are non-null (no-empty or missing value).

Another useful function provided by pandas is describe() which provides the count, mean, standard deviation, minimum and maximum values and the quantities of the data.

From above data, we can conclude that the mean value of each columns is less than the median value (50%) in index column.
There is a huge difference between the 75% and max values of predictors “residual sugar”, “free sulfur dioxide” and “total sulfur dioxide”.
Above two observations, gives an indication that there are extreme values- deviations in our data set.

Couples of key insights we can get from dependent variables are as follow −

In “quality” score scale, 1 comes at the bottom .i.e. poor and 10 comes at the top .i.e. best.
From above we can conclude, none of the observation score 1(poor), 2 and 9, 10(best) score. All the scores are between 3 to 8.

Above processed data provide an information on vote count for each quality score in descending order.
Most of the quality are in the range of 5-7.
Least observations are observed in the 3 and 6 categories.

Data Visualizations

To check Missing Values −

We can check missing values in our white-whiskey csv data set with the help of seaborn library. Below is the code to fullfil that −

Output

From above we can see there is no missing values in the dataset. Incase if there is any, we would have seen figure represented by different colour shade on purple background.
With different dataset where there are missing values and you’ll notice the difference.

To check correlation

To check correlation between different values of the dataset, insert below code in our existing dataset −

Output

Above, positive correlation is represented by dark shades and negative correlation by lighter shades.
Changes the value of annot=True, and the output will show you values by which features are correlated to each other in grid-cells.

Python For Data Science Cheat Sheet Lists Also See NumPy Arrays

We can generate another correlation matrix with annot=True. Modify your code by adding below lines of code to our existing code −

Output

From above we can see, there is a strong positive correlation of density with residual sugar. However, a strong negative correlation of density and alcohol.
Also, there is no correlation between free sulphur dioxide and quality.

Eda Python Cheat Sheet Download

Explore a dataset and create visual distributions
Identify and eliminate outliers
Uncover correlations between two datasets

Exploratory Data Analysis (EDA) And Data Visualization With ...

Creating an EDA is one of the first steps to building cleaner, more efficient machine learning and AI models. Read the tutorial and try it for yourself!