Python Data Science Handbook

Introduktion

IPython: Beyond Normal Python8 Ämnen

Introduction to NumPy9 Ämnen

Understanding Data Types in Python

The Basics of NumPy Arrays

Computation on NumPy Arrays: Universal Functions

Aggregations: Min, Max, and Everything In Between

Computation on Arrays: Broadcasting

Comparisons, Masks, and Boolean Logic

Fancy Indexing

Sorting Arrays

Structured Data: NumPy's Structured Arrays

Understanding Data Types in Python

Data Manipulation with Pandas13 Ämnen

Introducing Pandas Objects

Data Indexing and Selection

Operating on Data in Pandas

Handling Missing Data

Hierarchical Indexing

Combining Datasets: Concat and Append

Combining Datasets: Merge and Join

Aggregation and Grouping

Pivot Tables

Vectorized String Operations

Working with Time Series

HighPerformance Pandas: eval() and query()

Further Resources

Introducing Pandas Objects

Visualization with Matplotlib15 Ämnen

Simple Line Plots

Simple Scatter Plots

Visualizing Errors

Density and Contour Plots

Histograms, Binnings, and Density

Customizing Plot Legends

Customizing Colorbars

Multiple Subplots

Text and Annotation

Customizing Ticks

Customizing Matplotlib: Configurations and Stylesheets

ThreeDimensional Plotting in Matplotlib

Geographic Data with Basemap

Visualization with Seaborn

Further Resources

Simple Line Plots

Machine Learning15 Ämnen

What Is Machine Learning?

Introducing ScikitLearn

Hyperparameters and Model Validation

Feature Engineering

In Depth: Naive Bayes Classification

In Depth: Linear Regression

InDepth: Support Vector Machines

InDepth: Decision Trees and Random Forests

In Depth: Principal Component Analysis

InDepth: Manifold Learning

In Depth: kMeans Clustering

In Depth: Gaussian Mixture Models

InDepth: Kernel Density Estimation

Application: A Face Detection Pipeline

Further Machine Learning Resources

What Is Machine Learning?

Appendix: Figure Code
atplotlib has proven to be an incredibly useful and popular visualization tool, but even avid users will admit it often leaves much to be desired. There are several valid complaints about Matplotlib that often come up:
 Prior to version 2.0, Matplotlib’s defaults are not exactly the best choices. It was based off of MATLAB circa 1999, and this often shows.
 Matplotlib’s API is relatively low level. Doing sophisticated statistical visualization is possible, but often requires a lot of boilerplate code.
 Matplotlib predated Pandas by more than a decade, and thus is not designed for use with Pandas
DataFrame
s. In order to visualize data from a PandasDataFrame
, you must extract eachSeries
and often concatenate them together into the right format. It would be nicer to have a plotting library that can intelligently use theDataFrame
labels in a plot.
An answer to these problems is Seaborn. Seaborn provides an API on top of Matplotlib that offers sane choices for plot style and color defaults, defines simple highlevel functions for common statistical plot types, and integrates with the functionality provided by Pandas DataFrame
s.
To be fair, the Matplotlib team is addressing this: it has recently added the plt.style
tools discussed in Customizing Matplotlib: Configurations and Style Sheets, and is starting to handle Pandas data more seamlessly. The 2.0 release of the library will include a new default stylesheet that will improve on the current status quo. But for all the reasons just discussed, Seaborn remains an extremely useful addon.
Seaborn Versus Matplotlib
Here is an example of a simple randomwalk plot in Matplotlib, using its classic plot formatting and colors. We start with the typical imports:In [1]:
import matplotlib.pyplot as plt plt.style.use('classic') %matplotlib inline import numpy as np import pandas as pd
Now we create some random walk data:In [2]:
# Create some data rng = np.random.RandomState(0) x = np.linspace(0, 10, 500) y = np.cumsum(rng.randn(500, 6), 0)
And do a simple plot:In [3]:
# Plot the data with Matplotlib defaults plt.plot(x, y) plt.legend('ABCDEF', ncol=2, loc='upper left');
Although the result contains all the information we’d like it to convey, it does so in a way that is not all that aesthetically pleasing, and even looks a bit oldfashioned in the context of 21stcentury data visualization.
Now let’s take a look at how it works with Seaborn. As we will see, Seaborn has many of its own highlevel plotting routines, but it can also overwrite Matplotlib’s default parameters and in turn get even simple Matplotlib scripts to produce vastly superior output. We can set the style by calling Seaborn’s set()
method. By convention, Seaborn is imported as sns
:In [4]:
import seaborn as sns sns.set()
Now let’s rerun the same two lines as before:In [5]:
# same plotting code as above! plt.plot(x, y) plt.legend('ABCDEF', ncol=2, loc='upper left');
Ah, much better!
Exploring Seaborn Plots
The main idea of Seaborn is that it provides highlevel commands to create a variety of plot types useful for statistical data exploration, and even some statistical model fitting.
Let’s take a look at a few of the datasets and plot types available in Seaborn. Note that all of the following could be done using raw Matplotlib commands (this is, in fact, what Seaborn does under the hood) but the Seaborn API is much more convenient.
Histograms, KDE, and densities
Often in statistical data visualization, all you want is to plot histograms and joint distributions of variables. We have seen that this is relatively straightforward in Matplotlib:In [6]:
data = np.random.multivariate_normal([0, 0], [[5, 2], [2, 2]], size=2000) data = pd.DataFrame(data, columns=['x', 'y']) for col in 'xy': plt.hist(data[col], normed=True, alpha=0.5)
Rather than a histogram, we can get a smooth estimate of the distribution using a kernel density estimation, which Seaborn does with sns.kdeplot
:In [7]:
for col in 'xy': sns.kdeplot(data[col], shade=True)
Histograms and KDE can be combined using distplot
:In [8]:
sns.distplot(data['x']) sns.distplot(data['y']);
If we pass the full twodimensional dataset to kdeplot
, we will get a twodimensional visualization of the data:In [9]:
sns.kdeplot(data);
We can see the joint distribution and the marginal distributions together using sns.jointplot
. For this plot, we’ll set the style to a white background:In [10]:
with sns.axes_style('white'): sns.jointplot("x", "y", data, kind='kde');
There are other parameters that can be passed to jointplot
—for example, we can use a hexagonally based histogram instead:In [11]:
with sns.axes_style('white'): sns.jointplot("x", "y", data, kind='hex')