Python Data Science Handbook

Introduktion

IPython: Beyond Normal Python8 Ämnen

Introduction to NumPy9 Ämnen

Understanding Data Types in Python

The Basics of NumPy Arrays

Computation on NumPy Arrays: Universal Functions

Aggregations: Min, Max, and Everything In Between

Computation on Arrays: Broadcasting

Comparisons, Masks, and Boolean Logic

Fancy Indexing

Sorting Arrays

Structured Data: NumPy's Structured Arrays

Understanding Data Types in Python

Data Manipulation with Pandas13 Ämnen

Introducing Pandas Objects

Data Indexing and Selection

Operating on Data in Pandas

Handling Missing Data

Hierarchical Indexing

Combining Datasets: Concat and Append

Combining Datasets: Merge and Join

Aggregation and Grouping

Pivot Tables

Vectorized String Operations

Working with Time Series

HighPerformance Pandas: eval() and query()

Further Resources

Introducing Pandas Objects

Visualization with Matplotlib15 Ämnen

Simple Line Plots

Simple Scatter Plots

Visualizing Errors

Density and Contour Plots

Histograms, Binnings, and Density

Customizing Plot Legends

Customizing Colorbars

Multiple Subplots

Text and Annotation

Customizing Ticks

Customizing Matplotlib: Configurations and Stylesheets

ThreeDimensional Plotting in Matplotlib

Geographic Data with Basemap

Visualization with Seaborn

Further Resources

Simple Line Plots

Machine Learning15 Ämnen

What Is Machine Learning?

Introducing ScikitLearn

Hyperparameters and Model Validation

Feature Engineering

In Depth: Naive Bayes Classification

In Depth: Linear Regression

InDepth: Support Vector Machines

InDepth: Decision Trees and Random Forests

In Depth: Principal Component Analysis

InDepth: Manifold Learning

In Depth: kMeans Clustering

In Depth: Gaussian Mixture Models

InDepth: Kernel Density Estimation

Application: A Face Detection Pipeline

Further Machine Learning Resources

What Is Machine Learning?

Appendix: Figure Code
Data Manipulation with Pandas
september 22, 2020
In the previous chapter, we dove into detail on NumPy and its ndarray
object, which provides efficient storage and manipulation of dense typed arrays in Python. Here we’ll build on this knowledge by looking in detail at the data structures provided by the Pandas library. Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame
. DataFrame
s are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.
As we saw, NumPy’s ndarray
data structure provides essential features for the type of clean, wellorganized data typically seen in numerical computing tasks. While it serves this purpose very well, its limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to elementwise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us. Pandas, and in particular its Series
and DataFrame
objects, builds on the NumPy array structure and provides efficient access to these sorts of ”data munging” tasks that occupy much of a data scientist’s time.
In this chapter, we will focus on the mechanics of using Series
, DataFrame
, and related structures effectively. We will use examples drawn from real datasets where appropriate, but these examples are not necessarily the focus.
Installing and Using Pandas
Installation of Pandas on your system requires NumPy to be installed, and if building the library from source, requires the appropriate tools to compile the C and Cython sources on which Pandas is built. Details on this installation can be found in the Pandas documentation. If you followed the advice outlined in the Preface and used the Anaconda stack, you already have Pandas installed.
Once Pandas is installed, you can import it and check the version:In [1]:
import pandas pandas.__version__
Out[1]:
'0.18.1'
Just as we generally import NumPy under the alias np
, we will import Pandas under the alias pd
:In [2]:
import pandas as pd
This import convention will be used throughout the remainder of this book.
Reminder about BuiltIn Documentation
As you read through this chapter, don’t forget that IPython gives you the ability to quickly explore the contents of a package (by using the tabcompletion feature) as well as the documentation of various functions (using the ?
character). (Refer back to Help and Documentation in IPython if you need a refresher on this.)
For example, to display all the contents of the pandas namespace, you can type
In [3]: pd.<TAB>
And to display Pandas’s builtin documentation, you can use this:
In [4]: pd?
More detailed documentation, along with tutorials and other resources, can be found at http://pandas.pydata.org/.