Python Data Science Handbook

Introduktion

IPython: Beyond Normal Python8 Ämnen

Introduction to NumPy9 Ämnen

Understanding Data Types in Python

The Basics of NumPy Arrays

Computation on NumPy Arrays: Universal Functions

Aggregations: Min, Max, and Everything In Between

Computation on Arrays: Broadcasting

Comparisons, Masks, and Boolean Logic

Fancy Indexing

Sorting Arrays

Structured Data: NumPy's Structured Arrays

Understanding Data Types in Python

Data Manipulation with Pandas13 Ämnen

Introducing Pandas Objects

Data Indexing and Selection

Operating on Data in Pandas

Handling Missing Data

Hierarchical Indexing

Combining Datasets: Concat and Append

Combining Datasets: Merge and Join

Aggregation and Grouping

Pivot Tables

Vectorized String Operations

Working with Time Series

HighPerformance Pandas: eval() and query()

Further Resources

Introducing Pandas Objects

Visualization with Matplotlib15 Ämnen

Simple Line Plots

Simple Scatter Plots

Visualizing Errors

Density and Contour Plots

Histograms, Binnings, and Density

Customizing Plot Legends

Customizing Colorbars

Multiple Subplots

Text and Annotation

Customizing Ticks

Customizing Matplotlib: Configurations and Stylesheets

ThreeDimensional Plotting in Matplotlib

Geographic Data with Basemap

Visualization with Seaborn

Further Resources

Simple Line Plots

Machine Learning15 Ämnen

What Is Machine Learning?

Introducing ScikitLearn

Hyperparameters and Model Validation

Feature Engineering

In Depth: Naive Bayes Classification

In Depth: Linear Regression

InDepth: Support Vector Machines

InDepth: Decision Trees and Random Forests

In Depth: Principal Component Analysis

InDepth: Manifold Learning

In Depth: kMeans Clustering

In Depth: Gaussian Mixture Models

InDepth: Kernel Density Estimation

Application: A Face Detection Pipeline

Further Machine Learning Resources

What Is Machine Learning?

Appendix: Figure Code
Introduction to NumPy
september 28, 2020
This chapter, along with chapter 3, outlines techniques for effectively loading, storing, and manipulating inmemory data in Python. The topic is very broad: datasets can come from a wide range of sources and a wide range of formats, including be collections of documents, collections of images, collections of sound clips, collections of numerical measurements, or nearly anything else. Despite this apparent heterogeneity, it will help us to think of all data fundamentally as arrays of numbers.
For example, images–particularly digital images–can be thought of as simply twodimensional arrays of numbers representing pixel brightness across the area. Sound clips can be thought of as onedimensional arrays of intensity versus time. Text can be converted in various ways into numerical representations, perhaps binary digits representing the frequency of certain words or pairs of words. No matter what the data are, the first step in making it analyzable will be to transform them into arrays of numbers. (We will discuss some specific examples of this process later in Feature Engineering)
For this reason, efficient storage and manipulation of numerical arrays is absolutely fundamental to the process of doing data science. We’ll now take a look at the specialized tools that Python has for handling such numerical arrays: the NumPy package, and the Pandas package (discussed in Chapter 3).
This chapter will cover NumPy in detail. NumPy (short for Numerical Python) provides an efficient interface to store and operate on dense data buffers. In some ways, NumPy arrays are like Python’s builtin list
type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size. NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python, so time spent learning to use NumPy effectively will be valuable no matter what aspect of data science interests you.
If you followed the advice outlined in the Preface and installed the Anaconda stack, you already have NumPy installed and ready to go. If you’re more the doityourself type, you can go to http://www.numpy.org/ and follow the installation instructions found there. Once you do, you can import NumPy and doublecheck the version:In [1]:
import numpy numpy.__version__
Out[1]:
'1.11.1'
For the pieces of the package discussed here, I’d recommend NumPy version 1.8 or later. By convention, you’ll find that most people in the SciPy/PyData world will import NumPy using np
as an alias:In [2]:
import numpy as np
Throughout this chapter, and indeed the rest of the book, you’ll find that this is the way we will import and use NumPy.
Reminder about Built In Documentation
As you read through this chapter, don’t forget that IPython gives you the ability to quickly explore the contents of a package (by using the tabcompletion feature), as well as the documentation of various functions (using the ?
character – Refer back to Help and Documentation in IPython).
For example, to display all the contents of the numpy namespace, you can type this:
In [3]: np.<TAB>
And to display NumPy’s builtin documentation, you can use this:
In [4]: np?
More detailed documentation, along with tutorials and other resources, can be found at http://www.numpy.org.