pygram11

pygram11 is a small Python library for creating simple histograms quickly. The backend is written in C++14 with some help from pybind11 and accelerated with OpenMP.

conda-forge PyPI PyPI - Python Version GitHub stars

Installation

Requirements

The only requirement to use pygram11 is NumPy. If you install binaries from conda-forge or PyPI, NumPy will be installed as a required dependency.

Extras for Source Builds

When building from source, all you need is a C++ compiler with C++14 support. The setup.py script will test to see if OpenMP is available. If it’s not, then the installation will abort. Most Linux distributions with modern GCC versions should provide OpenMP automatically (search the web to see how to install OpenMP from your distribution’s package manager). On macOS you’ll want to install libomp from Homebrew to use OpenMP with the Clang compiler shipped by Apple.

Install Options

PyPI

$ pip install pygram11

conda-forge

Installations from conda-forge provide a build that used OpenMP.

$ conda install pygram11 -c conda-forge

Note

On macOS the OpenMP libraries from LLVM (libomp) and Intel (libiomp) can clash if your conda environment includes the Intel Math Kernel Library (MKL) package distributed by Anaconda. You may need to install the nomkl package to prevent the clash (Intel MKL accelerates many linear algebra operations, but does not impact pygram11):

Source

$ pip install git+https://github.com/douglasdavis/pygram11.git@main

Quick Start

Jumping In

The main purpose of pygram11 is to be a faster near drop-in replacement of numpy.histogram() and numpy.histogram2d() with support for uncertainties. The NumPy functions always return the bin counts and the bin edges, while pygram11 functions return the bin counts and the standard error on the bin counts (if weights are not used, the second return type from pygram11 functions will be None). Therefore, if one only cares about the bin counts, the libraries are completely interchangable.

These two funcion calls will provide the same result:

import numpy as np
import pygram11 as pg
rng = np.random.default_rng(123)
x = rng.standard_normal(10000)
counts1, __ = np.histogram(x, bins=20, range=(-3, 3))
counts2, __ = pg.histogram(x, bins=20, range=(-3, 3))
np.testing.assert_allclose(counts1, counts2)

If one cares about the statistical uncertainty on the bin counts, or the ability to retain under- and over-flow counts, then pygram11 is a great replacement. Checkout a blog post which describes how to recreate this behavior in pure NumPy, while pygram11 is as simple as:

data = rng.standard_normal(10000)
weights = rng.uniform(0.1, 0.9, x.shape[0])
counts, err = pg.histogram(data, bins=10, range=(-3, 3), weights=weights, flow=True)

The pygram11.histogram() and pygram11.histogram2d() functions in the pygram11 API are meant to provide an easy transition from NumPy to pygram11. The next couple of sections summarize the structure of the pygram11 API.

Core pygram11 Functions

pygram11 provides a simple set of functions for calculating histograms:

pygram11.fix1d(x[, bins, range, weights, …])

Histogram data with fixed (uniform) bin widths.

pygram11.fix1dmw(x, weights[, bins, range, flow])

Histogram data with multiple weight variations and fixed width bins.

pygram11.var1d(x, bins[, weights, density, flow])

Histogram data with variable bin widths.

pygram11.var1dmw(x, weights, bins[, flow])

Histogram data with multiple weight variations and variable width bins.

pygram11.fix2d(x, y[, bins, range, weights, …])

Histogram the x, y data with fixed (uniform) binning.

pygram11.var2d(x, y, xbins, ybins[, …])

Histogram the x, y data with variable width binning.

You’ll see that the API specific to pygram11 is a bit more specialized than the NumPy histogramming API (shown below).

Histogramming a normal distribution:

>>> rng = np.random.default_rng(123)
>>> h, __ = pygram11.fix1d(rng.standard_normal(10000), bins=25, range=(-3, 3))

See the API reference for more examples.

NumPy-like Functions

For convenience a NumPy-like API is also provided (not one-to-one, see the API reference).

pygram11.histogram(x[, bins, range, …])

Histogram data in one dimension.

pygram11.histogram2d(x, y[, bins, range, …])

Histogram data in two dimensions.

Supported Types

Conversions between NumPy array types can take some time when calculating histograms.

In [1]: import numpy as np

In [2]: import pygram11 as pg

In [3]: rng = np.random.default_rng(123)

In [4]: x = rng.standard_normal(2_000_000)

In [5]: %timeit pg.histogram(x, bins=30, range=(-4, 4))
1.95 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit pg.histogram(x.astype(np.float32), bins=30, range=(-4, 4))
2.33 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

You can see the type conversion increases this calculation time by about 20%. The backend C++ functions prohibit type conversions of the input data. If an array with an unsupported numpy.dtype is passed to pygram11, a TypeError will be rasied. Supported numpy.dtype’s for data are:

  • numpy.float64 (a C/C++ double)

  • numpy.int64 (a C/C++ int64_t)

  • numpy.uint64 (a C/C++ uint64_t)

  • numpy.float32 (a C/C++ float)

  • numpy.int32 (a C/C++ int32_t

  • numpy.uint32 (a C/C++ uint32_t)

and for weights:

  • numpy.float64

  • numpy.float32

OpenMP Configuration

For small datasets OpenMP acceleration introduced unncessary overhead. The C++ backend utilizes OpenMP parallel loops if the data size is above a threshold for a respective histogramming situation. By default these thresholds are 10,000 for fixed width histograms and 5,000 for variable width histograms. The thresholds can be configured with dynamic variables in the pygram11 module:

  • FIXED_WIDTH_PARALLEL_THRESHOLD_1D

  • FIXED_WIDTH_PARALLEL_THRESHOLD_2D

  • FIXED_WIDTH_MW_PARALLEL_THRESHOLD_1D

  • VARIABLE_WIDTH_PARALLEL_THRESHOLD_1D

  • VARIABLE_WIDTH_PARALLEL_THRESHOLD_2D

  • VARIABLE_WIDTH_MW_PARALLEL_THRESHOLD_1D

An example changing the threshold:

>>> import pygram11
>>> import numpy as np
>>> rng = np.random.default_rng(123)
>>> x = rng.standard_uniform(6000)
>>> bins = np.array([-3.1, -2.5, -2.0, 0.1, 0.2, 2.1, 3.0])
>>> result = pygram11.histogram(x, bins=bins)  # will use OpenMP
>>> pygram11.VARIABLE_WIDTH_PARALLEL_THRESHOLD_1D = 7500
>>> result = pygram11.histogram(x, bins=bins)  # now will _not_ use OpenMP

Some shortcuts exist to completely disable or enable OpenMP:

Benchmarks

Setup

There are a number Python modules providing APIs for histogram calculations. Here we see how pygram11 performs in comparison to numpy, fast-histogram, and boost-histogram. Tests were performed on an Intel i7-8850H 2.60Gz processor (6 physical cores, 12 threads).

Fast-histogram does not provide calculations for variable width bins, so, when benchmarking variable width bins, we only compare to NumPy and boost-histogram.

Results

The results clearly show that pygram11 is most useful for input arrays exceeding about 5,000 elements. This makes sense because the pygram11 backend has a clear and simple overhead: to take advantage of N available threads we make N result arrays, fill them individually (splitting the loop over the input data N times), and finally combine the results (one per thread) into a final single result that is returned.

For one dimensional histograms with fixed width bins pygram11 becomes the most performant calculation for arrays with about 5,000 or more elements (up to about 3x faster than the next best option and over 10x faster than NumPy). Fast-histogram is a bit more performant for smaller arrays, while pygram11 is always faster than NumPy and boost-histogram.

_images/fixed1d.png

For two dimensional histograms with fixed width bins pygram11 becomes the most performant calculation for arrays with about 10,000 or more elements (up to about 3x faster than the next best option and almost 100x faster than NumPy). Fast-histogram is again faster for smaller inputs, while pygram11 is always faster than NumPy and almost always faster than boost-histogram.

_images/fixed2d.png

For one dimensional histograms with variable width bins pygram11 becomes the most performant option for arrays with about 10,000 or more elements (up to about 8x faster than the next best option and about 13x faster than NumPy).

_images/var1d.png

For two dimensional histograms with variable width bins pygram11 becomes the most performant option for arrays with about 5,000 or more elements (up to 10x faster than the next best option).

_images/var2d.png

API Reference

pygram11.fix1d

pygram11.fix1d(x, bins=10, range=None, weights=None, density=False, flow=False)[source]

Histogram data with fixed (uniform) bin widths.

Parameters
  • x (numpy.ndarray) – Data to histogram.

  • bins (int) – The number of bins.

  • range ((float, float), optional) – The minimum and maximum of the histogram axis. If None, min and max of x will be used.

  • weights (numpy.ndarray, optional) – The weights for each element of x. If weights are absent, the second return type will be None.

  • density (bool) – Normalize histogram counts as value of PDF such that the integral over the range is unity.

  • flow (bool) – Include under/overflow in the first/last bins.

Raises
  • ValueError – If x and weights have incompatible shapes.

  • TypeError – If x or weights are unsupported types

Returns

  • numpy.ndarray – The resulting histogram bin counts.

  • numpy.ndarray, optional – The standard error of each bin count, \(\sqrt{\sum_i w_i^2}\). The return is None if weights are not used.

Examples

A histogram of x with 20 bins between 0 and 100:

>>> h, __ = fix1d(x, bins=20, range=(0, 100))

When weights are absent the second return is None. The same data, now histogrammed with weights and over/underflow included:

>>> rng = np.random.default_rng(123)
>>> w = rng.uniform(0.1, 0.9, x.shape[0]))
>>> h, stderr = fix1d(x, bins=20, range=(0, 100), weights=w, flow=True)

pygram11.fix1dmw

pygram11.fix1dmw(x, weights, bins=10, range=None, flow=False)[source]

Histogram data with multiple weight variations and fixed width bins.

The weights array must have a total number of rows equal to the length of the input data. The number of columns in the weights array is equal to the number of weight variations. (The weights array must be an M x N matrix where M is the length of x and N is the number of weight variations).

Parameters
  • x (numpy.ndarray) – Data to histogram.

  • weights (numpy.ndarray) – The weight variations for the elements of x, first dimension is the length of x, second dimension is the number of weights variations.

  • bins (int) – The number of bins.

  • range ((float, float), optional) – The minimum and maximum of the histogram axis. If None, min and max of x will be used.

  • flow (bool) – Include under/overflow in the first/last bins.

Raises
  • ValueError – If x and weights have incompatible shapes (if x.shape[0] != weights.shape[0]).

  • ValueError – If weights is not a two dimensional array.

  • TypeError – If x or weights are unsupported types

Returns

Examples

Multiple histograms of x using 20 different weight variations:

>>> rng = np.random.default_rng(123)
>>> x = rng.standard_normal(10000)
>>> twenty_weights = np.abs(rng.standard_normal((x.shape[0], 20)))
>>> h, err = fix1dmw(x, twenty_weights, bins=50, range=(-3, 3))

h and err are now shape (50, 20). Each column represents the histogram of the data using its respective weight.

pygram11.fix2d

pygram11.fix2d(x, y, bins=10, range=None, weights=None, flow=False)[source]

Histogram the x, y data with fixed (uniform) binning.

The two input arrays (x and y) must be the same length (shape).

Parameters
  • x (numpy.ndarray) – First entries in data pairs to histogram.

  • y (numpy.ndarray) – Second entries in data pairs to histogram.

  • bins (int or (int, int)) – If int, both dimensions will have that many bins; if tuple, the number of bins for each dimension

  • range (Sequence[Tuple[float, float]], optional) – Axis limits in the form [(xmin, xmax), (ymin, ymax)]. If None the input data min and max will be used.

  • weights (array_like, optional) – The weights for data element. If weights are absent, the second return type will be None.

  • flow (bool) – Include over/underflow.

Raises
  • ValueError – If x and y have incompatible shapes.

  • ValueError – If the shape of weights is incompatible with x and y

  • TypeError – If x, y, or weights are unsupported types

Returns

Examples

A histogram of (x, y) with 20 bins between 0 and 100 in the x dimention and 10 bins between 0 and 50 in the y dimension:

>>> h, __ = fix2d(x, y, bins=(20, 10), range=((0, 100), (0, 50)))

The same data, now histogrammed weighted (via w):

>>> h, err = fix2d(x, y, bins=(20, 10), range=((0, 100), (0, 50)), weights=w)

pygram11.var1d

pygram11.var1d(x, bins, weights=None, density=False, flow=False)[source]

Histogram data with variable bin widths.

Parameters
  • x (numpy.ndarray) – Data to histogram

  • bins (numpy.ndarray) – Bin edges

  • weights (numpy.ndarray, optional) – The weights for each element of x. If weights are absent, the second return type will be None.

  • density (bool) – Normalize histogram counts as value of PDF such that the integral over the range is unity.

  • flow (bool) – Include under/overflow in the first/last bins.

Raises
  • ValueError – If the array of bin edges is not monotonically increasing.

  • ValueError – If x and weights have incompatible shapes.

  • TypeError – If x or weights are unsupported types

Returns

  • numpy.ndarray – The bin counts.

  • numpy.ndarray, optional – The standard error of each bin count, \(\sqrt{\sum_i w_i^2}\). The return is None if weights are not used.

Examples

A simple histogram with variable width bins:

>>> rng = np.random.default_rng(123)
>>> x = rng.standard_normal(1000)
>>> edges = np.array([-3.0, -2.5, -1.5, -0.25, 0.25, 2.0, 3.0])
>>> h, __ = var1d(x, edges)

pygram11.var1dmw

pygram11.var1dmw(x, weights, bins, flow=False)[source]

Histogram data with multiple weight variations and variable width bins.

The weights array must have a total number of rows equal to the length of the input data. The number of columns in the weights array is equal to the number of weight variations. (The weights array must be an M x N matrix where M is the length of x and N is the number of weight variations).

Parameters
  • x (numpy.ndarray) – Data to histogram.

  • weights (numpy.ndarray) – Weight variations for the elements of x, first dimension is the shape of x, second dimension is the number of weights.

  • bins (numpy.ndarray) – Bin edges.

  • flow (bool) – Include under/overflow in the first/last bins.

Raises
  • ValueError – If the array of bin edges is not monotonically increasing.

  • ValueError – If x and weights have incompatible shapes.

  • ValueError – If weights is not a two dimensional array.

  • TypeError – If x or weights are unsupported types

Returns

Examples

Using three different weight variations:

>>> rng = np.random.default_rng(123)
>>> x = rng.standard_normal(10000)
>>> weights = nb.abs(rng.standard_normal((x.shape[0], 3)))
>>> edges = np.array([-3.0, -2.5, -1.5, -0.25, 0.25, 2.0, 3.0])
>>> h, err = var1dmw(x, weights, edges)
>>> h.shape
(6, 3)
>>> err.shape
(6, 3)

pygram11.var2d

pygram11.var2d(x, y, xbins, ybins, weights=None, flow=False)[source]

Histogram the x, y data with variable width binning.

The two input arrays (x and y) must be the same length (shape).

Parameters
  • x (numpy.ndarray) – First entries in data pairs to histogram.

  • y (numpy.ndarray) – Second entries in data pairs to histogram.

  • xbins (numpy.ndarray) – Bin edges for the x dimension.

  • ybins (np.ndarray) – Bin edges for the y dimension.

  • weights (array_like, optional) – The weights for data element. If weights are absent, the second return type will be None.

  • flow (bool) – Include under/overflow.

Raises
  • ValueError – If x and y have different shape.

  • ValueError – If either bin edge definition is not monotonically increasing.

  • TypeError – If x, y, or weights are unsupported types

Returns

Examples

A histogram of (x, y) where the edges are defined by a numpy.logspace() in both dimensions:

>>> bins = numpy.logspace(0.1, 1.0, 10, endpoint=True)
>>> h, __ = var2d(x, y, bins, bins)

pygram11.histogram

pygram11.histogram(x, bins=10, range=None, weights=None, density=False, flow=False)[source]

Histogram data in one dimension.

Parameters
  • x (array_like) – Data to histogram.

  • bins (int or array_like) – If int: the number of bins; if array_like: the bin edges.

  • range ((float, float), optional) – The minimum and maximum of the histogram axis. If None with integer bins, min and max of x will be used. If bins is an array this is expected to be None.

  • weights (array_like, optional) – Weight variations for the elements of x. For single weight histograms the shape must be the same shape as x. For multiweight histograms the first dimension is the length of x, second dimension is the number of weights variations.

  • density (bool) – Normalize histogram counts as value of PDF such that the integral over the range is unity.

  • flow (bool) – Include under/overflow in the first/last bins.

Raises
  • ValueError – If bins defines edges while range is also not None.

  • ValueError – If the array of bin edges is not monotonically increasing.

  • ValueError – If x and weights have incompatible shapes.

  • ValueError – If multiweight histogramming is detected and weights is not a two dimensional array.

  • TypeError – If x or weights are unsupported types

Returns

  • numpy.ndarray – The bin counts.

  • numpy.ndarray, optional – The standard error of each bin count, \(\sqrt{\sum_i w_i^2}\). The return is None if weights are not used.

See also

fix1d

Used for no weight or single weight fixed bin width histograms

fix1dmw

Used for multiweight fixed bin width histograms.

var1d

Used for no weight or single weight variable bin width histograms.

var1dmw

Used for multiweight variable bin width histograms.

Examples

A simple fixed width histogram:

>>> h, __ = histogram(x, bins=20, range=(0, 100))

And with variable width histograms and weights:

>>> h, err = histogram(x, bins=[-3, -2, -1.5, 1.5, 3.5], weights=w)

pygram11.histogram2d

pygram11.histogram2d(x, y, bins=10, range=None, weights=None, flow=False)[source]

Histogram data in two dimensions.

This function provides an API very simiar to numpy.histogram2d(). Keep in mind that the returns are different.

Parameters
  • x (array_like) – Array representing the x coordinate of the data to histogram.

  • y (array_like) – Array representing the y coordinate of the data to histogram.

  • bins (int or array_like or [int, int] or [array, array], optional) –

    The bin specification:
    • If int, the number of bins for the two dimensions (nx = ny = bins).

    • If array_like, the bin edges for the two dimensions (x_edges = y_edges = bins).

    • If [int, int], the number of bins in each dimension (nx, ny = bins).

    • If [array_like, array_like], the bin edges in each dimension (x_edges, y_edges = bins).

  • range (array_like, shape(2,2), optional) – The edges of this histogram along each dimension. If bins is not integral, then this parameter is ignored. If None, the default is [[x.min(), x.max()], [y.min(), y.max()]].

  • weights (array_like) – An array of weights associated to each element \((x_i, y_i)\) pair. Each pair of the data will contribute its associated weight to the bin count.

  • flow (bool) – Include over/underflow.

Raises
  • ValueError – If x and y have different shape or either bin edge definition is not monotonically increasing.

  • ValueError – If the shape of weights is not compatible with x and y.

  • TypeError – If x, y, or weights are unsupported types

See also

fix2d

Used for no weight or single weight fixed bin width histograms

var2d

Used for no weight or single weight variable bin width histograms.

Returns

Examples

>>> h, err = histogram2d(x, y, weights=w)

pygram11.force_omp

pygram11.force_omp()[source]

Force OpenMP acceleration by minimizing the parallel thresholds.

The default behavior is to avoid OpenMP acceleration for input data with length below about 10,000 for fixed with histograms and 5,000 for variable width histograms. This function forces all thresholds to be the 1 (always use OpenMP acceleration).

pygram11.disable_omp

pygram11.disable_omp()[source]

Disable OpenMP acceleration by maximizing the parallel thresholds.

The default behavior is to avoid OpenMP acceleration for input data with length below about 10,000 for fixed with histograms and 5,000 for variable width histograms. This function forces all thresholds to be the sys.maxsize (never use OpenMP acceleration).

pygram11.bin_centers

pygram11.bin_centers(bins, range=None)[source]

Construct array of center values for each bin.

Parameters
  • bins (int or array_like) – Number of bins or bin edges array.

  • range ((float, float), optional) – The minimum and maximum of the histogram axis.

Returns

Array of bin centers.

Return type

numpy.ndarray

Raises

ValueError – If bins is an integer and range is undefined (None).

Examples

The centers given the number of bins and max/min:

>>> bin_centers(10, range=(-3, 3))
array([-2.7, -2.1, -1.5, -0.9, -0.3,  0.3,  0.9,  1.5,  2.1,  2.7])

Or given bin edges:

>>> bin_centers([0, 1, 2, 3, 4])
array([0.5, 1.5, 2.5, 3.5])

pygram11.bin_edges

pygram11.bin_edges(bins, range)[source]

Construct bin edges given number of bins and axis limits.

Parameters
  • bins (int) – Total number of bins.

  • range ((float, float)) – Minimum and maximum of the histogram axis.

Returns

Edges defined by the number of bins and axis limits.

Return type

numpy.ndarray

Examples

>>> bin_edges(bins=8, range=(-2, 2))
array([-2. , -1.5, -1. , -0.5,  0. ,  0.5,  1. ,  1.5,  2. ])