The tafra began life as a thought experiment: how could we reduce the idea
of a dataframe (as expressed in libraries like pandas or languages
like R) to its useful essence, while carving away the cruft?
The original proof of concept
stopped at "group by".
This library expands on the proof of concept to produce a practically
useful tafra, which we hope you may find to be a helpful lightweight
substitute for certain uses of pandas.
A tafra is, more-or-less, a set of named columns or dimensions.
Each of these is a typed numpy array of consistent length, representing
the values for each column by rows.
The library provides lightweight syntax for manipulating rows and columns, support for managing data types, iterators for rows and sub-frames, pandas-like "transform" support and conversion from pandas Dataframes, and SQL-style "group by" and join operations.
Install from conda-forge (includes pre-built C extension — no compiler needed):
conda install tafra -c conda-forgeOr install from PyPI with pip:
pip install tafraNote
conda install provides a pre-built binary with the C extension already
compiled for your platform. pip install from PyPI will attempt to
compile the C extension from source; if no C compiler is available, the
package installs without it and falls back to pure Python + numpy.
To build from source (including the optional C extension):
git clone https://github.com/petbox-dev/tafra.git
cd tafra
pip install -e .Requirements:
- Python >=3.9
- numpy >=2.1
- A C compiler (optional, for the
_accelextension): - Windows: Visual Studio Build Tools (with Windows SDK) or MinGW-w64 - Linux:gcc(usually pre-installed, orapt install build-essential) - macOS: Xcode Command Line Tools (xcode-select --install)
If no C compiler is available, the package installs without the extension and falls back to pure Python + numpy at runtime. To verify the C extension is active:
>>> from tafra._accel import groupby_sum
>>> print("C extension active")To build a distributable wheel:
pip install build
python -m buildThe C extension requires the MSVC compiler to find the Windows SDK headers.
If you get fatal error C1083: Cannot open include file: 'io.h', the
Windows SDK include/lib paths are not set. Two options:
Use a Developer Command Prompt (recommended): Open "Developer Command Prompt for VS" or "Developer PowerShell for VS" from the Start menu. This runs
vcvarsall.batautomatically and sets all required paths.Use MinGW-w64 instead of MSVC:
python setup.py build_ext --inplace --compiler=mingw32
MinGW-w64 can be installed via conda (
conda install m2w64-gcc -c conda-forge) or from winlibs.com.
If building with python -m build (which creates an isolated environment),
use --no-isolation to inherit your shell's environment variables, or run
from a Developer Command Prompt:
python -m build --no-isolation>>> from tafra import Tafra
>>> t = Tafra({
... 'x': np.array([1, 2, 3, 4]),
... 'y': np.array(['one', 'two', 'one', 'two']),
... })
>>> t.pformat()
Tafra(data = {
'x': array([1, 2, 3, 4]),
'y': array(['one', 'two', 'one', 'two'])},
dtypes = {
'x': 'int', 'y': 'str'},
rows = 4)
>>> print('List:', '\n', t.to_list())
List:
[array([1, 2, 3, 4]), array(['one', 'two', 'one', 'two'], dtype=object)]
>>> print('Records:', '\n', tuple(t.to_records()))
Records:
((1, 'one'), (2, 'two'), (3, 'one'), (4, 'two'))
>>> gb = t.group_by(
... ['y'], {'x': sum}
... )
>>> print('Group By:', '\n', gb.pformat())
Group By:
Tafra(data = {
'x': array([4, 6]), 'y': array(['one', 'two'])},
dtypes = {
'x': 'int', 'y': 'str'},
rows = 2)group_by reduces — one row per group, applies aggregation functions:
>>> tf.group_by(['wellid'], {'total_oil': (np.sum, 'oil')})
# Returns: one row per wellid, with summed oilpartition splits — returns all original rows, grouped into sub-Tafras
for independent processing (e.g., multiprocessing):
>>> from concurrent.futures import ProcessPoolExecutor
>>> def forecast_well(tf):
... """Run a forecast on one well's production data."""
... # tf contains all rows for a single well, sorted by date
... return compute_forecast(tf['date'], tf['oil'])
>>> parts = tf.partition(['wellid'], sort_by=['date'])
>>> with ProcessPoolExecutor(max_workers=4) as pool:
... results = list(pool.map(
... forecast_well, [sub for _, sub in parts]))
>>> combined = Tafra.concat(results)With 8 workers and ~13 ms of work per group, partition achieves ~5x
speedup over serial execution. For light aggregations (sum, mean, std),
group_by is 10-100x faster — use it instead. See
numerical.rst for
detailed benchmarks.
chunks splits by row count (for data-parallel workloads where group
integrity doesn't matter):
>>> for chunk in tf.chunks(n=4, sort_by=['date']):
... process(chunk)Have some code that works with pandas, or just a way of doing things
that you prefer? tafra is flexible:
>>> df = pd.DataFrame(np.c_[
... np.array([1, 2, 3, 4]),
... np.array(['one', 'two', 'one', 'two'])
... ], columns=['x', 'y'])
>>> t = Tafra.from_dataframe(df)And going back is just as simple:
>>> df = pd.DataFrame(t.data)Note
Benchmarks collected with tafra 2.1.0. See
numerical.rst
for full benchmarks against pandas 2.3/3.0 and polars 1.39.
Lightweight means performant. By minimizing abstraction to access the
underlying numpy arrays, tafra provides dramatic speedups over
pandas and polars on construction and access:
# Construction: 100k rows, 5 columns
Tafra(): 0.02 ms
pd.DataFrame(): 2.80 ms # 140x slower
pl.DataFrame(): 0.04 ms # 2x slower
# Column access: 100k rows, per access
tf['x']: 0.13 µs
df['x']: 1.81 µs # 14x slower (pandas 2.3)
plf['x']: 0.70 µs # 5x slowertafra uses vectorized numpy operations (np.bincount,
ufunc.reduceat) and an optional C extension (single-pass aggregation,
hash joins) for GroupBy and joins. With the C extension:
# GroupBy: 10k rows, 50 groups, sum + mean
Tafra+C: 0.15 ms
pandas: 0.73 ms # 5x slower
polars: 0.60 ms # 4x slower
# Transform: 10k rows, 50 groups
Tafra+C: 0.06 ms
pandas: 0.60 ms # 10x slower
polars: 1.67 ms # 28x slower
# Equi inner join: 1k x 1k
Tafra+C: 0.08 ms
pandas: 0.93 ms # 12x slower
polars: 1.53 ms # 19x slower- Import note If you assign directly to the
Tafra.dataorTafra._dataattributes, you must callTafra._coalesce_dtypesafterwards in order to ensure the typing is consistent.