Developer Documentation for Numbuf

Numbuf is a library for the fast serialization of primitive Python objects (lists, tuples, dictionaries, NumPy arrays) to the Apache Arrow format.

class numbuf::DictBuilder

Constructing dictionaries of key/value pairs. Sequences of keys and values are built separately using a pair of SequenceBuilders. The resulting Arrow representation can be obtained via the Finish method.

Public Functions

SequenceBuilder &keys()

Builder for the keys of the dictionary.

SequenceBuilder &vals()

Builder for the values of the dictionary.

std::shared_ptr<arrow::StructArray> Finish(std::shared_ptr<arrow::Array> list_data, std::shared_ptr<arrow::Array> tuple_data, std::shared_ptr<arrow::Array> dict_data)

Construct an Arrow StructArray representing the dictionary. Contains a field “keys” for the keys and “vals” for the values.

Parameters
  • list_data -

    List containing the data from nested lists in the value list of the dictionary

  • dict_data -

    List containing the data from nested dictionaries in the value list of the dictionary

class numbuf::SequenceBuilder

A Sequence is a heterogeneous collections of elements. It can contain scalar Python types, lists, tuples, dictionaries and tensors.

Public Functions

Status Append()

Appending a none to the sequence.

Status Append(bool data)

Appending a boolean to the sequence.

Status Append(int64_t data)

Appending an int64_t to the sequence.

Status Append(uint64_t data)

Appending an uint64_t to the sequence.

Status Append(const char *data, int32_t length)

Appending a string to the sequence.

Status Append(float data)

Appending a float to the sequence.

Status Append(double data)

Appending a double to the sequence.

arrow::Status Append(const std::vector<int64_t> &dims, uint8_t *data)

Appending a tensor to the sequence

Parameters
  • dims -

    A vector of dimensions

  • data -

    A pointer to the start of the data block. The length of the data block will be the product of the dimensions

Status AppendList(int32_t size)

Add a sublist to the sequenc. The data contained in the sublist will be specified in the “Finish” method.

To construct l = [[11, 22], 33, [44, 55]] you would for example run list = ListBuilder(); list.AppendList(2); list.Append(33); list.AppendList(2); list.Finish([11, 22, 44, 55]); list.Finish();

Parameters
  • size -

    The size of the sublist

std::shared_ptr<DenseUnionArray> Finish(std::shared_ptr<arrow::Array> list_data, std::shared_ptr<arrow::Array> tuple_data, std::shared_ptr<arrow::Array> dict_data)

Finish building the sequence and return the result.

template <typename T>
class numbuf::TensorBuilder

This is a class for building a dataframe where each row corresponds to a Tensor (= multidimensional array) of numerical data. There are two columns, “dims” which contains an array of dimensions for each Tensor and “data” which contains data buffer of the Tensor as a flattened array.

Public Functions

Status Append(const std::vector<int64_t> &dims, const elem_type *data)

Append a new tensor.

Parameters
  • dims -

    The dimensions of the Tensor

  • data -

    Pointer to the beginning of the data buffer of the Tensor. The total length of the buffer is sizeof(elem_type) * product of dims[i] over i

std::shared_ptr<Array> Finish()

Convert the tensors to an Arrow StructArray.

int32_t length()

Number of tensors in the column.