Utilities

DataSet

class summit.utils.dataset.DataSet(data=None, index=None, columns=None, metadata_columns=[], units=None, dtype=None, copy=False)[source]

A represenation of a dataset

This is basically a pandas dataframe with a set of “metadata” columns that will be removed when the dataframe is converted to a numpy array

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

Parameters
  • data (ndarray (structured or homogeneous), Iterable, dict, or DataFrame) –

    Dict can contain Series, arrays, constants, or list-like objects

    Changed in version 0.23.0: If data is a dict, argument order is maintained for Python 3.6 and later.

  • index (Index or array-like) – Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided

  • columns (Index or array-like) – Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided

  • metadata_columns (Array-like) – A list of metadata columns that are already contained in the columns parameter.

  • dtype (dtype, default None) – Data type to force. Only a single dtype is allowed. If None, infer

  • copy (boolean, default False) – Copy data from inputs. Only affects DataFrame / 2d ndarray input

See also

DataFrame.from_records

Constructor from tuples, also record arrays.

DataFrame.from_dict

From dicts of Series, arrays, or dicts.

DataFrame.from_items

From sequence of (key, value) pairs pandas.read_csv, pandas.read_table, pandas.read_clipboard.

Examples

>>> data_columns = ["tau", "equiv_pldn", "conc_dfnb", "temperature"]
>>> metadata_columns = ["strategy"]
>>> columns = data_columns + metadata_columns
>>> values = [[1.5, 0.5, 0.1, 30.0, "test"]]
>>> ds = DataSet(values, columns=columns, metadata_columns="strategy")
>>> values = {("tau", "DATA"): [1.5, 10.0],                   ("equiv_pldn", "DATA"): [0.5, 3.0],                   ("conc_dfnb", "DATA"): [0.1, 4.0],                   ("temperature", "DATA"): [30.0, 100.0],                   ("strategy", "METADATA"): ["test", "test"]}
>>> ds = DataSet(values)

Notes

Based on https://notes.mikejarrett.ca/storing-metadata-in-pandas-dataframes/

property data_columns

Names of the data columns

data_to_numpy()int[source]

Return dataframe with the metadata columns removed

static from_df(df: pandas.core.frame.DataFrame, metadata_columns: List = [], units: List = [])[source]

Create Dataset from a pandas dataframe

Parameters
  • df (pandas.DataFrame) – Dataframe to be converted to a DataSet

  • metadata_columns (list, optional) – names of the columns in the dataframe that are metadata columns

  • units (list, optional) – A list of objects representing the units of the columns

classmethod from_dict(d)[source]

Construct DataFrame from dict of array-like or dicts.

Creates DataFrame object from dictionary by columns or by index allowing dtype specification.

Parameters
  • data (dict) – Of the form {field : array-like} or {field : dict}.

  • orient ({'columns', 'index'}, default 'columns') – The “orientation” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.

  • dtype (dtype, default None) – Data type to force, otherwise infer.

  • columns (list, default None) – Column labels to use when orient='index'. Raises a ValueError if used with orient='columns'.

Returns

Return type

DataFrame

See also

DataFrame.from_records

DataFrame from structured ndarray, sequence of tuples or dicts, or DataFrame.

DataFrame

DataFrame object creation using constructor.

Examples

By default the keys of the dict become the DataFrame columns:

>>> data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
>>> pd.DataFrame.from_dict(data)
   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d

Specify orient='index' to create the DataFrame using dictionary keys as rows:

>>> data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
>>> pd.DataFrame.from_dict(data, orient='index')
       0  1  2  3
row_1  3  2  1  0
row_2  a  b  c  d

When using the ‘index’ orientation, the column names can be specified manually:

>>> pd.DataFrame.from_dict(data, orient='index',
...                        columns=['A', 'B', 'C', 'D'])
       A  B  C  D
row_1  3  2  1  0
row_2  a  b  c  d
insert(loc, column, value, type='DATA', units=None, allow_duplicates=False)[source]

Insert column into DataFrame at specified location.

Raises a ValueError if column is already contained in the DataFrame, unless allow_duplicates is set to True.

Parameters
  • loc (int) – Insertion index. Must verify 0 <= loc <= len(columns).

  • column (str, number, or hashable object) – Label of the inserted column.

  • value (int, Series, or array-like) –

  • allow_duplicates (bool, optional) –

property metadata_columns

Names of the metadata columns

static read_csv(filepath_or_buffer, **kwargs)[source]

Create a DataSet from a csv

standardize(small_tol=1e-05, return_mean=False, return_std=False, **kwargs)numpy.ndarray[source]

Standardize data columns by removing the mean and scaling to unit variance

The standard score of each data column is calculated as:

z = (x - u) / s

where u is the mean of the columns and s is the standard deviation of each data column

Parameters
  • small_tol (float, optional) – The minimum value of any value in the final scaled array. This is used to prevent very small values that will cause issues in later calcualtions. Defaults to 1e-5.

  • return_mean (bool, optional) – Return an array with the mean of each column in the DataSet

  • return_std (bool, optional) – Return an array with the stnadard deviation of each column in the DataSet

  • mean (array, optional) – Pass a precalculated array of means for the columns

  • std (array, optional) – Pass a precalculated array of standard deviations for the columns

Returns

standard – Numpy array of the standardized data columns

Return type

np.ndarray

Notes

This method does not change the internal values of the data columns in place.

to_dict(**kwargs)[source]

Convert the DataFrame to a dictionary.

The type of the key-value pairs can be customized with the parameters (see below).

Parameters
  • orient (str {'dict', 'list', 'series', 'split', 'records', 'index'}) –

    Determines the type of the values of the dictionary.

    • ’dict’ (default) : dict like {column -> {index -> value}}

    • ’list’ : dict like {column -> [values]}

    • ’series’ : dict like {column -> Series(values)}

    • ’split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}

    • ’records’ : list like [{column -> value}, … , {column -> value}]

    • ’index’ : dict like {index -> {column -> value}}

    Abbreviations are allowed. s indicates series and sp indicates split.

  • into (class, default dict) – The collections.abc.Mapping subclass used for all Mappings in the return value. Can be the actual class or an empty instance of the mapping type you want. If you want a collections.defaultdict, you must pass it initialized.

Returns

Return a collections.abc.Mapping object representing the DataFrame. The resulting transformation depends on the orient parameter.

Return type

dict, list or collections.abc.Mapping

See also

DataFrame.from_dict

Create a DataFrame from a dictionary.

DataFrame.to_json

Convert a DataFrame to JSON format.

Examples

>>> df = pd.DataFrame({'col1': [1, 2],
...                    'col2': [0.5, 0.75]},
...                   index=['row1', 'row2'])
>>> df
      col1  col2
row1     1  0.50
row2     2  0.75
>>> df.to_dict()
{'col1': {'row1': 1, 'row2': 2}, 'col2': {'row1': 0.5, 'row2': 0.75}}

You can specify the return orientation.

>>> df.to_dict('series')
{'col1': row1    1
         row2    2
Name: col1, dtype: int64,
'col2': row1    0.50
        row2    0.75
Name: col2, dtype: float64}
>>> df.to_dict('split')
{'index': ['row1', 'row2'], 'columns': ['col1', 'col2'],
 'data': [[1, 0.5], [2, 0.75]]}
>>> df.to_dict('records')
[{'col1': 1, 'col2': 0.5}, {'col1': 2, 'col2': 0.75}]
>>> df.to_dict('index')
{'row1': {'col1': 1, 'col2': 0.5}, 'row2': {'col1': 2, 'col2': 0.75}}

You can also specify the mapping type.

>>> from collections import OrderedDict, defaultdict
>>> df.to_dict(into=OrderedDict)
OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])),
             ('col2', OrderedDict([('row1', 0.5), ('row2', 0.75)]))])

If you want a defaultdict, you need to initialize it:

>>> dd = defaultdict(list)
>>> df.to_dict('records', into=dd)
[defaultdict(<class 'list'>, {'col1': 1, 'col2': 0.5}),
 defaultdict(<class 'list'>, {'col1': 2, 'col2': 0.75})]
zero_to_one(small_tol=1e-05, return_min_max=False)numpy.ndarray[source]

Scale the data columns between zero and one

Each of the data columns is scaled between zero and one based on the maximum and minimum values of each column

Parameters

small_tol (float, optional) – The minimum value of any value in the final scaled array. This is used to prevent very small values that will cause issues in later calcualtions. Defaults to 1e-5.

Returns

  • scaled (numpy.ndarray) – A numpy array with the scaled data columns

  • if return_min_max true returns a tuple of scaled, mins, maxes

Notes

This method does not change the internal values of the data columns in place.

Multi-Objective

summit.utils.multiobjective.hypervolume(pointset, ref)[source]

Compute the absolute hypervolume of a pointset according to the reference point ref.

summit.utils.multiobjective.pareto_efficient(data, maximize=True)[source]

Find the pareto-efficient points

Parameters
  • data (array-like) – An (n_points, n_data) array

  • maximize (bool, optional) – Whether the problem is a maximization or minimization problem. Defaults to maximization (i.e,. True)

Returns

data is an array with the pareto front values indices is an array with the indices of the pareto points in the original data array

Return type

data, indices