Utilities ¶

Contents

Utilities
- DataSet
- Multi-Objective

DataSet ¶

class summit.utils.dataset.DataSet(data=None, index=None, columns=None, metadata_columns=[], units=None, dtype=None, copy=False)[source]¶

A represenation of a dataset

This is basically a pandas dataframe with a set of “metadata” columns that will be removed when the dataframe is converted to a numpy array

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

Parameters

data (ndarray (structured or homogeneous), Iterable, dict, or DataFrame) –
Dict can contain Series, arrays, constants, or list-like objects

Changed in version 0.23.0: If data is a dict, argument order is maintained for Python 3.6 and later.
index (Index or array-like) – Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided
columns (Index or array-like) – Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided
metadata_columns (Array-like) – A list of metadata columns that are already contained in the columns parameter.
dtype (dtype, default None) – Data type to force. Only a single dtype is allowed. If None, infer
copy (boolean, default False) – Copy data from inputs. Only affects DataFrame / 2d ndarray input

See also

DataFrame.from_records: Constructor from tuples, also record arrays.
DataFrame.from_dict: From dicts of Series, arrays, or dicts.
DataFrame.from_items: From sequence of (key, value) pairs pandas.read_csv, pandas.read_table, pandas.read_clipboard.

Examples

>>> data_columns = ["tau", "equiv_pldn", "conc_dfnb", "temperature"]
>>> metadata_columns = ["strategy"]
>>> columns = data_columns + metadata_columns
>>> values = [[1.5, 0.5, 0.1, 30.0, "test"]]
>>> ds = DataSet(values, columns=columns, metadata_columns="strategy")
>>> values = {("tau", "DATA"): [1.5, 10.0],                   ("equiv_pldn", "DATA"): [0.5, 3.0],                   ("conc_dfnb", "DATA"): [0.1, 4.0],                   ("temperature", "DATA"): [30.0, 100.0],                   ("strategy", "METADATA"): ["test", "test"]}
>>> ds = DataSet(values)

Notes

Based on https://notes.mikejarrett.ca/storing-metadata-in-pandas-dataframes/

property data_columns¶: Names of the data columns

data_to_numpy() → int[source]¶: Return dataframe with the metadata columns removed

static from_df(df: pandas.core.frame.DataFrame, metadata_columns: List = [], units: List = [])[source]¶

Create Dataset from a pandas dataframe

Parameters

df (pandas.DataFrame) – Dataframe to be converted to a DataSet
metadata_columns (list, optional) – names of the columns in the dataframe that are metadata columns
units (list, optional) – A list of objects representing the units of the columns

classmethod from_dict(d)[source]¶

Construct DataFrame from dict of array-like or dicts.

Creates DataFrame object from dictionary by columns or by index allowing dtype specification.

Parameters

data (dict) – Of the form {field : array-like} or {field : dict}.
orient ({'columns', 'index', 'tight'}, default 'columns') –
The “orientation” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’. If ‘tight’, assume a dict with keys [‘index’, ‘columns’, ‘data’, ‘index_names’, ‘column_names’].

New in version 1.4.0: ‘tight’ as an allowed value for the orient argument
dtype (dtype, default None) – Data type to force, otherwise infer.
columns (list, default None) – Column labels to use when orient='index'. Raises a ValueError if used with orient='columns' or orient='tight'.

Returns

Return type

DataFrame

See also

DataFrame.from_records: DataFrame from structured ndarray, sequence of tuples or dicts, or DataFrame.
DataFrame: DataFrame object creation using constructor.
DataFrame.to_dict: Convert the DataFrame to a dictionary.

Examples

By default the keys of the dict become the DataFrame columns:

>>> data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
>>> pd.DataFrame.from_dict(data)
   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d

Specify orient='index' to create the DataFrame using dictionary keys as rows:

>>> data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
>>> pd.DataFrame.from_dict(data, orient='index')
       0  1  2  3
row_1  3  2  1  0
row_2  a  b  c  d

When using the ‘index’ orientation, the column names can be specified manually:

>>> pd.DataFrame.from_dict(data, orient='index',
...                        columns=['A', 'B', 'C', 'D'])
       A  B  C  D
row_1  3  2  1  0
row_2  a  b  c  d

Specify orient='tight' to create the DataFrame using a ‘tight’ format:

>>> data = {'index': [('a', 'b'), ('a', 'c')],
...         'columns': [('x', 1), ('y', 2)],
...         'data': [[1, 3], [2, 4]],
...         'index_names': ['n1', 'n2'],
...         'column_names': ['z1', 'z2']}
>>> pd.DataFrame.from_dict(data, orient='tight')
z1     x  y
z2     1  2
n1 n2
a  b   1  3
   c   2  4

insert(loc, column, value, type='DATA', units=None, allow_duplicates=False)[source]¶

Insert column into DataFrame at specified location.

Raises a ValueError if column is already contained in the DataFrame, unless allow_duplicates is set to True.

Parameters

loc (int) – Insertion index. Must verify 0 <= loc <= len(columns).
column (str, number, or hashable object) – Label of the inserted column.
value (Scalar, Series, or array-like) –
allow_duplicates (bool, optional, default lib.no_default) –

See also

Index.insert: Insert new item by index.

Examples

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df
   col1  col2
0     1     3
1     2     4
>>> df.insert(1, "newcol", [99, 99])
>>> df
   col1  newcol  col2
0     1      99     3
1     2      99     4
>>> df.insert(0, "col1", [100, 100], allow_duplicates=True)
>>> df
   col1  col1  newcol  col2
0   100     1      99     3
1   100     2      99     4

Notice that pandas uses index alignment in case of value from type Series:

>>> df.insert(0, "col0", pd.Series([5, 6], index=[1, 2]))
>>> df
   col0  col1  col1  newcol  col2
0   NaN   100     1      99     3
1   5.0   100     2      99     4

property metadata_columns¶: Names of the metadata columns

static read_csv(filepath_or_buffer, **kwargs)[source]¶: Create a DataSet from a csv

standardize(small_tol=1e-05, return_mean=False, return_std=False, **kwargs) → numpy.ndarray[source]¶

Standardize data columns by removing the mean and scaling to unit variance

The standard score of each data column is calculated as:: z = (x - u) / s

where u is the mean of the columns and s is the standard deviation of each data column

Parameters

small_tol (float, optional) – The minimum value of any value in the final scaled array. This is used to prevent very small values that will cause issues in later calcualtions. Defaults to 1e-5.
return_mean (bool, optional) – Return an array with the mean of each column in the DataSet
return_std (bool, optional) – Return an array with the stnadard deviation of each column in the DataSet
mean (array, optional) – Pass a precalculated array of means for the columns
std (array, optional) – Pass a precalculated array of standard deviations for the columns

Returns

standard – Numpy array of the standardized data columns

Return type

np.ndarray

Notes

This method does not change the internal values of the data columns in place.

to_dict(**kwargs)[source]¶

Convert the DataFrame to a dictionary.

The type of the key-value pairs can be customized with the parameters (see below).

Parameters

orient (str {'dict', 'list', 'series', 'split', 'tight', 'records', 'index'}) –
Determines the type of the values of the dictionary.
- ’dict’ (default) : dict like {column -> {index -> value}}
- ’list’ : dict like {column -> [values]}
- ’series’ : dict like {column -> Series(values)}
- ’split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}
- ’tight’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values], ‘index_names’ -> [index.names], ‘column_names’ -> [column.names]}
- ’records’ : list like [{column -> value}, … , {column -> value}]
- ’index’ : dict like {index -> {column -> value}}
Abbreviations are allowed. s indicates series and sp indicates split.

New in version 1.4.0: ‘tight’ as an allowed value for the orient argument
into (class, default dict) – The collections.abc.Mapping subclass used for all Mappings in the return value. Can be the actual class or an empty instance of the mapping type you want. If you want a collections.defaultdict, you must pass it initialized.

Returns

Return a collections.abc.Mapping object representing the DataFrame. The resulting transformation depends on the orient parameter.

Return type

dict, list or collections.abc.Mapping

See also

DataFrame.from_dict: Create a DataFrame from a dictionary.
DataFrame.to_json: Convert a DataFrame to JSON format.

Examples

>>> df = pd.DataFrame({'col1': [1, 2],
...                    'col2': [0.5, 0.75]},
...                   index=['row1', 'row2'])
>>> df
      col1  col2
row1     1  0.50
row2     2  0.75
>>> df.to_dict()
{'col1': {'row1': 1, 'row2': 2}, 'col2': {'row1': 0.5, 'row2': 0.75}}

You can specify the return orientation.

>>> df.to_dict('series')
{'col1': row1    1
         row2    2
Name: col1, dtype: int64,
'col2': row1    0.50
        row2    0.75
Name: col2, dtype: float64}

>>> df.to_dict('split')
{'index': ['row1', 'row2'], 'columns': ['col1', 'col2'],
 'data': [[1, 0.5], [2, 0.75]]}

>>> df.to_dict('records')
[{'col1': 1, 'col2': 0.5}, {'col1': 2, 'col2': 0.75}]

>>> df.to_dict('index')
{'row1': {'col1': 1, 'col2': 0.5}, 'row2': {'col1': 2, 'col2': 0.75}}

>>> df.to_dict('tight')
{'index': ['row1', 'row2'], 'columns': ['col1', 'col2'],
 'data': [[1, 0.5], [2, 0.75]], 'index_names': [None], 'column_names': [None]}

You can also specify the mapping type.

>>> from collections import OrderedDict, defaultdict
>>> df.to_dict(into=OrderedDict)
OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])),
             ('col2', OrderedDict([('row1', 0.5), ('row2', 0.75)]))])

If you want a defaultdict, you need to initialize it:

>>> dd = defaultdict(list)
>>> df.to_dict('records', into=dd)
[defaultdict(<class 'list'>, {'col1': 1, 'col2': 0.5}),
 defaultdict(<class 'list'>, {'col1': 2, 'col2': 0.75})]

zero_to_one(small_tol=1e-05, return_min_max=False) → numpy.ndarray[source]¶

Scale the data columns between zero and one

Each of the data columns is scaled between zero and one based on the maximum and minimum values of each column

Parameters

small_tol (float, optional) – The minimum value of any value in the final scaled array. This is used to prevent very small values that will cause issues in later calcualtions. Defaults to 1e-5.

Returns

scaled (numpy.ndarray) – A numpy array with the scaled data columns
if return_min_max true returns a tuple of scaled, mins, maxes

Notes

This method does not change the internal values of the data columns in place.

Multi-Objective ¶

summit.utils.multiobjective.hypervolume(pointset, ref)[source]¶: Compute the absolute hypervolume of a pointset according to the reference point ref.

summit.utils.multiobjective.pareto_efficient(data, maximize=True)[source]¶

Find the pareto-efficient points

Parameters

data (array-like) – An (n_points, n_data) array
maximize (bool, optional) – Whether the problem is a maximization or minimization problem. Defaults to maximization (i.e,. True)

Returns

data is an array with the pareto front values indices is an array with the indices of the pareto points in the original data array

Return type

data, indices

Utilities¶

DataSet¶

Multi-Objective¶

Utilities ¶

DataSet ¶

Multi-Objective ¶