class summit.utils.dataset.DataSet(data=None, index=None, columns=None, metadata_columns=[], units=None, dtype=None, copy=False)[source]

A represenation of a dataset

This is basically a pandas dataframe with a set of “metadata” columns that will be removed when the dataframe is converted to a numpy array

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

  • data (ndarray (structured or homogeneous), Iterable, dict, or DataFrame) –

    Dict can contain Series, arrays, constants, or list-like objects

    Changed in version 0.23.0: If data is a dict, argument order is maintained for Python 3.6 and later.

  • index (Index or array-like) – Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided

  • columns (Index or array-like) – Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided

  • metadata_columns (Array-like) – A list of metadata columns that are already contained in the columns parameter.

  • dtype (dtype, default None) – Data type to force. Only a single dtype is allowed. If None, infer

  • copy (boolean, default False) – Copy data from inputs. Only affects DataFrame / 2d ndarray input

See also


Constructor from tuples, also record arrays.


From dicts of Series, arrays, or dicts.


From sequence of (key, value) pairs pandas.read_csv, pandas.read_table, pandas.read_clipboard.


>>> data_columns = ["tau", "equiv_pldn", "conc_dfnb", "temperature"]
>>> metadata_columns = ["strategy"]
>>> columns = data_columns + metadata_columns
>>> values = [[1.5, 0.5, 0.1, 30.0, "test"]]
>>> ds = DataSet(values, columns=columns, metadata_columns="strategy")
>>> values = {("tau", "DATA"): [1.5, 10.0],                   ("equiv_pldn", "DATA"): [0.5, 3.0],                   ("conc_dfnb", "DATA"): [0.1, 4.0],                   ("temperature", "DATA"): [30.0, 100.0],                   ("strategy", "METADATA"): ["test", "test"]}
>>> ds = DataSet(values)


Based on https://notes.mikejarrett.ca/storing-metadata-in-pandas-dataframes/

property data_columns

Names of the data columns


Return dataframe with the metadata columns removed

static from_df(df: pandas.core.frame.DataFrame, metadata_columns: List = [], units: List = [])[source]

Create Dataset from a pandas dataframe

  • df (pandas.DataFrame) – Dataframe to be converted to a DataSet

  • metadata_columns (list, optional) – names of the columns in the dataframe that are metadata columns

  • units (list, optional) – A list of objects representing the units of the columns

classmethod from_dict(d)[source]

Construct DataFrame from dict of array-like or dicts.

Creates DataFrame object from dictionary by columns or by index allowing dtype specification.

  • data (dict) – Of the form {field : array-like} or {field : dict}.

  • orient ({'columns', 'index', 'tight'}, default 'columns') –

    The “orientation” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’. If ‘tight’, assume a dict with keys [‘index’, ‘columns’, ‘data’, ‘index_names’, ‘column_names’].

    New in version 1.4.0: ‘tight’ as an allowed value for the orient argument

  • dtype (dtype, default None) – Data type to force, otherwise infer.

  • columns (list, default None) – Column labels to use when orient='index'. Raises a ValueError if used with orient='columns' or orient='tight'.


Return type


See also


DataFrame from structured ndarray, sequence of tuples or dicts, or DataFrame.


DataFrame object creation using constructor.


Convert the DataFrame to a dictionary.


By default the keys of the dict become the DataFrame columns:

>>> data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
>>> pd.DataFrame.from_dict(data)
   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d

Specify orient='index' to create the DataFrame using dictionary keys as rows:

>>> data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
>>> pd.DataFrame.from_dict(data, orient='index')
       0  1  2  3
row_1  3  2  1  0
row_2  a  b  c  d

When using the ‘index’ orientation, the column names can be specified manually:

>>> pd.DataFrame.from_dict(data, orient='index',
...                        columns=['A', 'B', 'C', 'D'])
       A  B  C  D
row_1  3  2  1  0
row_2  a  b  c  d

Specify orient='tight' to create the DataFrame using a ‘tight’ format:

>>> data = {'index': [('a', 'b'), ('a', 'c')],
...         'columns': [('x', 1), ('y', 2)],
...         'data': [[1, 3], [2, 4]],
...         'index_names': ['n1', 'n2'],
...         'column_names': ['z1', 'z2']}
>>> pd.DataFrame.from_dict(data, orient='tight')
z1     x  y
z2     1  2
n1 n2
a  b   1  3
   c   2  4
insert(loc, column, value, type='DATA', units=None, allow_duplicates=False)[source]

Insert column into DataFrame at specified location.

Raises a ValueError if column is already contained in the DataFrame, unless allow_duplicates is set to True.

  • loc (int) – Insertion index. Must verify 0 <= loc <= len(columns).

  • column (str, number, or hashable object) – Label of the inserted column.

  • value (Scalar, Series, or array-like) –

  • allow_duplicates (bool, optional, default lib.no_default) –

See also


Insert new item by index.


>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df
   col1  col2
0     1     3
1     2     4
>>> df.insert(1, "newcol", [99, 99])
>>> df
   col1  newcol  col2
0     1      99     3
1     2      99     4
>>> df.insert(0, "col1", [100, 100], allow_duplicates=True)
>>> df
   col1  col1  newcol  col2
0   100     1      99     3
1   100     2      99     4

Notice that pandas uses index alignment in case of value from type Series:

>>> df.insert(0, "col0", pd.Series([5, 6], index=[1, 2]))
>>> df
   col0  col1  col1  newcol  col2
0   NaN   100     1      99     3
1   5.0   100     2      99     4
property metadata_columns

Names of the metadata columns

static read_csv(filepath_or_buffer, **kwargs)[source]

Create a DataSet from a csv

standardize(small_tol=1e-05, return_mean=False, return_std=False, **kwargs)numpy.ndarray[source]

Standardize data columns by removing the mean and scaling to unit variance

The standard score of each data column is calculated as:

z = (x - u) / s

where u is the mean of the columns and s is the standard deviation of each data column

  • small_tol (float, optional) – The minimum value of any value in the final scaled array. This is used to prevent very small values that will cause issues in later calcualtions. Defaults to 1e-5.

  • return_mean (bool, optional) – Return an array with the mean of each column in the DataSet

  • return_std (bool, optional) – Return an array with the stnadard deviation of each column in the DataSet

  • mean (array, optional) – Pass a precalculated array of means for the columns

  • std (array, optional) – Pass a precalculated array of standard deviations for the columns


standard – Numpy array of the standardized data columns

Return type



This method does not change the internal values of the data columns in place.


Convert the DataFrame to a dictionary.

The type of the key-value pairs can be customized with the parameters (see below).

  • orient (str {'dict', 'list', 'series', 'split', 'tight', 'records', 'index'}) –

    Determines the type of the values of the dictionary.

    • ’dict’ (default) : dict like {column -> {index -> value}}

    • ’list’ : dict like {column -> [values]}

    • ’series’ : dict like {column -> Series(values)}

    • ’split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}

    • ’tight’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values], ‘index_names’ -> [index.names], ‘column_names’ -> [column.names]}

    • ’records’ : list like [{column -> value}, … , {column -> value}]

    • ’index’ : dict like {index -> {column -> value}}

    Abbreviations are allowed. s indicates series and sp indicates split.

    New in version 1.4.0: ‘tight’ as an allowed value for the orient argument

  • into (class, default dict) – The collections.abc.Mapping subclass used for all Mappings in the return value. Can be the actual class or an empty instance of the mapping type you want. If you want a collections.defaultdict, you must pass it initialized.


Return a collections.abc.Mapping object representing the DataFrame. The resulting transformation depends on the orient parameter.

Return type

dict, list or collections.abc.Mapping

See also


Create a DataFrame from a dictionary.


Convert a DataFrame to JSON format.


>>> df = pd.DataFrame({'col1': [1, 2],
...                    'col2': [0.5, 0.75]},
...                   index=['row1', 'row2'])
>>> df
      col1  col2
row1     1  0.50
row2     2  0.75
>>> df.to_dict()
{'col1': {'row1': 1, 'row2': 2}, 'col2': {'row1': 0.5, 'row2': 0.75}}

You can specify the return orientation.

>>> df.to_dict('series')
{'col1': row1    1
         row2    2
Name: col1, dtype: int64,
'col2': row1    0.50
        row2    0.75
Name: col2, dtype: float64}
>>> df.to_dict('split')
{'index': ['row1', 'row2'], 'columns': ['col1', 'col2'],
 'data': [[1, 0.5], [2, 0.75]]}
>>> df.to_dict('records')
[{'col1': 1, 'col2': 0.5}, {'col1': 2, 'col2': 0.75}]
>>> df.to_dict('index')
{'row1': {'col1': 1, 'col2': 0.5}, 'row2': {'col1': 2, 'col2': 0.75}}
>>> df.to_dict('tight')
{'index': ['row1', 'row2'], 'columns': ['col1', 'col2'],
 'data': [[1, 0.5], [2, 0.75]], 'index_names': [None], 'column_names': [None]}

You can also specify the mapping type.

>>> from collections import OrderedDict, defaultdict
>>> df.to_dict(into=OrderedDict)
OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])),
             ('col2', OrderedDict([('row1', 0.5), ('row2', 0.75)]))])

If you want a defaultdict, you need to initialize it:

>>> dd = defaultdict(list)
>>> df.to_dict('records', into=dd)
[defaultdict(<class 'list'>, {'col1': 1, 'col2': 0.5}),
 defaultdict(<class 'list'>, {'col1': 2, 'col2': 0.75})]
zero_to_one(small_tol=1e-05, return_min_max=False)numpy.ndarray[source]

Scale the data columns between zero and one

Each of the data columns is scaled between zero and one based on the maximum and minimum values of each column


small_tol (float, optional) – The minimum value of any value in the final scaled array. This is used to prevent very small values that will cause issues in later calcualtions. Defaults to 1e-5.


  • scaled (numpy.ndarray) – A numpy array with the scaled data columns

  • if return_min_max true returns a tuple of scaled, mins, maxes


This method does not change the internal values of the data columns in place.


summit.utils.multiobjective.hypervolume(pointset, ref)[source]

Compute the absolute hypervolume of a pointset according to the reference point ref.

summit.utils.multiobjective.pareto_efficient(data, maximize=True)[source]

Find the pareto-efficient points

  • data (array-like) – An (n_points, n_data) array

  • maximize (bool, optional) – Whether the problem is a maximization or minimization problem. Defaults to maximization (i.e,. True)


data is an array with the pareto front values indices is an array with the indices of the pareto points in the original data array

Return type

data, indices