malt.data.dataset.Dataset

class malt.data.dataset.Dataset(molecules: Optional[List] = None)[source]

Bases: torch.utils.data.dataset.Dataset

A collection of Molecules with functionalities to be compatible with training and optimization.

Parameters

molecules (List[malt.Molecule]) – A list of Molecules.

featurize(molecules)

Featurize all molecules in the dataset.

view()[source]

Generate a torch.utils.data.DataLoader from this Dataset.

__init__(molecules: Optional[List] = None) None[source]

Methods

__init__([molecules])

append(molecule)

Append a molecule to the dataset.

apply(function)

Apply a function to all molecules in the dataset.

batch(*args, **kwargs)

clone()

Return a copy of self.

erase_annotation()

Erase the metadata.

featurize_all()

Featurize all molecules in dataset.

shuffle([seed])

Shuffle the dataset and return it.

split(partition)

Split the dataset according to some partition.

view([collate_fn, by])

Provide a data loader from portfolio.

Attributes

lookup

Returns the mapping between the SMILES and the molecule.

smiles

Return the list of SMILE strings in the datset.

append(molecule)[source]

Append a molecule to the dataset.

Alias of append for molecules.

Note

  • This append in-place.

Parameters

molecule (molecule) – The data molecule to be appended.

apply(function)[source]

Apply a function to all molecules in the dataset.

Parameters

function (Callable) – The function to be applied to all molecules in this dataset in place.

Examples

>>> molecule = Molecule("CC")
>>> dataset = Dataset([molecule])
>>> from ..molecule import Molecule
>>> fn = lambda molecule: Molecule(
...     smiles=molecule.smiles, metadata={"name": "john"},
... )
>>> dataset = dataset.apply(fn)
>>> dataset[0]["name"]
'john'
clone()[source]

Return a copy of self.

erase_annotation()[source]

Erase the metadata.

featurize_all()[source]

Featurize all molecules in dataset.

property lookup

Returns the mapping between the SMILES and the molecule.

shuffle(seed=None)[source]

Shuffle the dataset and return it.

property smiles

Return the list of SMILE strings in the datset.

split(partition)[source]

Split the dataset according to some partition.

Parameters

partition (Sequence[Optional[int, float]]) – Splitting partition.

Returns

List of datasets split according to the partition.

Return type

List[Dataset]

Examples

>>> dataset = Dataset([Molecule("CC"), Molecule("C")])
>>> dataset0, dataset1 = dataset.split([1, 1])
>>> dataset0[0].smiles
'CC'
view(collate_fn: Optional[Callable] = None, by: Union[Iterable, str] = ['g', 'y'], *args, **kwargs)[source]

Provide a data loader from portfolio.

Parameters
  • collate_fn (Optional[Callable]) – The function to gather data molecules.

  • assay (Union[None, str]) – Batch data from molecules using key provided to filter metadata.

  • by (Union[Iterable, str])

Returns

Resulting data loader.

Return type

torch.utils.data.DataLoader