Skip to content

Datasets

Dataset

Dataset(
    type,
    name=None,
    resource=None,
    training_resource=None,
    testing_resource=None,
    validation_resource=None,
    derived_from=None,
    dataframe=None,
    training_dataframe=None,
    testing_dataframe=None,
    validation_dataframe=None,
)

Users should not instantiate a dataset directly but rather use the provided static methods origin(), clean(), and modeling().

Parameters:

Name Type Description Default
type DatasetType

The type of dataset.

required
name str | None

The name of the dataset.

None
resource Resource | None

A single resource (for origin and clean datasets).

None
training_resource Resource | None

The resource for the training set (for modeling datasets).

None
testing_resource Resource | None

The resource for the testing set (for modeling datasets).

None
validation_resource Resource | None

The resource for the validation set (optional, for modeling datasets).

None
derived_from list[int | Dataset] | None

A list of datasets (or ids) from which this dataset is derived.

None
dataframe DataFrame | None

A pandas dataframe for clean and origin datasets.

None
training_dataframe DataFrame | None

A pandas dataframe for modeling dataset.

None
testing_dataframe DataFrame | None

A pandas dataframe for modeling dataset.

None
validation_dataframe DataFrame | None

A pandas dataframe for modeling dataset.

None
Source code in src/vectice/models/dataset.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
def __init__(
    self,
    type: DatasetType,
    name: str | None = None,
    resource: Resource | None = None,
    training_resource: Resource | None = None,
    testing_resource: Resource | None = None,
    validation_resource: Resource | None = None,
    derived_from: list[int | Dataset] | None = None,
    dataframe: DataFrame | None = None,
    training_dataframe: DataFrame | None = None,
    testing_dataframe: DataFrame | None = None,
    validation_dataframe: DataFrame | None = None,
    # attachments: str | list[str] | None = None,
    # properties: dict[str, str | int] | list[Property] | Property | None = None,
):
    """Initialize a dataset.

    Users should not instantiate a dataset directly but rather use the provided static methods
    [`origin()`][vectice.models.dataset.Dataset.origin],
    [`clean()`][vectice.models.dataset.Dataset.clean], and
    [`modeling()`][vectice.models.dataset.Dataset.modeling].

    Parameters:
        type: The type of dataset.
        name: The name of the dataset.
        resource: A single resource (for origin and clean datasets).
        training_resource: The resource for the training set (for modeling datasets).
        testing_resource: The resource for the testing set (for modeling datasets).
        validation_resource: The resource for the validation set (optional, for modeling datasets).
        derived_from: A list of datasets (or ids) from which this dataset is derived.
        dataframe: A pandas dataframe for clean and origin datasets.
        training_dataframe: A pandas dataframe for modeling dataset.
        testing_dataframe: A pandas dataframe for modeling dataset.
        validation_dataframe: A pandas dataframe for modeling dataset.
    """
    derived_from_ids = []
    for df in derived_from or []:
        if isinstance(df, Dataset):
            if df.latest_version_id is None:
                raise ValueError(
                    f"Dataset '{df.name}' does not have a version id. "
                    "Was it registered in Vectice (assigned to a step)?"
                )
            derived_from_ids.append(df.latest_version_id)
        else:
            derived_from_ids.append(df)
    self._type = type
    self._name = name or f"dataset {datetime.time}"
    self._resource = resource
    self._training_resource = training_resource
    self._testing_resource = testing_resource
    self._validation_resource = validation_resource
    self._derived_from = derived_from_ids
    self._latest_version_id: int | None = None

    # self._properties = self._format_properties(properties) if properties else None
    # self._attachments = self._format_attachments(attachments) if attachments else None

    if self._type is DatasetType.MODELING:
        if self._training_resource is None or self._testing_resource is None:
            raise ValueError("You cannot create a modeling dataset without both training and testing sets")

        self._training_resource.usage = DatasetSourceUsage.TRAINING
        self._testing_resource.usage = DatasetSourceUsage.TESTING
        if self._validation_resource:
            self._validation_resource.usage = DatasetSourceUsage.VALIDATION

    self._fill_file_dataframe(
        [
            (resource, dataframe, "dataframe"),
            (training_resource, training_dataframe, "training_dataframe"),
            (testing_resource, testing_dataframe, "testing_dataframe"),
            (validation_resource, validation_dataframe, "validation_dataframe"),
        ]
    )

derived_from property

derived_from: list[int]

The datasets from which this dataset is derived.

Returns:

Type Description
list[int]

The datasets from which this dataset is derived.

latest_version_id writable property

latest_version_id: int | None

The id of the latest version of this dataset.

Returns:

Type Description
int | None

The id of the latest version of this dataset.

name writable property

name: str

The dataset's name.

Returns:

Type Description
str

The dataset's name.

resource property

resource: Resource | tuple[
    Resource, Resource, Resource | None
]

The dataset's resource.

Returns:

Type Description
Resource | tuple[Resource, Resource, Resource | None]

The dataset's resource.

type property

type: DatasetType

The dataset's type.

Returns:

Type Description
DatasetType

The dataset's type.

clean staticmethod

clean(
    resource, name=None, derived_from=None, dataframe=None
)

Create a clean dataset.

Examples:

from vectice import Dataset, FileResource

dataset = Dataset.clean(
    name="my clean dataset",
    resource=FileResource(path="clean_dataset.csv"),
)

Parameters:

Name Type Description Default
resource Resource

The resource for the clean dataset.

required
name str | None

The name of the dataset.

None
derived_from list[int | Dataset] | None

A list of datasets (or ids) from which this dataset is derived.

None
dataframe DataFrame | None

A pandas dataframe allowing vectice to compute more metadata about this dataset.

None
Source code in src/vectice/models/dataset.py
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
@staticmethod
def clean(
    resource: Resource,
    name: str | None = None,
    derived_from: list[int | Dataset] | None = None,
    dataframe: DataFrame | None = None,
    # properties: dict[str, str | int] | list[Property] | Property | None = None,
    # attachments: str | list[str] | None = None,
) -> Dataset:
    """Create a clean dataset.

    Examples:
        ```python
        from vectice import Dataset, FileResource

        dataset = Dataset.clean(
            name="my clean dataset",
            resource=FileResource(path="clean_dataset.csv"),
        )
        ```

    Parameters:
        resource: The resource for the clean dataset.
        name: The name of the dataset.
        derived_from: A list of datasets (or ids) from which this dataset is derived.
        dataframe: A pandas dataframe allowing vectice to compute more metadata about this dataset.
    """
    return Dataset(
        type=DatasetType.CLEAN,
        name=name,
        resource=resource,
        derived_from=derived_from,
        dataframe=dataframe,
        # properties=properties,
        # attachments=attachments,
    )

modeling staticmethod

modeling(
    training_resource,
    testing_resource,
    validation_resource=None,
    name=None,
    training_dataframe=None,
    testing_dataframe=None,
    validation_dataframe=None,
)

Create a modeling dataset.

Examples:

from vectice import Dataset, FileResource

dataset = Dataset.modeling(
    name="my modeling dataset",
    training_resource=FileResource(path="training_dataset.csv"),
    testing_resource=FileResource(path="testing_dataset.csv"),
    validation_resource=FileResource(path="validation_dataset.csv"),
)

Parameters:

Name Type Description Default
training_resource Resource

The resource for the training set (for modeling datasets).

required
testing_resource Resource

The resource for the testing set (for modeling datasets).

required
validation_resource Resource | None

The resource for the validation set (optional, for modeling datasets).

None
name str | None

The name of the dataset.

None
training_dataframe DataFrame | None

A pandas dataframe allowing vectice to compute more metadata about the training set.

None
testing_dataframe DataFrame | None

A pandas dataframe allowing vectice to compute more metadata about the testing set.

None
validation_dataframe DataFrame | None

A pandas dataframe allowing vectice to compute more metadata about the validation set.

None
Source code in src/vectice/models/dataset.py
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
@staticmethod
def modeling(
    training_resource: Resource,
    testing_resource: Resource,
    validation_resource: Resource | None = None,
    name: str | None = None,
    training_dataframe: DataFrame | None = None,
    testing_dataframe: DataFrame | None = None,
    validation_dataframe: DataFrame | None = None,
    # properties: dict[str, str | int] | list[Property] | Property | None = None,
    # attachments: str | list[str] | None = None,
) -> Dataset:
    """Create a modeling dataset.

    Examples:
        ```python
        from vectice import Dataset, FileResource

        dataset = Dataset.modeling(
            name="my modeling dataset",
            training_resource=FileResource(path="training_dataset.csv"),
            testing_resource=FileResource(path="testing_dataset.csv"),
            validation_resource=FileResource(path="validation_dataset.csv"),
        )
        ```

    Parameters:
        training_resource: The resource for the training set (for modeling datasets).
        testing_resource: The resource for the testing set (for modeling datasets).
        validation_resource: The resource for the validation set (optional, for modeling datasets).
        name: The name of the dataset.
        training_dataframe: A pandas dataframe allowing vectice to compute more metadata about the training set.
        testing_dataframe: A pandas dataframe allowing vectice to compute more metadata about the testing set.
        validation_dataframe: A pandas dataframe allowing vectice to compute more metadata about the validation set.
    """
    return Dataset(
        type=DatasetType.MODELING,
        name=name,
        training_resource=training_resource,
        testing_resource=testing_resource,
        validation_resource=validation_resource,
        training_dataframe=training_dataframe,
        testing_dataframe=testing_dataframe,
        validation_dataframe=validation_dataframe,
        # properties=properties,
        # attachments=attachments,
    )

origin staticmethod

origin(resource, name=None, dataframe=None)

Create an origin dataset.

Examples:

from vectice import Dataset, FileResource

dataset = Dataset.origin(
    name="my origin dataset",
    resource=FileResource(path="origin_dataset.csv"),
)

Parameters:

Name Type Description Default
resource Resource

The resource for the origin dataset.

required
name str | None

The name of the dataset.

None
dataframe DataFrame | None

A pandas dataframe allowing vectice to compute more metadata about this dataset.

None
Source code in src/vectice/models/dataset.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
@staticmethod
def origin(
    resource: Resource,
    name: str | None = None,
    dataframe: DataFrame | None = None,
    # properties: dict[str, str | int] | list[Property] | Property | None = None,
    # attachments: str | list[str] | None = None,
) -> Dataset:
    """Create an origin dataset.

    Examples:
        ```python
        from vectice import Dataset, FileResource

        dataset = Dataset.origin(
            name="my origin dataset",
            resource=FileResource(path="origin_dataset.csv"),
        )
        ```

    Parameters:
        resource: The resource for the origin dataset.
        name: The name of the dataset.
        dataframe: A pandas dataframe allowing vectice to compute more metadata about this dataset.
    """
    return Dataset(
        type=DatasetType.ORIGIN,
        name=name,
        resource=resource,
        dataframe=dataframe,
        # properties=properties,
        # attachments=attachments,
    )