Skip to content

GCS resource

GcsDataWrapper

GcsDataWrapper(
    gcs_client,
    bucket_name,
    resource_paths,
    name,
    usage=None,
    derived_from=None,
    inputs=None,
    type=None,
)

Bases: DataWrapper

Deprecated. Wrap columnar data and its metadata in GCS.

Wrappers are deprecated.

Instead, use Dataset and GCSResource.

Vectice stores metadata -- data about your dataset -- communicated with a DataWrapper. Your actual dataset is not stored by Vectice.

This DataWrapper wraps data that you have stored in Google Cloud Storage. You assign it to a step.

from vectice import DatasetType, GcsDataWrapper, connect
from google.cloud.storage import Client

my_service_account_file = "MY_SERVICE_ACCOUNT_JSON_PATH" # (1)
gcs_client = Client.from_service_account_json(json_credentials_path=my_service_account_file)  # (2)

my_project = connect(...)  # (3)
my_phase = my_project.phase(...)  # (4)
my_iter = my_phase.iteration()  # (5)

my_iter.step_my_data = GcsDataWrapper(
    gcs_client,
    bucket_name="my_bucket",
    resource_paths="my_folder/my_filename",
    name="My origin dataset name",
    type=DatasetType.ORIGIN,
)
  1. See Service account credentials.
  2. See GCS docs.
  3. See connection.
  4. See phases.
  5. See iterations.

Note that these three concepts are distinct, even if easily conflated:

  • Where the data is stored
  • The format at rest (in storage)
  • The format when loaded in a running Python program

Notably, the statistics collectors provided by Vectice operate only on this last and only in the event that the data is loaded as a pandas dataframe.

Parameters:

Name Type Description Default
gcs_client Client

The google.cloud.storage.Client used to interact with Google Cloud Storage.

required
bucket_name str

The name of the bucket to get data from.

required
resource_paths str | list[str]

The paths of the resources to get.

required
name str

The name of the DataWrapper (local to Vectice).

required
usage DatasetSourceUsage | None

The usage of the dataset.

None
derived_from list[int] | None

The list of dataset ids to create a new dataset from.

None
inputs list[int] | None

Deprecated. Use derived_from instead.

None
type DatasetType | None

The type of the dataset.

None
Source code in src/vectice/models/datasource/datawrapper/gcs_data_wrapper.py
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
@deprecate(
    parameter="inputs",
    warn_at="23.1",
    fail_at="23.2",
    remove_at="23.3",
    reason="The 'inputs' parameter is renamed 'derived_from'. "
    "Using 'inputs' will raise an error in v{fail_at}. "
    "The parameter will be removed in v{remove_at}.",
)
def __init__(
    self,
    gcs_client: Client,
    bucket_name: str,
    resource_paths: str | list[str],
    name: str,
    usage: DatasetSourceUsage | None = None,
    derived_from: list[int] | None = None,
    inputs: list[int] | None = None,
    type: DatasetType | None = None,
):
    """Initialize a GCS data wrapper.

    Parameters:
        gcs_client: The `google.cloud.storage.Client` used
            to interact with Google Cloud Storage.
        bucket_name: The name of the bucket to get data from.
        resource_paths: The paths of the resources to get.
        name: The name of the DataWrapper (local to Vectice).
        usage: The usage of the dataset.
        derived_from: The list of dataset ids to create a new dataset from.
        inputs: Deprecated. Use `derived_from` instead.
        type: The type of the dataset.
    """
    if not derived_from and inputs:
        derived_from = inputs

    self.bucket_name = bucket_name
    self.resource_paths = resource_paths if isinstance(resource_paths, list) else [resource_paths]
    self.gcs_client = gcs_client
    super().__init__(name=name, type=type, usage=usage, derived_from=derived_from)