Skip to content

GCS resource

GCSResource

GCSResource(gcs_client, bucket_name, resource_paths)

Bases: Resource

Wrap columnar data and its metadata in GCS.

Vectice stores metadata -- data about your dataset -- communicated with a resource. Your actual dataset is not stored by Vectice.

This resource wraps data that you have stored in Google Cloud Storage. You assign it to a step.

from vectice import GCSResource
from google.cloud.storage import Client

my_service_account_file = "MY_SERVICE_ACCOUNT_JSON_PATH"  # (1)
gcs_client = Client.from_service_account_json(json_credentials_path=my_service_account_file)  # (2)
gcs_resource = GCSResource(
    gcs_client,
    bucket_name="my_bucket",
    resource_paths="my_folder/my_filename",
)
  1. See Service account credentials.
  2. See GCS docs.

Note that these three concepts are distinct, even if easily conflated:

  • Where the data is stored
  • The format at rest (in storage)
  • The format when loaded in a running Python program

Notably, the statistics collectors provided by Vectice operate only on this last and only in the event that the data is loaded as a pandas dataframe.

Parameters:

Name Type Description Default
gcs_client Client

The google.cloud.storage.Client used to interact with Google Cloud Storage.

required
bucket_name str

The name of the bucket to get data from.

required
resource_paths str | list[str]

The paths of the resources to get.

required
Source code in src/vectice/models/resource/gcs_resource.py
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
def __init__(
    self,
    gcs_client: Client,
    bucket_name: str,
    resource_paths: str | list[str],
):
    """Initialize a GCS resource.

    Parameters:
        gcs_client: The `google.cloud.storage.Client` used
            to interact with Google Cloud Storage.
        bucket_name: The name of the bucket to get data from.
        resource_paths: The paths of the resources to get.
    """
    super().__init__()
    self.bucket_name = bucket_name
    self.resource_paths = resource_paths if isinstance(resource_paths, list) else [resource_paths]
    self.gcs_client = gcs_client