Estimator#

class pai.estimator.Estimator(image_uri: str, command: Union[str, List[str]], source_dir: Optional[str] = None, git_config: Optional[Dict[str, str]] = None, job_type: str = 'PyTorchJob', hyperparameters: Optional[Dict[str, Any]] = None, environments: Optional[Dict[str, str]] = None, requirements: Optional[List[str]] = None, base_job_name: Optional[str] = None, max_run_time: Optional[int] = None, checkpoints_path: Optional[str] = None, output_path: Optional[str] = None, metric_definitions: Optional[List[Dict[str, str]]] = None, instance_type: Optional[str] = None, instance_count: Optional[int] = None, user_vpc_config: Optional[UserVpcConfig] = None, experiment_config: Optional[ExperimentConfig] = None, resource_id: Optional[str] = None, session: Optional[Session] = None, **kwargs)#

基类:EstimatorBase

The Estimator object is responsible for submitting TrainingJob.

The Estimator helps to run a training script in the PAI Training Service with a specific image.

Example:

est = Estimator(
    source_dir="./train/src/",
    command="python train.py",
    image_uri = training_image_uri,
    instance_type="ecs.c6.xlarge",
    hyperparameters={
        "n_estimators": 50,
        "objective": "binary:logistic",
        "max_depth": 5,
        "eval_metric": "auc",
    },
    output_path="oss://{YOUR_BUCKET_NAME}/pai/training_job/output_path",
)

est.fit(inputs={
    "train": "oss://{YOUR_BUCKET_NAME}/path/to/train-data",
    "test": "oss://{YOUR_BUCKET_NAME}/path/to/test-data",
})

print(est.model_data())

Estimator constructor.

参数:
  • image_uri (str) -- The image used in the training job. It can be an image provided by PAI or a user customized image. To view the images provided by PAI, please refer to the document: https://help.aliyun.com/document_detail/202834.htm.

  • command (Union[str, List[str]]) -- The command used to run the training job.

  • source_dir (str, optional) --

    The local source code directory used in the training job. The directory will be packaged and uploaded to an OSS bucket, then downloaded to the /ml/usercode directory in the training job container. If there is a requirements.txt file in the source code directory, the corresponding dependencies will be installed before the training script runs.

    If 'git_config' is provided, 'source_dir' should be a relative location to a directory in the Git repo. With the following GitHub repo directory structure:

    |----- README.md
    |----- src
             |----- train.py
             |----- test.py
    

    if you need 'src' directory as the source code directory, you can assign source_dir='./src/'.

  • git_config (Dict[str, str]) --

    Git configuration used to clone the repo. Including repo, branch, commit, username, password and token. The repo is required. All other fields are optional. repo specifies the Git repository. If you don't provide branch, the default value 'master' is used. If you don't provide commit, the latest commit in the specified branch is used. username, password and token are for authentication purpose. For example, the following config:

    git_config = {
        'repo': 'https://github.com/modelscope/modelscope.git',
        'branch': 'master',
        'commit': '9bfc4a9d83c4beaf8378d0a186261ffc1cd9f960'
    }
    

    results in cloning the git repo specified in 'repo', then checking out the 'master' branch, and checking out the specified commit.

  • job_type (str) -- The type of job, which can be TFJob, PyTorchJob, XGBoostJob, etc.

  • hyperparameters (dict, optional) -- A dictionary that represents the hyperparameters used in the training job. The hyperparameters will be stored in the /ml/input/config/hyperparameters.json as a JSON dictionary in the training container.

  • environments -- A dictionary that maps environment variable names to their values. This optional field allows you to provide a set of environment variables that will be applied to the context where the code is executed.

  • requirements (list, optional) -- An optional list of strings that specifies the Python package dependencies with their versions. Each string in the list should be in the format 'package' or 'package==version'. This is similar to the contents of a requirements.txt file used in Python projects. If requirements.txt is provided in user code directory, requirements will override the conflict dependencies directly.

  • instance_type (str) -- The machine instance type used to run the training job. To view the supported machine instance types, please refer to the document: https://help.aliyun.com/document_detail/171758.htm#section-55y-4tq-84y. If the instance_type is "local", the training job is executed locally using docker.

  • max_run_time (int, optional) -- The maximum time in seconds that the training job can run. The training job will be terminated after the time is reached (Default None).

  • instance_count (int) -- The number of machines used to run the training job.

  • base_job_name (str, optional) -- The base name used to generate the training job name.

  • checkpoints_path (str, optional) -- An OSS URI that stores the checkpoint of the training job. If provided, the OSS URI will be mounted to the directory /ml/output/checkpoints/.

  • user_vpc_config (pai.estimator.UserVpcConfig, optional) -- The VPC configuration used to enable the training job instance to connect to the specified user VPC. If provided, an Elastic Network Interface (ENI) will be created and attached to the training job instance, allowing the instance to access the resources within the specified VPC. Default to None.

  • experiment_config (pai.estimator.ExperimentConfig, optional) -- The experiment configuration used to construct the relationship between the training job and the experiment. If provided, the training job will belong to the specified experiment, in which case the training job will use artifact_uri of experiment as default output path. Default to None.

  • output_path (str, optional) --

    An OSS URI to store the outputs of the training jobs. If not provided, an OSS URI will be generated using the default OSS bucket in the session. When the estimator.fit method is called, a specific OSS URI under the output_path for each channel is generated and mounted to the training container.

    A completed training container directory structure example:

    /ml
    |-- usercode                                // User source code directory.
    |   |-- requirements.txt
    |   `-- train.py
    |-- input                                   // TrainingJob input
    |   `-- config
    |       |-- hyperparameters.json    // Hyperparameters in JSON
    |       |                           // dictionary format for the
    |       |                           // TrainingJob
    |       |
    |   `-- data                                // TrainingJob input channels
    |       |                           // `/ml/input/data/` is a input
    |       |                           // channel, and the directory
    |       |                           // name is the channel name.
    |       |                           // Each directory under the
    |       |-- test-data
    |       |   `-- test.csv
    |       `-- train-data
    |           `-- train.csv
    `-- output                                  // TrainingJob output channels.
            |                           // Each directory under the
            |                           // `/ml/output/` is an output
            |                           // channel, and the directory
            |                           // name is the channel name.
            `-- model
            `-- checkpoints
    

  • metric_definitions (List[Dict[str, Any]) --

    A list of dictionaries that defines the metrics used to evaluate the training jobs. Each dictionary contains two keys: "Name" for the name of the metric, and "Regex" for the regular expression used to extract the metric from the logs of the training job. The regular expression should contain only one capture group that is responsible for extracting the metric value.

    Example:

    metric_definitions=[
        {
            "Name": "accuracy",
            "Regex": r".*accuracy="
                     r"([-+]?[0-9]*.?[0-9]+(?:[eE][-+]?[0-9]+)?).*",
        },
        {
            "Name": "train-accuracy",
            "Regex": r".*validation_0-auc="
                     r"([-+]?[0-9]*.?[0-9]+(?:[eE][-+]?[0-9]+)?).*",
        },
    ]
    

  • session (Session, optional) -- A PAI session instance used for communicating with PAI service.

training_image_uri() str#

Return the Docker image to use for training.

The fit() method, that does the model training, calls this method to find the image to use for model training.

返回:

The URI of the Docker image.

返回类型:

str

fit(inputs: Optional[Dict[str, Any]] = None, wait: bool = True, show_logs: bool = True)#

Submit a training job with the given input data.

参数:
  • inputs (Dict[str, Any]) -- A dictionary representing the input data for the training job. Each key/value pair in the dictionary is an input channel, the key is the channel name, and the value is the input data. The input data can be an OSS URI or a NAS URI object and will be mounted to the /ml/input/data/{channel_name} directory in the training container.

  • wait (bool) -- Specifies whether to block until the training job is completed, either succeeded, failed, or stopped. (Default True).

  • show_logs (bool) -- Specifies whether to show the logs produced by the training job (Default True).

抛出:

UnExpectedStatusException -- If the training job fails.

class pai.estimator.AlgorithmEstimator(algorithm_name: Optional[str] = None, algorithm_version: Optional[str] = None, algorithm_provider: Optional[str] = None, algorithm_spec: Optional[Dict[str, Any]] = None, hyperparameters: Optional[Dict[str, Any]] = None, environments: Optional[Dict[str, str]] = None, requirements: Optional[List[str]] = None, base_job_name: Optional[str] = None, max_run_time: Optional[int] = None, output_path: Optional[str] = None, instance_type: Optional[str] = None, instance_count: Optional[int] = None, user_vpc_config: Optional[UserVpcConfig] = None, session: Optional[Session] = None, instance_spec: Optional[Dict[str, Union[int, str]]] = None, **kwargs)#

基类:EstimatorBase

Handle training jobs with algorithms

The AlgorithmEstimator provides a simple way for submitting training jobs with algorithms.

Example:

# Create an AlgorithmEstimator with built-in algorithms
est = AlgorithmEstimator(
    algorithm_name="pai-algorithm-test",
    algorithm_version="0.1.0",
    algorithm_provider="pai",
)

# Inspect the definition of hyperparameters, input channels and output channels
print(est.hyperparameter_definitions)
print(est.input_channel_definitions)
print(est.output_channel_definitions)
print(est.supported_instance_types)

# Submit a training job
est.fit(
    inputs={
        "train": "oss://bucket/path/to/train/data",
        "test": "oss://bucket/path/to/test/data",
    },
)

# Inspect all outputs data
print(est.get_outputs_data())

Initialize an AlgorithmEstimator.

参数:
  • algorithm_name (str, optional) -- The name of the registered algorithm. If not provided, the algorithm_spec must be provided.

  • algorithm_version (str, optional) -- The version of the algorithm. If not provided, the latest version of the algorithm will be used. If algorithm name is not provided, this argument will be ignored.

  • algorithm_provider (str, optional) --

    The provider of the algorithm. Currently, only "pai" or None are supported. Set it to "pai" to retrieve

    a PAI official algorithm. If not provided, the default provider is user's PAI account. If algorithm name is not provided, this argument

    will be ignored.

  • algorithm_spec (Dict[str, Any], optional) -- A temporary algorithm spec. Required if algorithm_name is not provided.

  • hyperparameters (dict, optional) -- A dictionary that represents the hyperparameters used in the training job. Default hyperparameters will be retrieved from the algorithm definition.

  • environments -- A dictionary that maps environment variable names to their values. This optional field allows you to provide a set of environment variables that will be applied to the context where the code is executed.

  • requirements (list, optional) -- An optional list of strings that specifies the Python package dependencies with their versions. Each string in the list should be in the format 'package' or 'package==version'. This is similar to the contents of a requirements.txt file used in Python projects. If requirements.txt is provided in user code directory, requirements will override the conflict dependencies directly.

  • base_job_name (str, optional) -- The base name used to generate the training job name. If not provided, a default job name will be generated.

  • max_run_time (int, optional) -- The maximum time in seconds that the training job can run. The training job will be terminated after the time is reached (Default None).

  • output_path (str, optional) -- An OSS URI to store the outputs of the training jobs. If not provided, an OSS URI will be generated using the default OSS bucket in the session. When the estimator.fit method is called, a specific OSS URI under the output_path for each channel is generated and mounted to the training container.

  • instance_type (str, optional) -- The machine instance type used to run the training job. If not provider, the default instance type will be retrieved from the algorithm definition. To view the supported machine instance types, please refer to the document: https://help.aliyun.com/document_detail/171758.htm#section-55y-4tq-84y.

  • instance_count (int, optional) -- The number of machines used to run the training job. If not provider, the default instance count will be retrieved from the algorithm definition.

  • user_vpc_config (pai.estimator.UserVpcConfig, optional) -- The VPC configuration used to enable the training job instance to connect to the specified user VPC. If provided, an Elastic Network Interface (ENI) will be created and attached to the training job instance, allowing the instance to access the resources within the specified VPC. Default to None.

  • session (pai.session.Session, optional) -- A PAI session object used for interacting with PAI Service.

set_hyperparameters(**kwargs)#

Set hyperparameters for the algorithm training.

property hyperparameter_definitions: List[Dict[str, Any]]#

Get the hyperparameter definitions from the algorithm spec.

property input_channel_definitions: List[Dict[str, Any]]#

Get the input channel definitions from the algorithm spec.

property output_channel_definitions: List[Dict[str, Any]]#

Get the output channel definitions from the algorithm spec.

property supported_instance_types: List[str]#

Get the supported instance types from the algorithm spec.

fit(inputs: Optional[Dict[str, Any]] = None, wait: bool = True, show_logs: bool = True)#

Submit a training job with the given input data.

参数:
  • inputs (Dict[str, Any]) -- A dictionary representing the input data for the training job. Each key/value pair in the dictionary is an input channel, the key is the channel name, and the value is the input data. The input data can be an OSS URI or a NAS URI object and will be mounted to the /ml/input/data/{channel_name} directory in the training container.

  • wait (bool) -- Specifies whether to block until the training job is completed, either succeeded, failed, or stopped. (Default True).

  • show_logs (bool) -- Specifies whether to show the logs produced by the training job (Default True).

抛出:

UnExpectedStatusException -- If the training job fails.

get_outputs_data() Dict[str, str]#

Show all outputs data paths.

返回:

A dictionary of all outputs data paths.

返回类型:

dict[str, str]

class pai.common.configs.UserVpcConfig(vpc_id: str, security_group_id: str, switch_id: Optional[str] = None, extended_cidrs: Optional[List[str]] = None)#

基类:object

UserVpcConfig is used to give training job access to resources in your VPC.

Initialize UserVpcConfig.

参数:
  • vpc_id (str) -- Specifies the ID of the VPC that training job instance connects to.

  • security_group_id (str) -- The ID of the security group that training job instances belong to.

  • switch_id (str, optional) -- The ID of the vSwitch to which the instance belongs. Defaults to None.

  • extended_cidrs (List[str], optional) -- The CIDR blocks configured for the ENI of the training job instance. If it is not specified, the CIDR block will be configured as the same as the VPC network segmentation, which means that the training job instance can access all resources in the VPC. Defaults to None.

class pai.estimator.FileSystemInput(file_system_id: str, directory_path: Optional[str] = None)#

基类:FileSystemInputBase

FileSystemInput is used to mount a Standard/Extreme NAS file system for a TrainingJob.

Examples:

est = Estimator(
    image_uri="<TrainingImageUri>",
    command="sh train.sh",
    instance_type="ecs.c6.xlarge",
)

est.fit({
    "input": FileSystemInput(
        file_system_id="<FileSystemId>",
        directory_path="/path/to/data/"),
})
to_input_uri()#

Convert FileSystemInput to input uri used for TrainingJob.

class pai.estimator.CpfsFileSystemInput(file_system_id: str, protocol_service_id: str, export_id: str)#

基类:FileSystemInputBase

CpfsFileSystemInput is used to mount a CPFS file system for a TrainingJob.

More details about CPFS, please refer to documentation: https://help.aliyun.com/product/111536.html

Examples:

est = Estimator(
    image_uri="<TrainingImageUri>",
    command="sh train.sh",
    instance_type="ecs.c6.xlarge",
)

est.fit(
    inputs={"train": CpfsFileSystemInput(
        file_system_id="<FileSystemId>",
        protocol_service_id="<ProtocolServiceId>",
        export_id="<ExportId>",
    )},
)

Initialize CpfsFileSystemInput.

参数:
  • file_system_id (str) -- CPFS file system id.

  • protocol_service_id (str) -- CPFS protocol service id.

  • export_id (str) -- CPFS export id.

to_input_uri()#

Convert CpfsFileSystemInput instance to input uri used for TrainingJob.