Docker Autoscaler executor

Docker Autoscaler executorを使用する前に、一連の既知のイシューについて、GitLab Runnerオートスケールに関するフィードバックイシューを参照してください。

Docker Autoscaler executorは、Runnerマネージャーが処理するジョブに対処するために、オンデマンドでインスタンスを作成するオートスケール対応のDocker executorです。Docker executorをラップしているため、すべてのDocker executorのオプションと機能がサポートされています。

Docker Autoscalerは、フリートプラグインを使用してオートスケールします。フリートとは、オートスケールされたインスタンスのグループの抽象化であり、Google Cloud、AWS、Azureなどのクラウドプロバイダーをサポートするプラグインを使用します。

フリートプラグインをインストールする

ご使用のターゲットプラットフォームに対応するプラグインをインストールするには、フリートプラグインをインストールするを参照してください。具体的な設定について詳しくは、それぞれのプラグインプロジェクトのドキュメントを参照してください。

Docker Autoscalerを設定する

Docker Autoscaler executorはDocker executorをラップしているため、すべてのDocker executorオプションと機能がサポートされています。

Docker Autoscalerを設定するには、config.tomlで以下のように設定します:

[runners]セクションでexecutorをdocker-autoscalerとして指定します。
以下のセクションで、要件に基づいてDocker Autoscalerを設定します:
- [runners.docker]
- [runners.autoscaler]

各Runner設定の専用オートスケールグループ

各Docker Autoscaler設定には、それぞれに専用のオートスケールリソースが必要です:

AWSでは専用のオートスケールグループ
GCPでは専用のインスタンスグループ
Azureでは専用のスケールセット

これらのオートスケールリソースを以下の要素間で共有しないでください:

複数のRunnerマネージャー（個別のGitLab Runnerインストール）
同じRunnerマネージャーのconfig.toml内の複数の[[runners]]エントリ

Docker Autoscalerは、クラウドプロバイダーのオートスケールリソースと同期する必要があるインスタンスの状態を追跡します。複数のシステムが同じオートスケールリソースを管理しようとすると、競合するスケーリングコマンドが発行され、予測できない動作、ジョブの失敗、および高い可能性があるコストが発生する可能性があります。

次に例を示します: インスタンスあたり1つのジョブに対するAWSオートスケール

前提要件:

Docker EngineがインストールされたAMI。RunnerマネージャーがAMI上のDockerソケットにアクセスできるようにするには、ユーザーがdockerグループに所属している必要があります。
AMIでは、GitLab Runnerをインストールする必要はありません。AMIを使用して起動されたインスタンスを、GitLabにRunnerとして登録しないようにしてください。
AWSオートスケールグループ。Runnerがスケーリングを処理するため、スケーリングポリシーには「none」を使用します。インスタンスのスケールイン保護を有効にします。
適切な権限が設定されたIAMポリシー。

この設定では以下がサポートされています:

インスタンスあたりのキャパシティ: 1
使用回数: 1
アイドルスケール: 5
アイドル時間: 20分
インスタンスの最大数: 10

キャパシティと使用回数を両方とも1に設定することで、各ジョブに、他のジョブの影響を受けない安全な一時インスタンスが与えられます。ジョブが完了すると即時に、ジョブが実行されていたインスタンスが削除されます。

アイドルスケールが5の場合、Runnerは将来の需要に備えて5つのインスタンス全体を維持しようとします（インスタンスあたりのキャパシティが1であるため）。これらのインスタンスは少なくとも20分間維持されます。

Runnerのconcurrentフィールドは10（インスタンスの最大数*インスタンスあたりのキャパシティ）に設定されます。

concurrent = 10

[[runners]]
  name = "docker autoscaler example"
  url = "https://gitlab.com"
  token = "<token>"
  shell = "sh"                                        # use powershell or pwsh for Windows AMIs

  # uncomment for Windows AMIs when the Runner manager is hosted on Linux
  # environment = ["FF_USE_POWERSHELL_PATH_RESOLVER=1"]

  executor = "docker-autoscaler"

  # Docker Executor config
  [runners.docker]
    image = "busybox:latest"

  # Autoscaler config
  [runners.autoscaler]
    plugin = "aws" # in GitLab 16.11 and later, ensure you run `gitlab-runner fleeting install` to automatically install the plugin

    # in GitLab 16.10 and earlier, manually install the plugin and use:
    # plugin = "fleeting-plugin-aws"

    capacity_per_instance = 1
    max_use_count = 1
    max_instances = 10

    [runners.autoscaler.plugin_config] # plugin specific configuration (see plugin documentation)
      name             = "my-docker-asg"               # AWS Autoscaling Group name
      profile          = "default"                     # optional, default is 'default'
      config_file      = "/home/user/.aws/config"      # optional, default is '~/.aws/config'
      credentials_file = "/home/user/.aws/credentials" # optional, default is '~/.aws/credentials'

    [runners.autoscaler.connector_config]
      username          = "ec2-user"
      use_external_addr = true

    [[runners.autoscaler.policy]]
      idle_count = 5
      idle_time = "20m0s"

例: インスタンスあたり1つのジョブに対するGoogle Cloudインスタンスグループ

前提要件:

Docker EngineがインストールされたVMイメージ（COSなど）。
VMイメージでは、GitLab Runnerをインストールする必要はありません。VMイメージを使用して起動されたインスタンスを、GitLabにRunnerとして登録しないようにしてください。
シングルゾーンGoogle Cloudインスタンスグループ。Autoscaling modeでDo not autoscaleを選択します。Runnerがオートスケールを処理し、Google Cloudインスタンスグループは処理しません。
現在のところ、マルチゾーンインスタンスグループはサポートされていません。将来マルチゾーンインスタンスグループをサポートするためのイシューが存在しています。
適切な権限が設定されたIAMポリシー。GKEクラスターにRunnerをデプロイする場合は、KubernetesサービスアカウントとGCPサービスアカウントの間にIAMバインディングを追加できます。credentials_fileでキーファイルを使用する代わりに、iam.workloadIdentityUserロールでこのバインディングを追加し、GCPに対して認証できます。

この設定では以下がサポートされています:

インスタンスあたりのキャパシティ: 1
使用回数: 1
アイドルスケール: 5
アイドル時間: 20分
インスタンスの最大数: 10

Runnerのconcurrentフィールドは10（インスタンスの最大数*インスタンスあたりのキャパシティ）に設定されます。

concurrent = 10

[[runners]]
  name = "docker autoscaler example"
  url = "https://gitlab.com"
  token = "<token>"
  shell = "sh"                                        # use powershell or pwsh for Windows Images

  # uncomment for Windows Images when the Runner manager is hosted on Linux
  # environment = ["FF_USE_POWERSHELL_PATH_RESOLVER=1"]

  executor = "docker-autoscaler"

  # Docker Executor config
  [runners.docker]
    image = "busybox:latest"

  # Autoscaler config
  [runners.autoscaler]
    plugin = "googlecloud" # for >= 16.11, ensure you run `gitlab-runner fleeting install` to automatically install the plugin

    # for versions < 17.0, manually install the plugin and use:
    # plugin = "fleeting-plugin-googlecompute"

    capacity_per_instance = 1
    max_use_count = 1
    max_instances = 10

    [runners.autoscaler.plugin_config] # plugin specific configuration (see plugin documentation)
      name             = "my-docker-instance-group" # Google Cloud Instance Group name
      project          = "my-gcp-project"
      zone             = "europe-west1"
      credentials_file = "/home/user/.config/gcloud/application_default_credentials.json" # optional, default is '~/.config/gcloud/application_default_credentials.json'

    [runners.autoscaler.connector_config]
      username          = "runner"
      use_external_addr = true

    [[runners.autoscaler.policy]]
      idle_count = 5
      idle_time = "20m0s"

例: インスタンスあたり1つのジョブに対するAzureスケールセット

前提要件:

Docker EngineがインストールされているAzure VMイメージ。
VMイメージでは、GitLab Runnerをインストールする必要はありません。VMイメージを使用して起動されたインスタンスを、GitLabにRunnerとして登録しないようにしてください。
オートスケールポリシーがmanualに設定されているAzureスケールセット。Runnerがスケーリングを処理します。

この設定では以下がサポートされています:

インスタンスあたりのキャパシティ: 1
使用回数: 1
アイドルスケール: 5
アイドル時間: 20分
インスタンスの最大数: 10

キャパシティと使用回数が両方とも1に設定されている場合、各ジョブに、他のジョブの影響を受けない安全な一時インスタンスが与えられます。ジョブが完了すると、ジョブが実行されたインスタンスが直ちに削除されます。

アイドルスケールが5に設定されている場合、Runnerは将来の需要に備えて5つのインスタンスを維持します（インスタンスあたりのキャパシティが1であるため）。これらのインスタンスは少なくとも20分間維持されます。

Runnerのconcurrentフィールドは10（インスタンスの最大数*インスタンスあたりのキャパシティ）に設定されます。

concurrent = 10

[[runners]]
  name = "docker autoscaler example"
  url = "https://gitlab.com"
  token = "<token>"
  shell = "sh"                                        # use powershell or pwsh for Windows AMIs

  # uncomment for Windows AMIs when the Runner manager is hosted on Linux
  # environment = ["FF_USE_POWERSHELL_PATH_RESOLVER=1"]

  executor = "docker-autoscaler"

  # Docker Executor config
  [runners.docker]
    image = "busybox:latest"

  # Autoscaler config
  [runners.autoscaler]
    plugin = "azure" # for >= 16.11, ensure you run `gitlab-runner fleeting install` to automatically install the plugin

    # for versions < 17.0, manually install the plugin and use:
    # plugin = "fleeting-plugin-azure"

    capacity_per_instance = 1
    max_use_count = 1
    max_instances = 10

    [runners.autoscaler.plugin_config] # plugin specific configuration (see plugin documentation)
      name = "my-docker-scale-set"
      subscription_id = "9b3c4602-cde2-4089-bed8-889e5a3e7102"
      resource_group_name = "my-resource-group"

    [runners.autoscaler.connector_config]
      username = "azureuser"
      password = "my-scale-set-static-password"
      use_static_credentials = true
      timeout = "10m"
      use_external_addr = true

    [[runners.autoscaler.policy]]
      idle_count = 5
      idle_time = "20m0s"

トラブルシューティング

`ERROR: error during connect: ssh tunnel: EOF ()`

インスタンスが外部ソース（オートスケールグループや自動スクリプトなど）によって削除された場合、ジョブは次のエラーで失敗します:

ERROR: Job failed (system failure): error during connect: Post "http://internal.tunnel.invalid/v1.43/containers/xyz/wait?condition=not-running": ssh tunnel: EOF ()

また、GitLab Runnerのログには、ジョブに割り当てられたインスタンスIDのinstance unexpectedly removedエラーが表示されます:

ERROR: instance unexpectedly removed    instance=<instance_id> max-use-count=9999 runner=XYZ slots=map[] subsystem=taskscaler used=45

このエラーを解決するには、クラウドプロバイダープラットフォームでインスタンスに関連するイベントを確認してください。たとえばAWSでは、イベントソースec2.amazonaws.comのCloudTrailイベント履歴を確認します。

`ERROR: Preparation failed: unable to acquire instance: context deadline exceeded`

AWSフリートプラグインを使用している場合、ジョブが失敗して次のエラーになることが断続的に発生する可能性があります:

ERROR: Preparation failed: unable to acquire instance: context deadline exceeded

reservedのインスタンス数が変動するため、多くの場合、これはAWS CloudWatchのログの中に示されます:

"2024-07-23T18:10:24Z","instance_count:1,max_instance_count:1000,acquired:0,unavailable_capacity:0,pending:0,reserved:0,idle_count:0,scale_factor:0,scale_factor_limit:0,capacity_per_instance:1","required scaling change",
"2024-07-23T18:10:25Z","instance_count:1,max_instance_count:1000,acquired:0,unavailable_capacity:0,pending:0,reserved:1,idle_count:0,scale_factor:0,scale_factor_limit:0,capacity_per_instance:1","required scaling change",
"2024-07-23T18:11:15Z","instance_count:1,max_instance_count:1000,acquired:0,unavailable_capacity:0,pending:0,reserved:0,idle_count:0,scale_factor:0,scale_factor_limit:0,capacity_per_instance:1","required scaling change",
"2024-07-23T18:11:16Z","instance_count:1,max_instance_count:1000,acquired:0,unavailable_capacity:0,pending:0,reserved:1,idle_count:0,scale_factor:0,scale_factor_limit:0,capacity_per_instance:1","required scaling change",

このエラーを解決するには、AWSでオートスケールグループに対してAZRebalanceプロセスが無効になっていることを確認してください。