概要

GoでAWS SDK(AWS SDK for Go v2)を叩く際、SDK呼び出しをリトライしたくなることがあります。

何パターンか方法があるため、書いてみました。

前提

本記事に出てくる例では、Goがv1.18.7、aws-sdk-go-v2のバージョンはv1.17.5を使用しています。

またリトライに関して、全てのSDK呼び出しをリトライするように一元的な設定も可能(clientインスタンス生成時に設定)ですが、本記事ではSDK呼び出し(=API呼び出し)ごとにリトライを設定・変更したいということを前提に記載しています。

例）s3.ListObjectsとiam.DeleteRoleでリトライの挙動を変える

リトライパターン

Options.RetryMaxAttempts

まず一番シンプルなのがこちら。

AWS SDKのClient生成やAPI呼び出し時に使用されるOptions構造体には、以下のように「RetryMaxAttempts」、「RetryMode」というリトライ用パラメータが存在します。

実装例

input := &iam.DeleteRoleInput{
    RoleName: roleName,
}

optFn := func(o *iam.Options) {
    o.RetryMaxAttempts = 3
    o.RetryMode = aws.RetryModeStandard
}

_, err := i.client.DeleteRole(ctx, input, optFn)

これらを指定するだけでRetryMaxAttemptsで指定した回数まで、exponential backoffなリトライを実行してくれるようになります。

Options.Retryer

さらにOptionsには、より細かいリトライアルゴリズムを実装するためのRetryerというパラメータがあります。

これを指定すると、上記で挙げた「RetryMaxAttempts」「RetryMode」ではなく、こちらで指定（実装）したリトライ挙動が適用されます。

このOptions.Retryerに指定するのは、RetryerやRetryerV2というinterfaceになります。

リトライの判断ロジック(IsErrorRetryable)や最大試行回数(MaxAttempts)、スリープ時間(RetryDelay)を調整する関数があり、これによってリトライの挙動のカスタマイズができます。

具体的には、IsErrorRetryableには「どのような時にリトライさせるか」をより詳細に指定したり、RetryDelayでは「ただのexponential backoffだけでなくランダムな秒数で待つ（Jitter）」ようなロジックを組んだりすることができるようになります。

SDKモジュール内のコード(実装例ではない)

type Retryer interface {
    // IsErrorRetryable returns if the failed attempt is retryable. This check
    // should determine if the error can be retried, or if the error is
    // terminal.
    IsErrorRetryable(error) bool

    // MaxAttempts returns the maximum number of attempts that can be made for
    // an attempt before failing. A value of 0 implies that the attempt should
    // be retried until it succeeds if the errors are retryable.
    MaxAttempts() int

    // RetryDelay returns the delay that should be used before retrying the
    // attempt. Will return error if the if the delay could not be determined.
    RetryDelay(attempt int, opErr error) (time.Duration, error)

    // GetRetryToken attempts to deduct the retry cost from the retry token pool.
    // Returning the token release function, or error.
    GetRetryToken(ctx context.Context, opErr error) (releaseToken func(error) error, err error)

    // GetInitialToken returns the initial attempt token that can increment the
    // retry token pool if the attempt is successful.
    GetInitialToken() (releaseToken func(error) error)
}

// RetryerV2 is an interface to determine if a given error from an attempt
// should be retried, and if so what backoff delay to apply. The default
// implementation used by most services is the retry package's Standard type.
// Which contains basic retry logic using exponential backoff.
//
// RetryerV2 replaces the Retryer interface, deprecating the GetInitialToken
// method in favor of GetAttemptToken which takes a context, and can return an error.
//
// The SDK's retry package's Attempt middleware, and utilities will always
// wrap a Retryer as a RetryerV2. Delegating to GetInitialToken, only if
// GetAttemptToken is not implemented.
type RetryerV2 interface {
    Retryer

    // GetInitialToken returns the initial attempt token that can increment the
    // retry token pool if the attempt is successful.
    //
    // Deprecated: This method does not provide a way to block using Context,
    // nor can it return an error. Use RetryerV2, and GetAttemptToken instead.
    GetInitialToken() (releaseToken func(error) error)

    // GetAttemptToken returns the send token that can be used to rate limit
    // attempt calls. Will be used by the SDK's retry package's Attempt
    // middleware to get a send token prior to calling the temp and releasing
    // the send token after the attempt has been made.
    GetAttemptToken(context.Context) (func(error) error, error)
}

こちらの方法で行う場合、Retryerという構造体を別ファイルに定義してあげます。

上記IsErrorRetryableで実装するべきリトライ判断ロジック関数は呼び出し元で定義してコンストラクタから渡してあげるようにすることで、汎用的に使い回すことができるようになります。

また、以下例の真ん中あたりのRetryDelayにて、「ランダムな秒数でリトライする」ためのロジックを書いています。

実装例(retryer_options.go)

package retryer

import (
    "context"
    "math/rand"
    "time"

    "github.com/aws/aws-sdk-go-v2/aws"
)

const MaxRetryCount = 10

var _ aws.RetryerV2 = (*Retryer)(nil)

type Retryer struct {
    isErrorRetryableFunc func(error) bool
    delayTimeSec         int
}

func NewRetryer(isErrorRetryableFunc func(error) bool, delayTimeSec int) *Retryer {
    return &Retryer{
        isErrorRetryableFunc: isErrorRetryableFunc,
        delayTimeSec:         delayTimeSec,
    }
}

func (r *Retryer) IsErrorRetryable(err error) bool {
    return r.isErrorRetryableFunc(err)
}

func (r *Retryer) MaxAttempts() int {
    return MaxRetryCount
}

func (r *Retryer) RetryDelay(int, error) (time.Duration, error) {
    rand.Seed(time.Now().UnixNano())
    waitTime := 1
    if r.delayTimeSec > 1 {
        waitTime += rand.Intn(r.delayTimeSec)
    }
    return time.Duration(waitTime) * time.Second, nil
}

func (r *Retryer) GetRetryToken(context.Context, error) (func(error) error, error) {
    return func(error) error { return nil }, nil
}

func (r *Retryer) GetInitialToken() func(error) error {
    return func(error) error { return nil }
}

func (r *Retryer) GetAttemptToken(context.Context) (func(error) error, error) {
    return func(error) error { return nil }, nil
}

そして、これをもとにSDK呼び出し時にリトライ指定をしてあげます。

以下実装例にて、retryableという変数に、「SDKのエラー時にレスポンスにapi error Throttling: Rate exceededというメッセージがあったらリトライする」というような判断ロジックの関数を格納しています。

そしてoptFnには、「上記のretryableとリトライ時の上限待機時間を指定するSleepTimeSecで生成したRetryerインスタンスをOptions.Retryerに指定する関数」を定義し、SDK呼び出し（ここではDeleteRole）の第3引数に指定します。

実装例(呼び出し元)(iam.go)

const SleepTimeSec = 5

...
...

input := &iam.DeleteRoleInput{
    RoleName: roleName,
}

retryable := func(err error) bool {
    return strings.Contains(err.Error(), "api error Throttling: Rate exceeded")
}
optFn := func(o *iam.Options) {
    o.Retryer = retryer.NewRetryer(retryable, SleepTimeSec)
}

_, err := i.client.DeleteRole(ctx, input, optFn)

Goジェネリクス（自前リトライ）

上記「Options.Retryer」は公式の提供するリトライ方法に則ってかつロジックも自由に定義できるのですが、さらに自由にロジックを組める方法がこちらです。

Goの比較的新しい「ジェネリクス」機能を使った手法になります。

Options.Retryerだと、リトライを通してエラーになった時に出力するエラーメッセージが自由に作りづらい点があります。(エラーが発生したリソース名などの情報を出力したりなど)

※「作りづらい」：最大試行回数を超えた時の挙動はAWS SDKモジュール側のMaxAttemptsErrorというerror(AWS SDK for Go v2コードのこのあたり)でラップされたエラーが返る(AWS SDK for Go v2コードのこのあたり)ため、そのエラーを呼び出し元でハンドリングすれば自由にエラーメッセージを作ることはできますが、以下手法だと呼び出し元でのハンドリングの手間無しでそのままカスタマイズしたエラーメッセージを返すことができます。

このような点をさらに柔軟に行う方法をここでは述べます。

まず、ジェネリクスを使ったリトライ関数を別ファイルで定義します。

実装例(retryer_generics.go)

// T: Input type for API Request.
// U: Output type for API Response.
// V: Options type for API Request.
type RetryInput[T, U, V any] struct {
    Ctx              context.Context
    SleepTimeSec     int
    TargetResource   *string
    Input            *T
    ApiOptions       []func(*V)
    ApiCaller        func(ctx context.Context, input *T, optFns ...func(*V)) (*U, error)
    RetryableChecker func(error) bool
}

// T: Input type for API Request.
// U: Output type for API Response.
// V: Options type for API Request.
func Retry[T, U, V any](
    in *RetryInput[T, U, V],
) (*U, error) {
    retryCount := 0

    for {
        output, err := in.ApiCaller(in.Ctx, in.Input, in.ApiOptions...)
        if err == nil {
            return output, nil
        }

        if in.RetryableChecker(err) {
            retryCount++
            if err := waitForRetry(in.Ctx, retryCount, in.SleepTimeSec, in.TargetResource, err); err != nil {
                return nil, err
            }
            continue
        }
        return nil, err
    }
}

func waitForRetry(ctx context.Context, retryCount int, sleepTimeSec int, targetResource *string, err error) error {
    if retryCount > MaxRetryCount {
        errorDetail := err.Error() + "\nRetryCount(" + strconv.Itoa(MaxRetryCount) + ") over, but failed to delete. "
        return fmt.Errorf("RetryCountOverError: %v, %v", *targetResource, errorDetail)
    }

    select {
    case <-ctx.Done():
        return ctx.Err()
    case <-time.After(getRandomSleepTime(sleepTimeSec)):
    }
    return nil
}

func getRandomSleepTime(sleepTimeSec int) time.Duration {
    rand.Seed(time.Now().UnixNano())
    waitTime := 1
    if sleepTimeSec > 1 {
        waitTime += rand.Intn(sleepTimeSec)
    }
    return time.Duration(waitTime) * time.Second
}

ここから解説ですが、RetryInputというtypeをinputとして、Retryというリトライを行う関数を定義します。

まずRetryInputのジェネリクスに使うtype([T, U, V any])として、呼び出し元でTにiam.DeleteRoleInput、Uにiam.DeleteRoleOutput、Vにiam.Optionsを渡してあげるような使い方になります。

ApiCallerには、実際のSDKの関数自体(ex.iam.DeleteRole)を渡してあげます。

RetryableCheckerは「どんなときにリトライするか」という判断ロジックを定義する関数になります。

// T: Input type for API Request.
// U: Output type for API Response.
// V: Options type for API Request.
type RetryInput[T, U, V any] struct {
    Ctx              context.Context
    SleepTimeSec     int
    TargetResource   *string
    Input            *T
    ApiOptions       []func(*V)
    ApiCaller        func(ctx context.Context, input *T, optFns ...func(*V)) (*U, error)
    RetryableChecker func(error) bool
}

// T: Input type for API Request.
// U: Output type for API Response.
// V: Options type for API Request.
func Retry[T, U, V any](
    in *RetryInput[T, U, V],
) (*U, error) {
    retryCount := 0

    for {
        output, err := in.ApiCaller(in.Ctx, in.Input, in.ApiOptions...)
        if err == nil {
            return output, nil
        }

        if in.RetryableChecker(err) {
            retryCount++
            if err := waitForRetry(in.Ctx, retryCount, in.SleepTimeSec, in.TargetResource, err); err != nil {
                return nil, err
            }
            continue
        }
        return nil, err
    }
}

waitForRetry関数ですが、最大試行回数MaxRetryCountを超えた時にオリジナルなエラーメッセージを出力するerrorを返す処理になります。

また引数で渡しているcontextを使用して、リトライが行われるたびにcontextがキャンセル(Done)されていないか(他の処理で何らかのエラーが発生してプログラムを異常終了すべき状態になっていないか)をチェックし、キャンセルされている場合は次のリトライのためのsleepを実行せずにctx.Err()を返して終了するような仕組みを取り入れています。

func waitForRetry(ctx context.Context, retryCount int, sleepTimeSec int, targetResource *string, err error) error {
    if retryCount > MaxRetryCount {
        errorDetail := err.Error() + "\nRetryCount(" + strconv.Itoa(MaxRetryCount) + ") over, but failed to delete. "
        return fmt.Errorf("RetryCountOverError: %v, %v", *targetResource, errorDetail)
    }

    select {
    case <-ctx.Done():
        return ctx.Err()
    case <-time.After(getRandomSleepTime(sleepTimeSec)):
    }
    return nil
}

そして上記waitForRetry関数で登場するgetRandomSleepTimeで、リトライの際のスリープ時間を調整するロジックを書いています。

ここでは、指定した上限時間(sleepTimeSec)の範囲の中でランダムに待機(Jitter)する処理を書いています。

func getRandomSleepTime(sleepTimeSec int) time.Duration {
    rand.Seed(time.Now().UnixNano())
    waitTime := 1
    if sleepTimeSec > 1 {
        waitTime += rand.Intn(sleepTimeSec)
    }
    return time.Duration(waitTime) * time.Second
}

そして、このRetry関数を呼び出す側の実装例になります。

実装例(呼び出し元)(iam.go)

   input := &iam.DeleteRoleInput{
        RoleName: roleName,
    }

    retryable := func(err error) bool {
        return strings.Contains(err.Error(), "api error Throttling: Rate exceeded")
    }

    _, err := retryer.Retry(
        &retryer.RetryInput[iam.DeleteRoleInput, iam.DeleteRoleOutput, iam.Options]{
            Ctx:              ctx,
            SleepTimeSec:     SleepTimeSec,
            TargetResource:   roleName,
            Input:            input,
            ApiCaller:        i.client.DeleteRole,
            RetryableChecker: retryable,
        },
    )

この手法ではジェネリクスを使うことで、自作関数ながらもinputやoutput、optionsの型を合わせて汎用的に型の関係性を保証してリトライ処理を実装できるような点が特徴です。

ただし、特に理由が無ければせっかく公式で用意されているOptions.Retryerの方を用いた方が良いかなとも思います。