gRPC with NLB は idle timeout に気をつけないと死ぬ

2017-10-22

先日発表された AWS の Network Load Balancer (NLB) を gRPC で使ってみたのですが、 idle timeout 周りで盛大にミスったので知見共有です。

経緯

先日、 NLB こと Network Load Balancer が AWS にてリリースされました。

新しい Network Load Balancer – 秒間数百万リクエストに簡単にスケーリング | Amazon Web Services ブログ

NLB は TCP レベルでのロードバランシングができ、プレウォーミングなしで高パフォーマンスを発揮できるため既存の AWS の LB に比べ gRPC との相性が良いです。

マイクロサービス構成を取っている生放送サービスにて gRPC の経路に NLB を導入したのですが、一部のサービスにて接続エラーが生じるようになったので知見を共有いたします。

症状

一部のサービスにて rpc error: code = Unavailable desc = transport is closing のようなエラーが生じた
Go のサービスだけでなく、 Java のサービスでも生じた
頻繁に gRPC 通信をしない経路にて生じている様子
NLB を通さず Client - Server を直結した場合は生じない
NLB の Monitoring を見たところ、 TCP_ELB_Reset_Count が入っている

原因

ドキュメントに記載がありました（現在は英語版ドキュメントのみ）

Connection Idle Timeout

For each request that a client makes through a load balancer, the load balancer maintains two connections. A front-end connection is between a client and the load balancer, and a back-end connection is between the load balancer and a target. For each front-end connection, the load balancer manages an idle timeout that is triggered when no data is sent over the connection for a specified time period. If no data has been sent or received by the time that the idle timeout period elapses, the front-end connection is broken. If a client sends data after the idle timeout period has elapses, it receives a TCP RST packet to indicate that the connection is no longer valid.

Elastic Load Balancing sets the idle timeout value to 350 seconds. You cannot modify this value. Your targets can use TCP keepalive packets to reset the idle timeout.

Network Load Balancers - Elastic Load Balancing

TCP_ELB_Reset_Count が計測されていることからも分かる通り、 350s 以上 idle している connection は NLB 側から RST Packet （ RESET 用の packet ）を送出して接続を切るようです。

よって通信頻度の低いサービスにて久々に通信しようとすると接続が切断されており、エラーが生じたというのが原因と思われます。

対応

まず Client 側に Keepalive の設定を入れました。

Client のサンプルとして Go と Java を載せておきます。

conn, err := grpc.Dial(address, grpc.WithKeepaliveParams(keepalive.ClientParameters{
    Time:                150 * time.Second,
    PermitWithoutStream: true,
}))

grpc-go/keepalive.go at master · grpc/grpc-go

Java

return NettyChannelBuilder.
    forAddress(hostName, port).
    enableKeepAlive(true).
    keepAliveTime(150, TimeUnit.SECONDS).
    keepAliveWithoutCalls(true).
    build();

timeout が 350s なので半分弱くらいの 150s にしておきました。

これで Client 側が 350s で切断される問題は解消したのですが、検証していたところなぜか 10min - 20min くらいの idle で切断される場合があることがわかりました。

Server 側にも Keepalive の設定を追加することで解消しました。 Server は Go です。

server := grpc.NewServer(
    grpc.KeepaliveParams(keepalive.ServerParameters{
        Time: 150 * time.Second,
    }),
)

ここまでの対応で今のところ安定しております。

残る疑問点

とりあえずエラーは静まったのですが、把握しきれていないことが何点かあるのでメモしておきます。

もし詳細をご存じの方、把握できた方がいらっしゃいましたらご教示願えると嬉しいです！

Keepalive を短時間にすることの弊害

今のところ問題なさそうですが、負荷などへの影響をちゃんと検証していないので不明です。

10min - 20min で切断される現象

何が原因なのかよくわかっておらず。検証した条件を載せておきます。

1min, 5min, 6min, 10min, 20min, 30min 間隔で通信を行う 6 connection を、並列して 1 台のサーバに対し接続した場合、 20min 以上でエラー
NLB 経由せず Client - Server を直結した場合は生じない

解せぬ。

EnforcementPolicy の挙動

Server 側に EnforcementPolicy という設定がありまして、頻繁すぎる Keepalive を許容しないようにできるっぽいです。

grpc-go/keepalive.go at master · grpc/grpc-go

// EnforcementPolicy is used to set keepalive enforcement policy on the server-side.
// Server will close connection with a client that violates this policy.
type EnforcementPolicy struct {
	// MinTime is the minimum amount of time a client should wait before sending a keepalive ping.
	MinTime time.Duration // The current default value is 5 minutes.
	// If true, server expects keepalive pings even when there are no active streams(RPCs).
	PermitWithoutStream bool // false by default.
}

MinTime 以下の Keepalive 間隔の Client がいた場合、接続を拒否するような設定に見えます。

デフォルト値が 5min なので今回の 150s 設定だと切断されそうに思えるのですが、普通に接続されました。今後のバージョンアップで挙動が変わる可能性もあるので、正解を把握しておきたいところ。

以上

検証が済んでいなかったり最適設定といいきれる状態までまとめられていないのですが、 NLB 導入時に踏みそうな症状なのでいったん共有です。

idle が長い connection に心あたりがある場合は検証してから本番導入をおすすめします。

Hori Blog

フリーランスでバックエンドエンジニアとして活動している Ryota Hori のブログです。
最近はテック系記事より雑記ブログ気味。