[Critical] Outbound DNS Resolution Unstable on Tokyo (NRT1) Linode — ENOTFOUND Errors Every 30min-2h

In Progress

ZNZac NienDEVELOPER1d ago

Problem Overview

Our services are deployed on Linode servers in the Zeabur Tokyo (NRT1) region. We have observed intermittent outbound DNS resolution failures in our containers. Every few tens of minutes to a few hours, the service experiences batch DNS resolution failures (ENOTFOUND). This issue is reproducible across multiple projects, leading us to suspect an issue at the platform infrastructure level.

Affected Services

Project Details

Project 1 tradeyai-v04 (ID: 69cb7770b8fdd4fc5297eb93)

Service 1 agent-manager (ID: 69cb7777c2fdbebd296df903)

Project 2 TradeyAI-v03 (ID: 69c101eb1066986b9a175bcb)

Service 2 market-data-fetcher

Region Tokyo (NRT1)

Shared Outbound IP 172.104.74.15 (Linode)

Target Endpoint api.hyperliquid.xyz (Public API)

Symptom Details

1. Failure Pattern

Each time a DNS resolution fails, 4–5 concurrent fetch requests return ENOTFOUND simultaneously.
The failure event is not isolated but occurs in clusters.
Each request has implemented 4 retries (intervals: 1s/2s/4s/8s, totaling ~15 seconds).
DNS remains unresolvable within 15 seconds, exceeding typical DNS jitter windows.

2. Frequency

Approximately 8% of cycles failed over the last 48 hours.
Most recent occurrence: UTC 2026-04-14 19:12:03.

3. Failure Timeline (UTC)

2026-04-14 19:12:03 cycle 569 ENOTFOUND x5
2026-04-14 18:34:45 cycle 562 ENOTFOUND x5
2026-04-14 18:02:13 cycle 556 ENOTFOUND x5
2026-04-14 15:23:59 cycle 526 ENOTFOUND x5
2026-04-14 12:09:02 cycle 491 ENOTFOUND x5

Excluded Possibilities

1. ✅ Not a CoreDNS (10.43.0.20) single-point issue

Implemented cacheable-lookup + undici setGlobalDispatcher on the Node side.
Forced all outbound HTTP DNS to use public resolvers (1.1.1.1 / 8.8.8.8).
Bypassed container default nameservers.
Failure rate decreased but did not reach zero after configuration.

2. ✅ Not a client retry issue

Single ENOTFOUND event lasts >15 seconds.
Exceeds typical DNS jitter window.
4 retries implemented with 15-second total duration.

3. ✅ Not an application-layer bug

globalThis.fetch global dispatcher has been replaced.
All outbound calls use a cached public resolver.
Issue reproduces across multiple projects and services.

Technical Details

DNS Configuration:

Primary resolver: 1.1.1.1:53 (Cloudflare)
Secondary resolver: 8.8.8.8:53 (Google)
Protocol: UDP + TCP
Caching strategy: cacheable-lookup with undici

Failure Characteristics:

Target endpoint: api.hyperliquid.xyz (Public endpoint, no special restrictions)
Failure type: ENOTFOUND (DNS resolution failure, not connection timeout)
Concurrency: 4-5 requests fail simultaneously (not random single-point failures)

Items for Confirmation

Are outbound requests to 1.1.1.1:53 / 8.8.8.8:53 (UDP and TCP) being rate-limited, temporarily blocked, or routed through unstable egress nodes on Linode Tokyo at certain times?
Has there been any node/network-level incident in the Tokyo (NRT1) region during these times?

E.g., rolling restarts, CNI restarts, egress NAT switching, firewall rule changes, etc.

Is it recommended to switch to a stable egress/DNS endpoint provided by your platform at the container level?
Are there any known outbound connection issues with Linode Tokyo?

Noted similar cases on the forum (SUP-11316) involving blocked outbound TCP connections in Linode Tokyo.

Impact Assessment

Service Availability: ~92% uptime (8% failure rate).
User Impact: Data synchronization delays, loss of trading signals.
Urgency: High (Production environment affected).

1 Reply

BCBohan ChenEMPLOYEE16h ago

Hello,

We have already begun investigating this issue in the ticket (SUP-12136) submitted from your other account. The server is the same Linode Tokyo instance (172.104.74.15).

Current Progress:

We tested DNS resolution (1.1.1.1, 8.8.8.8, CoreDNS) from the host level, and real-time tests are normal, which is consistent with the intermittent nature you described.
Investigation direction proposed: Confirm whether there were any network events, egress NAT, or conntrack bottlenecks for outbound connections from Linode Tokyo at the time of the failure.
Suggested temporary mitigation: Install dnsmasq on the host level as a local DNS cache.

To avoid duplicate tracking, we recommend that we follow up on this issue exclusively in SUP-12136. Please update that ticket with any new failure timestamps or logs, and we will prioritize it.

Add a Reply

Forum