Problem Overview
Our services are deployed on Linode servers in the Zeabur Tokyo (NRT1) region. We have observed intermittent outbound DNS resolution failures in our containers. Every few tens of minutes to a few hours, the service experiences batch DNS resolution failures (ENOTFOUND). This issue is reproducible across multiple projects, leading us to suspect an issue at the platform infrastructure level.
Affected Services
Project Details
Project 1 tradeyai-v04 (ID: 69cb7770b8fdd4fc5297eb93)
Service 1 agent-manager (ID: 69cb7777c2fdbebd296df903)
Project 2 TradeyAI-v03 (ID: 69c101eb1066986b9a175bcb)
Service 2 market-data-fetcher
Region Tokyo (NRT1)
Shared Outbound IP 172.104.74.15 (Linode)
Target Endpoint api.hyperliquid.xyz (Public API)
Symptom Details
1. Failure Pattern
- Each time a DNS resolution fails, 4–5 concurrent fetch requests return ENOTFOUND simultaneously.
- The failure event is not isolated but occurs in clusters.
- Each request has implemented 4 retries (intervals: 1s/2s/4s/8s, totaling ~15 seconds).
- DNS remains unresolvable within 15 seconds, exceeding typical DNS jitter windows.
2. Frequency
- Approximately 8% of cycles failed over the last 48 hours.
- Most recent occurrence: UTC
2026-04-14 19:12:03.
3. Failure Timeline (UTC)
2026-04-14 19:12:03 cycle 569 ENOTFOUND x5
2026-04-14 18:34:45 cycle 562 ENOTFOUND x5
2026-04-14 18:02:13 cycle 556 ENOTFOUND x5
2026-04-14 15:23:59 cycle 526 ENOTFOUND x5
2026-04-14 12:09:02 cycle 491 ENOTFOUND x5
Excluded Possibilities
1. ✅ Not a CoreDNS (10.43.0.20) single-point issue
- Implemented
cacheable-lookup+ undicisetGlobalDispatcheron the Node side. - Forced all outbound HTTP DNS to use public resolvers (1.1.1.1 / 8.8.8.8).
- Bypassed container default nameservers.
- Failure rate decreased but did not reach zero after configuration.
2. ✅ Not a client retry issue
- Single ENOTFOUND event lasts >15 seconds.
- Exceeds typical DNS jitter window.
- 4 retries implemented with 15-second total duration.
3. ✅ Not an application-layer bug
globalThis.fetchglobal dispatcher has been replaced.- All outbound calls use a cached public resolver.
- Issue reproduces across multiple projects and services.
Technical Details
DNS Configuration:
- Primary resolver: 1.1.1.1:53 (Cloudflare)
- Secondary resolver: 8.8.8.8:53 (Google)
- Protocol: UDP + TCP
- Caching strategy: cacheable-lookup with undici
Failure Characteristics:
- Target endpoint:
api.hyperliquid.xyz(Public endpoint, no special restrictions) - Failure type: ENOTFOUND (DNS resolution failure, not connection timeout)
- Concurrency: 4-5 requests fail simultaneously (not random single-point failures)
Items for Confirmation
-
Are outbound requests to
1.1.1.1:53/8.8.8.8:53(UDP and TCP) being rate-limited, temporarily blocked, or routed through unstable egress nodes on Linode Tokyo at certain times? -
Has there been any node/network-level incident in the Tokyo (NRT1) region during these times?
- E.g., rolling restarts, CNI restarts, egress NAT switching, firewall rule changes, etc.
-
Is it recommended to switch to a stable egress/DNS endpoint provided by your platform at the container level?
-
Are there any known outbound connection issues with Linode Tokyo?
- Noted similar cases on the forum (SUP-11316) involving blocked outbound TCP connections in Linode Tokyo.
Impact Assessment
- Service Availability: ~92% uptime (8% failure rate).
- User Impact: Data synchronization delays, loss of trading signals.
- Urgency: High (Production environment affected).