We were the Cloudflare masters. What happened?
We used Cloudflare for years, free and paid features (not Enterprise), leveraging anything in the dashboard, including the fancy Service Workers, which was a great feature.
It was a perfect setup (browser “onload events” below 1-2 second*, globally).
We thought so.
BTW: We never used any feature from Cloudflare which could create issues if a Cloudflare account was cancelled by Cloudflare itself (we saw too many comments about cancelled accounts, because Cloudflare can do so, anytime – but we regarded this as a risk assessment-based decision).
But back to the perfect setup.
The first indications there were problems took us by surprise, as some clients complaints about slow loading times grew. GTmetrix and other tests reflected that all was good. We told clients the Western World overseas web site performance is all good. Nothing wrong.
Then a much larger VIP client with a different audience signed up with us for a new website, and we deployed our well-established setup and the problems began to show.
Cloudflare announced around this time they have a PoP* in Jakarta. We thought this is perfect.
All good. But we were totally wrong.
The VIP Client and webpage visitors complained about slow and/or stuck and/or frozen web pages and backend. They sent us screencasts and we thought initially it was a local network issue. But more and more complaints arrived and we were under pressure to respond.
So, serious investigations started.
It turned out that for traffic originating from established Indonesian ASNs the following was identified:
- Traffic was (mis)routed via Hongkong PoP, and then back to Singapore PoP.
- Traffic had a permanent packet loss between overseas peering and Hongkong PoP.
- Traffic had significant bandwidth limitations.
- The PoP Jakarta (the nearest one, Geo & Network) was never engaged.
- The 2nd best PoP Singapore was never directly engaged.
Status today, after 6 months: Better or not?
Linux:~$ mtr -4 --report a.random.paid.cloudflare.host Start: 2020-09-04T14:00:09+0800 HOST: Linux Loss% Snt Last Avg Best Wrst StDev 1.|-- 4g-telkom-lte.rr.com 0.0% 10 0.5 0.6 0.4 0.9 0.2 2.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0 3.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0 4.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0 5.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0 6.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0 7.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0 8.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0 9.|-- 188.8.131.52 0.0% 10 17.6 19.7 15.4 26.9 3.4 10.|-- 184.108.40.206 0.0% 10 57.1 53.6 45.3 65.4 6.0 11.|-- 220.127.116.11 10.0% 10 100.6 52.9 43.5 100.6 18.2 12.|-- 18.104.22.168 0.0% 10 264.2 229.5 216.8 264.2 15.9 13.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0 14.|-- 22.214.171.124 90.0% 10 269.7 269.7 269.7 269.7 0.0 15.|-- 126.96.36.199 90.0% 10 235.9 235.9 235.9 235.9 0.0 16.|-- 188.8.131.52 20.0% 10 219.9 222.1 219.5 236.2 5.7 Linux:~$
Long story short: Cloudflare denied there were any problem for months. They wrote us that they have a perfect quality setup, but we knew there was real trouble, they simply weren’t owning it. We involved multiple Cloudflare contacts we had, but nothing got better. After months they came back saying they found out they have issues with permanent packet loss etc. and if we want to use the Jakarta PoP, then we need the Enterprise plan. For USD 1.000,- per month per domain (or per 3 domains, I forgot already). We had around 60 domains involved at this time.
While Cloudflare was clearly ignoring the issue, we analyzed the involved peering, operator, IXs* etc. and made commercial agreements with involved ASNs operators for dedicated CDNs* for traffic to our infrastructure destination in Singapore.
Meanwhile we didn’t use Cloudflare’s proxy technology anymore, and all plans were downgraded to the free tier, yet we still had the DNS* zone managing at Cloudflare up and running.
But Cloudflare is not a flexible DNS operator, and don’t speak GeoDNS*, not even optional with some limitations. After further communication with them, we got another “brilliant” email from Cloudflare, but we were already on the way to a professional DNS operator.
Nowadays we see the same Cloudflare issues with other ASNs in other regions, which is telling us Cloudflare is doing … what? Luckily this is not our problem anymore.
Meanwhile we have developed a perfectly balanced solution and we are back to where we were coming from: Visitors browsers global “onload events” around 1-2 seconds for our clients’ WordPress sites, even in mobile networks – much better than before. We have a well-developed Multi-CDN deployment for our clients in place, which is both compelling and fast.
BTW: Timeline of this involuntarily project: around 6 months, from initial investigation to a deployable standard solution for our clients. It was a good, but not wanted, research project at the same time. Maybe next time our trouble tickets to vendors should be more arrogant or aggressive, otherwise nobody takes it seriously. What we learned too: Western World companies regularly misjudge South East Asian regions (maybe except for SG and ANZ, maybe)
Any questions? Please let me know.
p.s. as you can see above, Cloudflare’s issues still persist and I ask myself why Cloudflare just didn’t tell us: No support and/or no budget for Indonesia for non-Enterprise plans or something more helpful. Anything would have been more helpful. Anything, since this was a time-consuming process.
What is a Multi-CDN?
Depending on the origin of the traffic, we forward the traffic via various networks and CDNs that are optimally tailored to the origin of the traffic and web site visitor.
What is your solution today?
Our Multi-CDN setup. More details on our price list or on request.
For which kind of business or website is this relevant?
For any kind of business where you pay or get paid by clicks or visitors, like in the area of Digital Marketing or you have mid- to high-volume of visitors and need conversions.
Do you still use Cloudflare?
Yes, but limited to Western World countries originating audiences only, where we don’t have issues like stated above or sites with micro traffic or micro content. So very limited.
- onload event – Fires when finishes loading all content within a HTML document/Web page, including window, frames, objects and images – https://en.wikipedia.org/wiki/DOM_events
- PoP – point of presence
- ASN – It’s basically a large local or regional or national network – https://en.wikipedia.org/wiki/Autonomous_system_(Internet)
- IX – Internet Exchange, where network operator exchange internet traffic – https://en.wikipedia.org/wiki/Internet_exchange_point
- CDN – Content Delivery Network – https://en.wikipedia.org/wiki/Content_delivery_network
- DNS – basically a translator from www.domain.com to an IP number – https://en.wikipedia.org/wiki/Domain_Name_System
- GeoDNS – https://en.wikipedia.org/wiki/GeoDNS