Ways to monitor and manage critical DNS infrastructure

How do we monitor what is happening to the Canadian (and global) DNS?

One of the most important functions that CIRA has is to ensure that the DNS for the .CA top-level domain is always working. Without it the over 2.5 million registered .CA websites would effectively cease to work. In addition to investing in the latest technology, maintaining strong registrar relations, and having an ever-vigilant security team, CIRA also helps make the Internet’s fabric strong because, as a DNS operator, we are part of it.

From a DNS perspective one of the things we have done is to help encourage an active and successful network of Internet Exchange Points. This is good for Canada, but also provide locations to peer our own DNS infrastructure to. By peering locally we offer better performance for .CA and we help protect .CA from DDoS attacks. This matters because every organization that manages critical infrastructure is a target for the kind of attack that hit global DNS operator Dyn late last year. It isn’t just building out a strong infrastructure, but in measuring and monitoring the health of the DNS so we can continually improve it. 

So how do we monitor what is happening to the Canadian (and global) DNS? For starters, we belong to a few groups that actively share issues in real time, including the Canadian Cyber Incident Response Center (CSIRC), the DNS Operations Analysis and Research Center (DNS OARC), The Council of European National Top-Level Domain Registries (CENTR), and a host of ethical and black hat hacking groups. These organizations help ensure that we know what is going on and help to ensure a coordinated response when cyberattacks do happen.

From a measurement standpoint, one of the valuable technical tools we use takes the form of the RIPE (Réseaux IP Européens) Network Coordination Center (RIPE NCC) and their Atlas project. That is a lot of acronyms and organization names to make sense of if you haven’t heard of it before so we will simplify. RIPE is doing AMAZING things by building the largest Internet measurement network ever made – and everyone can participate by either contributing an anchor or a probe to the network - you can read all about that here.

By being part of the RIPE Atlas, CIRA can use these probes to measure the DNS response time for .CA TLD at the registry level and also the DNS response time for .CA domains held by registrants. If you aren’t familiar with how queries are answered we made a video a couple years back that is still relevant today. We do this by testing/monitoring the response to DNS queries that originate from multiple locations around the world. For registrants we monitor DNS servers that have .CA domains delegated to them and when we can see signs of active DDoS attacks. Alternatively, when we see consistent problems with any specific operator we will often reach out to them in order to see if we can help.  An operator in this case may be an organization that manages the DNS for lots of domains like a hosting company, or a large organization with many potential users.  Unfortunately, with over 100,000 name servers answering for .CA it is difficult to call every single local shopkeeper with a problem. This is a report we review daily (with the actual name servers blurred out so as not to identify any specific companies): 

Daily report of top name server outages showing the failed domain, the # of tests failed, and the current status for illustrative purposes. Actual name servers have been blurred out to not signal any organizations out.

 

The RIPE Atlas helps us with our D-Zone Anycast DNS service too

Image of a traceroute analysis using RIPE atlas showing that North American queries to D-Zone stay in North America

North American DNS traffic answered close to the query

Importantly, we are also able to use the RIPE Atlas to help choose our own DNS service providers. In addition to maintain our own global DNS we also back it up with other suppliers because having redundancy in every part of this critical registry function helps keep it resilient to outages of any type. What we are generally looking to see is that DNS responses from queries are answered close to the original query. It is this type of geo-fencing that improves end user experience but also can create scenarios where bad traffic stays where it originates.

While this seems obvious, here is a similar analysis that we ran using European and North American queries from a potential DNS supplier and you can see that answers are frequenly crossing the ocean. Fully 50% of European queries in our test were answered in North America and so we asked ourselves if the reason was that the infrastructure in the US was really good? But a similar test run from North America showed a third of traffic was answered in Europe. When managing hundreds of millions of queries this type of inefficiency adds-up and every millisecond counts in today's multi-service websites.  

Image of a traceroute analysis using RIPE atlas showing that 50% of European queries are being answered from North America

50% of European queries answered in North America...

Image of a traceroute analysis using RIPE atlas showing that North American queries are being answered in Europe

...while 30% of North American Queries are answered in Europe.

In conclusion, the DNS is one of our most important functions and the tools we use to monitor it on a daily basis are the same tools we use to have discussions with customers of our DNS services. As a dedicated service provider, it is our job to care about this critical infrastructure that most IT departments simply, "set-and-forget". If you are interested in us running a similar analysis on your service please book a meeting.  

Blog navigation