Not every domain registry's zone files are available through CZDS, unfortunately.
Not every domain listed in a zone file represents a "website".
Choosing a random domain from a zone file and prefixing it with "http://" and having PHP send a GET request certainly does not have a 100% chance of returning a web page.
(Might be interesting to calculate the probability.)
Seems like the author is not even filtering out A records corresponding to the NS entries in the zone files, e.g., something like a.ns.domain.tld
Sending a GET request to such subdomains is obviously not going return a web page.
As for clicking the button over 200 million times (assuming the total domains listed in the zone files is 200 million), that might violate the ICANN Zone File Access Agreement. Unless the terms have changed, one of the restrictions used to be against redestributing the data. This project would not be redistribution of the IP address data but if the user logs the names there's an argument it could be redistribution of the name data.
It's true that this doesn't list all the websites that are registered, nor do all the domains lead to a working website. However, I think that most of the invalid websites are not caused by NS entries. As for the Zone File Access Agreement, it prohibits uses that allow the access of a significant portion of the data. An immense amount of time would have to be spent scraping data to get any portion that could be considered significant.
Also, there are alternative, publicly accessible ways to get most of this public zone file data now, so I am not sure that restriction in the access agreement is anything more than an historical artifact at this point.
You could use publicly available scan data for ports 80 and 443 to pare down the list of "websites".
The goal of exposing the non-popular web is worthwhile.
You could port scan the entire IPV4 address space(minus all reserved addresses), send a GET request to everyone that responds, filter for valid HTML. It would take no more than 5 hours on a shitty PC, a lot less if you get a small aws instance.
I started this project after doing some research about how DNS works and learning about the CZDS, where any interested individual can request access to DNS zone files. I realized that I could turn this into a website, especially since I couldn't find anything similar on the internet. I used their Python API to download all the zone files, then wrote a Python script to scrape them into one file with only the domain names. I then stored these in a MySQL database on my web server, and used AJAX + PHP to retrieve and redirect to the domain. One thing I think is cool about this is that it gives you a sense of the websites that constitute most of the internet, not just the most popular ones. And unless you've clicked the button over 200 million times, you are almost certainly going to get a website you've never seen before.
I wonder if it was just chance or due the growth of the internet, but over half the sites I got were in Chinese. I found that really interesting, since the Chinese side of the internet is usually so far removed from the English-speaking world, and we have no idea how the internet is growing or being used over there.
My first reaction was "hey, this is Dutch, I understand this".
Well, the landing page is a very long list of statements about ngazz. Apparently ngazz is, was, remains, excites, grabs by the throat, stirs, laughs, and a lot of other things. One of these "ngazz" is a link to the actual page. It's some small blog/ promo page for ngazz, a band playing a fusion of rock and jazz.
I did "Jump In"s for about 10 minutes. Thought I'd provide some feedback.
I <3 the idea. I've personally wanted to see something like this for a while as I continuously visit the same 10 sites every day (as do most people).
Some feedback for the next iteration:
- Maybe ping sites first to see if they're down before jumping? I hit a few 500 and 404s during my jumps.
- Possibly show "content" sites? Said another way - I jumped into a few business pages XD (lawyers, doctors, and such) and those aren't the most interesting.
UX and speed all seem pretty decent. Thanks for sharing
Thanks for the feedback! I'm thinking about ways to prevent loading broken websites. I'm not sure it's possible to filter for only a certain type of website though, I think there are way too many sites for that.
I would say 75%+ of all the working sites were parked or expired pages. I would suggest to remove or re-redirect any sites that resolve to known registrar parking page IPs (perhaps only assuming if these IPs are distinct from their webhosting cluster IPs, where actual webhosting customer websites might live). That might be a good start to at least prune a lot of the parked sites.
Many of the sites understandably timeout or can't be resolved, or are just "under construction" or parking pages. However, I did come across this Japanese site: http://mottainou.com/
For some reason the topic, design, and color scheme makes me very nostalgic.
I ended up on 8sectnformats.online, which redirects to https://www.google.com/#spf=1600155592494. I couldn't find anything about them online. Does anyone know what they were?
Half of the sites were basically offline. Amusingly, I found that "Ma' Business Adviser Services DOT COM!" is now available, although it might actually be intended for Massachusetts.
Some links won't work, and that's unavoidable. I'm working on a way to redirect to another site if the first one is broken. In the mean time, just press the button again for a new link.
You must click the button each time to request a website. The loading page is just a placeholder, so reloading that page will not bring you anywhere. I'm not sure if that's what you were doing, but hopefully that helps.
Yes, I'm thinking about a way to implement that. It's too many domains to filter in advance, but it might be possible to redirect the user to a new site if the current one is dead.
Not every domain listed in a zone file represents a "website".
Choosing a random domain from a zone file and prefixing it with "http://" and having PHP send a GET request certainly does not have a 100% chance of returning a web page.
(Might be interesting to calculate the probability.)
Seems like the author is not even filtering out A records corresponding to the NS entries in the zone files, e.g., something like a.ns.domain.tld
Sending a GET request to such subdomains is obviously not going return a web page.
As for clicking the button over 200 million times (assuming the total domains listed in the zone files is 200 million), that might violate the ICANN Zone File Access Agreement. Unless the terms have changed, one of the restrictions used to be against redestributing the data. This project would not be redistribution of the IP address data but if the user logs the names there's an argument it could be redistribution of the name data.
To "click the button" once from the command line