1. robots.txt?

What is robots.txt?

A robots.txt file tells crawlers and other bots which pages or files are or aren't allowed to be requested from your site. It is usually located at the top level directory, such as example.com/robots.xtx. A robots.txt file generally looks something like this:

User-agent: BadBot
Disallow: /cgi-bin/
Disallow: /tmp

User-agent: Google
Disallow:

User-agent: *
Disallow: /joe/

In the example above, BadBot is banned from visiting anything under /cgi-bin/ or any url that starts with /tmp. Google is allowed to visit everything on the site. And all bots aside from Google are not allowed to visit /joe/.

Misuse and abuse

At this point, you may start to wonder, how are robots.txt rules enforced? In short, robots.txt are enforced by the honor system. It only affects well-behaved robots that follow the rules given. In some cases, the server may block misbehaving bots when it detects them, but generally, /robots.txt are not enforced actively and are more reliant on the bots respecting the standard. But almost all bad robots ignore /robots.txt, making it pointless as a defense against malicious bots(such as vulnerability scanners and email scrapers).

Even so, some misinformed developers (and CTF designers mimicking them) will put secret data that they wish to keep from the world in their /robots.txt, believing the world's bots will respect their wishes. Thus, /robots.txt makes for a great place to start when looking for vulnerabilities on website.

2. Sitemaps

What is a sitemap?

A sitemap is a file where websites provide information about the pages, videos, and other files on the website, and the relationships between them. Search engines like Google read this file to more intelligently crawl the website. They look something like this (though your browser may display XML differently):

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://www.example.com/foo.html</loc>
    <lastmod>2018-06-04</lastmod>
  </url>
</urlset>

Leakage

Since sitemaps are a listings of all the urls of a website that the website wants search engines to know, users could also check it for urls that aren't linked to in the website. For example, Apple supposedly leaked the names of its 2018 iPhone models before launch because they included urls in their sitemap. Similarly, interesting content can be found on sitemaps that may not have been intended for users to see.

3. What is security.txt?

/.well-known/security.txt is a file that records the channels for people to report the security issues they find in a website. It usually provides a contact email, their security policy, and acknowledgements page.