How to shield your site from public search engines and Yale’s Google Search Appliance
There are three options to prevent search engine “crawlers” from finding and indexing your site content:
- Using a “robots.txt” file to control search access over whole directories of your site
- HTML “meta” tags used to control search engine access to specific pages within your site
- Asking us to include your site or server URL in our “do not search” list on the Yale Google Search Appliance
Using Robots.txt files
These are simple text files that you place into the main directory (or selected subdirectories) of your Web site to exclude or control search engine access.
Use a text editor or Microsoft Word to create a plain-text file, and enter this text in the file:
User-agent: *
Disallow: /
Place the file in the same Web server directory as your Web site home page (the “root level” of your site). This will exclude your site from indexing by all of the major public search engines, and from Yale’s local Google Search Appliance. See the robots.txt reference site for further on options on using robots.txt files to control how your site is searched.
http://www.robotstxt.org/wc/norobots.html
Using HTML meta tags to control search engine access to specific pages
You can use standard HTML meta tags to tell search engine crawlers that you do not want then to index the content of the HTML file. Add this meta tag to the header area of your HTML page:
<head>
<meta name="robots" content="noindex, nofollow">
</head>
This will exclude the contents of that particular Web page from search engine indexing, and will prevent search engine crawlers from following links on the page.
The “do not search” list on Yale’s Google Search Appliance
You can contact the Yale Webmaster Team and tell us the Web site URL or Web server URL that you wish to exclude from Yale’s Google Search Appliance master index. This will exclude your site from local searches using the Yale Google Search Appliance, but will not exclude your site from public Web search services unless you also use other techniques (like robots.txt files or meta tags) to shield your content from public Web servers. If you have confidential information within your Web site and need a higher level of access control, please contact the Yale Webmaster Team for alternatives to controlling the security of your Web content.
A warning about all these “no search” techniques
These search exclusion techniques do not work instantly. If your pages have been previously crawled and indexed they may still appear in search results until the next time the search engine crawler visits the page and honors your “no search” request. For the Yale Google Search Appliance this will currently take about 72 hours or less, but on public search engines like Google.com the process of removing your content from their indexes may take much longer (sometimes weeks) because the large public search engines do not crawl your pages very often.
|