Fravia's web-searching lore (¯`·.¸(¯`·.¸ badlywri.htm ¸.·´¯)¸.·´¯)

~ Essays ~

				essays

(Courtesy of fravia's advanced searching lores)

(¯`·.¸ What search engines don't index and why ¸.·´¯)
by Rumsteack
published at fravia's searchlores in May 2000

Slightly edited by fravia+

Fravia+,
One thing (among many others) the unawares learn with your site is that the search engines don't index all the web (far from that...), and that they have serious limitations. Nevertheless, I didn't find much informations on your pages about these ones, other than database capacity limitations. So, I've decided to write this short essay about this (major) problem.

PS : My reference about the actual number of Apache severs can be studied here : http://www.netcraft.com/survey/ Microsoft's servers lose the competition... as they deserve!
Rumsteack

"What search engines don't index and why"

by Rumsteack

Title : Search engines limitations, or What search engines don't index, and why ?

These few lines should be useful for people who don't know the problems the s.e. encounter when indexing web sites.

It could be easy to think that ALL data which are on indexed pages are indeed searchable, because it could be easy to think that submitting a site to a s.e. allows necesseraly ALL its content to be indexed. Too easy a thought actually: that's not true at all!

There are several things which can prevent some parts of a site from being indexed by a search engine spider.
I'm going to talk about the ones I know, but you should consider what follows ONLY as a base to begin your own future works and researches in this field.

First, the 'voluntary' things: you are aware of what you are doing. In this category, we have the possible use of:

- Domain or password restrictions The most commonly know method is the .htaccess file. Within this file, you set up a list of the ressources you do not want everyone to be able to see. In other words, you create a database of logins and passwords, associated with the different parts of your site. The use of this file was introduced in the NCSA servers, and continued with the Apache servers. So, given that there are 7.8+ millions Apache servers in use, you see the huge quantity of data which can be blocked from indexing, just with this (basic) method.
I wrote "Just with this method" because there are many others: SQL and other databases have similar possibilities as well.
Moreover Javascript offers some blocking possibilities (yes, it is possible to protect a site with vanilla javascript: see 'gate keeper' scripts).

Consequently, you see that some pivate parts (generally, for obvious reasons, the most interesting ones) of a site can easily be protected (from the search engines spiders, but also from YOU).
In all these cases, you will need some cracking/hacking skills if you want to make these sites more *cooperatives*.

- robots.txt This text file was set up in 1994, following some complaints from the sysadmins about web spiders indexing 'indelicate' matters.
With it, a site-owner can indicate which parts of his site he doesn't want the well behaved spiders to index. This file is generally placed in the root directory of a site (that's where it is most useful). Actually, when a well behaving robot visit a site, he first seeks for the /robots.txt file. Provided he finds one, he will obey it and wont index any page beneath the robot.txt. Nevertheless, keep in mind that you can write your OWN bots, which you can make as 'irreponsibles' as you like (or you can use utilities like Teleport Pro, for example, with its "obey robots.txt file" case unchecked !!)
The content of this file (which can be viewed with your browser, by the way... See where I want to go? :-D consist in lines which are all like this :

<field>: <optionalspace> <value> <optionalspace>
#field is case insensitive, but the value is case sensitive !

The field values can be (one line by bot/directory, as previously said):
User-agent: the name of the spider(s) you (don't) like
Disallow: the directories (or the files) you (do not) want the previous bot to index

The use of special commands is allowed, but the unrecognised fields are ignored :

User-agent: * (for all bots)
Disallow: / (to forbid the whole site) or a white space (to allow the whole site)
#: for inserting comments in the file.

Of course, multiple combinaisons are allowed, and you can even think of sites which are "racists" toward some specific bots, and not toward anothers.

For example : User-agent: Bot I like
Disallow: Bot I dislike
User-agent: *
Disallow: /
A site with a robots.txt file of this kind can be a pain for search engines. BUT, since you can see it with your browser, you are able to know which are the "restricted" parts of a site with no effort whatsoever!.

- robots meta-tag Note that not every spider obeys to this tag. That's one of the few ways a web-page AUTHOR (server admin not needed anymore) can prevent some pages of his site from being indexed. This meta-tag (like all meta-tags, by the way) should be placed in the head section of your page. It is one line, like this:

The content value can be : "all", "none" or mixes.
The mixes values are : "{index || noindex}, {follow || nofollow}".

"index" mean that the page will be indexed. "noindex" is the contrary. "follow" means that the spider will follow the links present on the page. "nofollow" is the contrary.

"all" is equivalent to "index, follow". It is also the default value, if robots meta-tag is not present. "none" is equivalent to "noindex, nofollow".

So, when searching for specific URLs (url: or u:), keep in mind that if the spider you use reads this meta-tag, you can't reach a page which has been protected with the "noindex" tag.

- special commands These commands enable the author to choose what he wants to be indexed on a page. For example, he can choose to keep some passages invisible to search engines (for whatever reason!). Nevertheless, I only know the command capable to stop the Infoseek's spider. Here it is :
 Write whatever you want here

Authors are consequently able to choose what they want to index or not. All the *special* words (mp3, appz, warez, hack, commercial crap, smut sites nuking, destroy commercial bastards, :-D ) can be deleted from the 'vision field' of the search engines, even if the page itself is indexed. This is VERY useful to know when searching for - ahem - special content.

Given that my little knowledge of this lore ends here, let's talk about the other things that our beloved tools don't like. That is, let's talk about the content they CAN'T index. In fact, spiders not only can be blocked by admins/authors through the above procedures, but also have some problems when they bump into :

- HTML syntax mistakes (duh !)
Search engines parse out HTML files, like your browser. But the s.e. may not be complying with the errors that your browser would gladly allow to go by.
Syntax checker, or better, a good knowledge of HTML coul be helpful to avoid these mistakes, by the way (or to use them on purpose :-D
- PDF and formatted files
There are certain kind of files that the search engines simply don't read. Actually, if a plug-in or an add-on software is needed to read a file, then the search engine will (probably, that's not true for all of them) ignore it. Some examples are PDF, PostScript, RTF, Word, Excel, and PowerPoint files.
Links search is here of a helping hand.
- Huge pages
Though theorically search engines offer full-text indexing, practicaly there is an upper limit, beyond which robots stop indexing a file.
- Frames, image maps, and the like
These methods of enhancing your site have problems with the s.e. Frames and image maps are not indexed by all of them. So, all of the content (including links) entered into image maps and frames sites are not scanned by the all of the search engines.
But maybe this is not so bad after all!
- Dynamic pages
On some servers most of the web pages are dynamic (produced from databases or other applications, in response of a request), and they do not exist as static HTML files. Consequently, the search engines CANNOT index them. To better understand the problem, we can say that all the URLs for database access that contain a "?" symbol in them, are not indexed by any s.e. I hope you see what huge amount of information you are losing here!!
That explains you why you must use specific search engines, when searching for specific subjects: the big s.e. don't have any access to the specialised databases that exist (which are called 'the Invisible Web'). So, different targets, different tools as someone I don't recall said :-D

That's also finished for this part.

I hope you have enjoyed this little text, and I hope you will now be aware of the REAL limitations of the s.e. (that is, not only about their size), which oblige us to find OTHER tools for searching purposes (combing and the like).
Rumsteack, May 2000