essays |
---|
One thing (among many others) the unawares learn with your site is that the search engines don't index all the web (far from that...), and that they have serious limitations. Nevertheless, I didn't find much informations on your pages about these ones, other than database capacity limitations. So, I've decided to write this short essay about this (major) problem.
PS : My reference about the actual number of Apache severs can be studied here :
http://www.netcraft.com/survey/
Microsoft's servers lose the competition... as they deserve!
Rumsteack
Title : Search engines limitations, or What search engines don't index,
and why ?
These few lines should be useful for people who don't know the problems the
s.e. encounter when indexing web sites. It could be easy to think that
ALL data which are on indexed pages are indeed searchable,
because it could be easy to think that
submitting a site to a s.e. allows necesseraly ALL its content to be
indexed. Too easy a thought actually: that's not true at all!
There are several things which can prevent some parts of a site from being
indexed by a search engine spider.
First, the 'voluntary' things: you are aware of what you are doing.
In this category, we have the possible use of:
Consequently, you see that some pivate parts (generally, for obvious reasons, the most
interesting ones) of a site can
easily be protected (from the search engines spiders, but also from YOU). -
robots.txt
This text file was set up in 1994, following some complaints from the
sysadmins about web spiders indexing 'indelicate' matters. <field>: <optionalspace> <value> <optionalspace> The field values can be (one line by bot/directory,
as previously said): The use of special commands is allowed, but the unrecognised fields are
ignored : User-agent: * (for all bots) Of course, multiple combinaisons are allowed,
and you can even think of
sites which are "racists" toward some specific bots, and
not toward anothers. For example : User-agent: Bot I like - robots
meta-tag
Note that not every spider obeys to this tag. That's one of the few
ways a web-page AUTHOR (server admin not needed anymore) can prevent some pages of
his site from being indexed.
This meta-tag (like all meta-tags, by the way) should be placed in the
head section of your page. It is one line, like this:
<meta name="robots" content=""> The content value can be : "all", "none" or mixes.
"index" mean that the page will be indexed.
"noindex" is the contrary.
"follow" means that the spider will follow the links present on the page.
"nofollow" is the contrary. "all" is equivalent to "index,
follow". It is also the default value, if
robots meta-tag is not present.
"none" is equivalent to "noindex, nofollow".
So, when searching for specific URLs (url: or u:), keep in mind that if
the spider you use reads this meta-tag, you can't
reach a page which has been protected with the "noindex" tag. - special
commands
These commands enable the author to choose what he wants to be indexed on
a page. For example, he can choose
to keep some passages invisible to search engines (for whatever reason!).
Nevertheless, I only know the command capable to stop the Infoseek's spider.
Here it is :
Authors are consequently able to choose what they want to index or not.
All the *special* words (mp3, appz, warez, hack,
commercial crap, smut sites nuking, destroy commercial bastards, :-D ) can
be deleted from the 'vision field' of the search engines, even if
the page itself is indexed. This is VERY useful to know when searching for - ahem -
special content.
Given that my little knowledge of this lore ends here, let's talk about the
other things that our beloved tools don't like.
That is, let's talk about the content they CAN'T index. In fact, spiders not only can be
blocked by admins/authors through the above procedures, but also have some
problems when they bump into :
That explains you why you must use specific search engines, when
searching for specific subjects: the big s.e. don't have
any access to the specialised databases that exist (which are called 'the Invisible
Web'). So, different targets, different tools as someone I don't recall said :-D
I hope you have enjoyed this little text, and I hope you will now be aware
of the REAL limitations of the s.e. (that is,
not only about their size), which oblige us to find OTHER tools for
searching purposes (combing and the like).
I'm going to talk about the ones I know, but you should consider what
follows ONLY as a base to begin your own future works and researches in this field.
-
Domain
or password restrictions
The most commonly know method is the .htaccess file.
Within this file, you set up a list of the ressources you do not want
everyone to be able to see. In other words, you create a database of logins
and passwords, associated with the different parts of your site.
The use of this file was introduced in the NCSA servers, and continued
with the Apache servers. So, given that there are
7.8+ millions Apache servers in use, you see the huge quantity of data
which can be blocked from indexing, just with this (basic) method.
I wrote
"Just with this method" because there are many others: SQL and other
databases have similar possibilities as well.
Moreover Javascript offers some
blocking possibilities (yes, it is possible
to protect a site
with vanilla javascript: see 'gate keeper' scripts).
In all these cases, you will need some
cracking/hacking skills if you want to make these sites more *cooperatives*.
With it, a site-owner can indicate which parts of his site he doesn't want the
well behaved spiders to index.
This file is generally placed in the root directory of a site (that's
where it is most useful).
Actually, when a well behaving robot visit a site, he first seeks
for the /robots.txt
file. Provided he finds one, he will obey it and wont index any page beneath the robot.txt.
Nevertheless, keep in
mind that you can write your OWN bots, which you can make as 'irreponsibles'
as you like (or you can use utilities like Teleport Pro, for example,
with its "obey robots.txt
file" case unchecked !!)
The content of this file (which can
be viewed with your browser, by the
way... See where I want to go? :-D consist in lines
which are all like this :
#field is case insensitive,
but the value is case sensitive !
User-agent: the name of the spider(s) you (don't) like
Disallow: the directories (or the files) you (do not) want the previous
bot to index
Disallow: / (to forbid the whole site) or a white space (to allow the whole site)
#: for inserting comments in the file.
Disallow: Bot I dislike
User-agent: *
Disallow: /
A site with a robots.txt file of this kind can be a pain for search
engines. BUT, since you can see it with your browser,
you are able to know which are the
"restricted" parts of a site with no effort whatsoever!.
The mixes values are : "{index || noindex}, {follow || nofollow}".
<!--stopindex--> Write whatever you want here <!--startindex-->
That's also finished for this part.
Search engines parse out HTML files, like your browser. But the s.e. may
not be complying with the errors that your browser would gladly
allow to go by.
Syntax checker, or better, a good knowledge of HTML coul be helpful to
avoid these mistakes, by the way (or to use them on purpose :-D
There are certain kind of files that the search engines simply don't
read. Actually, if a plug-in or an add-on software is needed to read a
file, then the search engine will (probably, that's not true for all of
them) ignore it.
Some examples are PDF, PostScript, RTF, Word, Excel, and PowerPoint files.
Links search is here of a helping hand.
Though theorically search engines offer full-text indexing, practicaly
there is an upper limit, beyond which robots stop indexing a file.
These methods of enhancing your site have problems with the s.e. Frames
and image maps are not indexed by all of them. So, all of the content
(including links) entered into image maps and frames sites are not scanned
by the all of the search engines.
But maybe this is not so bad after all!
On some servers most of the web pages are dynamic (produced from
databases or other applications, in response of a request), and they do
not exist as static HTML files. Consequently, the search engines CANNOT index them.
To better understand the problem, we can say that all the URLs for
database access that contain a "?" symbol in them, are not indexed by
any s.e. I hope you see what huge amount of information you are losing here!!
Rumsteack, May 2000