search0397

HOW TO SEARCH THE WEB
by fravia+ ~ Letter 007 - (March - November) 1997

__The W3gate (image fetching)__ Search engines battles
Common errors and also How to evalute the results of a search

Fetching sites and images: the W3gate

	A very interesting possibility is offered by the fantastic W3gate, 
a German server (how comes that German FTPservices are so developed?) 
that allows you INCREDIBLE sniffing on the WWW.
	Try for instance to send it following email and you'll at once 
(well, as soon as you get the answer, say half an hour) understand what 
I mean:

To:	  w3mail@gmd.de
Subject:  nothing here
Text:	  get -a -img -l http://fravia.org/index.htm 

This IS the web-fetcher for all those that have slow connexions or that 
have been 'banned' from the Web for whatever reason.

If you need more info about the W3gate, just send an "help" message to the 
same address as above:w3mail@gmd.de

__Search engines battles and spiders__
Each search engine uses a "crawler" or "spider" agent to gather web pages. Most have nicknames. You can tell if you have been visited by a crawler by checking your logs and looking for the various names which are often part of the crawler's host name.

Do not believe that the more well known search engines are 
also the best ones... alliances (and money) play unfortunately 
a huge role in these matters, for example, Infoseek strong tie 
to Netscape guarantees that many people use the service, The 
world wide web worm has no netscape tie and no major commercial
backing, so fewer people use it.

AltaVista partnered with Yahoo in June 1996, becoming the
"preferred" search engine (see below). Altavista is very
vulnerable to spammers because of its near real-time indexing.
This makes it easy for slightly different variations of the same 
page to be submitted in an attempt to block others from the
top ten. ROBOT NAME: SCOOTER

Excite was launched in late 1995 and grew quickly, eating
its competitors. In July 1996, Excite purchased the Magellan
search engine and directory. In november 1996, it acquired
Webcrawler, however Magellan and Webcrawler have not yet
been merged with Excite (eventually Magellan will: on January 22 
Webcrawler took over Magellan's top spot on the Netscape 
search page, where Excite has also a spot, giving it two
of the five top slots). ROBOT NAME: ARCHITEXT

HotBot was launched in May 1996 and represents Wired's entry
into the search engines competition. The site is powered by
the Inktomi search engine, but that does not mean that it is
the same as the UC Berkeley Inktomi catalog, it just uses the
same technology that created that catalog. ROBOT NAME: SPIDER

InfoSeek, around since early 1995, is well known and well
connected. In fall 1996 the new 'Ultrasmart / Ultraseek' 
index (the commercial idiots always choose awful stupid 
names), with 50 million URLs was introduced. Ultraseek is
the same as Ultrasmart, plus some additional information
on the found sites. ROBOT NAME: SLURP THE WEB

Lykos, around since May 1994, is one of the oldest search 
engines. Was the FIRST engine to combat attempt to spam 
in may 1996. ROBOT NAME: HOUND

Open Text, is an index that has been around since early 1995,
and until June 1996 was Yahoo's preferred search engine partner.
It's a search engine "in decline". ROBOT NAME: xxx

Webcrawler opened to the public on April 1994, and started as a
research project at the university of Washington. Purchased by
AOL in March 1995, which used it as preferred service until
November 1996, when Excite, a Webcrawler competitor, acquired
the service. ROBOT NAME: SPIDEY

Yahoo is around since late 1994, may be the oldest major web site
directory. It is a directory (not a search engine) based on
user submission. If a search of Yahoo's catalog doesn't fish,
users should then consult a search engine, Yahoo pipes the
query to any of the major search engines with a click. There
are so many people using Yahoo that the search engines listed
FIRST on Yahoo page have a strategic advantage over others. Alta
Vista is its preferred search engine.

Since Netscape navigator is the browser that people use, and since
browser have a search button that connect to a pre-defined page,
and since people are idiots that would not know how to change
such a setting even if you would explain it to them (of course you
have YOUR OWN search engine page on YOUR HARDDISK connected to
that button, if you do not be ashamed and copy at once my
main.htm on your harddisk, you'll later modify it as you
fancy) the page connected there IS important. Millions push
that button daily... search engines and directories had to
pay Netscape 5 million dollars each to have a top spot on that
page. AOL directs its suckers to Excite (strategic partner) and 
Webcrawler (formerly-owned); Compuserve sends its suckers to
Lykos.

__Common errors__

[ERROR 400]
YOUR REQUEST COULD NOT BE UNDERSTOOD BY THE SERVER
Either your browser is malfunctioning or your Internet 
connection is unreliable

[ERROR 401]
YOU ARE UNAUTHORIZED TO ACCESS THAT DOCUMENT/WEBSITE
proper authentication is required, ask root organisation

[ERRORS 403, 404, 505]
ACCESS TO THAT DOCUMENT/WEBSITE IS FORBIDDEN
Check the URL you typed (punctuation AND capitalisation)
Slashes MUST be forward-facing (/)
Contact the site maintainers

How to evalute the results of a search

This is usually the hardest and most time-consuming part of a search. The number of hits you obtain can range from none to hundreds of thousands, and their relevance or usefulness can vary from considerable to negligible. There are some things you can do to help produce more relevant hits for the fewest total number.

Too many hits are caused by the use of queries that are too general. Try using more specific terms. The more exact your query, the better your results.
Too few hits are usually caused by too restrictive a query. Broaden your search by removing the least required keywords or operators.
Try starting with a subject search and continue down the path to the last relevant title. At this stop, switch to a keyword search. This limits the search to the last subject title, which will reduce the hits and improve their relevancy.
Compose the query with the appropriate operators for the particular search tool that you select. A large number of irrelevant hits are often due to a powerful search engine misguided in its search.
Narrow the scope of your search by choosing a specific field of search offered by the search engine, such as a time period or geographical area.

Success in any particular search query is usually more a question of which search tool has the best database for the subject and how the information is organized for retrieval. This is why it is often necessary to try a number of different search tools when searching for obscure information.

Some search engines list the hits by titles, some by brief text and some give you a choice. When available choose the brief text, as it is easier to evaluate. Even so, it is often necessary to click the link to see the entire document before you can assess its content. Some sites may not be of apparent interest, but will contain links that have great relevancy. Some searches yield the desired information quickly, and some you may just have to plod through. Another problem is caused by search ngines that DO NOT list the DATES of the retrieved pages.
This is VERY BAD, because the 'volatility' of Internet will have probably caused the disappearence of many of those sites (I for instance don't even bother to check pages with a 'fetch-date' older than three months when I am confronted with many hits)

As you gain experience, you will find the search tools to use that are most appropriate for your particular interests and how best to evaluate the hits.

Go ahead, enjoy!

fravia+, February-November 1997

how to search 5

mail_FraVia

FraVia 20 Feb 1997