Hi, back with me again on a sunny day this Monday.. Today we will discuss how search engines work - crawling & indexing.. May be useful for you all, and for those who already know, I apologize, because I just wanted to share :D
The function of the Search Engine is Spidering , Indexing, and Link Analysis. . And it will return the results of sequence based on the results website done by crawling results of the search engine spiders. The very first thing we see in the indexing is a URL (uniform resource locator).
So, how Search Engines find your website:
* Submission via web form (Google)
* Submission via XML sitemap ( "Big 4")
* Paid Inclusion (Yahoo)
* Found Links (All)
But at this moment, put your website to search engine submission is "wasting time" you. So, how do the fastest?? One way is to make a linkback to our website. XML also is a good thing, because it gave the search engine one-stop "Change List" to updated page. Sometimes, there are a lot of mistake in using this XML sitemap, people think they can get special priority on certain pages, XML sitemap is not going to tell search engines which pages you want to prioritize
Let us continue, we now talk about spiders & crawling .. Spiders, robots, and crawlers are the same thing, just have different names, they are sent by the software search engine to check the website. But in your website there is the intention to Follows "Robot Exclusion Protocol", where its function is to tell the search engine spider to a page which may in-check and not be in-check. After Spider do this, the Spider will keep all the information for later in the index.
Fasten your seat belts, we're more towards the technical... HTTP, Hyper Text Transfer Protocol, is part of another rule that is used by the spider, because HTTP is a secret language of a web. This is the protocol used by your web browser, which you do not realize that spiders follow the protocol. They have a HEAD Request, GET and POST Request Request. In this case, the spider sends requests to your website, and your website answered with a static code. You can check the server response using WebBug www.cyperspyder.com (tip: use HTTP/1.1 Only), Or you can see it in Firefox by using pluggin "Live HTTP Headers". This is not something you should go into, but good enough if you know a little about this.
Request that usually done by the Search Engine Spiders are usually like this:
* Get somepage.html
* Host: www.YourURL.com
* If-Modified-Since (Check Last)
* User-Agent: (GoogleBot, etc)
And usually the response will be provided by our website is:
* 200 OK - the server sends the page
* 301 - Moved permanently (to a new URL)
* 302 - Found (in the URL "temporary")
* 304 - Found (No Change, since the last in check)
* 404 - not found / 410 Lost
Let us move to the next topic, the parsing and indexing.. or what happens after this spider found our website. The first is the processing carried out our website:
* Strip out Java script
* Strip out most formatting
* Strip out IFRAME
* Just look at some tags that are meaningful in the indexing process: title, meta description keywords, H1-H6, A, IMG
Content is stored in the indexing process. Basically, the indexing process is only based on "Words", no pictures .. In the process of storing the link in your web page, all the links are duplicated and the no-follow, will not be stored in special index by search engine robots. Things you need to watch is the quality of your links, do not you have a link to the page with the status 404/410/5xx web page, because it will reduce the quality of your website.
Let us end with a few things you need to know,
* Search Engine does not save web pages - they index the text on all pages
* Search Engine does not perform web searches - they just look for the web that had been by their Index
Hopefully helpful, I am waiting for your feedback in the Comment box.. Thanks