What is a search engine and how does it work? Search engines What is an Internet search engine

Hello, dear readers of the blog site. , then its few users had enough of their own bookmarks. However, as you remember, it happened in geometric progression, and very soon it became more difficult to navigate in all its diversity.

Then directories appeared (Yahoo, Dmoz and others), in which their authors added and sorted various sites into categories. This immediately made life easier for the then, not yet very numerous users of the global network. Many of these catalogs are still alive today.

But after some time, the size of their databases became so large that the developers first thought about creating a search within them, and then about creating an automated system for indexing all Internet content in order to make it accessible to everyone.

The main search engines of the Russian-speaking segment of the Internet

As you understand, this idea was implemented with stunning success, but, however, everything turned out well only for a handful of selected companies that managed not to disappear on the Internet. Almost all search engines that appeared in the first wave have now either disappeared, languished, or were bought by more successful competitors.

A search engine is a very complex and, importantly, very resource-intensive mechanism (this means not only material resources, but also human ones). Behind the seemingly simple , or its ascetic analogue from Google, there are thousands of employees, hundreds of thousands of servers and many billions of investments that are necessary for this colossus to continue to operate and remain competitive.

Entering this market now and starting from scratch is more of a utopia than a real business project. For example, one of the world's richest corporations, Microsoft, has been trying to gain a foothold in the search market for decades, and only now their search engine Bing is slowly beginning to meet their expectations. And before that there was a whole series of failures and setbacks.

What can we say about entering this market without any special financial influences. For example, our domestic search engine Nigma has a lot of useful and innovative things in its arsenal, but their traffic is thousands of times lower than the leaders of the Russian market. For example, take a look at the daily Yandex audience:

In this regard, we can assume that the list of the main (best and luckiest) search engines of the Runet and the entire Internet has already been formed and the whole intrigue lies only in who will eventually devour whom, or how their percentage share will be distributed if they all survive and will stay afloat.

Russian search engine market is very clearly visible and here, probably, we can distinguish two or three main players and a couple of minor ones. In general, a rather unique situation has developed in RuNet, which, as I understand it, has repeated itself only in two other countries in the world.

I'm talking about the fact that the Google search engine, having come to Russia in 2004, has still not been able to take leadership. In fact, they tried to buy Yandex around this period, but something didn’t work out there and now “our Russia”, along with the Czech Republic and China, are those places where the almighty Google, if not defeated, then, in any case, met serious resistance.

In fact, to see the current state of affairs among the best search engines on the RuNet Anyone can. It will be enough to paste this URL into the address bar of your browser:

Http://www.liveinternet.ru/stat/ru/searches.html?period=month;total=yes

The fact is that most of them use .

After entering the given Url, you will see a picture that is not very attractive and presentable, but it well reflects the essence of the matter. Pay attention to the top five search engines from which sites in Russian receive traffic:

Yes, of course, not all resources with Russian-language content are located in this zone. There are also SU and RF, and general areas like COM or NET are full of Internet projects focused on Runet, but still, the sample is quite representative.

This dependence can be presented in a more colorful way, as, for example, someone did online for his presentation:

This doesn't change the essence. There are a couple of leaders and several very, very far behind search engines. By the way, I have already written about many of them. Sometimes it can be quite interesting to plunge into the history of success or, conversely, to delve into the reasons for the failures of once promising search engines.

So, in order of importance for Russia and the Runet as a whole, I will list them and give them brief characteristics:

    Searching on Google has already become a household word for many people on the planet - you can read about it in the link. In this search engine, I liked the “translation of results” option, when you received answers from all over the world, but in your native language, but now, unfortunately, it is not available (at least on google.ru).

    Lately I have also been puzzled by the quality of their output (Search Engine Result Page). Personally, I always first use the RuNet mirror search engine (there is one there, well, I’m used to it) and only if I don’t find an intelligible answer there, I turn to Google.

    Usually the release of them made me happy, but lately it has only puzzled me - sometimes such nonsense comes out. It is possible that their struggle to increase income from contextual advertising and the constant shuffling of search results in order to discredit SEO promotion may lead to the opposite result. In any case, this search engine has a competitor on the RuNet, and what kind of one at that.

    I think that it is unlikely that anyone will specifically go to Go.mail.ru to search in RuNet. Therefore, traffic to entertainment projects from this search engine can be significantly more than ten percent. Owners of such projects should pay attention to this system.

However, in addition to the clear leaders in the search engine market of the Russian-language segment of the Internet, there are several more players whose share is quite low, but nevertheless the very fact of their existence makes it necessary to say a few words about them.

Runet search engines from the second echelon


Internet-wide search engines

By and large, on the scale of the entire Internet there is only one serious player - Google. This is the undisputed leader, but it still has some competition.

First of all, it's still the same Bing, which, for example, has a very good position in the American market, especially considering that its engine is also used on all Yahoo services (almost a third of the entire US search market).

Well, secondly, due to the huge share that users from China make up in the total number of Internet users, their main search engine called Baidu wedges itself into the distribution of places on the world Olympus. He was born in 2000 and now his share is about 80% of the entire national audience in China.

It is difficult to say anything more intelligible about Baidu, but on the Internet there are opinions that places in its Top are occupied not only by the sites most relevant to the request, but also by those who paid for it (directly to the search engine, and not to the SEO office). Of course, this applies primarily to commercial listings.

In general, looking at the statistics, it becomes clear why Google easily agrees to worsen its search results in exchange for increasing profits from contextual advertising. In fact, they are not afraid of user churn, because in most cases they have nowhere to go. This situation is somewhat sad, but we'll see what happens next.

By the way, to make life even more difficult for optimizers, and perhaps to maintain the peace of mind of this search engine’s users, Google has recently been using encryption when transmitting queries from users’ browsers to the search bar. Soon it will no longer be possible to see in the statistics of visitor counters what queries Google users came to you for.

Of course, in addition to the search engines mentioned in this publication, there are thousands of others - regional, specialized, exotic, etc. Trying to list and describe them all in one article would be impossible, and probably not necessary. Let's better say a few words about how easy it is to create a search engine and how it’s not easy or cheap to keep it up to date.

The vast majority of systems work on similar principles (read about this and that) and pursue the same goal - to give users an answer to their question. Moreover, this answer must be relevant (corresponding to the question), comprehensive and, which is not unimportant, relevant (of the first freshness).

Solving this problem is not so easy, especially considering that the search engine will need to analyze the contents of billions of Internet pages on the fly, weed out the unnecessary ones, and from the remaining ones form a list (issue), where the most appropriate answers to the user’s question will appear first.

This extremely complex task is solved by preliminary collection of information from these pages using various indexing robots. They collect links from already visited pages and load information from them into the search engine database. There are bots that index text (a regular and fast bot that lives on news and frequently updated resources so that the latest data is always presented in the results).

In addition, there are robots that index images (for their subsequent output to), favicons, site mirrors (for their subsequent comparison and possible gluing), bots that check the functionality of Internet pages, which users or through tools for webmasters (here you can read about, and) .

The indexing process itself and the subsequent process of updating index databases are quite time-consuming. Although Google does this much faster than its competitors, at least Yandex, which takes a week or two to do this (read about).

Typically, a search engine breaks the text content of an Internet page into individual words, which are reduced to the basic principles, so that it can then give correct answers to questions asked in different morphological forms. All the extra stuff in the form of HTML tags, spaces, etc. things are deleted, and the remaining words are sorted alphabetically and their position in this document is indicated next to them.

This kind of thing is called a reverse index and allows you to search not by web pages, but by structured data located on the search engine servers.

The number of such servers for Yandex (which searches mainly only for Russian-language sites and a little for Ukrainian and Turkish) is in the tens or even hundreds of thousands, and for Google (which searches in hundreds of languages) - in the millions.

Many servers have copies, which serve both to increase the security of documents and help increase the speed of request processing (by distributing the load). Estimate the costs of maintaining this entire economy.

The user's request will be sent by the load balancer to the server segment that is currently least loaded. Then an analysis is carried out of the region from which the search engine user sent his request, and it is analyzed morphologically. If a similar query was recently entered in the search bar, then the user is given data from the cache so as not to overload the servers again.

If the request has not yet been cached, then it is transferred to the area where the search engine's index database is located. In response, you will receive a list of all Internet pages that are at least somewhat related to the request. Not only direct occurrences are taken into account, but also other morphological forms, as well as, etc. things.

Their needs to be ranked and at this stage the algorithm (artificial intelligence) comes into play. In fact, the user's request is multiplied through all possible options for its interpretation, and answers to many requests are searched simultaneously (through the use of query language operators, some of which are available to ordinary users).

As a rule, the search results contain one page from each site (sometimes more). are now very complex and take into account many factors. In addition, to correct them, and are used, which manually evaluate reference sites, which allows you to adjust the operation of the algorithm as a whole.

In general, it is clear that the matter is dark. We can talk about this for a long time, but it is already clear that user satisfaction with a search system is achieved, oh, how difficult it is. And there will always be those who don’t like something, like you and me, dear readers.

Good luck to you! See you soon on the pages of the blog site

You might be interested

Yandex People - how to search for people on social networks Apometr is a free service for tracking changes in search results and updates of search engines. DuckDuckGo - a search engine that doesn't follow you
How to check Internet speed (Speedtest, Internetometer from Yandex)
Yandex widgets - how to customize and make the main page more informative and convenient for you
Yandex and Google images, as well as search by image file in Tineye (tinai) and Google Comparison of sites in SEObuilding.RU for free analysis of potential donors when purchasing links Google Alerts - what is it and how to use it, examples of creating useful alerts
My business - review of online accounting or electronic document management via the Internet
Free file hosting services - how to upload a photo and get a link to the picture

The best Internet search engines. Internet search engine These are special search programs installed on a whole range of specialized machines. In simple terms, it’s the same website with a set of programs, only on a special search engine (server). It is with the help of search engines that you find all the information you need. There are a lot of search engines.

1. What is an Internet search engine

2. Popular search engines in our country

3. Popular search engines abroad

4. Unusual search engines

5. How to properly search for information on the Internet

The most best psearch systems in our country:

http://www.yandex.ru

http://www.google.com

http://www.aport.ru

http://www.rambler.ru/

http://go.mail.ru

http://www.webalta.ru/

The most unloved and intrusive search engine by all.

Popular search engines abroad

http://www.altavista.com

http://www.alltheweb.com

http://www. bing.com

http://www.google.com
http://www.excite.com
http://www.lycos.com
http://www.mamma.com

http://www.yahoo.com

http://www.dmoz.com
http://www.hotbot.com
http://www.dogpile.com
http://www.netscape.com
http://www.msn.com
http://www.webcrawler.com
http://www.jayde.com
http://www.aol.com
http://www.euroseek.com
http://www.teoma.com
http://www.about.com
http://www.ixquick.com
http://www.lookle.com
http://www.metaeureka.com
http://www.searchspot.com
http://www.slider.com
http://www.allthesites.com
http://www.clickey.com
http://www.galaxy.com
http://brainysearch.com
http://www.orura.com

Each country has its own popular search engines.

Unusual search engines

  • DuckDuckGo (https://duckduckgo.com/) - a hybrid search engine with a privacy policy for the user and his search queries.

  • TinEye (http://tineye.com/) is a search engine specializing in searching images on the Internet. It has recently lost its relevance after Google introduced the same function in its image search.

  • Guenon (http://www.genon.ru/) is a search engine that collects and creates content on its website.

In almost every search engine, in addition to the search box, there are links to the most popular news sites, and sites of certain subjects.

How to properly search for information on the Internet

Each search engine has its own algorithms (rules) for searching for information.

In order to find some information on the Internet through a search engine, you need to enter in the search field request. If you enter one word, then this request will give you thousands of links to sites where this word is mentioned.

Therefore, it is necessary to enter as specific a query as possible, consisting of two, three or more phrases.

Let's look at an example of a search engine query Yandex.

Let's say you want to find information on buying a computer. If you write one word in the search box “ Computer", then you will get 133 million answers

You need to ask a more specific request. It is better to indicate which computer you want to buy and where (in which city).

Then the search engine will give you much fewer answers to your query.

The search engine doesn’t care at all whether you enter your query in capital or small letters.

Yandex distinguishes between nouns and adjectives, but completely ignores endings.

He is also completely indifferent to cases, plurals and the like.

To make the search more accurate, you need to put the query in quotation marks or put an exclamation mark before the word.

Now look at the same query, but without the exclamation marks.

Do you see the difference? With exclamation marks, the number of responses is not 2 million, but 186 thousand.

If you put an exclamation point in front of a word with a capital letter, you will be given answers that contain that particular word with a capital letter.

If the word is in the nominative case, and you need information on exactly such a word, and exactly as you wrote it, then put two exclamation marks in front of this word. For example: !!Ball .

The search will give you answers for exactly this word " Ball" the way you wrote it. Not " ball", Not " balls", and with a capital letter.

If you write a phrase with the word " on", then Yandex will ignore " on" For example: " on the shelf" The search will only be performed using the word " shelf ».

In order for him to take it into account and not ignore it, it is necessary before the word “ on» put a plus sign – « +on ».

Each search engine has its own search algorithm, so if you use a specific search engine and want to learn how to correctly compose queries, then you just need to type “ search rules inGoogle " or " search rules in Yandex ", follow the link in the response to your request and read the necessary information.

In order to successfully maintain and develop our blog, we, first of all, need to know what algorithms they work by. A clear understanding of the answers to these questions will allow us to successfully solve the problems of website promotion in search engines. But the conversation about search engine optimization of websites is still ahead, but for now a little theory about search engines.

What are Internet search engines?

If we turn to Wikipedia, this is what we find out:

“A search engine is a software and hardware complex with a web interface that provides the ability to search for information on the Internet.”

And now in a language we understand. Let's say we urgently need information on a certain topic. So that we can quickly find it, search engines have been created - sites where, by entering a search query in the search form, we will be given a list of sites on which, with a high degree of probability, we will find what we are looking for. This list is called search results. It can consist of millions of pages with 10 sites on each. The main task of a webmaster is to get into at least the top ten.

Remember that when you search for something on the Internet, you usually find it on the first page of the search results, rarely moving to the second, much less to subsequent ones. This means that the higher the site ranks, the more visitors will visit its pages. And high traffic (number of visitors per day) is, among other things, an opportunity to do well.

How do Internet search engines find information on the Internet and on what basis do they distribute places in search results?

In a few words, internet search engine- this is a whole web in which spider robots constantly scan the network and remember all the texts that enter the Internet. Analyzing the received data, search engines select documents that most correspond to the search query, i.e. relevant ones, from which search results are formed.

The most interesting thing is that search engines cannot read. So how then do they find information? Search engine algorithms boil down to a few basic principles. First of all, they pay attention to the title and description of the article, paragraph headings, semantic highlights in the text and the density of keywords, which must necessarily correspond to the topic of the article. The more accurate this match is, the higher the site will appear in search results. In addition, the volume of information and many other factors must be taken into account. For example, the authority of a web resource, which depends on the number and authority of sites linking to it. The greater the authority, the higher the ranking.

A set of measures aimed at increasing the site’s position in search results for certain queries is called search engine optimization. Now this is a whole science -. But more on that later.

At the moment there are many search engines in the world. I'll name the most popular ones. In the west these are: Google, Bing and Yahoo. In Runet - Yandex, Mail.ru, Rambler and Nigma. Basically, users give preference to the world leader, and the Yandex system has become the most popular on the Russian-language Internet.

A little history. Google was created in 1997 by a native of Moscow Sergey Brin and his American friend Larry Page during their studies at Stanford University.

Google's peculiarity was that it brought the most relevant search results in a logical sequence to the first positions in search results, while other search engines simply compared the words in the query with the words on the web page.

On September 23 of the same year, the Yandex system was announced, which since 2000 began to exist as a separate company “Yandex”.

I won’t bore you any more, I hope it’s a little clearer now, what are internet search engines. It is worth saying that search engine algorithms are constantly evolving. Every day, search engines are getting better at identifying the needs of users and showing them the most relevant information in the search results, based on many factors (region, what queries the user has already requested, what sites he visited during the search process, where he went from them, etc.).

Soon Google and Yandex will know better than us what we need and what we think about!

The most popular web service of our time is the search engine. Everything is understandable here, because the days when representatives of the first Internet users could observe new products on the Internet are long gone.

So much information appears and accumulates that it has become very difficult for a person to find exactly what he needs. Imagine what it would be like to search on the Internet if the average user had to look for information from God knows where. Just don’t understand where, because you won’t find much information with a manual search.

Search engine, what is it?

It’s good if the user already knows sites that may have the necessary information, but what to do otherwise? In order to make life easier for a person in finding the necessary information on the Internet, search engines or simply search engines were invented. The search engine performs one very important function, without which the Internet would not be the same as we are used to seeing it - this is searching for information on the Internet.

Search system- this is a special web site or in other words a site that provides users, upon their requests, with hyperlinks to pages of sites that respond to a given search query.

To be a little more precise, it is a search for information on the Internet, carried out thanks to a software and hardware functional set and a web interface for interacting with users.

For human interaction with the search engine, a web interface was created, that is, a visible and understandable shell. This approach of search engine developers makes searching easier for many people. As a rule, it is on the Internet that searches are carried out using search engines, but there are also search systems for FTP servers, certain types of goods on the World Wide Web, or news information or other search directions.

The search can be carried out not only by the text content of sites, but also by other types of information that a person can search for: images, videos, sound files, etc.

How does a search engine search?

Searching the Internet itself, just like browsing websites, is possible using an Internet browser. Only after the user has specified his query in the search bar, the search itself is carried out directly.

Any search system contains a software part on which the entire search mechanism is based; it is called a search engine - this is a software package that provides the ability to search for information. After contacting a search engine, a person generates a search query and enters it into the search bar, the search engine generates a page with a list of search results, the most relevant ones, in the opinion of the search engine, are located higher.

Search relevance - searching for the most relevant materials to the user's request and placing hyperlinks on them on the search results page with more accurate results above others. The distribution of results itself is called site ranking.

So how does a search engine prepare its materials for publication and how does the search engine itself search for information? The collection of information on the network is facilitated by a unique robot or bot for each search engine, which also has a number of other synonyms such as crawler or spider, and the work of the search system itself can be divided into three stages:

The first stage of a search engine's operation includes scanning sites on the global network and collecting copies of web pages on its own servers. This creates a huge amount of information that has not yet been processed and is not suitable for search results.

The second stage of the search engine’s work comes down to putting in order the information received earlier, in the first stage, from sites. The sorting is carried out in such a way that in the least amount of time will be conducive to the very high-quality search that users actually expect from a search engine. The stage is called indexing, which means that the pages are already prepared for issuance, and the current database will be considered an index.

It is precisely the third stage that determines the search results, after receiving a request from its client, based on the keywords or near keywords specified in the request. This facilitates the selection of the information most relevant to the request and its subsequent delivery. Since there is a lot of information, the search engine performs ranking in accordance with its algorithms.
The best search engine is considered to be the one that can provide the material that most correctly responds to the user’s request. But here, too, there may be results that were influenced by people interested in promoting their site; such sites, although not always, often appear in search results, but not for long.

Although world leaders have already been identified in many regions, search engines continue to develop their high-quality search. The better search they can provide, the more people will use it.

How to use the search engine?

What is a search engine and how it works is already clear, but how to use it correctly? Most sites always have a search bar, and next to it there is a Find or Search button. A query is entered into the search line, after which you need to press the search button or, as is more often the case, press the Enter key on the keyboard and in a matter of seconds you receive the result of the query in the form of a list.

But it’s not always possible to get the correct answer to a search query the first time. To ensure that the search for what you want does not become painful, you must correctly compose your search query and follow the recommendations described below.

We compose the search query correctly

The following will provide tips for using the search engine. Following some tricks and rules when searching for information in a search engine will make it possible to get the desired result much faster. Follow these guidelines:

  1. Correct spelling of words ensures the maximum number of matches with the desired information object (Although modern search engines have already learned to correct spelling errors, this advice should not be neglected).
  2. By using synonyms in your query, you can cover a wider search range.
  3. Sometimes changing a word in the query text can bring better results; reformat the query.
  4. Bring specificity to your query, use exact occurrences of phrases that should define the main essence of the search.
  5. Experiment with keywords. Using keywords and phrases can help identify the main point, and the search engine will return more relevant results.

So what a search engine is is nothing more than an opportunity to find information of interest and usually use it completely free of charge, learn something, understand something, or draw the right conclusion for yourself. Many people can no longer imagine their life without voice search, in which there is no need to type text, you just need to say your request, and the information input device here is a microphone. All this indicates the constant development of search technologies on the Internet and the need for them.

Hello, dear readers of the blog site. When doing or, in other words, search engine optimization, both at a professional level (promoting commercial projects for money) and at an amateur level (), you will definitely come across the fact that you need to know the principles of work in general in order to successfully optimize for them your own or someone else's site.

The enemy, as they say, must be known by sight, although, of course, they (for RuNet this is Yandex and) are not enemies for us at all, but rather partners, because their share of traffic is in most cases the prevailing and main one. There are, of course, exceptions, but they only confirm this rule.

What is a snippet and how search engines work?

But here, first you need to figure out what a snippet is, what is it for, and why is its content so important for the optimizer? In the search results it is located immediately below the link to the found document (the text of which is taken from what I already wrote):

Pieces of text from this document are usually used as a snippet. The ideal option is designed to provide the user with the opportunity to form an opinion about the content of the page without going to it (but this is if it turns out to be successful, and this is not always the case).

The snippet is generated automatically and it is up to you to decide which text fragments will be used in it, and, what is important, the same web page will have different snippets for different requests.

But there is a possibility that the contents of the Description tag can sometimes be used (especially in Google) as a snippet. Of course, this will also depend on the issue in whose search results it appears.

But the contents of the Description tag can be displayed, for example, if the query keywords coincide with the words you used in the description or if the algorithm itself has not yet found text fragments on your site for all queries for which your page appears in Yandex or Google results .

Therefore, don’t be lazy and fill in the contents of the Description tag for each article. This can be done in WordPress if you use the one described (and I strongly recommend that you use it).

If you are a Joomla fan, you can use this material -.

But the snippet cannot be obtained from the reverse index, because it stores information only about the words used on the page and their position in the text. It is precisely to create snippets of the same document in different search results (for different queries) that our beloved Yandex and Google, in addition to the reverse index (needed directly for conducting a search - read about it below), also save direct index, i.e. a copy of the web page.

By saving a copy of the document in their database, it is then quite convenient for them to cut the necessary snippets from them, without referring to the original.

That. It turns out that search engines store both the forward and reverse index of the web page in their database. By the way, you can indirectly influence the formation of snippets by optimizing the text of a web page in such a way that the algorithm chooses exactly the fragment of text that you have in mind. But we’ll talk about this in another article in this section.

How search engines work in general

The essence of optimization is to “help” search engine algorithms to raise the pages of the sites that you are promoting to the highest possible position in the search results for certain queries.

I put the word “help” in the previous sentence in quotation marks, because... With our optimization actions, we do not really help, and often completely prevent the algorithm from making results that are completely relevant to the request (about mysterious ones).

But this is the bread and butter of optimizers, and until search algorithms become perfect, there will be opportunities through internal and external optimization to improve their positions in Yandex and Google results.

But before moving on to studying optimization methods, you will need to at least superficially understand the principles of how search engines work, so that you can do all further actions consciously and understanding why this is necessary and how those whom we are trying to deceive will react to it.

It is clear that we will not be able to understand the entire logic of their work from start to finish, since much information is not subject to disclosure, but for us, at first, an understanding of the fundamental principles will be enough. So let's get started.

How do search engines work anyway? Oddly enough, but the logic of their work is, in principle, the same and is as follows: information is collected about all web pages on the network that they can reach, after which this data is cunningly processed so that it would be convenient for them conduct a search. That’s all, in fact, this article can be considered complete, but let’s still add a little specifics.

First, let's clarify that a document refers to what we usually call a site page. Moreover, it must have its own unique address () and, what is noteworthy, hash links will not lead to the appearance of a new document (about).

Secondly, it is worth dwelling on the algorithms (methods) for searching information in the collected document database.

Direct and reverse index algorithms

Obviously, the method of simply iterating through all the pages stored in the database will not be optimal. This method is called algorithm direct search and while this method allows you to certainly find the necessary information without missing anything important, it is completely unsuitable for working with large volumes of data, because the search will take too much time.

Therefore, to effectively work with large volumes of data, an inverse (inverted) index algorithm was developed. And, remarkably, it is the one that is used by all major search engines in the world. Therefore, we will dwell on it in more detail and consider the principles of its operation.

When using the algorithm reverse indexes documents are converted into text files containing a list of all the words they contain.

Words in such lists (index files) are arranged in alphabetical order and next to each of them are indicated in the form of coordinates the places on the web page where this word occurs. In addition to the position in the document, for each word there are also other parameters that determine its meaning.

If you remember, in many books (mostly technical or scientific) on the last pages there is a list of words used in this book, indicating the page numbers where they appear. Of course, this list does not include all the words used in the book, but nevertheless it can serve as an example of constructing an index file using inverted indexes.

Please note that search engines are looking for information not on the Internet, and in the reverse indexes of the web pages they processed. Although they also save direct indexes (original text), because it will later be needed to compile snippets, but we already talked about this at the beginning of this publication.

The reverse index algorithm is used by all systems, because it allows you to speed up the process, but at the same time there will be inevitable loss of information due to distortions introduced by converting the document into an index file. For ease of storage, reverse index files are usually compressed in a clever way.

Mathematical model used for ranking

In order to search using reverse indexes, a mathematical model is used to simplify the process of detecting the necessary web pages (based on a query entered by the user) and the process of determining the relevance of all found documents to this query. The more it matches a given request (the more relevant it is), the higher it should appear in search results.

This means that the main task performed by the mathematical model is to search for pages in its database of reverse indexes corresponding to a given query and their subsequent sorting in descending order of relevance to this query.

Using a simple logical model, when a document is found if the searched phrase is found in it, will not suit us, due to the huge number of such web pages presented to the user for consideration.

The search engine must not only provide a list of all web pages on which the words from the query appear. She must provide this list in such a form that the documents most relevant to the user’s request are at the very beginning (sort by relevance). This task is not trivial and cannot be performed perfectly by default.

By the way, optimizers take advantage of the imperfection of any mathematical model, influencing in one way or another the ranking of documents in search results (in favor of the site they are promoting, of course). The mathematical model used by all search engines belongs to the vector class. It uses such a concept as the weight of a document in relation to a user-specified query.

In the basic vector model, the weight of a document for a given query is calculated based on two main parameters: the frequency with which a given word appears in it (TF - term frequency) and how rarely this word occurs in all other pages of the collection (IDF - inverse document frequency ).

By collection we mean the entire set of pages known to the search engine. By multiplying these two parameters by each other, we get the weight of the document for a given request.

Naturally, various search engines, in addition to the TF and IDF parameters, use many different coefficients to calculate the weight, but the essence remains the same: the weight of the page will be greater, the more often the word from the search query appears in it (up to certain limits, after which the document can be recognized as spam) and the less frequently this word appears in all other documents indexed by this system.

Assessment of the quality of the formula by assessors

Thus, it turns out that the generation of results for certain requests is carried out completely according to the formula without human intervention. But no formula will work perfectly, especially at first, so you need to monitor the operation of the mathematical model.

For these purposes, specially trained people are used - who view the results (specifically of the search engine that hired them) for various queries and evaluate the quality of the current formula.

All comments they make are taken into account by the people responsible for setting up the mathematical model. Changes or additions are made to its formula, as a result of which the quality of the search engine’s work improves. It turns out that assessors act as a kind of feedback between the developers of the algorithm and its users, which is necessary to improve quality.

The main criteria for assessing the quality of the formula are:

  1. Search engine results accuracy is the percentage of relevant documents (that match the query). The fewer web pages (for example, doorways) that are not related to the topic of the request, the better.
  2. Completeness of search results is the percentage of web pages corresponding to a given query (relevant) to the total number of relevant documents available in the entire collection. Those. it turns out that in the entire database of documents that are known to search, there will be more web pages corresponding to a given query than are shown in the search results. In this case, we can talk about incompleteness of the issuance. It is possible that some of the relevant pages fell under the filter and were, for example, mistaken for doorways or some other slag.
  3. The relevance of the search results is the degree to which a real web page on a website on the Internet corresponds to what is written about it in the search results. For example, a document may no longer exist or be greatly changed, but it will be present in the search results for a given request, despite its physical absence at the specified address or its current non-compliance with the given request. The relevance of the results depends on the frequency of search robots scanning documents from their collection.

How Yandex and Google collect their collection

Despite the apparent simplicity of indexing web pages, there are a lot of nuances that you need to know, and subsequently use when optimizing (SEO) your own or custom-made websites. Network indexing (collection collection) is carried out by a specially designed program called a search robot (bot).

The robot receives an initial list of addresses that it will have to visit, copy the contents of these pages and give this content for further processing to the algorithm (it converts them into reverse indexes).

The robot can not only follow a list given to it in advance, but also follow links from these pages and index documents located on these links. That. the robot behaves exactly the same as a regular user following links.

Therefore, it turns out that with the help of a robot it is possible to index everything that is usually available to a user using a browser to surf (search engines index direct visibility documents that can be seen by any Internet user).

There are a number of features associated with indexing documents on the Internet (let me remind you that we have already discussed).

The first feature can be considered that in addition to the reverse index, which is created from the original document downloaded from the network, the search engine also stores a copy of it, in other words, search engines also store the direct index. Why is this necessary? I already mentioned a little earlier that this is needed to compose various snippets depending on the entered query.

How many pages of one site does Yandex show in search results and index?

I would like to draw your attention to such a feature of Yandex’s work as the presence in the search results for a given request of only one document from each site. Until recently, it could not have happened that two pages from the same resource were present in different positions in the search results.

This was one of the fundamental rules of Yandex. Even if there are a hundred pages relevant to a given query on one site, only one (the most relevant) will appear in the results.

Yandex is interested in the user receiving a variety of information, and not scrolling through several pages of search results with pages of the same site, which this user turned out to be not interesting for one reason or another.

However, I hasten to correct myself, because when I finished writing this article I learned the news that it turns out that Yandex began to allow the display of a second document from the same resource in the search results, as an exception, if this page turns out to be “very good and relevant” (in other words, it is highly relevant to the request).

What is noteworthy is that these additional results from the same site are also numbered, therefore, because of this, some resources occupying lower positions will fall out of the top. Here is an example of the new Yandex output:

Search engines strive to index all websites evenly, but often this is not easy due to the completely different number of pages on them (some have ten, while others have ten million). How to be in this case?

Yandex is getting out of this situation by limiting the number of documents that it can put into the index from one site.

For projects with a second-level domain name, for example, a website, the maximum number of pages that can be indexed by a Runet mirror is in the range from one hundred to one hundred and fifty thousand (the specific number depends on the relationship to the given project).

For resources with a third-level domain name - from ten to thirty thousand pages (documents).

If you have a website with a second-level domain (), and you need to index, for example, a million web pages, then the only way out of this situation is to create many subdomains ().

Subdomains for a second-level domain may look like this: JOOMLA.site. The number of subdomains for the second level that Yandex can index is somewhere just over 200 (sometimes up to a thousand), so in this simple way you can put several million web pages into the index of the RuNet mirror.

How Yandex treats sites in non-Russian domain zones

Due to the fact that until recently Yandex searched only in the Russian-language part of the Internet, it indexed mainly Russian-language projects.

Therefore, if you are creating a website not in domain zones, which by default are classified as Russian-language (RU, SU and UA), then you should not expect fast indexing, because he will most likely find it no earlier than a month later. But subsequent indexing will occur with the same frequency as in Russian-language domain zones.

Those. The domain zone only affects the time that will pass before the start of indexing, but will not subsequently affect its frequency. By the way, what does this frequency depend on?

The logic of how search engines work to reindex pages comes down to approximately the following:

  1. Having found and indexed a new page, the robot visits it the next day
  2. Having compared the contents with what was there yesterday and not finding any differences, the robot will come to it again only in three days
  3. if this time nothing changes on it, then he will come in a week, etc.

That. Over time, the frequency of the robot's visits to this page will be equal to or comparable to the frequency of its updates. Moreover, the robot’s re-entry time can be measured for different sites both in minutes and in years.

These are the smart search engines that create an individual visit schedule for various pages of various resources. It is possible, however, to force search engines to re-index a page at our request, even if nothing has changed on it, but more on that in another article.

We will continue to study the principles of search in the next article, where we will look at the problems that search engines face and consider the nuances. Well, and much more, of course, that helps in one way or another.

Good luck to you! See you soon on the pages of the blog site

You might be interested

Rel Nofollow and Noindex - how to block external links on a website from indexing by Yandex and Google
Taking into account the morphology of the language and other problems solved by search engines, as well as the difference between high-frequency, mid-range and low-frequency queries
Site trust - what is it, how to measure it in XTools, what influences it and how to increase the authority of your site
SEO terminology, acronyms and jargon
Relevance and ranking - what is it and what factors influence the position of sites in Yandex and Google results
What search engine optimization factors affect website promotion and to what extent?
Search engine optimization of texts - optimal frequency of keywords and its ideal length
Content for the site - how filling it with unique and useful content helps in modern website promotion
Title, description and keywords meta tags hinder promotion
Yandex updates - what they are, how to track up Tits, changes in search results and all other updates

Publications on the topic