25 minute read
HOW SEARCH ENGINES REALLY WORK
By Jeff Ferguson
The Internet and the World Wide Web we know today would be utterly unnavigable without search engines. But how do Google, Bing, and the other search engines really work? The short answer: Search engines work by fi lling a database, or index, of billions of web pages through crawling. During the crawling process, computer programs called web crawlers, also known as bots or spiders, download the content and the links to other web pages found on these pages. Search engines analyze these web pages using algorithms for factors such a topicality, quality, speed, mobile-friendliness, and more to determine the posiঞ on, or rank, in the search engine results presented to users. But let’s dive a li le deeper. In this study , you will learn: • About Machine Learning • What is a Search Engine? • How Do Search Engines Obtain
Advertisement
Their Results? • How Do Search Engines Understand Your Query? • How Do Search Engines Rank
Web Pages? • Pulling It All Together: How
Google Really Ranks Web Pages • Summary
FIRST LET’S TALK ABOUT MACHINE LEARNING
Before we discuss how search engines work, you should understand a little about Machine Learning. According to Jim Stern, author of Arࢼfi cial Intelligence for Marketing, “Machine Learning is the automated creaঞ on of predictive models based on the structure of the given data.” To say that Machine Learning is a computer teaching itself how to do something is an oversimplification, but it’s an excellent place to start for this discussion. While Machine Learning lives under the Arঞfi cial Intelligence (AI) canopy, it should not be confused as being the same thing. Machine Learning is already in use daily in a myriad of computer systems; AI is the name given to a variety of technologies, some of which are sঞ ll the stuff of science-fi cঞ on. Some Machine Learning systems can run on an Unsupervised basis, sঞ ll requiring data to get started to fi nd pa erns in data and reveal correlations. There is also Supervised
Machine Learning, whereby we ensure the algorithms are doing their jobs properly like a teacher grading a student’s homework. Lastly, there is what is known as “Reinforcement Learning,” automated systems that opঞ mize markeঞ ng campaigns, such as Google Ads opঞ mizing a handful of variables to get as many conversions as possible at the right price. The big takeaway here is that these systems, while fi xed in their purpose, are not fi xed in their modeling, and evolve as they consume more data. Know that modern search engines are using Machine Learning systems of all three types in their a empts to organize the world’s knowledge.
WHAT IS A SEARCH ENGINE?
At this point in history, I’m sure just about everyone knows what a search engine is in theory; however, let’s talk a li le about what a search engine is from a technical standpoint. The concept of a database, that is, a structured collection of informaঞ on stored in a computer, has been around since the 1960s. That’s basically what a search engine is, a vast database of web pages combined with a set of algorithms, that is, a collecঞ on of computer instrucঞ ons, that decide which web pages to return and in what order they should appear when someone asks that database a quesঞ on, or query.
HOW DO SEARCH ENGINES OBTAIN THEIR RESULTS?
That database is fi lled through the process of crawling, whereby a computer program visits a known web page and downloads the informaঞ on found on that web page, a process known as parsing, into the database. The data collected during this process is not only the contents of the web page itself but also the links found on that page, which point to other web pages. The links found on that web page get added to a list of web pages for the crawler to visit at another ঞ me. Despite the name, crawlers, also known as bots or spiders, do not move from page to page by way of the links found there; instead, it’s more like the parser adds the newly discovered pages to a sort of “to-do list” to visit later. This to-do list is what’s known as a scheduler, and itself is an algorithm that determines how vital those newly discovered web pages are in comparison to all the other web pages on the internet the crawler knows about already. The parser then sends the informaঞ on it obtained from the web page to what is known as an index (a process known as, well, indexing), which itself is a kind of database. However, an index is more a database of locaঞ ons (or citaࢼ ons) of informaঞ on along with brief descripঞ ons of that informaঞ on (called abstracts). These citaঞ ons and brief descripঞ ons are basically what search engines provide to you when you query them for informaঞ on about a given topic.
If you would like to learn more about this process, I suggest Andrew Hogue’s excellent tech talk at Google from 2011 called, The Structured Search Engine. Just geম ng your web page into a search engine’s index is a substantial process, and search engines perform this acঞ on thousands and thousands of ঞ mes a day for new and old web pages alike. Google and the other search engines have made this job of discovery a bit easier for themselves by allowing website owners to provide a list of web pages to them, a fi le known as a sitemap. Addiঞ onally, you can submit new individual pages to both Google and Bing via the Google Search Console and Bing Webmaster Tools websites, respecঞ vely. As a website owner, it’s essenঞ al to understand this process. When the ঞ me comes to determine why your website may or may not receive any attention from the Organic search channel, one of the first questions that you should ask is, “Is our web page even indexed?”, that is, is the web page in question even in the search engine at all. If it’s not, then either the search engine crawlers simply haven’t reached your website yet, or there is something technical in nature keeping your web page from being crawled or indexed. Once included in a search engine’s index, the search engine must then determine when and where that web page will appear to its users when they search for something. That is, the search engine needs to decide which keywords and in what posiঞ on your web page will appear, or rank, in a search result. This process is where a search engine’s ranking algorithms come into play. Every search engine’s ranking algorithm works a li le diff erently; however, since Google dominates the search engine market in most of the English-speaking world (and beyond), we’ll focus on its ranking process for this discussion.
HOW DO SEARCH ENGINES UNDERSTAND YOUR QUERY?
Before Google can show you any results for your query, it must first determine what your question is about, that is, not only understanding the words in the query but the intent of those words as well. According to Google, “This involves steps as seemingly simple as interpreঞ ng spelling mistakes and extends to trying to understand the type of query you’ve entered by applying some of the latest research on natural language understanding.” This task is more complicated than you may think. The English language is, frankly, a mess, and Google’s ability to decipher that mess has improved steadily over the years. Google continues, “For example, our synonym system helps Search know what you mean by establishing that multiple words mean the same thing. This capability allows Search to match the query ‘How to change a light bulb’ with pages describing how to replace a light bulb.”
During the process, Google a empts to determine if the informaঞ on you are looking for is broad, or very specifi c, or if the query is about a local business. Google also tries its best to determine if your quesঞ on requires more recent, or fresh, informaঞ on. “If you search for trending keywords, our freshness algorithms will interpret that as a signal that up-to-date information might be more useful than older pages,” Google conঞ nues. “This means that when you’re searching for the latest’ premiership scores’, ‘Strictly Come Dancing’ results or ‘BP earnings’, you’ll see the latest informaঞ on.” And you thought Google just read your quesঞ on as entered, didn’t you? Now that Google has fi gured out what you’re asking, it needs to determine which web pages answer that quesঞ on the best, a process known as ranking. Again, this is no simple ma er.
HOW DO SEARCH ENGINES RANK WEB PAGES?
Google reviews hundreds of traits, or signals, of a web page to determine when and where it should appear in its index. Despite what anyone might tell you, no one outside of Google knows all these signals, nor do they know if any signal has a higher priority or importance than another. Although, when Google introduced RankBrain, one of their engineers admi ed that it was the third most important ranking signal, but, as you’ll see, that isn’t all that helpful. Google has been kind enough to defi ne some of the groups of signals, which in and of themselves are algorithms dedicated to specifi c areas of interest by Google. The groups of signals in the above image were shown to me and a small group of SEO professionals by Google spokesperson, Gary Illyes, at the Search Markeঞ ng Summit in Sydney, Australia, in early 2019 in what was supposed to be a closed-door session. Gary’s one request was that we didn’t share this informaঞ on, which is supposedly taught to Google’s engineers “on day one,” on Twi er or any other social media platform. However, this request was honored unঞ l just slightly after the session was completed, so I assume it is safe to share it here as well. Not shown in this collecঞ on of algorithms is Personalizaঞ on and Localizaঞ on, which Google someঞ mes calls “Context and Seম ngs.” Personalizaঞ on is the dynamic adjustments to the search engine results page (SERP) based on your Google usage history. Similarly, Localizaঞ on is the proacঞ ve adjustments to the SERP based on, well, your locaঞ on. These two factors alone are enough to make tracking your web pages’ posiঞ ons on Google frustraঞ ng. Let’s look at the individual algorithms that make up Google’s ranking process.
TOPICALITY Someঞ mes also referred to as “topical relevance,” this algorithm group’s funcঞ on is perhaps the most crucial concepts you must understand in Search Engine Opঞ mizaঞ on. If you, as the creator of a web page, want your content to appear for a given search result, then your web page must be about the topic searched for in the fi rst place. This concept, for some, is Earth-shattering news. Again, Google tells us precisely what they mean here, “The most basic signal that information is relevant is when a webpage contains the same keywords as your search query. If those keywords appear on the page, or if they appear in the headings or body of the text, the informaঞ on is more likely to be relevant.” However, before you run out and start stuffi ng your content with the same words repeatedly, which was, honest to goodness, an SEO strategy for years, it’s essenঞ al to understand this next sentence. Google’s modern Machine Learning based algorithms use a variety of signals beyond just a word’s appearance in a piece of content to determine if a web page is a relevant answer to a quesঞ on. Thanks to this collecঞ on of signals, the concept of “keyword density” no longer ma ers, if it ever did at all. Counter to what some SEO bloggers have published, this does not mean that Google looks at engagement metrics such as click-through rate or bounce rate for every page to help determine its rank. Instead, Google’s Machine Learning system has used similar informaঞ on for thousands of web pages in aggregate over ঞ me and then looks for similariঞ es in other content. While this doesn’t mean you shouldn’t concern yourself with metrics such as click-through rate and bounce rate for your content, it does mean that Google isn’t tracking these metrics for every web page in its index to determine that web page’s rank. Instead, focus on these metrics because it is a good indicaঞ on that the readers of that content fi nd it useful.
QUALITY When I took some of my fi rst computer science classes in high school, one of my teachers attempted to demonstrate the complexities of a computer program by having her students tell her, acঞ ng as a computer, how to make a peanut bu er and jelly sandwich. The teacher would sit at the front of the class with jars of peanut bu er and jelly, a knife, and a bag of sandwich bread, and say, “Where do I start? Tell me, the computer, what to do.” The fi rst student to take on this challenge would usually say something like, “Ok, first put the peanut butter on the bread,” only to have the teacher grab the entire jar of peanut bu er and place it on the bag of bread. A[ er a few giggles, the students would understand that they fi rst needed to tell the computer to open the bag of bread, take a slice of bread from the bag, open the jar of peanut bu er, use the knife to obtain
some peanut bu er, and so on. Even now, I’m simplifying these instructions, as she would sometimes get hung up on using the twist ঞ e on the bread bag. My point here is that geম ng a computer program to replicate human acঞ vity is incredibly complex. So, you can imagine how diffi cult it is to try and teach an algorithm the defi niঞ on of something so mulঞ faceted as quality. Google’s Machine Learning systems are once again used to bridge the gap between our human opinions on the subject of quality and a machine’s interpretaঞ on of that opinion. A[ er it was leaked onto the web in late 2015, Google released its Search Quality Evaluator Guidelines to the public in its entirety. Leaks of this document had occurred a few times since 2008. In 2013, Google even released an abridged version in response to the conঞ nued leaks; however, this was the fi rst ঞ me that Google responded by publishing the enঞ re 160-page guide to the public. When the Guidelines were released, some SEO professionals treated them like the Dead Sea Scrolls. To calm the SEO community down a bit, Google’s Ben Gomes stated in a 2018 interview with CNBC, “You can view the rater guidelines as where we want the search algorithm to go. They don’t tell you how the algorithm is ranking results, but they fundamentally show what the algorithm should do.” It is incredibly important here to point out that the Search Quality Evaluator Guidelines are not ranking signals. Instead, these guidelines are used by actual humans to check the accuracy of Google’s algorithms so that the Machine Learning systems used to assess the complicated concept of quality can conঞ nue to learn and improve their results. Once again, does this mean you shouldn’t concern yourself with the ma ers we’re about to discuss? Of course not. As you will see, what Google is looking for here is solid advice for anyone trying to create quality content. You, as a web page creator, may try your best to follow all the guidance provided in those quality guidelines, like you were checking items on a to-do list, and sঞ ll not end up on the fi rst page of results. Google’s Machine Learning algorithm doesn’t have a specifi c way to track all these elements; however, it can fi nd similariঞ es in other measurable areas and rank that content accordingly. Let’s discuss what those elements are in more detail. Benefi cial Purpose – Although not added to the guidelines unঞ l 2018, this aspect has become the top priority in the process of determining quality content at Google. It specifi - cally states that “websites and pages should be created to help users.” If your page is trying to harm or deceive Google’s users or to make money with no effort to help users, then Google’s quality raters are not going to rate your content well. Chances are Google’s algorithms wouldn’t like it much either. Google is not against your website selling products or services; it’s just that you need to be helpful in the process. As Google’s John Mueller stated in a 2011 Webmaster Central blog post, content creators should focus on providing “the best possible user experience” rather than treaঞ ng the various aspects of Google’s algorithms as a checklist. Your Money or Your Life (YMYL) – Google has stated in its guidelines that the accuracy of some content must be judged more criঞ cally than other content. This type of material, which they refer to as “Your Money or Your Life” pages, can “impact a person’s future happiness, health, fi nancial stability, or safety.” Quoঞ ng directly from Google’s guidelines, this content takes the form of the following: • News and current events: news about important topics such as international events, business, poliঞ cs, science, technology, etc.
Keep in mind that not all news arঞ cles are necessarily considered
YMYL (e.g., sports, entertainment, and everyday lifestyle topics are generally not YMYL). Please use your judgment and knowledge of your locale. • Civics, government, and law: informaঞ on important to maintaining an informed citizenry, such as informaঞ on about voting, government agencies, public institutions, social services, and legal issues (e.g., divorce, child custody, adopঞ on, creaঞ ng a will, etc.).
• Finance: fi nancial advice or information regarding investments, taxes, reঞ rement planning, loans, banking, or insurance, parঞ cularly webpages that allow people to make purchases or transfer money online. • Shopping: information about or services related to research or purchase of goods/services, parঞ cularly webpages that allow people to make purchases online. • Health and safety: advice or information about medical issues, drugs, hospitals, emergency preparedness, how dangerous an acঞ vity is, etc. • Groups of people: information about or claims related to groups of people, including but not limited to those grouped on the basis of race or ethnic origin, religion, disability, age, naঞ onality, veteran status, sexual orientaঞ on, gender or gender idenঞ ty. • Other: there are many other topics related to big decisions or important aspects of people’s lives which thus may be considered
YMYL, such as fi tness and nutriঞ on, housing informaঞ on, choosing a college, finding a job, etc.
Please use your judgment. To sum this up, if you’re trying to share facts, not opinions, about a topic, Google is going to take the evaluaঞ on of this content seriously, and so should you.
Expertise, Authoritativeness, Trust
worthiness (E-A-T) – A close, yet less upঞ ght cousin to the YMYL content mentioned above, the concept of E-A-T has become a source of considerable discussion in the SEO community since its release. Research on the ma er will reveal numerous explanaঞ ons of the idea along with a few well-meaning articles on “How to Write E-A-T Content for Google” and the like. If you don’t understand the words “experঞ se,” “authoritaঞ veness,” and “trustworthiness,” feel free to seek these blog posts out as defi ning (and redefi ning) the words themselves seem to be their favorite pasঞ mes. That said, the lesson you should learn from Google’s inclusion of these terms in their Guidelines is that Google is looking at more than your ability to construct a proper sentence when it comes to the concept of quality. Therefore, your content must then prove that you have a respectable level of understanding of a given topic (Experঞ se), that others in your industry or community agree with your understanding by ciঞ ng you as an expert (Authoritaঞ veness), and that few disagree with that authority (Trustworthiness). E-A-T is a dynamic concept. Someone wriঞ ng a guide on foods you can grill during BBQ season doesn’t need to meet the same content standards as someone writing about cancer research. As Google states in their Guidelines, “Keep in mind that there are high E-A-T pages and websites of all types, even gossip websites, fashion websites, humor websites, forum and Q&A pages, etc.”
As I stated earlier, there are numerous blogs and slide presentaঞ ons that try and turn E-A-T into a checklist of tacঞ cs (SEO professionals love a good list); however, the best way to learn about this concept is through example. Luckily, you can read the same standards that Google provides to its quality raters in the Guidelines itself (specifi cally, secঞ on 4.6, “Examples of High Quality Pages” in the 2019 edition of Search Quality Evaluator Guidelines). Just remember that these examples were wri en by humans, for humans, who are a empঞ ng to teach a computer to do their job. Avoid geম ng overly fixated on certain details or try to attach a specific quantity to the quality rates acঞ ons. For example, when some SEOs read the section header, “A Satisfying Amount of High-Quality Main Content” from these Guidelines, they try to assign a
specifi c number of words that need to be wri en or prove that “longer is be er,” but that is simply not the case. As Google’s John Mueller and a bevy of other Google employees will tell you, “Write for the readers, not us.”
PAGERANK One of Google’s oldest algorithms, PageRank, is charged with evaluaঞ ng the quality of inbound links to a website. Google’s long-held idea that “if other prominent websites link to the [web] page, that has proven to be a good sign that the informaঞ on is well trusted,” is one of the things that has set the search engine apart from its compeঞ tors. While many SEO tools and bloggers love to quesঞ on the importance of links in Google’s algorithms, according to Google, it is sঞ ll very much a part of the equaঞ on. Occasionally, in blog posts on the ma er, an SEO will state that links have a “high correlaঞ on” to ranking, which is kind of a silly statement when Google has already indicated that they use inbound links in their algorithm since 1998. Claiming you confi rmed this is like saying you fi gured out a Manha an cocktail uses bourbon when there are recipes readily available.
RANKBRAIN Introduced in 2015, RankBrain is, well, complicated. According to Danny Sullivan, when he sঞ ll worked for Search Engine Land (he works for Google now), RankBrain is “mainly used as a way to interpret the searches that people submit to find pages that might not have the exact words that were searched for.” Every day, Google processes something like three billion searches. Of those searches, anywhere from fi fteen to twenty-five percent have never been done before. Let that sink in a li le. That means that every day, there are 450 million to 750 million searches done every day that Google sees for the fi rst ঞ me. While that may seem daunঞ ng, many of those previously unknown searches are close to inquiries made before. That’s where RankBrain gets involved. Google had systems in place before to help with this sort of thing. Early in its history, it was able to start understanding the similarities between words like “bird” and “birds” through a process called stemming, that is, reducing a word down to their word stem, or root form. Additionally, in 2012 Google introduced the Knowledge Graph, which is a database of known facts like, “Who was the third President of the United States?” (Thomas Jefferson) that it could quickly answer without having to refer you to a website. The Knowledge Graph also allowed Google to understand the connecঞ ons to other facts. For instance, as Sullivan illustrated, “you can do a search like ‘when was the wife of Obama born’ and get an answer about Michelle Obama… without ever using her name.” RankBrain was designed to take these concepts even further by looking for similarities between new and old searches, or, as Greg Corrado, a senior search scienঞ st at Google, put it, “That phrase seems like something I’ve seen in the past, so I’m going to assume that you meant this.” Some SEO bloggers claim that the introduction of RankBrain to the algorithm set was the point when Google first started understanding what SEO professionals call “search intent,” that is, Google’s alignment of search results with users’ purpose for searching. However, nothing that was ever offi cially reported by Google upon the release of RankBrain confi rms this theory. There is also a lot of conjecture in the SEO community if you could really “opঞ mize” for this algorithm or not. Addiঞ onally, many SEO professionals focused on this algorithm’s importance in the overall collecঞ on of algorithms. A quoted Google representaঞ ve said it was the third most
important a[ er “links” and “words,” as Sullivan put it, which we can safely assume to be PageRank and Topicality, accordingly. I wouldn’t concern yourself as much with these theories, and instead, just be thankful that you don’t need to write content with every possible variaঞ on of a word to appear for relevant searches.
SITE SPEED/CORE WEB VITALS In 2010, Google fi rst started using how quickly a web page loads on a desktop computer as a ranking signal. Why? Because slow loading web pages are bad for business for both Google and the web page owners. In 2018, Google expanded this focus on site speed to include mobile web pages as well, further proving that they are not messing around in this area. In early 2020, Google introduced a new set of tools to its Google Search Console, called the Core Web Vitals, and stated explicitly that the metrics found there would become ranking signals starঞ ng in 2021. Core Web Vitals absorbed the site speed metrics looked at previously and expanded into new areas that thankfully needed to be addressed (such as the sloppy way some web pages load adverঞ sements and other images).
MOBILE Depending on which study you read, anywhere from 60%-70% of all searches start on a mobile device. Google has been pushing mobile-friendliness as a ranking signal since at least 2015; however, in 2018, they made it offi cial by focusing on what they call the “Mobile-First Index” process. “Mobile-First” does not mean “mobile-only,” but instead that Google now looks at the mobile version of your website fi rst during the process of evaluaঞ ng your website. A[ er the introducঞ on of this algorithm, the days of website owners not concerning themselves with mobile-friendliness were officially over. While the “Mobilegeddon” update wasn’t the bloodbath that the SEO press made it out to be, the importance of mobile cannot be undersold here. There have been numerous arঞ cles wri en on the subject on mobile opঞ mizaঞ on, most of them by the legendary Cindy Krum, who literally wrote the book on the subject, so I won’t spend any time on what to do here. Just know that it needs to be done. Google’s focus on Site Speed and Mobile are great examples of the search engine forcing website owners to do what they should have been doing for years, making their websites easier to use. This requirement is kind of like the government telling you to wear a seat belt when riding in an automobile – you should do it because it keeps you safe and is a smart thing to do, but someঞ mes, people just want their “freedom” (to be ejected through the windshield of their car). As a website owner, you should want to make your website fast loading and functional on mobile devices because your customers want that, but that wasn’t happening as much as it should. So, Google said, “if you want to show up in organic search results, you should do these things,” so now more website owners concern themselves with these ma ers.
PULLING IT ALL TOGETHER: HOW GOOGLE REALLY RANKS WEB PAGES
Now that you know all the various aspects of Google’s ranking system, here’s the most important lesson: How Google combines these mulঞ ple algorithms to determine the rank of a given web page. While most would assume that Google assigns a score for each of these areas then simply adds them up for a total score that equates to rank for a given query, they would be using the incorrect arithmeঞ c operator. In fact, according to Gary Illyes, on that beauঞ ful day in Sydney, Google assigns a score for each of these areas then mulࢼ plies those scores for a total score that determines the rank for a web page in the results for a given query. To see why this is important, one has only to remember the diff erence in outcomes for 1 plus 0.1 versus 1 mulࢼ plied by 0.1, which is 1.1 and 0.1, respecঞ vely. This mind-blowing news means that there is no specifi c priority for these various algorithms in the grand scheme of things. One could spend all their ঞ me making sure that their website was the fastest amongst their compeঞ tors but get dragged down by low-quality content. You could spend all your markeঞ ng budget on the best writers for your content only to be ranked lower because your website wasn’t opঞ mized for mobile devices. There is no silver bullet when it comes to Search Engine Opঞ mizaঞ on. In summary , If you learn anything from this discussion of how search engines work, remember this: Search Engine Optimization is about doing all the markeঞ ng, website design, and public relaঞ ons tacঞ cs that roll up to form SEO. These tacঞ cs aren’t about wriঞ ng a specifi c number of words in an arঞ - cle or a empঞ ng to trick Google by abusing canonical tags, but about creaঞ ng excellent content for your target audience on a properly built website. Search Engines are incredibly complex systems, but they are systems designed to bring out our best eff orts. SEO is work and lots of it.