SEO Process

10 Steps to Boost Your Site’s Crawlability and Indexability 

Nick Eubanks
Last Updated: Apr. 10, 2024
Crawlability and Indexability

When you think of the first visitor to your website, who do you picture?

Most people assume it’s a potential customer who’s interested in your product or service, but actually, it’s a search engine crawler.

When I share this information with site owners, many don’t understand what I mean.

Here’s the thing: If Google and other search engines can’t crawl and index your website, even the best SEO practices won’t help you succeed.

So, I wrote this article to help you better understand what happens behind the scenes when you publish content on your website.

Let me start by explaining what crawlability and indexability are, and how they’re connected to SEO. 

What Is Crawlability?

Crawlability is the process when search engine crawlers, like the Google bot, try to access and navigate your website pages.

If your website is open for crawling and no errors are detected in the robots.txt file (I’ll talk about that later), your website content will go through the following steps before ranking on Google:

  1. Discovering
  2. Crawling
  3. Rendering 
  4. Indexing 
  5. Ranking

Crawlability is essential in this 5-step indexing process because crawled pages usually get into the Google index and become searchable. From there, you can take steps to improve your SEO.

On the other hand, if your website is not crawlable, it won’t be indexed. As a result, it won’t appear in search results.

What Is Indexability?

Indexability is the process of adding your website pages to the Google database of millions of web pages (called Google index) and making them searchable. 

Once the crawling and rendering steps are completed, your web pages will be added to the Google index, making them eligible to appear in search results.

Only indexed pages can show up in organic search results! 

Why Crawlability and Indexability Are Important for SEO

If your website pages are not crawled and indexed, they won’t rank. 

No ranking — no organic website traffic. That’s pretty straightforward. 

Even though dealing with indexability requires some technical skills and Google Search Console (GSC) knowledge, it’s a must-have task for every site owner willing to grow an organic traffic generation channel. 

Remember that if your site is closed for indexing or can’t be easily indexed, some web pages may be omitted from organic search results, making it difficult for users to find your content through search engines. Therefore, I recommend prioritizing indexability issues to ensure all your web pages can be successfully discovered on Google. 

If you wonder whether your website has crawling or indexing issues, here’s how you can quickly check it. 

How to Check If Your Website Has Been Crawled 

Check the crawl status report for any crawling issues if you have a verified domain property in Google Search Console. 

If you see sudden crawl spikes or lows, 4xx status codes, or any other issues, it’s a sign to take a closer look at your website since it might have crawling issues. 

GSC server response

Alternatively, you can use SEO tools like Semrush, Ahrefs, and Screaming Frog to check your site’s crawlability. 

For instance, Semrush’s Site Audit tool has a dedicated crawling report. 

Whether you run a site audit for your website or competitors, you can see all crawled pages, crawl depth (or click depth), page loading speed, crawl budget waste, and more. 

What’s great about Semrush is that you can check all crawling, indexing, and other technical errors for every website page separately. 

If you regularly audit your website — let’s say once a week — you can proactively discover and fix all the issues. 

Semrush site audit

There’s one more thing Semrush does exceptionally well — it helps you fix technical errors by showing hints and explaining how to fix every issue.

It’s a very convenient feature that helps website owners save time and manage technical issues independently from developers. 

Semrush Site audit how to

Keep in mind that the robots.txt file controls crawling.

If you don’t have the robots.txt file on your website, Google will crawl the entire website.

It’s a bad SEO strategy because your website’s crawl budget will be wasted on irrelevant pages, such as WordPress admin, admin directories, or test servers, to name a few. 

It may also happen that some web pages won’t be crawled since no crawl budget is left. Therefore, every website should have a robots.txt file telling search engine crawlers what can and can’t be crawled. 

One last piece of advice. Don’t disallow Javascript and CSS in the robots.txt file since Google won’t be able to see your website content properly. It can lead to Google flagging your website pages as non-mobile friendly. 

How to Check If Your Website Has Been Indexed

If your web pages are indexed, it means they will rank in search results. 

However, indexing doesn’t always happen automatically. Therefore, you should occasionally check your website indexing to ensure new web pages are added to the index while old pages remain searchable. 

If you have a verified domain property in Google Search Console, you can see all indexed and not indexed pages in the Page indexing report. That’s the most reliable data because Google provides it. 

The more you grow your website, the higher the amount of indexed pages on the graph should be. 

However, Google can exclude the following pages and files from indexing:

  • Images with WebP format (because they aren’t indexed as HTML pages)
  • Pages with poor mobile optimization
  • Duplicate pages

Therefore, I recommend checking this GSC report weekly to ensure you don’t miss any essential updates. 

GSC Indexing status

You can also check whether an individual page is indexed using an inspection tool in Google Search Console. I frequently use this strategy to check whether recently published blog posts are added to the index. 

GSC URL Crawl information

Alternatively, you can use the search operator “site:” to check any website indexability on Google.

It’s an easy and effective method, letting you check the indexed pages and the number of competitor pages in organic search results. 

Google site search

Get Your Website Crawled and Indexed

Now comes the most exciting part of this article. 

Implementing the following 10 steps will ensure your website is successfully crawled and indexed. 

I developed these strategies based on my experience working with startups, SaaS, and B2B companies. 

You can definitely find more practices, tips, and strategies online. However, not every strategy will yield the same results. Therefore, critically assess your time and resources and only listen to expert advice! 

Step #1: Create an XML Sitemap

XML sitemap is like a roadmap for search engine crawlers, showing what pages of your website should be crawled and indexed. 

I’ve come across a controversial statement that a sitemap isn’t necessary since crawlers can find all web pages, given that a site architecture is set up correctly.

It can be true for relatively small websites with a click depth of three. However, what should giant websites with thousands of pages do? Even if you follow a strict internal linking strategy, it won’t guarantee the absence of orphan pages.

You can easily avoid this issue by creating a simple XML sitemap and placing it in the root folder on your website. 

If you run a website using content management systems (CMS), these SEO plugins will help you create and update a sitemap: Yoast SEO or All-In-One SEO plugins. 

Or, you can ask your web developer for help. 

If your website has a sitemap, you’ll see it under the following link: site.com/sitemap.xml

XML Sitemap

Step #2: Create a Robots.txt File

Since many people don’t get the meaning of the robots.txt file (which plays a pretty important role), I’ll try to explain it creatively. 

Robots.txt is like a backstage pass to a huge concert hall, which grants and declines access to certain areas. This way, journalists (crawlers) know where they can go and what photos they can take (crawl particular pages). 

If there is no backstage pass and the concert hall is open for anyone, journalists can access all areas and photograph everything on the go. 

The same will happen to your website if it does not have a robots.txt file — you won’t have control over what’s crawled and indexed on your website. Eventually, you might find admin panels, thank you pages, test servers, and more in the Google index. 

If you still don’t have a robots.txt file on your website, you can easily add it using SEO plugins for CMS. Or you can ask a web developer to place a robots.txt file in the root folder on your website.

Here’s how you can check the robots.txt file on any website: site.com/robots.txt

If you want to dig deeper, Semrush has an in-depth guide on robots.txt and why it’s essential for SEO. 

robots txt

Step #3: View Rendered HTML File 

In web development, rendering refers to the process of taking HTML, CSS, and JavaScript code and displaying it as a visually appealing web page that users can interact with.

So, a rendered HTML File is an HTML code of a web page after it has been processed and displayed by a web browser. 

Why do I recommend checking a rendered HTML file? Because Google also fetches CSS, Javascript, and multimedia resources while rendering a web page. If something isn’t accessible for Google or returns errors, Google may flag your web pages as non-mobile friendly. 

Eventually, your website may lose rankings and organic traffic. 

Google has an excellent article about the rendered source of a page if you want to learn more about it. 

Here is how you can check a rendered HTML file in Google Search Console:

  1. Inspect your target URL via Google Search Console.
  2. Click “Test live URL.” 
  3. Click “View tested page.”

The HTML tab will show you the rendered HTML file.

If you navigate to the “More info” tab, you’ll see various page resources that didn’t load for some reason. 

GSC Rendered HTML File

Step #4: Check Crawl Logs

If you think your website has crawling issues but couldn’t discover them using the above-described methods, check the crawl logs report. However, this method is a bit technical.

To download your log files (access.log), you need to get access to your web server via the File Transfer Protocol (FTP), such as FileZilla. Log files are usually located in the “/logs/” or “/access_log/” folder.

Once you have the log file, you should retrieve the data from it. 

The data contained in the log file is regular text so you can use any text editor.

However, I recommend using the Semrush Log File Analyzer, which generates detailed and visually appealing reports from your log file. 

Furthermore, Semrush has a detailed guide to help you download the log file. You can also contact the Semrush support team for further assistance with log files and data retrieval. 

Semrush Log File Analyzer

Step #5: Use Canonicals 

If your website has internal duplicate content, it will be challenging for Google to define what pages should be indexed and ranked.

Eventually, Google can randomly pick pages for ranking. However, those pages may not be the ones you want Google to index and rank. 

Every page on your website should have a self-referencing canonical tag. 

This way, you won’t send mixed or wrong signals to Google. Instead, search engine bots will know exactly what pages should be crawled, indexed, and ranked. 

If you believe your website has duplicate content, here is a detailed guide about canonicals and how to set them up correctly. 

Step #6: Check HTTP Status Requests 

Your web pages should return a 200 OK status code every time your web browser (the client) sends a request to the server. 

A 200 OK status code means the server received, understood, and accepted the request. So you, as a user, can successfully browse a web page. 

On the other hand, the 404 Not Found or 410 Gone status codes indicate the requested resources are unavailable and web pages aren’t accessible to users. 

Why is this important?

Displaying 404 or any other 4xx status code can cause Google to drop your rankings and eventually remove your web pages from the index. 

That’s why I recommend regularly checking your website HTTP status codes to ensure all essential pages return a 200 OK status. 

You can check all the technical issues, including 4xx errors, in the Semrush Site Audit report. The tool prioritizes the most critical issues so you’ll know what to handle first. 

Semrush site audit issues

The Google Search Console Pages report also shows 4xx errors on your website. 

I recommend checking this report daily to track your website performance and quickly manage new technical issues. 

GSC pages not indexed

Screaming Frog is another tool you can use to crawl your website and discover HTTP code errors quickly. This feature works even for users with a free plan. 

Screaming Frog response codes

Step #7: Implement Internal Linking 

You might have heard that a smart internal linking strategy helps users understand where they are on the website and quickly navigate to the target web page.

However, internal linking isn’t only about user experience.

Internal links pass link juice from one page to another, boosting the organic performance of the target pages (which you will monitor in your SEO reports). 

Furthermore, search engine crawlers follow internal links to re-crawl existing pages and crawl new ones. So, you can significantly improve the indexing of your website if your web pages are properly interlinked. 

Search Engine Crawling Process

How do you improve internal linking? 

The following strategies have proven to work for my SEO clients’ websites. Therefore, I recommend focusing on these two: 

  1. Ensure all related articles, categories, and products are interlinked
  2. Review the click depth and add related internal links to web pages with the click depth over three

You can use any SEO tool to analyze and fix internal linking issues. However, I recommend using Semrush for its simplicity and powerful toolkit. 

Semrush has a dedicated internal linking report where you can see the click depth for all web pages, the number of incoming and outgoing internal links, and the prioritized list of issues with actionable recommendations. 

Semrush internal linking report

Step #8: Check for Redirect Loops

I once accidentally broke a client’s website by incorrectly editing the .htaccess file. So I recommend that you never edit this file if you’re unsure of the consequences. 

Guess what I did wrong to break a website in a few minutes? 

I created an endless chain of redirections, preventing the browser from successfully loading the WordPress login page. Simply put, we lost the website access, and no one could log in. 

I was particularly afraid of one thing. If the issue with redirect loops wasn’t resolved for a while, Google could downgrade the website’s rankings and eventually remove the previously ranking pages from the index.

I’ve just described a critical scenario where the entire website is gone. However, your website might have internal redirect loops you are unaware of …

Several proven-to-work strategies exist to discover the redirect loops before they harm your website’s rankings.

First, you can use Semrush Site Audit to discover and analyze all redirect loops. 

The “Why and how to fix it” links in the report will guide you on how to fix these issues.

Semrush site audit redirect chains and loops

Secondly, you can use the Screaming Frog crawler. 

Here’s how to do it:

  1. Crawl your website using Screaming Frog
  2. Choose the “Response Code” tab
  3. Filter by “Redirection (3xx)”
  4. Select all links by using the Ctr+A combination
  5. View the source of redirection by choosing the “Inlinks” tab
  6. Click “Reports” > “Redirects” > “Redirect Chains” 

Once you complete all these steps, you can download a CSV file showing you all redirect chains and loops on your website. 

Screaming Frog redirect chains

Step #9: Check Robots Meta Tags

The meta robots tag instructs search engine crawlers whether they can index and display your website’s content in search results. 

The most common directives used in the meta robots tag are “index/no-index” and “follow/no-follow.”

If a web page is open for crawling but has a “no-index” tag, search engine bots will crawl it but won’t add it to the index.

Therefore, use a meta robots tag carefully and ensure the “no-index” tags are removed if they are no longer needed. 

You can check all web pages excluded by the “no-index” tag in your Google Search Console account.

GSC robots meta tags

Step #10: Optimize Site Speed

Page loading speed is part of the user experience, which affects rankings. 

Even though the poor page loading speed won’t result in your website’s exclusion from the index, I strongly recommend analyzing all technical errors related to the site speed.

In particular, you should regularly check the Core Web Vitals components, including:

  • LCP: Largest Content Paint 
  • INP (will replace FID in March 2024): Interaction to Next Paint
  • CLS: Cumulative Layout Shift 

If you have any critical errors on your website, discuss them with your developer and SEO team. 

The Page Speed Insights tool developed by Google can update you on the current state of your website and recommend actionable page speed improvements. All you have to do is plug in your URL in the search bar and hit “Analyze.”

Ensure you use the reports for mobile since Google has switched to mobile-first indexing. 

Page speed insights

If you use Semrush Site Audit, check the Core Web Vitals score per page and recommended improvements.

Semrush core web vitals

Take the Next Step to Improve Your Website’s Indexability

There you have it: 10 proven steps to ensure your website is properly crawled and indexed. 

Now, it’s your turn to take action!

Try Semrush’s Site Audit Tool to find, analyze, and fix technical errors.