The Top 10 Things Developers Need to Know About SEO

Express Server and Router

Create an instance of Express server and set it to use our apiRouter on /api path and listen on a port.

Create an instance of Express Router and add a get route matcher on path /page-health to use our main handler function. This now makes your API route /api/page-health.

Now let's create our handler function. It needs to be able to read the query string parameter for ?url= and validate it. If it checks out we can pass it to our render service to process it. Otherwise we send an error response to the user to correct the faulty parameter.

Here is how we validate the URL. URL class throws an error when the passed string can not be constructed into a URL instance, so it's an easy way to validate a string.

Now we are done with our general scaffolding for the API side, let's build some software!

Rendering Pages

Our core service for the app is our rendering service, with a few simple duties:

Take the URL passed to it from the API router
Create a puppeteer browser
Create a page and response object and pass it to all our metrics
Return the results from all metrics
Close browser

But, as always, the devil is in the details and there are few gotchas involved:

Chrome switches

We don't want to run too many of the APIs Chrome comes with out of the box, as they will slow down our service. To overcome this we will use a custom launchOptions for puppeteer's launch function, so we can pass in a set a chrome switches that would disable most of what we won't need.

We won't go into much detail here, but you can check out the list of switches passed into chrome for disabling various features and Web APIs. You can also see the full list of args that chrome takes.

Rejection tokens

We don't want to request any possibly large assets of the page, especially if they don't help us in calculation of our metrics. So we need to abort the requests that are made to such resources.

We also don't want our test to run any analytics events, so we would need to create a list of typical analytics servers too.

And finally, we need to also block any request made to well known advertising servers, as they just slow down the whole process.

Now we are able to easily check whether the request URL in ResourceType is matching any of these types. If so, we can .abort the request, otherwise we simply .continue

This constitutes all of the lifting that needs to be done before we can finally add some metrics to our API.

The metrics

We want all of our metrics to have the same blueprint, as we simply want to pass them all as an array into our rendering service and get a consistent set of results from them. To do so we need to create a blueprint abstract class which they will all extend.

The base metric will set a few standards for our metrics:

All metrics will get their results from puppeteer's Page and Response
We need the response payloads to provide information about the type of data returned from each metric. We do this for better usability of our data, so another program can easily read what type the data has so it can pick the right type of handling logic against that metric.
We also want all metrics to have a unique name so that they can be easily picked from the list if needed.
All metrics should implement a public method called getMetricValue which provides the name and the final value of that metric. This is then used by the getMetric method from BaseMetric to add the other information we need for all the metrics.

Internal links

The easiest way of determining if a link is internal is to check that it is not external. We simply do this by checking if the href's hostname is different to the current page's href hostname. See isExternalLink method below.

After this we need 4 pieces of information about our links:

Easy: href
Easy: Text content
Tricky: List of all event listeners attached
Tricky: Is the link healthy or not

There is no "easy" way to extract all the event listeners attached to an element (inline or from script) without doing lots of work. Not to worry as Chrome Developer Protocol (CDP) can come to the rescue.

How, you ask? CDP, amongst other tools, has access to the console API of chrome. Console API contains a function in the global namespace called getEventListeners. This function is available when you open the developer toolbar in chrome but normally isn't part of the global namespace.

This function takes a node as an argument and returns an object containing all the event listeners attached to that element, making CDP an ideal choice for extracting our link data.

You can create a CDP session through puppeteer's page object, which is perfect for us as we decided all our metrics will have access to the Page and Response object. To make this a little more encapsulated, we can wrap all the functionality we need from CDP into a class called CDPSessionClient so we won't pollute our metric class with its implementation detail.

So now our InternalLinks metric class can extract the first three pieces of information for us. For health checking we will create our LinkHealthChecker class

CDPSessionClient

Getting the href attribute: For attribute we will use the DOM.getAttributes command in CDP. The response of this command is an array of attribute names and their values in the same array (yeah, a little weird, but it will do.) So all we have to do is find what the index of `href` is in that response array and the value after that index will be the href value.
Getting event listeners: As discussed earlier, we will use the debugger function for this through the DOMDebugger.getEventListeners command. Notice that unlike DOM normal API, CDP mainly operates on nodeId (Unique DOM node identifier).
Getting text content: CDP does not have a command to get the text content of a node, so we use cheerio for that. Using the DOM.getOuterHTML we can get the HTML of the link, however, this would contain HTML, so by parsing it using cheerio we can use the .text() method to get the combined text content.

LinkHealthChecker

This class uses the raw information we gathered using CDP. Health checker considers a link healthy when:

It has a value for the href attribute
That value is not # based
That value is not javascript code (javascript:)
Not having event handlers is considered better but we allow them for now

No Index

With simple metrics like this we don't have to go too deep into CDP, as we can simply rely on puppeteer's high level API.

Two things make a page NoIndex for bots:

Having a <meta name='robots|googlebot|bingbot....' content='noindex'/> on the page.
Having a X-Robots-Tag=noindex in the response headers

As you can see, robots meta tag is not mononymous so the easiest way to determine meta noindex is by looking for meta tags where their content attribute value is noindex. We can easily do this by passing a page function into page.evaluate

For the headers, we need to use the puppeteer's response object and look for the key X-Robots-Tag and check if its value is noindex

Performance

One of the newer Web APIs introduced recently is the Performance API which provides access to many of the performance-related information you would want about a page. Each piece of information in this API is called an entry. Each entry has a type which is the group the entry belongs to.

For our purposes we are focusing on the page rendering performance, for which its metric belongs to the entryType called paint. This is the category that contains metrics such as first-paint and first-contentful-paint (FCP).

As far as FCP goes the lower the start time the better. Lighthouse uses a particular scoring system for this, which you may embrace.

Redirect chain

Puppeteer's request object contains the chain of requests, however, if there are no redirects and the request was successful, the chain will be empty. For SEO purposes it's always nice to have the final link in the chain included too, so for this reason we always manually add the final link ourselves.

Responsive

What makes a page responsive is a complex debate, however, we propose for simplicity that the minimum requirement for being responsive is for the width of the page to follow the width of the device.

So based on this, if a page at least has a viewport meta tag where its content attribute contains width=device-width then we consider this page to be responsive.

robots.txt

For robots we will be using a library called robots-parser. This tool is especially useful when your site has a large robots.txt file and finding the specific rule that matches a url can be hard.

You may wish to extend this so the API also allows you to pass custom user-agents to the parser, or even switch them out with the newly open sourced Google robots.txt parser. It is written in C++ so you would have to use a child process or tools like shell.js to interact with the C++ binary. You can also do it using node-gyp by creating a node C++ native add-on that would call the robots_main directly from source. I may be sharing some of these techniques in future posts.

schema.org

For extracting schema, we use a library called Web Auto Extractor which can extract Microdata, RDFa-lite, JSON-LD and some other random meta tags.

In future posts we may show how to validate this extracted data against actual schema definitions

Status code

Puppeteer's response object contains a utility method for status, which is perfect for adding to our metrics.

tf-idf

Node's natural library has a good implementation of tf-idf where we can extract the list of all the terms from the corpus in order of their importance. For our purposes we can limit this to the first 10 only.

🏁 I hope you have enjoyed this post and find it useful. Feel free to change, extend and add your own metrics.