A simple API containing a curated set metrics about the health of a webpage from a technical SEO point of view. Hopefully this API will act as a starting point for any engineer who likes to play and learn more about extracting insights from web pages for the purposes of SEO or testing.
Github RepositoryCreate an instance of Express server and set it to use our apiRouter on /api
path and
listen on
a port.
Create an instance of Express Router and add a get
route matcher on path
/page-health
to use our main handler function. This now makes your API route
/api/page-health
.
Now let's create our handler function. It needs to be able to read the query string parameter for
?url=
and validate it. If it checks out we can pass it to our render service to process it.
Otherwise we send an error response to the user to correct the faulty parameter.
Here is how we validate the URL. URL class throws an error when the passed string can not be constructed into a URL instance, so it's an easy way to validate a string.
Now we are done with our general scaffolding for the API side, let's build some software!
Our core service for the app is our rendering service, with a few simple duties:
But, as always, the devil is in the details and there are few gotchas involved:
We don't want to run too many of the APIs Chrome comes with out of the box, as they will slow down our
service.
To overcome this
we will use a custom launchOptions
for puppeteer's launch
function, so
we
can pass in a set a chrome switches
that would disable most of what we won't need.
We won't go into much detail here, but you can check out the list of switches passed into chrome for disabling various features and Web APIs. You can also see the full list of args that chrome takes.
We don't want to request any possibly large assets of the page, especially if they don't help us in calculation of our metrics. So we need to abort the requests that are made to such resources.
We also don't want our test to run any analytics events, so we would need to create a list of typical analytics servers too.
And finally, we need to also block any request made to well known advertising servers, as they just slow down the whole process.
Now we are able to easily check whether the request URL in ResourceType is matching any of these
types. If so, we can .abort
the request, otherwise we simply
.continue
This constitutes all of the lifting that needs to be done before we can finally add some metrics to our API.
We want all of our metrics to have the same blueprint, as we simply want to pass them all as an array into our rendering service and get a consistent set of results from them. To do so we need to create a blueprint abstract class which they will all extend.
The base metric will set a few standards for our metrics:
Page
and Response
getMetricValue
which provides the
name
and the final value of that metric. This is then used by the getMetric
method from
BaseMetric to add the other information we need for all the metrics.The easiest way of determining if a link is internal is to check that it is not external. We
simply
do
this
by checking if the href's hostname is different to the current page's href hostname. See
isExternalLink
method below.
After this we need 4 pieces of information about our links:
href
There is no "easy" way to extract all the event listeners attached to an element (inline or from script) without doing lots of work. Not to worry as Chrome Developer Protocol (CDP) can come to the rescue.
How, you ask? CDP, amongst other tools, has access to the console API of chrome. Console API contains a
function
in
the global namespace called getEventListeners
. This function is available when you open the
developer toolbar in chrome but normally isn't part of the global namespace.
This function takes a node as an argument and returns an object containing all the event listeners attached to that element, making CDP an ideal choice for extracting our link data.
You can create a CDP session through puppeteer's page object, which is perfect for us as we decided all
our
metrics will have access to the Page and Response object. To make this a little more encapsulated, we can
wrap
all the functionality we need from CDP into a class called CDPSessionClient
so
we won't pollute our metric class with its implementation detail.
So now our InternalLinks
metric class can extract the first three pieces of information for us.
For
health checking we will create our LinkHealthChecker
class
Getting the href
attribute: For attribute we will use the
DOM.getAttributes
command in CDP. The response of this
command is an array of attribute names and their values in the same array (yeah, a little weird,
but
it will do.)
So all we have to do is find what the index of `href` is in that response array and the value
after
that index will be
the href value.
Getting event listeners: As discussed earlier, we will use the debugger function for this
through the
DOMDebugger.getEventListeners
command. Notice that unlike DOM normal API, CDP
mainly
operates on nodeId (Unique DOM node identifier).
Getting text content: CDP does not have a command to get the text content of a node, so we
use cheerio for that. Using
the
DOM.getOuterHTML
we can get the HTML of the link, however, this would contain HTML, so
by
parsing it using cheerio we can use the .text()
method to get the combined text
content.
This class uses the raw information we gathered using CDP. Health checker considers a link healthy when:
#
basedjavascript:
)With simple metrics like this we don't have to go too deep into CDP, as we can simply rely on puppeteer's high level API.
Two things make a page NoIndex for bots:
<meta name='robots|googlebot|bingbot....' content='noindex'/>
on
the
page.
X-Robots-Tag=noindex
in the response headers
As you can see, robots meta tag is not mononymous so the easiest way to determine meta noindex is by
looking
for meta tags where their content
attribute value is noindex
. We can easily do
this
by
passing a page function into page.evaluate
For the headers, we need to use the puppeteer's response object and look for the key
X-Robots-Tag
and
check if its value is noindex
One of the newer Web APIs introduced recently is the Performance API which provides access to many of the performance-related information you would want about a page. Each piece of information in this API is called an entry. Each entry has a type which is the group the entry belongs to.
For our purposes we are focusing on the page rendering performance, for which its metric belongs to the entryType
called paint
. This is the category that contains metrics such as first-paint
and
first-contentful-paint
(FCP).
As far as FCP goes the lower the start time the better. Lighthouse uses a particular scoring system for this, which you may embrace.
Puppeteer's request object contains the chain of requests, however, if there are no redirects and the request was successful, the chain will be empty. For SEO purposes it's always nice to have the final link in the chain included too, so for this reason we always manually add the final link ourselves.
What makes a page responsive is a complex debate, however, we propose for simplicity that the minimum requirement for being responsive is for the width of the page to follow the width of the device.
So based on this, if a page at least has a viewport meta tag where its content attribute contains
width=device-width
then we consider this page to be responsive.
For robots we will be using a library called robots-parser. This tool is especially useful when your site has a large robots.txt file and finding the specific rule that matches a url can be hard.
You may wish to extend this so the API also allows you to pass custom user-agents to the parser, or even switch them out with the newly open sourced Google robots.txt parser. It is written in C++ so you would have to use a child process or tools like shell.js to interact with the C++ binary. You can also do it using node-gyp by creating a node C++ native add-on that would call the robots_main directly from source. I may be sharing some of these techniques in future posts.
For extracting schema, we use a library called Web Auto Extractor which can extract Microdata, RDFa-lite, JSON-LD and some other random meta tags.
In future posts we may show how to validate this extracted data against actual schema definitions
Puppeteer's response object contains a utility method for status, which is perfect for adding to our metrics.
Node's natural library has a good implementation of tf-idf where we can extract the list of all the terms from the corpus in order of their importance. For our purposes we can limit this to the first 10 only.
🏁 I hope you have enjoyed this post and find it useful. Feel free to change, extend and add your own metrics.