Hamlet Batista – Search Engine Land News On Search Engines, Search Engine Optimization (SEO) & Search Engine Marketing (SEM) Fri, 27 Aug 2021 19:02:52 +0000 en-US hourly 1 https://wordpress.org/?v=5.8 How to audit sites inside corporate networks /how-to-audit-sites-inside-corporate-networks-341195 Mon, 28 Sep 2020 15:17:39 +0000 /?p=341195 Turn private (VPN required) URLs into temporary ones that allow for page changes but hide the content to preserve its privacy.

The post How to audit sites inside corporate networks appeared first on Search Engine Land.

There is a common problem when auditing staging enterprise sites inside corporate networks. 

If you work in-house, you first connect to the corporate network using a VPN client. Then, you need to run auditing tools to review the pages. 

The only tools that work are the ones that you can run directly from your computer. For example, the ScreamingFrog spider, which is a downloadable program.

However, many enterprise sites have millions of pages which makes crawling from your computer impractical due to time constraints or machine resources.

Enterprise cloud-based crawlers like DeepCrawl, Ryte, Oncrawl, etc. are better suited for this type of work. But, they are not able to audit sites inside private networks.

In addition to this, this leaves out many other valuable tools like the URL Inspection tools from Google and Bing that are critical to audit JavaScript-driven content.

If you work agency-side, you have the extra complication that security and privacy compliance is now a requirement to work with enterprises. It is common to have to complete extensive security questionnaires before you are even considered as a vendor.

The content in the staging site inside the private network might not be ready to be opened to the public. 

Introducing network admin tools for SEO

In previous articles, I’ve mentioned the importance of being aware of tools and techniques used in the development and IT industries. In this article I’m going to continue to make the case for that.

Let me introduce a couple of tools that are familiar to network and system administrators: ngrok and mitmproxy.

We can use ngrok to turn private (VPN required) URLs into temporary and public ones. We can use mitmproxy to make changes to the pages and hide and/or obfuscate the content and preserve its privacy. This requires writing simple Python scripts.

Proxies and HTTP Tunnels

Before I dive in and play with the tools, let me go over their underlying concepts. 


“When navigating through different networks of the Internet, proxy servers and HTTP tunnels are facilitating access to content on the World Wide Web. A proxy can be on the user’s local computer, or anywhere between the user’s computer and a destination server on the Internet. This page outlines some basics about proxies and introduces a few configuration options.“

Proxies and HTTP tunnels are standard approaches to relay requests/pages and make them available from once source site to another. Please review the linked article to learn more about the topic.

Ngrok creates HTTP tunnels and mitmproxy is a reverse proxy. 

These are two different use cases that are a good fit to solve the problems I mentioned at the start.

Using Ngrok

Ngrok creates HTTP tunnels and is super simple to setup and use. 

Let’s say your staging site is https://staging.internal-network.net:8080 and you are only able to open the page after you connect using the VPN client. 

You could expose this site temporarily so you could verify Google Search Console and Bing Webmaster Tools, and run the URL inspection tools (or enterprise crawlers) on the exposed URLs.

Here is how you do that:

  1. Download and install ngrok for your Mac or Windows PC. 
  2. Open a terminal window and launch ngrok. 

Ngrok is a command line tool, so you need to run it in a shell and pass parameters to make it work.

Now let’s create the HTTP tunnel and temporary URL.

./ngrok http staging.internal-network.net:8080 > ngrok.log 2>&1 &

Here I am asking ngrok to expose the web server that is only accessible from my computer at port 8080. I added some extra commands to log any errors to ngrok.log and finally want the process to run in the background and let me type more commands.

tail ngrok.log

I check the log has nothing and that means it should be working fine. Next, I need to get the public URL generated.

I need to make an API call to the service, which returns a JSON response that I need to parse. We are going to simplify this part by downloading another handy command line tool, jq

Assuming you also have curl, you can get the temporary URL with this command.

curl -s http://localhost:4040/api/tunnels | jq ".tunnels[0].public_url"

You should get a URL that you can open in your web browser like this:


After you open it, you will see the internal site. Try using the Rich Testing Tool on it (the URL you get, not this example) and it should work. How cool is that?

As you don’t own the ngrok.io domain, you need to take an extra step in order to register with Google Search Console and Bing Webmaster Tools. 

You need to create an account and register a custom domain that you control. 

Before you create the tunnel, you need to authenticate.

./ngrok authtoken <token>

Then, you add another parameter to specify the custom domain while you create the tunnel.

./ngrok http -hostname=dev.yourdomain.com staging.internal-network.net:8080 > ngrok.log 2>&1 &

You will be able to register this subdomain and run the URL inspection tools (or your favorite enterprise crawler).

Using Mitmproxy

So, we learned to expose staging sites inside the corporate network using temporary public URLs. But, what if we couldn’t risk making the content public and inadvertently reveal unannounced news that could hurt a publicly listed company?

One option is to layer in a reverse proxy and use it to hide or obfuscate any private information in the HTML and/or images to preserve the company’s privacy.

Mitmproxy is an awesome HTTPS proxy that, among many things, allows you modify the HTTP traffic going through it on the fly, even HTTPS, which is encrypted!

You can make simple text replacements in the command line or any arbitrary modifications by writing simple Python scripts

Mitmproxy can operate in several modes, we are interested in its reverse proxy one. 

It is a Python package, so you can install it using. 

pip install mitmproxy

Then call it using.

mitmproxy -P 8081 --mode reverse:https://staging.internal-network.net:8080

Let me illustrate this powerful technique with one example. 

I’m going to reverse-proxy StackOverflow and change the text in their H1 from “People” to “SEOs” 

mitmproxy -P 8081 --mode reverse:https://stackoverflow.com/ --modify-body '/ people who code/ SEOs who code'

Let’s open the browser on http://localhost:8081 and see if it works.

Kaboom! Now tell me this isn’t exciting stuff :)

The idea is to replace any text or images that shouldn’t be exposed publicly.

You would need to run ngrok afterwards instructing it to connect to this reverse proxy at port 8081 instead of directly to the source server.

./ngrok http -hostname=dev.yourdomain.com localhost:8081 > ngrok.log 2>&1 &

MIT stands for (Man in the middle attack), which is an information security concept that means there is an intercepting device/element in a two way conversation. This device can sniff or tamper with the information transmitted.

As you can imagine, this could be used for nefarious purposes. Fortunately, in our case, we want to use it for good. We want to hide/obfuscate sensitive information from internal pages before exposing them publicly with ngrok.

The post How to audit sites inside corporate networks appeared first on Search Engine Land.

How to evaluate content quality with BERT /how-to-evaluate-content-quality-with-bert-337283 Wed, 08 Jul 2020 14:11:12 +0000 /?p=337283 Using a code-free deep learning toolkit model to review grammar in your posts and could be used as one of several proxies for content quality.

The post How to evaluate content quality with BERT appeared first on Search Engine Land.

Marie Haynes recently had a really insightful podcast interview with John Muller.

I specifically enjoyed the conversation about BERT and its potential for content quality evaluation.

“M 26:40 – .. Is Google using BERT now to better understand now whether content is good?”

“J 27:00 – … It’s not so much to understand the quality of the content but more to understand what is this content about, what is this sentence about, what is this query about …”

Google has repeatedly said that it helps understand natural language better. Content quality assessment like humans do is still fairly complicated for machines to do.

“M 28:54 – … could Google treat that as a negative to say ‘oh this page looks like it was SEO-ed, these keywords are here for Google and make that an actual detriment to the page”

“J 29:41 – … they’re just adding thousands of variations of the same keywords to a page and then our keyword stuffing algorithm might kick in and say well actually this looks like keyword stuffing, and then our keyword stuffing algorithm might kick in …” 

On the other hand, keywords stuffing is something that is easier for machines to spot. One way to check that is to see if the text is written in a nonsensical way.

“J 29:41 – … But I guess with regards to BERT one of the things that that could be done because a lot of these algorithms are open-sourced, there’s a lot of documentation and reference material around them, is to try things out and to take some of this SEO text and throw it into one of these algorithms and see does the primary content get pulled out, are the entities able to be recognized properly and it’s not one to one the same to how we would do it because I’m pretty sure our algorithms are based on similar ideas but probably tuned differently but it can give you some insight into is this written in such a way that it’s actually too confusing for a system to understand what it is that they’re writing about.”

This is the part that got me excited. Trying this out is a great idea and precisely what we will do in this article.

Britney Muller from Moz shared a really good idea and Python notebook with the code to test it. 

We can use BERT fine tuned on The Corpus of Linguistic Acceptability (CoLA) dataset for single sentence classification. 

This model can help us determine which sentences are grammatically correct and which aren’t. It could be used as one of several proxies for content quality. 

It is obviously not foolproof, but can get us in the right direction.

Fine tuning BERT on CoLA

The Colab notebook linked in Britney’s tweet is far too advanced for non-experts, so we are going to take a drastic shortcut!

We are going to use Ludwig, a very powerful and code-free deep learning toolkit from Uber, to do the same.

Here are the technical steps:

  1. Fetch a target page and extract the text.
  2. Split it into sentences.
  3. Use our model to predict whether each sentence is grammatically correct or not.
  4. Calculate and report grammatically correct and incorrect sentences

First, let’s build our predictive model.

I coded a simple to follow Google Colab notebook with all the steps.

Copy the notebook to your Google Drive and change the runtime type to GPU.

You can use the form at the top to test the code on your own articles. You might need to change the CSS selector to extract relevant text per target page. The one included works with SEL articles.

You should be able to run all the cells (one at a time) and see the evaluation in action.

Building the predictive model

When you compare the original notebook to the one that I created, you will find that we avoided having to write a lot of advanced deep learning code.

In order to create our cutting edge model with Ludwig, we need to complete four simple steps:

  1. Download and uncompress the CoLA dataset
  2. Create the Ludwig model definition with the appropriate settings and hyper parameters
  3. Run Ludwig to train the model
  4. Evaluate the model with held back data in the CoLA dataset

You should be able to follow each of these steps in the notebook. I will explain my choices here and some of the nuances needed to make it work.

Google Colab comes with Tensorflow version 2.0 preinstalled, which is the latest version. But, Ludwig requires version 1.15.3.

Another important step is that you need to set up the GPU version of Tensorflow to finish the training quickly. 

We accomplish this with the next few lines of code:

!pip install tensorflow-gpu==1.15.3

%tensorflow_version 1.x
import tensorflow as tf; print(tf.__version__)

After this, you need to restart the runtime using the menu item: Runtime > Restart runtime.

Run the form again, the line that imports pandas and continue to the step where you need to install Ludwig.

The accuracy of the predictive model can vary widely and it is heavily influenced by your choice of hyper parameters

These are generally determined empirically by trial and error and to save time, I simply borrowed the ones from the weights and biases notebook

As you can see above, in their visualization, the best combination results in a validation accuracy of 84%. 

We added the same parameters to our model definition under the training section.


batch_size: 16
learning_rate: 0.00003
epochs: 3

Next, we can train our BERT model on the CoLA dataset using a single command line.

!ludwig experiment --data_csv cola_dataset.csv --model_definition_file model_definition.yaml

We achieve a validation accuracy of 80%, which is slightly lower than the original notebook, but we put in significantly less effort!

Now, we have a powerful model that can classify sentences as grammatically correct or not.

I added additional code to the notebook to evaluate some test sentences and it found 92 grammatically incorrect out of 516. 

As you can see above, the predictions on the grammatically incorrect sentences look pretty accurate.

Converting web pages to sentences to predict their grammatical correctness

Splitting text into sentences using regular expressions seems like a trivial thing to do, but there are many language nuances that make this approach impractical.

Fortunately, I found a fairly simple solution in this StackOverflow thread

As you can see above, the technique works quite well. Now, we just need to feed these sentences to our grammar correctness predictive model.

Fortunately, it only found 4 out of 89 sentences grammatically incorrect in my last article.

Try this out in your own articles and let me know in Twitter how you do!

The post How to evaluate content quality with BERT appeared first on Search Engine Land.

The dangers of misplaced third-party scripts /the-dangers-of-misplaced-third-party-scripts-327329 Thu, 09 Jan 2020 16:17:20 +0000 /?p=327329 Most SEO tags like the title, canonical, etc. belong in the HTML HEAD but if placed in the BODY, Google and other search engines will ignore them. Here's how you fix it.

The post The dangers of misplaced third-party scripts appeared first on Search Engine Land.

I was recently helping one of my team members diagnose a new prospective customer site to find some low hanging fruit to share with them.

When I checked their home page with our Chrome extension, I found a misplaced canonical tag. We added this type of detection a long time ago when I first encountered the issue.

What is a misplaced SEO tag, you might ask?

Most SEO tags like the title, meta description, canonical, etc. belong in the HTML HEAD. If they get placed in the HTML BODY, Google and other search engines will ignore them.

If you go to the Elements tab, you will find the SEO tags inside the <BODY> tag. But, these tags are supposed to be in the <HEAD>!

Why does something like this happen?

If we check the page using VIEW SOURCE, the canonical tag is placed correctly inside the HTML HEAD (line 56, while the <BODY> is in line 139.).

What is happening here?!

Is this an issue with Google Chrome?

The canonical is also placed in the BODY in Firefox.

We have the same issue with Internet Explorer.

Edge is no exception.

We have the same problem with other browsers.

HTML parsing vs. syntax highlighting

Why is the canonical placed correctly when we check VIEW SOURCE, but not when we check it in the Elements tab?

In order to understand this, I need to introduce a couple of developer concepts: lexical analysis and syntax analysis.

When we load a source page using VIEW SOURCE, the browser automatically color codes programming tokens (HTML tags, HTML comments, etc).

In order to do this, the browser performs basic lexical analysis to break the source page into HTML tokens.

This task is typically performed by a lexer. It is a simple, and low-level task.

All programming language compilers and interpreters use a lexer that can break source text into language tokens.

When we load the source page with the Elements tab, the browser not only does syntax highlighting, but it also builds a DOM tree.

In order to build a DOM tree, it is not enough to know HTML tags and comments from regular text, you also need to know when a tag opens and closes, and their place in the tree hierarchy.

This syntactic analysis requires a parser.

An English spellchecker needs to perform a similar, two-phased analysis of the written text. First, it needs to translate text into nouns, pronouns, adverbs, etc. Then, it needs to apply grammar rules to make sure the part of speech tags are in the right order.

But why are the SEO tags placed in the HTML body?

Parsing HTML from Python

I wrote a Python script to fetch and parse some example pages with errors, find the canonical anywhere in the HTML, and print the DOM path where it was found.

After parsing the same page that shows misplaced SEO tags in the HTML Body, I find them correctly placed in the HTML head.

What are we missing?

Invalid tags in the HTML head

Some HTML tags are only valid in the HTML BODY. For example, <DIV> and <SPAN> tags are invalid in the HTML head.

When I looked closely at the HTML HEAD in our example, I found a script with a hardcoded <SPAN>. This means, the script was meant to be placed in the <BODY>, but the user incorrectly placed it in the head. 

Maybe the instructions were not clear, the vendor omitted this information or the user didn’t know how to do this in WordPress.

I tested by moving the script to the BODY but still faced the misplaced canonical issue.

After a bit of trial and error, I found another script that when I moved it to the BODY, the issue disappeared.

While the second script didn’t have any hardcoded invalid tags, it was likely writing one or more to the DOM.

In other words, it was doing it dynamically.

But, why would inserting invalid tags, cause the browser to push the rest of the HTML in the head to the body?

Web browser error tolerance

I created a few example HTML files with the problems I discussed and loaded them in Chrome to show you what happens.

In the first example, I commented out the opening BODY tag. This removes it.

You can see that Chrome added one automatically. 

Now, let’s see what happens if I add a <DIV> inside the HTML HEAD, which is invalid.

This is where it gets interesting. Chrome closed the HTML HEAD early and pushed the rest of the HEAD elements to the body, including our canonical tag and <DIV>.

In other words, Chrome assumed we forgot an opening <BODY> tag!

This should make it clear why misplaced tags in the HEAD can cause our SEO tags to end up in the BODY.

Now, let’s look at our second case where we don’t have a hardcoded invalid tag, but a script might write one dynamically.

Here you see that if a script writes an invalid tag in the HTML head, it will cause the browser to close it early as before. We have exactly the same problem!

We didn’t see the problem with our Python parser because lxml (the Python parsing library) doesn’t try to fix HTML errors.

Why do browsers do this?

Browsers need to render pages that our Python script doesn’t need to do. If they try to render before correcting mistakes, the pages would look completely broken.

The web is full of pages that would completely break if web browsers didn’t accommodate for errors.

This article from HTML5Rocks provides a fascinating look inside web browsers and helps explain the behavior we see in our examples.

“The HTML5 specification does define some of these requirements. (WebKit summarizes this nicely in the comment at the beginning of the HTML parser class.)

Unfortunately, we have to handle many HTML documents that are not well-formed, so the parser has to be tolerant about errors.

We have to take care of at least the following error conditions:

The element being added is explicitly forbidden inside some outer tag. In this case, we should close all tags up to the one which forbids the element, and add it afterward.

Please read the full article or at least make sure to read at least the section on “Browser’s Error Tolerance” to get a better context.

How to fix this

Fortunately, fixing this problem is actually very simple. We have two alternatives. A lazy one and a proper one.

The proper fix is to track down scripts that insert invalid HTML tags in the head and move them to the HTML body.

The lazy and quickest fix is to move all SEO tags (and other important tags) before any third party scripts. Preferably, right after the opening <HEAD> tag.

You can see how I do it here.

We still have the same invalid tag and script in the HTML head and the SEO tags are also in the head.

Is this a common problem?

I’ve been seeing this issue happening for many years now, and Patrick Stox has also reported seeing the same problem happening often to enterprise sites.

One of the biggest misconceptions about technical SEO is that you do it once and you are done. That would be the case if the sites didn’t change, users/developers didn’t make mistakes and/or Googlebot behavior didn’t change either.

At the moment that is hardly the case.

I’ve been advocating technical SEOs learn developer skills and I hope this case study illustrates the growing importance of this.

If you enjoyed this tip, make sure to attend my SMX West session on Solving Complex JavaScript Issues And Leveraging Semantic HTML5 next month. Among other things, I will share advanced research on how Googlebot and Bingbot handle script and HTML issues like the ones I mentioned here.

The post The dangers of misplaced third-party scripts appeared first on Search Engine Land.

Catching SEO errors during development using automated tests /catching-seo-errors-during-development-using-automated-tests-322365 Thu, 19 Sep 2019 18:45:02 +0000 /?p=322365 Avoid the high cost of catching SEO issues in the production phase by using automated testing techniques during development.

The post Catching SEO errors during development using automated tests appeared first on Search Engine Land.

Last June I had the pleasure to present at SMX Advanced about one of my favorite topics: improving the collaboration between SEOs and developers.

While my session was about JavaScript for SEO, I took the opportunity to introduce a practice that I think can solve a painful business problem: the high cost of catching SEO issues in production when you can catch them during development using automated testing techniques.

How often do you learn about a noindex meta robots tag in the wrong pages released to production and causing massive SEO traffic drop?

Let’s learn how we can prevent this error and similar ones from happening in the first place.

Automated testing in professional development

Modern professional developers need to add new features or fix bugs at a fast pace and often rely on automated testing to keep their code quality high.

During my session, I mentioned this as a perfect place to catch some SEO errors early, before their damage is too expensive.

In this article, we are going to explore this concept in detail, review some practical examples and outline the responsibilities of the developer and the SEO.

The anatomy of the front end of a modern web application

The front-end of modern web applications is generally built in a modular way using controllers, views, and components.

Controllers route page requests to the correct view of the app and the views are what you see when the page loads.

The views are further broken down into components. For example, in a search page, the grid of search results could be powered by one component.

These components can be rendered on the server-side, on the client-side or on both sides as it is the case of hybrid rendering solutions.

SEO scope

It is important to understand these concepts because not every app controller, view or component requires SEO input or automated tests.

One way to tell is to ask if the component’s functionality should be visible or not to search engine crawlers.

For example, all components or actions behind a login form are not in the scope of SEO because search engine crawlers can’t see them.

The different types of automated tests

Automated testing is a broad topic, but when it comes to SEO concerns, there are two main types of automated tests we need to learn about: unit tests and end-to-end tests.

Developers generally write unit tests to perform individual component and method level checks. The idea is to verify each part of the application works as expected separately and in isolation.

However, while the individual parts can operate correctly, they could fail when put to work together. That is where integration tests (a.k.a. end-to-end tests) come into place. They test that the components can work together too.

We should write both types of tests to check for SEO issues during development.

Let’s review some practical examples.

Writing SEO unit tests

In preparation for my presentation, I coded an AngularJS app that monitors Google Trends topics. I focused on trying to optimize it for basic SEO best practices.

In Angular, we can use Jasmine to write unit tests. Let’s review what unit tests look like and what we can do with them.

As an example, let’s look at the Category Topics component in our app, which is responsible for listing the Google Trends topics for a selected category. 

We added these unit tests to check for basic SEO tags.

The tests above make sure the component sets proper canonical URLs, page titles and meta descriptions. 

You could easily extend this list to include other meta tags like meta robots and hreflang tags.

After you write tests like these, you generally need to execute them after you update the app.

Here is how you run them using Jasmine. In Angular, you type the command: ng test

Here is what the output looks like.

As developers add new features to the website or app and then run the tests, they can get immediate feedback when they forget to add important SEO tags or introduce incorrect ones.

Part of your ongoing work as an SEO is to make sure new relevant components are covered by unit tests.

Writing SEO integration tests

Next, let’s review some of the integration tests that I coded for our app so you can see what they look like.

In Angular, we can use Protractor to run end to end tests.

You might be wondering why we need two different tools to run automated tests?

End-to-end tests run exclusively on a web browser by automating the browser so it performs the scripted actions we specify. This is very different from unit testing where we could run just the specific back end/front end code that we are testing.

If we look at our example app’s category topics page, you can see we added end-to-end tests to check for prerendering issues.

The example tests check that our basic SEO tags work correctly after the page is rendered. This is a test that requires loading the page in the browser and wait for the JavaScript code to execute.

One simple check we added was to make sure the key meta tags like title and meta description didn’t come back null after rendering. Another test would be to check the server-side tags and client-side rendered tags are not different as it could cause cloaking issues.

Here is how you run them using Protractor. In Angular, you type the command: ng e2e

Prerendering JavaScript-based sites can lead to SEO issues that are hard to detect in production. Robust integration tests can provide a strong first line of defense.

Continuous integration

I didn’t cover this topic during my talk, but it is worth mention it. Most development teams that write automated tests, also implement a technique called continuous integration.

Continuous integration allows developers to push their code changes to a code repository and have each commit trigger a suite of automated tests. If the tests pass, the code is packaged for release and deployed automatically. But, if any of the tests fail, the packaging and release pipeline is halted.

Some continuous integration tools like CircleCi require you to add a simple test definitions file to your code repository, add the project to their service and they will run all automated tests, including the deployment pipeline, plus include reporting.

As an SEO practitioner, you could ask your dev team to give you access so you can review SEO tests that fail and review SEO test coverage to recommend any missing tests.

Shared responsibilities

Catching SEO errors during development can save companies a lot of money and headache, and it is a shared responsibility between developers and technical SEOs.

I created these two tables to help define some of the different responsibilities for unit tests and integration tests.

Resources to learn more

I used Angular examples, but automated testing is an established discipline in professional development. You can find equivalent tools and processes in most frameworks.

Here are a few to investigate further for your specific dev stack.

The post Catching SEO errors during development using automated tests appeared first on Search Engine Land.

PWA: How to avoid partial rendering issues with service workers /pwa-how-to-avoid-partial-rendering-issues-with-service-workers-317631 Fri, 31 May 2019 12:00:13 +0000 /?p=317631 When there are issues rendering pages server side to prevent correct rendering, the content can have discrepancies shown to end users (or search bots).

The post PWA: How to avoid partial rendering issues with service workers appeared first on Search Engine Land.

In preparation for my upcoming SMX Advanced session about The New Renaissance of JavaScript, I decided to code a progressive web app and try to optimize it for SEO. In particular, I was interested in reviewing all key rendering options (client side, server side, hybrid and dynamic) from a development/implementation perspective.

I learned six interesting insights that I will share during my talk. One of the insights addresses a painful problem that I see happening so often that I thought it was important to share it as soon as possible. So, here we go.

How partial rendering kills SEO performance

When you need to render JavaScript server side, there is a chance that you won’t get the full page content fully rendered. Let’s review a concrete example.

The category view all page from the AngularJs site above hasn’t finished loading all product images after 20 seconds. In my tests, it took about 40 seconds to load fully.

Here is the problem with that. Rendering services won’t wait forever for a page to finish loading. For example, Google’s dynamic rendering service, Rendertron by default won’t wait more than 10 seconds.

View-all pages are generally preferred by both users and search engines when they load fast. But, how do you load a page with over 400 product images fast?

Service workers to the rescue

Before I explain the solution, let’s review service workers and how they are applicable in this context. Detlev Johnson, who will be moderating our panel, wrote a great article on the topic.

When I think about service workers, I think about them as a content delivery network running in your web browser. A CDN helps speed up your site by offloading some of the website functionality to the network. One key functionality is caching, but most modern CDNs can do a lot more than that, like resizing/compressing images, blocking attacks, etc.

A mini-CDN in your browser is similarly powerful. It can intercept and programmatically cache the content from a PWA. One practical use case is that this allows the app to work offline. But what caught my attention was that as service worker operates separate from the main browser thread, it could also be used to offload the processes that slows the page loading (and rendering process) down.

So, here is the idea:

  1. Make an XHR request to get the initial list of products that return fast (for example page 1 in the full set)
  2. Register a service worker that intercepts this request, caches it, passes it through and makes subsequent requests in the background for the rest of the pages in the set. It should cache them all as well.
  3. Once all the results are loaded and cached, notify the page so that it gets updated.

The first time the page is rendered, it won’t get all the results, but it will get them on subsequent ones. Here is some code you can adapt to get this started.

I checked the page to see if they were doing something similar, but sadly they aren’t.

This approach will prevent the typical timeouts and errors from disrupting the page rendering at the cost of maybe some missing content during the initial page load. Subsequent page loads should have the latest information and loaded faster from the browser cache.

I checked Rendertron, to see if this idea would be supported, and I found a pull request merged into their codebase that confirms support for the required feature.

However, as Google removed Googlebot from the list of bots supported in Renderton by default, you need to add it back to get this to work.

Service workers limitations

When working with service workers and moving background work to them, you need to consider some constraints:

  1. Service workers require HTTPS
  2. Service workers intercept requests at the “directory level” they are installed in. For example, /test/ng-sw.js would only intersect requests under /test/* and /ng-sw.js  would intercept requests for the whole site.
  3. The background work shouldn’t require DOM access. Also there is no window, document or parent objects access.

Some example tasks that could run in the background using a service worker are data manipulation or traversal, like sorting or searching — also loading data and data generation.

More potential rendering issues

In a more generalized way, when using Hybrid or server-side rendering (using NodeJs), some of the issues can include:

  1. XHR/Ajax requests timing out.
  2. Server overloaded (memory/CPU).
  3. Third party scripts down.

When using Dynamic rendering (using Chrome), in addition to the issues above, some additional issues can include:

  1. The browser failed to load.
  2. Images take long to download and render.
  3. Longer latency

Bottom line is that when you are rendering pages server side and there are issues preventing full, correct rendering, the rendered content can have important discrepancies with the content shown to end users (or search bots).

There are three potential problems with this: 1) important content not getting indexed 2) accidental cloaking and 3) compliance issues.

We haven’t seen any client affected by accidental cloaking, but it could be a risk. However, we see compliance issues often. One example of compliance issue is the one affecting sites selling on Google Shopping. The information in the product feed needs to match the information on the website. Google uses the same Googlebot for organic search and Google Shopping, so something as simple as missing product images can cause ads to get disapproved.

Additional resources

Please note that this is just one example of the insights I will be sharing during my session. Make sure to stop by so you don’t miss out on the rest.

I found the inspiration for my idea in this article. I also found other useful resources while researching for my presentation that I list below. I hope you find them helpful.

Developing Progressive Web Apps (PWAs) Course
JavaScript Concurrency
The Service Worker Lifecycle
Service Worker Demo

The post PWA: How to avoid partial rendering issues with service workers appeared first on Search Engine Land.

Brands can better understand users on third-party sites by using a keyword overlap analysis /brands-can-better-understand-users-on-third-party-sites-by-using-a-keyword-overlap-analysis-316157 Tue, 30 Apr 2019 18:17:24 +0000 /?p=316157 These scripts can help analyze cross-site branded traffic with overlapping keywords to capture untapped audiences.

The post Brands can better understand users on third-party sites by using a keyword overlap analysis appeared first on Search Engine Land.

If you are a manufacturer selling on your own site as well as on retail partners, it is likely you don’t have visibility into who is buying your products or why they buy beyond your own site. More importantly, you probably don’t have enough insights to improve your marketing messaging.

One technique you can use to identify and understand your users buying on third-party websites is to track your brand through organic search. You can then compare the brand searches on your site and the retail partner, see how big the overlap is, how much of the overlapping keywords you rank above the retailer and vice versa. More importantly, you can see if you are appealing to different audiences or competing for the same ones. Armed with these new insights, you could restructure your marketing messaging to unlock new audiences you didn’t tap into before.

In previous articles, I’ve covered several useful data blending examples, but in this one, we will do something different. We will do a deeper dive into just one data blending example and perform what I call a cross-site branded keyword overlap analysis. As you will learn below, this type of analysis will help us understand your users buying on third-party retailer partners.

In the Venn diagram above, you can see an example of visualization we will put together in this article. It represents the number of overlapping keywords in organic search for the brand “Tommy Hilfiger” between their main brand site and Macy’s, a retail partner.

We recently had to perform this analysis for one of our clients and our findings surprised us. We discovered that with 60% of our client’s organic SEO traffic coming from branded searches, as much as 30% of those searches were captured by four retailer partners that also sell their products.

Armed with this evidence and with the knowledge that selling through their retail partners still made business sense, we provided guidance on how to improve their brand searches so they can compete more effectively, and change their messaging to appeal to a different customer than the one that buys from the retailers.

After my team conducted this analysis manually and I saw how valuable it is, I set out to automate the whole process in Python so we could easily reproduce it for all our manufacturing clients. Let me share the code snippets I wrote here and walk you over their use.

Pulling branded organic search keywords

I am using the Semrush API to collect the branded keywords from their service. I created a function to take their response and return a pandas data frame. This function simplifies the process of collecting data for multiple domains.

Here is the code to get organic searches for “Tommy Hilfiger” going to Macy’s.

Here is the code to get organic searches for “Tommy Hilfiger” going to Tommy Hilfiger directly.

Visualizing the branded keyword overlap

After we pull the searches for “Tommy Hilfiger” from both sites, we want to understand the size of the overlap. We accomplish this in the following lines of code:

We can quickly see that the overlap is significant, with 4601 keywords in common, 515 unique to Tommy Hilfiger, and 125 unique to Macy’s.

Here is the code to visualize this overlap as the Venn diagram illustrated above.

Who ranks better for the overlapping keywords?

The most logical next question you would want to ask is that given how significant the overlap is, who commands more higher rankings for those. How can we figure this out? With data blending of course!

First, as we learned in my first data blending article, we will merge the two data frames, and we will use an inner join to keep only the keywords common in the two sets.

When we merge data frames and they have the same columns, they are repeated and the first columns include _x at the end and the second one includes _y. So, Macy’s columns end with _x.

Here is how we create a new data frame with the overlapping branded keywords where Macy’s ranks higher.

Here is the corresponding data frame where Tommy Hilfiger ranks higher.

Here we can see that while the overlap is big, Tommy ranks higher for many more branded keywords than Macy’s (3,173 vs. 1,075). So, is Tommy doing better? Not quite!

As you remember, we also pulled traffic numbers from the API. In the next snippet of code, we will check which keywords are pulling more traffic.

Surprisingly, we see that, while Macy’s performs better for fewer keywords than Tommy Hilfiger,  when we add up the traffic, Macy’s attracts more visitors (75,026 vs. 66,415).

As you can see, sweating the details matters a lot in this type of analysis!

How different are the audiences

Finally, let’s use the branded keywords unique to each site to learn any differences in the audiences that visit each site. We will simply strip the branded phrase from the keywords and create word clouds to understand them better. When we remove the branded phrase “Tommy Hilfiger,” we are left with the additional qualifiers that users use to indicate their intention.

I created a function to create and display the word clouds. Here is the code:

Here is the word cloud with the most popular words left after you remove the phrase “Tommy Hilfiger” from Macy’s keywords.

Here is the corresponding word cloud when you do the same for the Tommy Hilfiger ones.

The main difference I see is people looking for Tommy Hilfiger products in Macy’s have specific products in mind, like boots and curtains, while when it comes to the brand site, people primarily have the outlets in mind. This might be an indicator that they intend to visit the store vs. trying to purchase online. This may also indicate that people going to brand site are bargain hunters while the ones going to Macy’s might not be. These are very interesting and powerful insights!

Given these insights, Tommy Hilfiger could review the SERPS and compare the difference in the messaging between Macy’s and their brand site and adjust it to appeal to their unique audience’s interests.

The post Brands can better understand users on third-party sites by using a keyword overlap analysis appeared first on Search Engine Land.

5 additional data blending examples for smarter SEO insights /5-additional-data-blending-examples-for-smarter-seo-insights-314645 Wed, 27 Mar 2019 12:39:26 +0000 /?p=314645 Once you preprocess columns to consistent formatting, additional data blending options include prioritizing pages with search clicks, mining internal site search for content gaps, analyzing traffic issues with 404 pages and more.

The post 5 additional data blending examples for smarter SEO insights appeared first on Search Engine Land.

As I covered in my previous article, data blending can uncover really powerful insights that you would not be able to see otherwise.

When you start shifting your SEO work to be more data-driven, you will naturally look at all the data sources in your hands and might find it challenging to come up with new data blending ideas. Here is a simple shortcut that I often use: I don’t start with the data sources I have (bottoms up), but with the questions I need to answer and then I compile the data I need (top-bottom).

In this article, we will explore 5 additional SEO questions that we can answer with data blending, but before we dive in, I want to address some of the challenges you will face when putting this technique to practice.

Tony McCreath raised a very important frustration you can experience when data blending:

When you join separate datasets, the common columns need to be formatted in the same way for this technique to work. However, this is hardly the case. You often need to preprocess the columns ahead of the join operation.

It is relatively easy to perform advanced data joins in Tableau, Power BI and similar business intelligence tools, but when you need to preprocess the columns is where learning a little bit of Python pays off.

Here are some of the most common preprocessing issues you will often see and how you can address them in Python.


Absolute or relative. You will often find absolute and relative URLs. For example, Google Analytics URLs are relative, while URLs from SEO spider crawls are absolute. You can convert both to relative or absolute.

Here is how to convert relative URLs to absolute:

Here is how to convert absolute URLs to relative:

Case sensitivity. Most URLs are case sensitive, but If the site is hosted on a Windows Server, you will often find URLs with different capitalization that return the same content. You can convert both to lowercase or upper case.

Here is how to convert them to lowercase:

Here is how to convert them to uppercase:

Encoding. Sometimes the URLs come from the URL parameter of another source URL and if they have query strings they will be URL encoded. When you extract the parameter value, the library you use might or might not do it for you.

Here is how to decode URL-encoded URLs

Parameter handling. If the URLs have more than one URL parameter, you can face some of these issues:

  1. You might have parameters with no values.
  2. You might have redundant/unnecessary parameters.
  3. You might have parameters ordered differently

Here is how we can address each one of these issues.


Dates can come in many different formats. The main strategy is to parse them from their source format into Python datetime objects. You can optionally manipulate the datetime objects. For example, to sort the dates correctly or to localize to a specific time zone. But, most importantly, you can easily format the datetime dates using a consistent convention.

Here are some examples:


Correctly matching keywords across different datasets can also be a challenge. You need to review the columns to see if the keywords appear as users would type them or there has been any normalization.

For example, is not uncommon for users to search by copying and pasting text. This type of keyword searches would include hyphens, quotes, trademark symbols, etc. that would not normally appear when typed. But, when typing, spacing and capitalization might be inconsistent across users.

In order to normalize keywords, you need to at least remove any unnecessary characters and symbols, remove extra spacing and standardize in lower case (or upper case).

Here is how you would do that in Python:

Now that we know how to preprocess columns, let get to the fun part of the article. Let’s review some additional SEO data blending examples:

Error pages with search clicks

You have a massive list of 404 errors that you pulled from your web server logs because Google Search Console doesn’t make it easy to get the full list. Now you need to redirect most of them to recover traffic lost. One approach you could use is to prioritize the pages with search clicks, starting with the most popular ones!

Here is the data you’ll need:

Google Search Console: page, clicks

Web server log: HTTP request, status code = 404

Common columns (for the merge function): left_on: page, right_on: HTTP request.

Pages missing Google Analytics tracking code

Some sites choose to insert tracking codes manually instead of placing them on web page templates. This can lead to traffic underreporting issues due to pages missing tracking codes. You could crawl the site to find such pages, but what if the pages are not linked from within the site? One approach you could use is to compare the pages in Google Analytics and Google Search Console during the same time period. Any pages in the GSC dataset but missing in the GA set can potentially be missing the GA tracking script.

Here is the data you’ll need:

Google Search Console: date, page

Google Analytics: ga:date, ga:landingPagePath, filtered to Google organic searches.

Common columns (for the merge function): left_on: page, right_on: ga:landingPagePath.

Excluding 404 pages from Google Analytics reports

One disadvantage of inserting tracking codes in templates is that Google Analytics page views could trigger when users end up in 404 pages. This is generally not a problem, but it can complicate your life when you are trying to analyze traffic issues and can’t tell which traffic is good and ending in actual page content and which is bad and ending in errors. One approach you could use is to compare pages in Google Analytics with pages crawled from the website that return 200 status code.

Here is the data you’ll need:

Website crawl: URL, status code = 200

Google Analytics: ga:landingPagePath

Common columns (for the merge function): left_on: URL, right_on: ga:landingPagePath

Mining internal site search for content gaps

Let’s say that you review your internal site search reports in Google Analytics and find people coming from organic search and yet performing one or more internal searches until they find their content. It might be the case that there are content pieces missing that could drive those visitors directly from organic search. One approach you could use is to compare your internal search keywords with the keywords from Google Search Console. The two datasets should use the same date range.

Here is the data you’ll need:

Google Analytics: ga:date, ga:searchKeyword, filtered to Google organic search.

Google Search Console: date, keyword

Common columns (for the merge function): left_on: ga:searchKeyword, right_on: keyword

Checking Google Shopping organic search performance

Google announced last month that products listed in Google Shopping feeds can now show up in organic search results. I think it would be useful to check how much traffic you get versus the regular organic listings. If you add additional tracking parameters to the URLs in your feed, you could use Google Search Console data to compare the same products appearing in regular listings vs organic shopping listings.

Here is the data you’ll need:

Google Search Console: date, page, filtered to pages with the shopping tracking parameter

Google Search Console: date, page, filtered to pages without the shopping tracking parameter

Common columns (for the merge function): left_on: page, right_on: page

The post 5 additional data blending examples for smarter SEO insights appeared first on Search Engine Land.

5 practical data blending examples for smarter SEO insights /5-practical-data-blending-examples-for-smarter-seo-insights-312787 Fri, 22 Feb 2019 13:05:17 +0000 /?p=312787 Here's a step-by-step guide to blending data tables from different tools to uncover valuable new insights using Python (or SQL).

The post 5 practical data blending examples for smarter SEO insights appeared first on Search Engine Land.

Sometimes we face questions that are hard to answer with the information from isolated tools. One powerful technique we can use is to combine data from different tools to discover valuable new insights.

You can use Google Data Studio to perform data blending, but note that it’s limited to only one type of blending technique: a left outer join (discussed below). I will cover a more comprehensive list of data blending techniques that you can do in Python (or SQL if you prefer it).

Let’s explore some practical SEO applications.

Overall approach

In order to blend separate data tables (think spreadsheets in Excel), you need one or more columns that they need to have in common. For example, we could match the column ga:landingPagePath in a Google Analytics table with the page column in a Google Search Console table.

When we combine data tables this way, we have several options to compute the resulting table.

The Venn diagrams above illustrate standard set theory used to represent the membership of elements in the resulting set. Let’s discuss each example:

Full Outer Join: The elements in the resulting set include the union of all the elements in the source sets. All elements from both sides of the join are included, with joined information if they share a key, and blanks otherwise.

Inner Join: The elements in the resulting set include the intersection of all elements in the source sets. Only elements that share a key on both sides are included.

Left (Outer) Join: The elements in the resulting set include the intersection of all elements in the source sets and the elements only present in the first set. All elements on the left hand side are present, with additional joined information only if a key is shared with the right hand side.

Right (Outer) Join: The elements in the resulting set include the intersection of all elements in the source sets and the elements only present in the second set. All elements on the right hand side are present, with additional joined information only if a key is shared with the left hand side.

I’ll walk through an example of these joins below, but this topic is easier to learn by doing. Feel free to practice with this interactive tutorial.

Here are some practical SEO data blending use cases:

Adding conversion/revenue data to Google Search Console

Google Search Console is my must have tool for technical SEO, but like me, you are probably frustrated that you can’t have revenue or conversion data in the reports. This is relatively easy to fix for landing pages by blending data from Google Analytics.

Both data tables must use the same date range.

First, we’ll set up a Pandas DataFrame with some example Google Analytics data and call it df_a.

Google Analytics data table containing ga:landingPagePath, ga:revenue, ga:transactions (filtered to google organic search traffic)

Next, we’ll set up a DataFrame with some example Search Console data and call it df_b.

Google Search Console data table containing page, impressions, clicks, position

Now, we’ll use the Pandas merge function to combine the two, using first an inner join (the intersection of the two sets), and then using an outer join (the union).

You can see that the outer, left, and right joins contain missing data (“NaN”) when no key is shared by the other side.

You can now divide transactions by clicks to get the conversion rate per landing page, and the revenue per transaction to get the average order value.

Correlating links and domains over time with traffic increase

Are increasing backlinks responsible for an increase in traffic or not? You can export the latest links from Google Search Console (which include the last time Googlebot crawled them), then combine this data table with Google Analytics organic search traffic during the same time frame.

Similar to the first example, both data tables must use the same date range.

Here is the data you’ll need:

Google Search Console: Linking page, Last crawled

Google Analytics: ga:date, ga:newUsers

Common columns (for the merge function): left_on: Last crawled, right_on: ga:date

You can plot traffic and links over time. Optionally, you can add a calculated domain column to the Search Console data table. This will allow you to plot linking domains by traffic.

Correlating new user visits to content length

What is the optimal length of your content articles? Instead of offering rule-of-thumb answers, you can actually calculate this per client. We will combine a data table from your favorite crawler with performance data from Google Analytics or Google Search Console. The idea is to group pages by their word count, and check which groups get the most organic search visits.

Both data tables must use the same set of landing pages.

Screaming Frog crawl: Address, Word count

Google Analytics: ga:landingPagePath, ga:newUsers

Common columns: left_on:Address, right_on: ga:landingPagePath

You need to create Word count bins, group by bins and then plot the traffic per bin.

Narrowing down the pages that lost (or gained) traffic

Why did the traffic drop (or increase)? This is a common and sometimes painful question to answer. We can learn which specific pages lost (or gained) traffic by combining data tables from two separate time periods.

Both data tables must use the same number of days before and after the drop (or increase).

First period in Google Analytics: ga:landingPagePath, ga:newUsers

Second period in Google Analytics: ga:landingPagePath , ga:newUsers

Common columns: left_on:ga:landingPagePath, right_on: ga:landingPagePath

We first need to aggregate new users by page and subtract the first period from the second one. Let’s call this subtraction delta. If delta is greater than zero the page gained traffic, if it is less than zero lost traffic, and if it is zero didn’t change.

Here is an example where we have grouped pages by the page type (Collections, Products, or N/A) and calculated the delta before and after a drop in traffic.

Finding high-converting paid search keywords with poor SEO rankings

Do you have high-converting keywords in Adwords that are ranking poorly in organic search? You can find out by combining Google Adwords data with Google Search Console.

Both data tables must use the same date range.

Google Analytics: ga:adMatchedQuery, ga:transactions (filtered by transactions greater than zero)

Google Search Console: query, position, clicks (filtered by keywords with position greater than 10)

Common columns: left_on: ga:adMatchedQuery, right_on: query

The result will list low ranking organic keywords with transactions, positions and clicks columns.

The post 5 practical data blending examples for smarter SEO insights appeared first on Search Engine Land.