JR Oakes – Search Engine Land News On Search Engines, Search Engine Optimization (SEO) & Search Engine Marketing (SEM) Thu, 13 Feb 2020 20:17:51 +0000 en-US hourly 1 https://wordpress.org/?v=5.5.3 Using the Apriori algorithm and BERT embeddings to visualize change in search console rankings /using-the-apriori-algorithm-and-bert-embeddings-to-visualize-change-in-search-console-rankings-328702 Wed, 05 Feb 2020 19:34:21 +0000 /?p=328702 By leveraging the Apriori algorithm, we can categorize queries from GSC, aggregate PoP click data by category and use BERT embeddings to find semantically related categories.

The post Using the Apriori algorithm and BERT embeddings to visualize change in search console rankings appeared first on Search Engine Land.

]]>
One of the biggest challenges an SEO faces is one of focus. We live in a world of data with disparate tools that do various things well, and others, not so well. We have data coming out of our eyeballs, but how to refine large data to something meaningful. In this post, I mix new with old to create a tool that has value for something, we as SEOs, do all the time. Keyword grouping and change review. We will leverage a little known algorithm, called the Apriori Algorithm, along with BERT, to produce a useful workflow for understanding your organic visibility at thirty thousand feet.

What is the Apriori algorithm

The Apriori algorithm was proposed by RakeshAgrawal and RamakrishnanSrikant in 2004. It was essentially designed as a fast algorithm used on large databases, to find association/commonalities between component parts of rows of data, called transactions. A large e-commerce shop, for example, may use this algorithm to find products that are often purchased together, so that they can show associated products when another product in the set is purchased.

I discovered this algorithm a few years ago, from this article, and immediately saw a connection to helping find unique pattern sets in large groups of keywords. We have since moved to more semantically-driven matching technologies, as opposed to term-driven, but this is still an algorithm that I often come back to as a first pass through large sets of query data.

Transactions



1technicalseo

2technicalseoagency
3seoagency

4technicalagency

5locomotiveseoagency
6locomotiveagency

Below, I used the article by Annalyn Ng, as inspiration to rewrite the definitions for the parameters that the Apriori algorithm supports, because I thought it was originally done in an intuitive way. I pivoted the definitions to relate to queries, instead of supermarket transactions.

Support

Support is a measurement of how popular a term or term set is.  In the table above, we have six separate tokenized queries. The support for  “technical” is 3 out of 6 of the queries, or 50%. Similarly, “technical, seo” has a support of 33%, being in 2 out of 6 of the queries.

Confidence

Confidence shows how likely terms are to appear together in a query. It is written as {X->Y}. It is simply calculated by dividing the support for {term 1 and term 2} by the support for {term 1}. In the above example, the confidence of {technical->seo} is 33%/50% or 66%.

Lift

Lift is similar to confidence but solves a problem in that really common terms may artificially inflate confidence scores when calculated based on the likelihood that they appear with other terms simply based on their frequency of usage. Lift is calculated, for example, by dividing the support for {term 1 and term 2} by ( the support for {term 1} times the support for {term 2} ). A value of 1 means no association. A value greater than 1 says the terms are likely to appear together, while a value less than 1 means they are unlikely to appear together.

Using Apriori for categorization

For the rest of the article, we will follow along with a Colab notebook and companion Github repo, that contains additional code supporting the notebook. The Colab notebook is found here. The Github repo is called QueryCat.

We start off with a standard CSV from Google Search Console (GSC), of comparative, 28-day queries, period-over-period. Within the notebook, we load the Github repo, and install some dependencies. Then we import querycat and load a CSV containing the outputted data from GSC. 

Click to enlarge

Now that we have the data, we can use the Categorize class in querycat, to pass a few parameters and easily find relevant categories. The most meaningful parameters to look at are the “alg” parameter, which specifies the algorithm to use. We included both Apriori and FP-growth, which both take the same inputs and have similar outputs. The FP-Growth algorithm is supposed to be a more efficient algorithm. In our usage, we preferred the Apriori algorithm.

The other parameter to consider is “min-support.” This essentially says how often a term has to appear in the dataset, to be considered. The lower this value is, the more categories you will have. Higher numbers, have less categories, and generally more queries with no categories. In our code, we designate queries with no calculated category, with a category “##other##”

The remaining parameters “min_lift” and “min_probability” deal with the quality of the query groupings and impart a probability of the terms appearing together. They are already set to the best general settings we have found, but can be tweaked to personal preference on larger data sets.

Click to enlarge

You can see that in our dataset of 1,364 total queries, the algorithm was able to place the queries in 101 categories. Also notice that the algorithm is able to pick multi-word phrases as categories, which is the output we want.

After this runs, you can run the next cell, which will output the original data with the categories appended to each row. It is worth noting, that this is enough to be able to save the data to a CSV, to be able to pivot by the category in Excel and aggregate the column data by category. We provide a comment in the notebook which describes how to do this. In our example, we distilled matched meaningful categories, in only a few seconds of processing. Also, we only had 63 unmatched queries.

Click to enlarge

Now with the new (BERT)

One of the frequent questions asked by clients and other stakeholders is “what happened last <insert time period here>?” With a bit of Pandas magic and the data we have already processed, to this point, we can easily compare the clicks for the two periods in our dataset, by category, and provide a column that shows the difference (or you could do % change if you like) between the two periods.

Click to enlarge

Since we just launched a new domain at the end of 2019, locomotive.agency, it is no wonder that most of the categories show click growth comparing the two periods. It is also good to see that our new brand, “Locomotive”, shows the most growth.  We also see that an article that we did on Google Analytics Exports, has 42 queries, and a growth of 36 monthly clicks.

This is helpful, but it would be cool to see if there are semantic relationships between query categories that we did better, or worse. Do we need to build more topical relevance around certain categories of topics?

In the shared code, we made for easy access to BERT, via the excellent Huggingface Transformers library, simply by including the querycat.BERTSim class in your code. We won’t cover BERT in detail, because Dawn Anderson, has done an excellent job here.

Click to enlarge

This class allows you to input any Pandas DataFrame with a terms (queries) column, and it will load DistilBERT, and process the terms into their corresponding summed embeddings. The embeddings, essentially are vectors of numbers that hold the meanings the model as “learned” about the various terms. After running the read_df method of querycat.BERTSim, the terms and embeddings are stored in the terms (bsim.terms) and embeddings(bsim.embeddings) properties, respectively.

Similarity

Since we are operating in vector space with the embeddings, this means we can use Cosine Similarity to calculate the cosine of the angles between the vectors to measure the similarity.  We provided a simple function here, that would be helpful for sites that may have hundreds to thousands of categories. “get_similar_df” takes a string as the only parameter, and returns the categories that are most similar to that term, with a similarity score from 0 to 1. You can see below, that for the given term “train,” locomotive, our brand, was the closest category, with a similarity of 85%.

Click to enlarge

Plotting Change

Going back to our original dataset, to this point, we now have a dataset with queries and PoP change. We have run the queries through our BERTSim class, so that class knows the terms and embeddings from our dataset.  Now we can use the wonderful matplotlib, to bring the data to life in an interesting way.

Calling a class method, called diff_plot, we can plot a view of our categories in two-dimensional, semantic space, with click change information included in the color (green is growth) and size (magnitude of change) of the bubbles.

Click to enlarge

We included three separate dimension reduction strategies (algorithms), that take the 768 dimensions of BERT embeddings down to two dimensions. The algorithms are “tsne,” “pca” and “umap.” We will leave it to the reader to investigate these algorithms, but “umap” has a good mixture of quality and efficiency.

It is difficult to see (because ours is a relatively new site) much information from the plot, other than an opportunity to cover the Google Analytics API in more depth. Also, this would be a more informative plot had we removed zero change, but we wanted to show how this plot semantically clusters topic categories in a meaningful way.

Wrapping Up

In this article, we:

  • Introduced the Apriori algorithm.
  • Showed how you could use Apriori to quickly categorize a thousand queries from GSC.
  • Showed how to use the categories to aggregate PoP click data by category.
  • Provided a method for using BERT embeddings to find semantically related categories.
  • Finally, displayed a plot of the final data showing growth and decline by semantic category positioning.

We have provided all code as open source with the hopes that others will play and extend the capabilities as well as write more articles showing other ways various algorithms, new and old, can be helpful for making sense of the data all around us.

This year’s SMX Advanced will feature a brand-new SEO for Developers track with highly-technical sessions – many in live-coding format – focused on using code libraries and architecture models to develop applications that improve SEO. SMX Advanced will be held June 8-10 in Seattle. Register today.

The post Using the Apriori algorithm and BERT embeddings to visualize change in search console rankings appeared first on Search Engine Land.

]]>
Measuring the quality of popular keyword research tools /measuring-the-quality-of-popular-keyword-research-tools-299582 Mon, 11 Jun 2018 22:06:00 +0000 /?p=299582 Contributor JR Oakes measures the quality of popular keyword research tools against data found in Google search results and performing page data from Google Search Console.

The post Measuring the quality of popular keyword research tools appeared first on Search Engine Land.

]]>

Ever wondered how the results of some popular keyword research tools stack up against the information Google Search Console provides? This article looks at comparing data from Google Search Console (GSC) search analytics against notable keyword research tools and what you can extract from Google.

As a bonus, you can get related searches and people also search data results from Google search results by using the code at the end of this article.

This article is not meant to be a scientific analysis, as it only includes data from seven websites. To be sure, we were gathering somewhat comprehensive data: we selected websites from the US and the UK plus different verticals.

Procedure

1. Started by defining industries with respect to various website verticals

We used SimilarWeb’s top categories to define the groupings and selected the following categories:

  • Arts and entertainment.
  • Autos and vehicles.
  • Business and industry.
  • Home and garden.
  • Recreation and hobbies.
  • Shopping.
  • Reference.

We pulled anonymized data from a sample of our websites and were able to obtain unseen data from search engine optimization specialists (SEOs) Aaron Dicks and Daniel Dzhenev. Since this initial exploratory analysis involved quantitative and qualitative components, we wanted to spend time understanding the process and nuance rather than making the concessions required in scaling up an analysis. We do think this analysis can lead to a rough methodology for in-house SEOs to make a more informed decision on which tool may better fit their respective vertical.

2. Acquired GSC data from websites in each niche

Data was acquired from Google Search Console by programming and using a Jupyter notebook.

Jupyter notebooks are an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text to extract website-level data from the Search Analytics API daily, providing much greater granularity than is currently available in Google’s web interface.

3. Gathered ranking keywords of a single internal page for each website

Since home pages tend to gather many keywords that may or may not be topically relevant to the actual content of the page, we selected an established and performing internal page so the rankings are more likely to be relevant to the content of the page. This is also more realistic, since users tend to do keyword research in the context of specific content ideas.

The image above is an example of the home page ranking for a variety of queries related to the business but not directly related to the content and intent of the page.

We removed brand terms and restricted the Google Search Console queries to first-page results.

Finally, we selected a head term for each page. The phrase “head term” is generally used to denote a popular keyword with high search volume. We chose terms with relatively high search volume, though not the absolute highest search volume. Of the queries with the most impressions, we selected the one that best represented the page.

4. Did keyword research in various keyword tools and looked for the head term

We then used the head term selected in the previous step to perform keyword research in three major tools: Ahrefs, Moz and SEMrush.

The “search suggestions” or “related searches” options were used, and all queries returned were kept, regardless of whether or not the tool specified a metric of how related the suggestions were to the head term.

Below we listed the number of results from each tool. In addition, we extracted the “people also search for” and “related searches” from Google searches for each head term (respective to country) and added the number of results to give a baseline of what Google gives for free.

**This result returned more than 5,000 results! It was truncated to 1,001, which is the max workable and sorted by descending volume.

We compiled the average number of keywords returned per tool:

5.  Processed the data

We then processed the queries for each source and website by using some language processing techniques to transform the words into their root forms (e.g., “running” to “run”), removed common words such as  “a,” “the” and “and,” expanded contractions and then sorted the words.

For example, this process would transform “SEO agencies in Raleigh” to “agency Raleigh SEO.”  This generally keeps the important words and puts them in order so that we can compare and remove similar queries.

We then created a percentage by dividing the number of unique terms by the total number of terms returned by the tool. This should tell us how much redundancy there are in the tools.

Unfortunately, it does not account for misspellings, which can also be problematic in keyword research tools because they add extra cruft (unnecessary, unwanted queries) to the results. Many years ago, it was possible to target common misspellings of terms on website pages. Today, search engines do a really good job of understanding what you typed, even if it’s misspelled.

In the table below, SEMrush had the highest percentage of unique queries in their search suggestions.

This is important because, if 1,000 keywords are only 70 percent unique, that means 300 keywords basically have no unique value for the task you are performing.

Next, we wanted to see how well the various tools found queries used to find these performing pages. We took the previously unique, normalized query phrases and looked at the percentage of GSC queries the tools had in their results.

In the chart below, note the average GSC coverage for each tool and that Moz is higher here, most likely because it returned 1,000 results for most head terms. All tools performed better than related queries scraped from Google (Use the code at the end of the article to do the same).

Getting into the vector space

After performing the previous analysis, we decided to convert the normalized query phrases into vector space to visually explore the variations in various tools.

Assigning to vector space uses something called pre-trained word vectors that are reduced in dimensionality (x and y coordinates) using a Python library called t-distributed Stochastic Neighbor Embedding (TSNE). Don’t worry if you are unfamiliar with this; generally, word vectors are words converted into numbers in such a way that the numbers represent the inherent semantics of the keywords.

Converting the words to numbers helps us process, analyze and plot the words. When the semantic values are plotted on a coordinate plane, we get a clear understanding of how the various keywords are related. Points grouped together will be more semantically related, while points distant from one another will be less related.

Shopping

This is an example where Moz returns 1,000 results, yet the search volume and searcher keyword variations are very low.  This is likely caused by Moz semantically matching particular words instead of trying to match more to the meaning of the phrase. We asked Moz’s Russ Jones to better understand how Moz finds related phrases:

“Moz uses many different methods to find related terms. We use one algorithm that finds keywords with similar pages ranking for them, we use another ML algorithm that breaks up the phrase into constituent words and finds combinations of related words producing related phrases, etc. Each of these can be useful for different purposes, depending on whether you want very close or tangential topics. Are you looking to improve your rankings for a keyword or find sufficiently distinct keywords to write about that are still related? The results returned by Moz Explorer is our attempt to strike that balance.”

Moz does include a nice relevancy measure, as well as a filter for fine-tuning the keyword matches. For this analysis, we just used the default settings:

In the image below, the plot of the queries shows what is returned by each keyword vendor converted into the coordinate plane. The position and groupings impart some understanding of how keywords are related.

In this example, Moz (orange) produces a significant volume of various keywords, while other tools picked far fewer (Ahrefs in green) but more related to the initial topic:

Autos and vehicles

This is a fun one. You can see that Moz and Ahrefs had pretty good coverage of this high-volume term. Moz won by matching 34 percent of the actual terms from Google Search Console. Moz had double the number of results (almost by default) that Ahrefs had.

SEMrush lagged here with 35 queries for a topic with a broad amount of useful variety.

The larger gray points represent more “ground truth” queries from Google Search Console. Other colors are the various tools used. Gray points with no overlaid color are queries that various tools did not match.

Internet and telecom

This plot is interesting in that SEMrush jumped to nearly 5,000 results, from the 50-200 range in other results. You can also see (toward the bottom) that there were many terms outside of what this page tended to rank for or that were superfluous to what would be needed to understand user queries for a new page:

Most tools grouped somewhat close to the head term, while you can see that SEMrush (in purplish-pink) produced a large number of potentially more unrelated points, even though Google People Also Search were found in certain groupings.

General merchandise   

Here is an example of a keyword tool finding an interesting grouping of terms (groupings indicated by black circles) that the page currently doesn’t rank for. In reviewing the data, we found the grouping to the right makes sense for this page:

The two black circles help to visualize the ability to find groupings of related queries when plotting the text in this manner.

Analysis

Search engine optimization specialists with experience in keyword research know there is no one tool to rule them all.  Depending on the data you need, you may need to consult a few tools to get what you are after.

Below are my general impressions from each tool after reviewing, qualitatively:

  • The query data and numbers from our analysis of the uniqueness of results.
  • The likelihood of finding terms that real users use to find performing pages.

Moz     

Moz seems to have impressive numbers in terms of raw results, but we found that the overall quality and relevance of results was lacking in several cases.

Even when playing with the relevancy scores, it quickly went off on tangents, providing queries that were in no way related to my head term (see Moz suggestions for “Nacho Libre” in image above).

With that said, Moz is very useful due to its comprehensive coverage, especially for SEOs working in smaller or newer verticals. In many cases, it is exceedingly difficult to find keywords for newer trending topics, so more keywords are definitely better here.

An average of 64 percent coverage for real user data from GSC for selected domains was very impressive  This also tells you that while Moz’s results can tend to go down rabbit holes, they tend to get a lot right as well. They have traded off a loss of fidelity for comprehensiveness.

Ahrefs

Ahrefs was my favorite in terms of quality due to their nice marriage of comprehensive results with the minimal amount of clearly unrelated queries.

It had the lowest number of average reported keyword results per vendor, but this is actually misleading due to the large outlier from SEMrush. Across the various searches it tended to return a nice array of terms without a lot of clutter to wade through.

Most impressive to me was a specific type of niche grill that shared a name with a popular location. The results from Ahrefs stayed right on point, while SEMrush returned nothing, and Moz went off on tangents with many keywords related to the popular location.

A representative of Ahrefs clarified with me that their tool “search suggestions” uses data from Google Autosuggest.  They currently do not have a true recommendation engine the way Moz does. Using “Also ranks for” and “Having same terms” data from Ahrefs would put them more on par with the number of keywords returned by other tools.

 SEMrush   

SEMrush overall offered great quality, with 90 percent of the keywords being unique It was also on par with Ahrefs in terms of matching queries from GSC.

It was, however, the most inconsistent in terms of the number of results returned. It yielded 1,000+ keywords (actually 5,000) for Internet and Telecom > Telecommunications yet only covered 22 percent of the queries in GSC. For another result, it was the only one not to return related keywords. This is a very small dataset, so there is clearly an argument that these were anomalies.

Google: People Also Search For/Related Searches 

These results were extremely interesting because they tended to more closely match the types of searches users would make while in a particular buying state, as opposed to those specifically related to a particular phrase. 

For example, looking up “[term] shower curtains” returned “[term] toilet seats.”

These are unrelated from a semantic standpoint, but they are both relevant for someone redoing their bathroom, suggesting the similarities are based on user intent and not necessarily the keywords themselves.

Also, since data from “people also search” are tied to the individual results in Google search engine result pages (SERPs), it is hard to say whether the terms are related to the search query or operate more like site links, which are more relevant to the individual page.

Code used

When entered into the Javascript Console of Google Chrome on a Google search results page, the following will output the “People also search for” and “Related searches” data in the page, if they exist.

In addition, there is a Chrome add-on called Keywords Everywhere which will expose these terms in search results, as shown in several SERP screen shots throughout the article.

Conclusion

Especially for in-house marketers, it is important to understand which tools tend to have data most aligned to your vertical. In this analysis, we showed some benefits and drawbacks of a few popular tools across a small sample of topics. We hoped to provide an approach that could form the underpinnings of your own analysis or for further improvement and to give SEOs a more practical way of choosing a research tool.

Keyword research tools are constantly evolving and adding newly found queries through the use of clickstream data and other data sources. The utility in these tools rests squarely on their ability to help us understand more succinctly how to better position our content to fit real user interest and not on the raw number of keywords returned. Don’t just use what has always been used. Test various tools and gauge their usefulness for yourself.

The post Measuring the quality of popular keyword research tools appeared first on Search Engine Land.

]]>
Common Search: The open source project bringing back PageRank /common-search-open-source-project-bringing-back-pagerank-262736 Mon, 14 Nov 2016 14:58:58 +0000 http:/?p=262736 Columnist JR Oakes explains Common Search, a great open source tool for understanding how search engines work, which has a hidden gem for those of us who miss checking our PageRank score.

The post Common Search: The open source project bringing back PageRank appeared first on Search Engine Land.

]]>
common-search-flowchart2

Over the last several years, Google has slowly reduced the amount of data available to SEO practitioners. First it was keyword data, then PageRank score. Now it is specific search volume from AdWords (unless you are spending some moola). You can read more about this in Russ Jones’s excellent article that details the impact of his company’s research and insights into clickstream data for volume disambiguation.

One item that we have gotten really involved in recently is Common Crawl data. There are several teams in our industry that have been using this data for some time, so I felt a little late to the game. Common Crawl data is an open source project that scrapes the entire internet at regular intervals. Thankfully, Amazon, being the great company it is, pitched in to store the data to make it available to many without the the high storage costs.

In addition to Common Crawl data, there is a non-profit called Common Search whose mission is to create an alternative open source and transparent search engine — the opposite, in many respects, of Google. This piqued my interest because it means that we all can play, tweak and mangle the signals to learn how search engines operate without the huge time investment of starting from ground zero.

Common Search data

Currently, Common Search uses the following data sources for calculating their search rankings (This is taken directly from their website):

  • Common Crawl: The largest open repository of web crawl data. This is currently our unique source of raw page data.
  • Wikidata: A free, linked database that acts as central storage for the structured data of many Wikimedia projects like Wikipedia, Wikivoyage and Wikisource.
  • UT1 Blacklist: Maintained by Fabrice Prigent from the Université Toulouse 1 Capitole, this blacklist categorizes domains and URLs into several categories, including “adult” and “phishing.”
  • DMOZ: Also known as the Open Directory Project, it is the oldest and largest web directory still alive. Though its data is not as reliable as it was in the past, we still use it as a signal and metadata source.
  • Web Data Commons Hyperlink Graphs: Graphs of all hyperlinks from a 2012 Common Crawl archive. We are currently using its Harmonic Centrality file as a temporary ranking signal on domains. We plan to perform our own analysis of the web graph in the near future.
  • Alexa top 1M sites: Alexa ranks websites based on a combined measure of page views and unique site users. It is known to be demographically biased. We are using it as a temporary ranking signal on domains.

Common Search ranking

In addition to these data sources, in investigating the code, it also uses URL length, path length and domain PageRank as ranking signals in its algorithm. Lo and behold, since July, Common Search has had its own data on host-level PageRank, and we all missed it.

I will get to the PageRank (PR) in a moment, but it is interesting to review the code of Common Crawl, especially the ranker.py portion located here, because you really can get into the driver’s seat with tweaking the weights of the signals that it uses to rank the pages:

signal_weights = {
"url_total_length": 0.01,
"url_path_length": 0.01,
"url_subdomain": 0.1,
"alexa_top1m": 5,
"wikidata_url": 3,
"dmoz_domain": 1,
"dmoz_url": 1,
"webdatacommons_hc": 1,
"commonsearch_host_pagerank": 1
}

Of particular note, as well, is that Common Search uses BM25 as the similarity measure of keyword to document body and meta data. BM25 is a better measure than TF-IDF because it takes document length into account, meaning a 200-word document that has your keyword five times is probably more relevant than a 1,500-word document that has it the same number of times.

It is also worthwhile to say that the number of signals here is very rudimentary and obviously missing many of the refinements (and data) that Google has integrated in their search ranking algorithm. One of the key things that we are working on is to use the data available in Common Crawl and the infrastructure of Common Search to do topic vector search for content that is relevant based on semantics, not just keyword matching.

On to PageRank

On the page here, you can find links to the host-level PageRank for the June 2016 Common Crawl. I am using the one entitled pagerank-top1m.txt.gz (top 1 million) because the other file is 3GB and over 112 million domains. Even in R, I do not have enough machine to load it without capping out.

After downloading, you will need to bring the file into your working directory in R. The PageRank data from Common Search is not normalized and also is not in the clean 0-10 format that we are all used to seeing it in. Common Search uses “max(0, min(1, float(rank) / 244660.58))” — basically, a domain’s rank divided by Facebook’s rank — as the method of translating the data into a distribution between 0 and 1. But this leaves some definite gaps, in that this would leave Linkedin’s PageRank as a 1.4 when scaled by 10.

common-search-pagerank

The following code will load the dataset and append a PR column with a better approximated PR:

#Grab the data
df <- read.csv("pagerank-top1m.txt", header = F, sep = " ")

#Log Normalize
logNorm <- function(x){

    #Normalize
    x <- (x-min(x))/(max(x)-min(x))
    10 / (1 - (log10(x)*.25))
}

#Append a Column named PR to the dataset
df$pr <- (round(logNorm(df$V2),digits = 0))

common-search-pagerank-better

We had to play around a bit with the numbers to get it somewhere close (for several samples of domains that I remembered the PR for) to the old Google PR. Below are a few example PageRank results:

  • en.wikipedia.org (8)
  • searchengineland.com (6)
  • consultwebs.com (5)
  • youtube.com (9)
  • moz.com (6)

Here is a plot of 100,000 random samples. The calculated PageRank score is along the Y-axis, and the original Common Search score is along the X-axis.

rplot

To grab your own results, you can run the following command in R (Just substitute your own domain):

df[df$V1=="searchengineland.com",c('pr')]

common-search-pagerank-domain

Keep in mind that this dataset only has the top one million domains by PageRank, so out of 112 million domains that Common Search indexed, there is a good chance your site may not be there if it doesn’t have a pretty good link profile. Also, this metric includes no indication of the harmfulness of links, only an approximation of your site’s popularity with respect to links.

Common Search is a great tool and a great foundation. I am looking forward to getting more involved with the community there and hopefully learning to understand the nuts and bolts behind search engines better by actually working on one. With R and a little code, you can have a quick way to check PR for a million domains in a matter of seconds. Hope you enjoyed!

The post Common Search: The open source project bringing back PageRank appeared first on Search Engine Land.

]]>
Using word vectors and applying them in SEO /word-vectors-implication-seo-258599 Mon, 19 Sep 2016 13:54:59 +0000 http:/?p=258599 Contributor JR Oakes takes look at technology from the natural language processing and machine-learning community to see if it's useful for SEO.

The post Using word vectors and applying them in SEO appeared first on Search Engine Land.

]]>
Word Vectors and SEO

Today, the SEO world is abuzz with the term “relevancy.” Google has gone well past keywords and their frequency to looking at the meaning imparted by the words and how they relate to the query at hand.

In fact, for years, the common term used for working with text and language had been natural language processing (NLP). The new focus, though, is natural language understanding (NLU). In the following paragraphs, we want to introduce you to a machine-learning product that has been very helpful in quantifying and enhancing relevancy of content.

Earlier this year, we started training models based on a code base called Char-rnn from Andrej Karpathy. The really interesting thing about this code base was that you could (after training) end up with a model that would generate content based on what it learned from the training documents. It wouldn’t just repeat the content, but it would generate new readable (although quite nonsensical) content.

It operates by using a neural network to learn which character to guess next. If you have the time, Karpathy’s write-up is a fascinating read that will help you understand a bit more about how this works.

In testing out various code bases, we came across one that, instead of predicting characters, attempted to predict which words would come next. The most interesting part of this was that it used something called GloVe embeddings that were basically words turned into numbers in such a way that the plot of the number coordinates imparted semantic relationships between the words. I know, that was a mouthful.

What is GloVe?

GloVe stands for “global vectors for word representation.” They are built from very large content corpuses and look at co-occurrence statistics of words to define relationships between those words. From their site:

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Here is an example of the term “SEO” converted into a word vector:

Word vector representation for seo

To work with GloVe embeddings, you need familiarity with Python and Word2Vec, as well as a server of sufficient size to handle in-memory storage of 6+ billion words. You have been warned.

Why are GloVe vectors important?

GloVe vectors are important because they can help us understand and measure relevancy. Using Word2Vec, you can do things like measure the similarity between words or documents, find most similar words to a word or phrase, add and subtract words from each other to find interesting results, and also visualize the relationship between words in a document.

Similarity

If you have an understanding of Python, Gensim is an excellent tool for running similarity analysis on words and documents. We updated a converter on Github to make it easier to convert GloVe vectors to a format that Gensim can use here.

To show the power of GloVe vectors to produce semantically similar words to a seed word or phrase, take a look at the following image. This was the result of finding the most similar words to “dui lawyer” using the Gensim library and GloVe vectors (geographical terms were removed).

Similarity terms for dui lawyer

Note how these are not word variations or synonyms, but rather concepts that you would expect to encounter when dealing with an attorney in this practice area.

Adding and subtracting vectors

One of the most frequently used examples of the power of these vectors is shown below. Since the words are converted into numerical vectors, and there are semantic relationships in the position of the vectors, this means that you can use simple arithmetic on the vectors to find additional meaning. In this example, the words “King,” “Man” and “Woman” are turned into GloVe vectors prior to addition and subtraction, and “Queen” is very close to the resulting vector.

Adding and subtracting vectors

Visualization

Once we are able to turn a document of text into its resulting vectors, we are able to plot those words using a very cool library called t-SNE along with d3.js. We have put together a simple demo that will allow you to enter a keyword phrase and two ranking URLs to view the difference in vector space using GloVe vectors.

Demo is here.

It’s important to point out a few things to look for when using the demo.

Look at the relationships between close words

Notice how groupings of words are not simply close variations or synonyms, but rather unique words that just belong together.

Keyword vector space grouping

Use pages with a good amount of content

The tool works by extracting the content on the page, so if there is not much to work with, the result will not be great. Be careful using home pages, pages that are listings of excerpts or mostly image-based content.

Small words don’t mean small value

The size of the resulting words is based on the frequency with which the word was encountered, not the importance of the word. If you enter a comparison URL that is ranking higher than you for the same term, take note of the color differences to see topics or topic areas that you may be missing on your page.

Wrapping it up

Obviously, from an SEO perspective, it is beneficial to create content that covers a topic as thoroughly as possible and that ensures a good experience for your visitor. While we don’t expect all SEOs to run out and learn Python, we think knowing that there is amazing power to be leveraged to that end is an important point to relay. GloVe vectors are one of the many tools that can be leveraged to give you an edge on the competition.

Finally, for those who are fans of latent dirichlet allocation (LDA), Chris Moody released a project this year called LDA2Vec that uses LDA’s topic modeling, along with word vectors, to create an interesting way to assign and understand the various topics within a corpus of text.

The post Using word vectors and applying them in SEO appeared first on Search Engine Land.

]]>
Data exploration from: An experiment in trying to predict Google rankings /data-exploration-experiment-trying-predict-google-rankings-253638 Fri, 22 Jul 2016 14:40:49 +0000 http:/?p=253638 Contributor JR Oakes shares a deep dive into 500,000 search results for transactional service industry queries from 200 of the largest cities in America.

The post Data exploration from: An experiment in trying to predict Google rankings appeared first on Search Engine Land.

]]>
Node.js Code For Data Collection

Introduction

Over the last few months, we have been working with a company named Statec (a data science company from Brazil) to engineer features for predictive algorithms. One of the initial considerations in working with predictive algorithms is picking relevant data to train them on.

We set out quite naively to put together a list of webpage features that we thought may offer some value. Our goal was simply to see if from available features, we could get close to predicting the rank of a webpage in Google. We learned soon into this process that we had to put blinders on to data that was unreachable and hope for the best with what we had.

The following is an analysis of the data we collected, how we collected it and useful correlations derived from the data.

The data

One initial problem was that we needed to gain access to ranking data for enough search engine results page (SERP) results to provide a useful training set. Luckily, GetStat made this very easy. With GetStat, we simply loaded up keyword combinations across the top 25 service industries with the location of the top 200 cities (by size) in the US. This resulted in 5,000 unique search terms (e.g., “Charlotte Accountant” taken from Charlotte, NC).

Our company, Consultwebs, is focused on legal marketing, but we wanted the model to be more universal. After loading up the 5,000 terms and waiting a day, we then had roughly 500,000 search results we could use to construct our data set.

After finding this so easy, we collected the rest of the data. I had built several crawlers with Node.js, so I decided to build a feature extraction mechanism on top of pre-existing work. Luckily, Node.js is an excellent ecosystem for this type of job. Below I list several libraries that make Node wonderful for data collection:

  • Aylien TextAPI — This is a node API for a third-party service that does sentiment analysis, text extraction, summarization, concept/keyword extraction and Named-Entity Recognition (NER).
  • Natural — An awesome natural language processing toolkit for node. It doesn’t hold a candle to what is available on Python, but was surprisingly helpful for our needs.
  • Text Statistics — Helpful to get data on sentence length, reading level and so on.
  • Majestic — I started out crawling their API via a custom script, but they provided the data in one gulp, which was very nice. Thanks, Dixon!
  • Cheerio — An easy-to-use library for parsing DOM elements using jQuery-style markup.
  • IPInfo — Not really a library, but a great API to get server info.

The crawling process was very slow, due mainly to hit limits by API providers and our proxy service. We would have created a cluster, but the expense limited us to hitting a couple of the APIs about once per second.

Slowly, we gained a complete crawl of the full 500,000 URLs. Below are a few notes on my experience with crawling URLs for data collection:

  • Use APIs where possible. Aylien was invaluable at doing tasks where node libraries would be inconsistent.
  • Find a good proxy service that will allow switching between consecutive calls.
  • Create logic for websites and content types that may cause errors. Craigslist, PDF and Word docs caused issues during the crawl.
  • Check the collected data diligently, especially during the first several thousand results, to make sure that errors in the crawl are not creating issues with the structure of the data collected.

The results

We have reported our results from the ranking predictions in a separate post, but I wanted to review some of the interesting insights in the data collected.

Most competitive niches

For this data, we reduced the entire data set to only include rankings in the top 20 and also removed the top four percent of observations based on referring domains. The goal in removing the top four percent of referring domains was to keep URLs such as Google, Yelp and other large websites from having undue influence on the averages. Since we were focusing on service industry results, we wanted to make sure that local business websites would likely be compared, and not major directories.

In the chart below, we assume that the web designer category is the largest due to the practice of footer links from website work. The second two highest are no surprise to those of us who work in the legal niche.

Referring domains by niche chart

Top city link competition

Again we filtered to the top 20 ranking results across all observations and also removed the top four percent of observations based on referring domains to remove URLs from Google, Yelp and other large websites. Feel free to use this in proposals when qualifying needs for clients in particular cities.

The top results here are no surprise to those of us who have had clients in these cities. New York, in particular, is a daunting task for many niches.

Average referring domains by city chart

Facebook shares

For this data, we kept the full rank data at 100 results per search term, but we removed observations with referring domains over the top four percent threshold and over 5,000 Facebook shares. This was a minimal reduction to the overall size, yet it made the data plot much cleaner.

The plot reminds me of when I go out to the shooting range, in that there is really no order to shots. The Pearson correlation of average shares to rank is 0.016, and you can tell from the chart that it would be hard to draw a line between Facebook and any effect on ranking for these type of sites.

Average Facebook shares by rank chart

Majestic Citation Flow

For Citation Flow (CF), we stayed with the full 100 results per search term, but we again removed the top four percent of referring domains. Unsurprisingly to anyone who uses this metric, there was a very strong correlation of -0.872 between the average CF score and ranking position. There is a negative correlation because the rank becomes lower as the CF score progresses higher. This is a good reason to use CF.

Average Majestic Citation Flow by rank chart

Majestic Trust Flow

For Trust Flow, we stayed with the full 100 results per search term, but we again removed the top four percent of referring domains. The correlation was not as strong as Citation Flow, but relatively strong at -0.695. An interesting note from the graph is the trajectory upward as you get into the top 20 results. Also notice that the 1 to 3 positions are probably skewed due to the impact of other metrics on local results.

Average Majestic Trust Flow by rank chart

Response time

Speed is on top of everyone’s mind today with Google’s focus on it and new projects like AMP. Due to crawling limitations, we were only able to measure the time it took for the hosting server to get us the contents of the page. We wanted to be careful not to call this load time, as that is often considered as the time it takes your browser to load and render the page. There is also a consideration of latency encountered between our server (AWS) and the host, but we think in aggregate any skew in results would be negligible.

Again, this is 100 search results for each search term, with the top four percent by referring domains removed. The Pearson correlation is 0.414, which suggests a relationship between response time and ranking.

Although similar to the correlation found by Backlinko for HTTPS, this might be explained in terms of better run, and optimized sites all around tend to be toward the top. In the Backlinko findings, I would question whether it is accurate to chalk up HTTPS to Google ranking preference (I know what they said) or to the fact that in many verticals, the top results are dominated by brands that tend toward HTTPS.

Average server response time by rank chart

Text length

This one was a bit of a shock to me, but keep in mind that the keywords in this data set were more transactional in nature and not the usual Wikipedia eliciting results.The full 100 results were used, as well as the top four percent by referring domains removed.

The Pearson correlation to rank is 0.829, which suggests that it may not all be about longer content. Please note that again the local results are clearly there, and it is important to note that text length is measured in characters and can be converted to words on average by dividing by 4.5.

Average text length by rank chart

Server type

One of the other features that we collected is server type. This data was pulled from the server response header “Server” and classified to one of 13 categories. We restricted the results to the top 20 for each search term, and no filter was placed for referring domains. Also, we omitted types that were not defined or infrequent in the dataset. The type “GWS” is for Google Web Services. The lower average rank can be attributed to Google video and Google local results typically appearing with prominent positioning.

Average rank by server type chart

URL depth

For URL depth, we filtered to the top 20 ranking results across all observations and also removed the top four percent of observations based on referring domains to remove URLs from Google, Yelp and other large websites. This is an interesting one because common advice is that you want your most important results as close to the root of the site as possible. Also, notice the impact of local in terms of preference for the home page of a website.

Average URL length by rank chart

Conclusion

I don’t think there was anything truly earth-shattering in the results of our data analysis, and this is only a small sampling of data from the 70+ features that we collected during our training.

The two most important takeaways for me are that links and speed are areas where one can make the most impact on a website. Content needs to be good (and there are indications all over the place that user behavior influences rank for some verticals), but you need to be seen to generate user behavior. The one thing that is most interesting in this data set is that it is more geared towards small business-type queries than other studies that sample a broad range of queries.

I always have been an advocate of testing, rather than relying on what works for other people or what was reported on your favorite blogs. GetStat and a bit of JavaScript (node) can give you the ability to easily put together collection mechanisms to get a more nuanced view of results relevant to the niche you are working in. Being able to deliver these types of studies can also help when trying to provide justification to our non-SEO peers as to why we are recommending for things to be done a certain way.

The post Data exploration from: An experiment in trying to predict Google rankings appeared first on Search Engine Land.

]]>
An experiment in trying to predict Google rankings /experiment-trying-predict-google-rankings-253621 Wed, 20 Jul 2016 15:06:07 +0000 http:/?p=253621 In late 2015, JR Oakes and his colleagues undertook an experiment to attempt to predict Google ranking for a given webpage using machine learning. What follows are their findings, which they wanted to share with the SEO community.

The post An experiment in trying to predict Google rankings appeared first on Search Engine Land.

]]>
Google Prediction Machine Learning Experiment

Machine learning is quickly becoming an indispensable tool for many large companies. Everyone has, for sure, heard about Google’s AI algorithm beating the World Champion in Go, as well as technologies like RankBrain, but machine learning does not have to be a mystical subject relegated to the domain of math researchers. There are many approachable libraries and technologies that show promise of being very useful to any industry that has data to play with.

Machine learning also has the ability to turn traditional website marketing and SEO on its head. Late last year, my colleagues and I (rather naively) began an experiment in which we threw several popular machine learning algorithms at the task of predicting ranking in Google. We ended up with an assembly that achieved 41 percent true positive and 41 percent true negative on our data set.

In the following paragraphs, I will take you through our experiment, and I will also discuss a few important libraries and technologies that are important for SEOs to begin understanding.

Our experiment

Toward the end of 2015, we started hearing more and more about machine learning and its promise to make use of large amounts of data. The more we dug in, the more technical it became, and it quickly became clear that it would be helpful to have someone help us navigate this world.

About that time, we came across a brilliant data scientist from Brazil named Alejandro Simkievich. The interesting thing to us about Simkievich was that he was working in the area of search relevance and conversion rate optimization (CRO) and placing very well for important Kaggle competitions. (For those of you not familiar, Kaggle is a website that hosts machine learning competitions for groups of data scientists and machine learning enthusiasts.)

Simkievich is the owner of Statec, a data science/machine learning consulting company, with clients in the consumer goods, automotive, marketing and internet sectors. Lots of Statec’s work had been focused on assessing the relevance of e-commerce search engines. Working together seemed a natural fit, since we are obsessed with using data to help with decision-making for SEO.

We like to set big hairy goals, so we decided to see if we could use the data available from scraping, rank trackers, link tools and a few more tools, to see if we could create features that would allow us to predict the rank of a webpage. While we knew going in that the likelihood of pulling it off was very low, we still pushed ahead for the opportunity for an amazing win, as well as the chance to learn some really interesting technology.

The data

Fundamentally, machine learning is using computer programs to take data and transform it in a way that provides something valuable in return. “Transform” is a very loosely applied word, in that it doesn’t quite do justice to all that is involved, but it was selected for the ease of understanding. The point here is that all machine learning begins with some type of input data.

(Note: There are many tutorials and courses freely available that do a very good job of covering the basics of machine learning, so we will not do that here. If you are interested in learning more, Andrew Ng has an excellent free class on Coursera here.)

The bottom line is that we had to find data that we could use to train a machine learning model. At this point, we didn’t know exactly what would be useful, so we used a kitchen-sink approach and grabbed as many features as we could think of. GetStat and Majestic were invaluable in supplying much of the base data, and we built a crawler to capture everything else.

Image of data used for analysis

Our goal was to end up with enough data to successfully train a model (more on this later), and this meant a lot of data. For the first model, we had about 200,000 observations (rows) and 54 attributes (columns).

A little background

As I said before, I am not going to go into a lot of detail about machine learning, but it is important to grasp a few points to understand the next section. In total, much of the machine learning work today deals with regression, classification and clustering algorithms. I will define the first two here, as they were relevant to our project.

Image showing the difference between classification and regression algorithms

  • Regression algorithms are normally useful for predicting a single number. If you needed to create an algorithm that predicted a stock price based on features of stocks, you would select this type of model. These are called continuous variables.
  • Classification algorithms are used to predict a member of a class of possible answers. This could be a simple “yes or no” classification, or “red, green or blue.” If you needed to predict whether an unknown person was male or female from features, you would select this type of model. These are called discrete variables.

Machine learning is a very technical space right now, and much of the cutting-edge work requires familiarity with linear algebra, calculus, mathematical notation and programming languages like Python. One of the items that helped me understand the overall flow at an approachable level, though, was to think of machine learning models as applying weights to the features in the data you give it. The more important the feature, the stronger the weight.

When you read about “training models,” it is helpful to visualize a string connected through the model to each weight, and as the model makes a guess, a cost function is used to tell you how wrong the guess was and to gently, or sternly, pull the string in the direction of the right answer, correcting all the weights.

The part below gets a bit technical with terminology, so if it is too much for you, feel free to skip to the results and takeaways in the final section.

Tackling Google rankings

Now that we had the data, we tried several approaches to the problem of predicting the Google ranking of each webpage.

Initially, we used a regression algorithm. That is, we sought to predict the exact ranking of a site for a given search term (e.g., a site will rank X for search term Y), but after a few weeks, we realized that the task was too difficult. First, a ranking is by definition a characteristic of a site relative to other sites, not an intrinsic characteristic of the site (as, for example, word count). Since it was impossible for us to feed our algorithm with all sites ranked for a given search term, we reformulated the problem.

We realized that, in terms of Google ranking, what matters most is whether a given site ends up on the first page for a given search term. Thus, we re-framed the problem: What if we try to predict whether a site will end up in the top 10 sites ranked by Google for a certain search term? We chose top 10 because, as they say, you can hide a dead body on page two!

From that standpoint, the problem turns into a binary (yes or no) classification problem, where we have only two classes: a) the site is a top 10 site, or b) the site is not a top 10 site. Furthermore, instead of making a binary prediction, we decided to predict the probability that a given site belongs to each class.

Later, to force ourselves to make a clear-cut decision, we decided on a threshold above which we predict that a site will be top 10. For example, if we predict that the threshold is 0.85, then if we predict that the probability of a site being in the top 10 is higher than 0.85, we go ahead and predict that the site will be in the top 10.

To measure the performance of the algorithm, we decided to use a confusion matrix.

The following chart provides an overview of the entire process.

Image visually showing our machine learning process

Cleaning the data

We used a data set of 200,000 records, including roughly 2,000 different keywords/search terms.

In general, we can group the attributes we used into three categories:

  • Numerical features
  • Categorical variables
  • Text features

Numerical features are those that can take on any number within an infinite or finite interval. Some of the numerical features we used are ease of read, grade level, text length, average number of words per sentence, URL length, website load time, number of domains referring to website, number of .edu domains referring to website, number of .gov domains referring to website, Trust Flow for a number of topics, Citation Flow, Facebook shares, LinkedIn shares and Google shares. We applied a standard scalar (multiplier) to these features to center them around the mean, but other than that, they require no further preprocessing.

A categorical variable is one which can take on a limited number of values, with each value representing a different group or category. The categorical variables we used include most frequent keywords, as well as locations and organizations throughout the site, in addition to topics for which the website is trusted. Preprocessing for these features included turning them into numerical labels and subsequent one-hot encoding.

Text features are obviously composed of text. They include search term, website content, title, meta-description, anchor text, headers (H3, H2, H1) and others.

It is important to highlight that there is not a clear-cut difference between some categorical attributes (e.g., organizations mentioned on the site) and text, and some attributes indeed switched from one category to the other in different models.

Feature engineering

We engineered additional features, which have correlation with rank.

Most of these features are Boolean (true or false), but some are numerical. An example of a Boolean feature is the exact search term included on the website text, whereas a numerical feature would be how many of the tokens in the search term are included in the website text.

Below are some of the features we engineered.

Image showing boolean and quantitative features that were engineered

Run TF-IDF

To pre-process the text features, we used the TF-IDF algorithm (term-frequency, inverse document frequency). This algorithm views every instance as a document and the entire set of instances as a corpus. Then, it assigns a score to each term, where the more frequent the term is in the document and the less frequent it is in the corpus, the higher the score.

We tried two TF-IDF approaches, with slightly different results depending on the model. The first approach consisted of concatenating all the text features first and then applying the TF-IDF algorithm (i.e., the concatenation of all text columns of a single instance becomes the document, and the set of all such instances becomes the corpus). The second approach consisted of applying the TF-IDF algorithm separately to each feature (i.e., every individual column is a corpus), and then concatenating the resulting arrays.

The resulting array after TF-IDF is very sparse (most columns for a given instance are zero), so we applied dimensionality reduction (single value decomposition) to reduce the number of attributes/columns.

The final step was to concatenate all resulting columns from all feature categories into an array. This we did after applying all the steps above (cleaning the features, turning the categorical features into labels and performing one-hot encoding on the labels, applying TF-IDF to the text features and scaling all the features to center them around the mean).

Models and ensembles

Having obtained and concatenated all the features, we ran a number of different algorithms on them. The algorithms that showed the most promise are gradient boosting classifier, ridge classifier and a two-layer neural network.

Finally, we assembled the model results using simple averages, and thus we saw some additional gains as different models tend to have different biases.

Optimizing the threshold

The last step was to decide on a threshold to turn probability estimations into binary predictions (“yes, we predict this site will be top 10 in Google” or “no, we predict this site will not be top 10 in Google”). For that, we optimized a cross-validation set and then used the obtained threshold on a test set.

Results

The metric we thought would be the most representative to measure the efficacy of the model is a confusion matrix. A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.

I am sure you have heard the saying that “a broken clock is right twice a day.” With 100 results for every keyword, a random guess would correctly predict “not in top 10” 90 percent of the time. The confusion matrix ensures the accuracy of both positive and negative answers. We obtained roughly a 41-percent true positive and 41-percent true negative in our best model.

Image showing confusion matrix of our best model

Another way of visualizing the effectiveness of the model is by using an ROC curve. An ROC Curve is “a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.” The non-linear models used in the ensemble were XGBoost and a neural network. The linear model was logistic regression. The ensemble plot indicated a combination of the linear and non-linear models.

Image of ROC curve generated by our model

XGBoost is short for “Extreme Gradient Boosting,” with gradient boosting being “a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.”

The chart below shows the relative contribution of the feature categories to the accuracy of the final prediction of this model. Unlike neural networks, XGBoost, along with certain other models, allow you to easily peek into the model to tell the relative predictive weight that particular features hold.

Graph of predictive importance by feature category

We were quite impressed that we were able to build a model that showed predictive power from the features that we had given it. We were very nervous that our limitation of features would lead to the utter fruitlessness of this project. Ideally, we would have a way to crawl an entire site to gain overall relevance. Perhaps we could gather data on the number of Google reviews a business had. We also understood that Google has much better data on links and citations than we could ever hope to gather.

What we learned

Machine learning is a very powerful tool that can be used even if you do not understand fully the complexity of how it works. I have read many articles about RankBrain and the inability of engineers to understand how it works. This is part of the magic and beauty of machine learning. Similar to the process of evolution, in which life gains different features and some live and some die, the process of machine learning finds the way to the answer instead of being given it.

While we were happy with the results of our first models, it is important to understand that this was trained on a relatively small sample compared to the immense size of the internet. One of the key goals in building any kind of machine learning tool is the idea of generalization and operating effectively on data that has never been seen before. We are currently testing our model on new queries and will continue to refine.

The largest takeaway for me in this project was just starting to get a grasp on the immense value that machine learning has for our industry. A few of the ways I see it impacting SEO are:

  • Text generation, summarization and categorization. Think about smart excerpts for content and websites that potentially self-organize based on classification.
  • Never having to write another ALT parameter (See below).
  • New ways of looking at user behavior and classification/scoring of visitors.
  • Integration of new ways of navigating websites using speech and smart Q&A style content/product/recommendation systems.
  • Entirely new ways of mining analytics and crawled data to give insights into visitors, sessions, trends and potentially visibility.
  • Much smarter tools in distribution of ad channels to relevant users.

This project was more about learning for us rather than accomplishing a holy grail (of sorts). Much like the advice I give to new developers (“the best learning happens while doing”), it is important to get your hands dirty and start training. You will learn to gather, clean and organize data, and you’ll familiarize yourself with the ins and outs of various machine learning tools.

Much of this is familiar to more technical SEOs, but the industry also is developing tools to help those who are not as technically inclined. I have compiled a few resources below that are of interest in understanding this space.

Recent technologies of interest

It is important to understand that the gross majority of machine learning is not about building a human-level AI, but rather about using data to solve real problems. Below are a few examples of recent ways this is happening.

NeuralTalk2

NeuralTalk2 is a Torch model by Andrej Karpathy for generating natural language descriptions of given images. Imagine never having to write another ALT parameter again and having a machine do it for you. Facebook is already incorporating this technology.

Microsoft Bots and Alexa

Researchers are mastering speech processing and are starting to be able to understand the meaning behind words (given their context). This has deep implications to traditional websites in how information is accessed. Instead of navigation and search, the website could have a conversation with your visitors. In the instance of Alexa, there is no website at all, just the conversation.

Natural language processing

There is a tremendous amount of work going on right now in the realm of translation and content semantics. It goes far beyond traditional Markov chains and n-gram representations of text. Machines are showing the initial hints of abilities to summarize and generate text across domains. “The Unreasonable Effectiveness of Recurrent Neural Networks” is a great post from last year that gives a glimpse of what is possible here.

Home Depot search relevance competition

Home Depot recently sponsored an open competition on Kaggle to predict the relevance of their search results to the visitor’s query. You can see some of the process behind the winning entries on this thread.

How to get started with machine learning

Because we, as search marketers, live in a world of data, it is important for us to understand new technologies that allow us to make better decisions in our work. There are many places where machine learning can help our understanding, from better knowing the intent of our users to which site behaviors drive which actions.

For those of you who are interested in machine learning but are overwhelmed with the complexity, I would recommend Data Science Dojo. There are simple tutorials using Microsoft’s Machine Learning Studio that are very approachable to newbies. This also means that you do not have to learn to code prior to building your first models.

If you are interested in more powerful customized models and are not afraid of a bit of code, I would probably start with listening to this lecture by Justin Johnson at Stanford, as it goes through the four most common libraries. A good understanding of Python (and perhaps R) is necessary to do any work of merit. Christopher Olah has a pretty great blog that covers a lot of interesting topics involving data science.

Finally, Github is your friend. I find myself looking through recent repos added to see the incredibly interesting projects people are working on. In many cases, data is readily available, and there are pretrained models that perform certain tasks very well. Looking around and becoming familiar with the possibilities will give you some perspective into this amazing field.

The post An experiment in trying to predict Google rankings appeared first on Search Engine Land.

]]>