The post Leveraging data science to illuminate the modern consumer decision journey appeared first on Search Engine Land.

]]>The advertiser analytics group at Microsoft Advertising (my employer) is taking deeper dives into internal search query data to help marketers get the visibility they need. What exactly does today’s CDJ look like? Well, like this:

What we’re looking at here is an actual representation of recent search queries on Bing related to enterprise cloud software. It’s a network comprised of search queries and the relationships between them, with *relationship* defined as searches conducted by the same person in a close window of time. Let’s dive in and explore.

Messy, right? Well, understanding searcher behavior is a complicated problem. First off, let’s look at all the different communities within this network, which are visualized by color. It should become quickly apparent that these queries are clustered thematically; queries around VPN are given their own color, and big players in the space such as Azure and AWS have large communities. It is important to note that queries are not placed in communities based on the content of the query itself, but rather based on the regularity with which they are searched by the same user. This is an important distinction, and it gives us something that is hard to come by: a raw, unbiased look at where brands are positioned in their space.

The size of a community is always an interesting factor, but it is the relationships between queries that can best unlock hidden insights. No matter what your product or brand space, there are queries that exist in one community, but have relationships with queries in other communities. For instance, we see below that the queries “cloud computing” and “IoT” have relationships with each other, and with the Azure and AWS communities. This is the connective tissue that drives deeper insights into your customers, your business and your competitive landscape.

The key takeaway with relationships is that the vast levels of interconnectivity between queries illustrate the true sophistication of searcher behavior. Searches are seldom linear, occurring more in clusters that don’t necessarily align with funnel-like behavior. Enduring convictions about consumer intent, loyalty and the different types of contributions made by brand and non-brand queries are challenged by the data. To help drive the point home, let’s extract some insights that are only accessible with this acknowledgement as a prerequisite.

We’ll start by isolating the query “what is AI?” We can instantly see in our network that this query has been searched by users who have also searched “what is artificial intelligence” and “AI.”

In turn, we see associations between these terms and brand nodes, such as “IBM” and “AWS.” However, ultimately, we are able to see that “what is AI” is a part of the same community as “IBM,” telling us that many people are searching for both. IBM is doing a superior job at positioning their brand closer to these types of consumer questions.

How about one more example of how embracing the intricacy of searcher behavior can open pathways to a more comprehensive understanding of industry dynamics? Like other big players in this space, Google has its own community within the network. The query “Google Cloud” is the central node in this community, and based on what we’ve seen in other communities throughout the network, we would likely presume that other queries within the community would also be related to Google’s cloud product. However, this community defies our expectations; it contains a mix of competitor and non-brand terms, many of which contain the term “cloud.” From this, we can denude that Google has positioned their brand close to the term “cloud” – a nice mindshare win for them, and an opportunity for their competition.

How can today’s marketers manage such complex customer journeys? Firstly, it’s important to have a strong partnership with your publisher. Your account teams are champions for your business, and part of that is stewardship for relevant data. The second thing is to invest in data science within your advertising program. Unraveling intricate problems will often call for technical expertise, and consumer behavior grows more sophisticated with every technological advance. And finally, AI and machine learning are already being infused throughout the space to help marketers collect, analyze and leverage massive amounts of data to reach future customers in better ways.

The post Leveraging data science to illuminate the modern consumer decision journey appeared first on Search Engine Land.

]]>The post Analyze data distribution more accurately with time series appeared first on Search Engine Land.

]]>In part one of this series, we explained why taking averages at face value can be misleading, and leave us with an incomplete understanding of what’s going on in an account. We established that using data distributions is an effective way to control for that possibility, and then we covered on how to analyze a data distribution using a histogram as a visual aid.

In part two of this series we examined the same set of data using a box and whisker plot.

And we left off with the declaration that a graduate of the first two parts of this series should be able to identify the relationship between these two visualizations of the same data.

With this baseline knowledge firmly tucked into our belts, we move into the realm of using data distributions as time series. While there are some excellent ways to incorporate histograms and time series, none are immediately available in Microsoft Excel.

First things first, in order to get the most granular understanding of our distribution as possible, we’ve been segmenting our performance reports by keyword and by day, but now we’re going to add another layer to the time grain: month.

Before we get into the distribution views again, let’s visit an example of some conventional business intelligence about CPCs over a period of six months.

A likely analysis of a view like this would be something like, “There were relatively stable CPCs between November and February, before encountering pricing volatility in March and April.” That’s all fine and well, but we’re leaving a lot of information on the table by using averages instead of distributions.

So let’s turn these summaries into distributions.

At a glance, one thing jumps out immediately, and that’s the behavior of the outlier CPCs in April ’18. In the five months before that, outlier behavior was pretty consistent, with an upper threshold of around $50. In April this advertiser suddenly saw several instances of a keyword with CPCs over $60, and ranging up to $100, which is certainly an item of interest for optimization.

However, the presence of the outliers are skewing the y-axis, and making trends within the quartiles difficult to ascertain. In order to elucidate that quartiles a little bit better, remove the visualization of the outliers. This is made easy in Excel. Right click your plot, select “format data series,” and then uncheck the “Show outlier points” box.

This is the same data, outliers removed. Note the top of the y-axis now caps out at 20, where before it ranged to 120.

We can immediately see that the fourth quartile range is the most sporadic from month to month, and the third quartile range is also more volatile than the first or second quartile ranges. Importantly, the median CPC is consistently lower than the mean CPC, which is owed to the influence of the fourth quartile range and the outliers. Furthermore, remembering that the “x” represents average CPC, the top threshold of the fourth quartile range appears to have a distinct relationship with average CPC.

This is a good example of how looking at distributions provides the advertiser with more information that has true diagnostic value than the summary mean.

On behalf of the Bing Analytics Group, we hope you feel you’ve evolved your analyses with this series. Look how far you’ve come!

The post Analyze data distribution more accurately with time series appeared first on Search Engine Land.

]]>The post A closer look at Bing’s box and whisker plots to analyze CPC data appeared first on Search Engine Land.

]]>If you’ve finished part one of this series, then the histogram on the left should look familiar. The plot on the right is a box and whisker plot, created from the very same set of CPCs that we used in part one. Hooray for continuity!

First, let’s ground ourselves in some basics. Because we are not segmenting our data in any way, and therefore using only one distribution, the CPC value will be expressed on the y-axis, and the x-axis will be null.

Now, let’s go through the components of the box and whisker plot. First off, the x.

This x represents the mean value of the distribution, which you’ll recognize as the simple average often associated with your search data. For the purposes of this exercise, the X is your average CPC. To that end, the line in the middle of the box represents the median.

While getting both the mean and median of the distribution in the visualization is a wonderful feature of the box and whisker plot, the four quartiles can help divine a lot of information that we can’t get at through a histogram.

The bottom threshold of the box (or left-most threshold for a horizontally justified plot) is the lower quartile, or first quartile, or Q1, and it represents the number such that 25 percent of observations are less than it and 75 percent are larger. In this context, think of an “observation” as a single data point.

The top threshold of the box (or right-most threshold for a horizontally justified plot) is the upper quartile, or third quartile, or Q3, and it represents the number such that 75 percent of observations are less than it, and 25 percent are larger.

Following this same notation, you can also infer that the median serves as the second quartile, given that 50 percent of observations are greater, and 50 percent are lesser.

This can admittedly becoming a little confusing to keep track of. We’ve found that something that helps with intuition is to think of the quartiles as possessing ranges, and remembering that each range contains roughly a quarter of the total data points in the data set. Perhaps this pursuit would be frowned upon by the statistician purists of the world, but we take a bright view of whatever helps you learn. Hopefully the visual below helps conceptualize.

Now we’re getting somewhere, right? We can observe that the first three quartile ranges of this distribution have a pretty comparable range of values. But the fourth quartile range is a much broader stroke. For this advertiser to lower their CPCs, a focused and precise tactic would be to isolate keywords that fall within that fourth quartile range, and modify the attendant bids.

Alright, but what about those dots?

Data points that render as individual dots can be considered statistical outliers in the context of a data distribution. In our hypothetical scenario, the advertiser is looking for tactics to mitigate CPC cost. In addition to the fourth quartile range, this advertiser should investigate the keywords responsible for these outlier values, and act accordingly.

Hearken back to part one of this series for a moment, and recall that our distribution is right tailed, meaning that the skew is towards values that are greater than the median. Knowing what you know now about both histograms and box and whisker plots, you should be able to intuit the relationship between these two visualizations of the same data.

In the final part of this series, we’ll explore using distributions to identify changes in your data over time.

The post A closer look at Bing’s box and whisker plots to analyze CPC data appeared first on Search Engine Land.

]]>The post Bing histograms reveal better business intelligence metrics with data distribution appeared first on Search Engine Land.

]]>In the field of business intelligence, and specifically, the BI extracted from search performance, averages are ubiquitous. Cost-per-click, cost-per-acquisition, and average position are metrics that should immediately come to mind, but others such as average order value lay in the weeds as well.

There’s nothing inherently wrong with a simple average, but in many cases they can be useless or misleading because of their susceptibility to extreme influence by outlier data points. To briefly illustrate the point, consider a portfolio of ten keywords. Nine of those keywords have one click each, all at the cost of $1. The tenth keyword also has one click, but this one came at a price of $6. This brings the average CPC of the portfolio to $1.50, which is an obfuscation of a lot of important information.

Of course, portfolios are generally much larger than ten keywords, and with scale the opportunity for averages to muddy the waters of your analyses also grows. As such, the aim of this three-part series is to help you become comfortable thinking about your data regarding *distributions*, which will help bring more information and context to your business intelligence metrics, and help you depend less on averages.

Let’s start by highlighting the difference between a summary view and a distribution view, moving forward with CPC as an example. Below is a standard method for visualizing CPC performance for a single month.

But we can immediately unlock a lot of information about this month by segmenting the keyword report we pull down from the Bing UI by day. Since we’re working with CPC data, we’ll want to remove any line items from the Excel file that have 0f clicks. Once we do that, select all your CPC data for the month, and create a histogram.

Our resulting plot is below:

The histogram is a common visualization for data distributions. It features a **binned x-axis**, which means that each tick on the axis represents a range of values. Each time a value is represented in the dataset, it is binned accordingly. The cumulative count of values within a given range is called frequency and is represented on the y-axis.

Next, calculate the mean and median of your CPC data. In Excel, achieve this using the =AVERAGE() function for mean and the =MEDIAN() function for the median.

Remember that our Average CPC for the month was $6.82. Our median CPC comes in at $6.01. That’s a whopping $0.81 difference and an absolutely valuable piece of information for this advertiser.

The gap between the mean and median CPC is caused by the right-skew of the distribution. The farther along the tail a value is, the more that value is capable of influencing the mean. All data points have an equal influence on the median.

Before we looked at this distribution of CPCs throughout one month, all we knew was that the average click cost was $6.82. Now we understand that the advertiser had a much higher probability of receiving a click in the $4.20 to $6.30 range than they did in the $6.40 to $6.90 range.

Histograms are just the tip of the iceberg when it comes to understanding data distributions. In the next part of this series, we’ll explore this same dataset using a box and whisker plot.

The post Bing histograms reveal better business intelligence metrics with data distribution appeared first on Search Engine Land.

]]>