This will be my final ranty post for a while (can’t promise anything, but I’ll do my best!).
With the recent launch of Google’s knowledge graph this is just going a step too far for Google. They are just taking the micky, always banging on about creating good quality and unique content and not simply aggregating data for users then on the other hand totally going against this whole ethos.
What is more annoying is that they totally penalise anyone attempting to aggregate data in the same way that they do – double standards. Is this to squash competition, all in the guise of ‘better for the user’? Since Google wouldn’t want to be sending people through to other websites that provide better information.
Let me try and explain a little more, but first if you haven’t seen already then take a look of their latest change – Google Knowledge Graph
So why is this such a big deal?
If you think about how how content is created and how money is earned from this content.
Website A creates some awesome content with unique research that their readers love to read about. Website A then earns money on the website based on advertising (which is generally priced based on number of page views, either CPM basis or CPC basis – which ultimately means more page views = more revenue generated). Alternatively Website A could also earn money based on number of subscriptions to their newsletter (if they are selling products via the website / newsletter).
And so on… the point being that the website that created the content ultimately earns money based on how many people view their website when looking it in a rather over-simplified way.
What has this got to do with Google Knowledge Graph?
Everything! Lets take the example if I was to search for “Tower of London” on Google then this is what is presented on the search results;
In the screenshot above you can see all of the additional information that is contained within the search results. So what if the information listed in this section answers the users question or at least answers their question enough for them not to bother going through to the website in question?
Well, the website who wrote the content then loses traffic and ultimately revenue.
This is such a short term view for Google to implement “improvements” like this as it means that the people who are creating the great content end up getting squeezed as their revenues suffer due to drops in traffic to their website.
Screw publishers and Google increases revenues
So if users aren’t going through to the websites who created the content and they are spending more time within the search results directly, then guess where there next click is likely to be… either on an organic listing or on one of their paid adverts which earn Google revenue.
Good business plan? Steal content and increase revenues….well good for one party at least!
If you are going to steal content, then why not just steal is from a website who has already collated all of this information into a nice easily scrapable format….welcome Wikipedia to the room.
Google are even nice enough to tell us directly that this is what they have done! They have posted a white paper on their research blog titled From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas which quotes the following information within the first couple of paragraphs;
“How do we represent concepts? Our approach piggybacks on the unique titles of entries from an encyclopedia, which are mostly proper and common noun phrases. We consider each individual Wikipedia article as representing a concept (an entity or an idea), identified by its URL. Text strings that refer to concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link) that point to each Wikipedia page, thus drawing on the vast link structure of the web. For every English article we harvested the strings associated with its incoming hyperlinks from the rest of Wikipedia, the greater web, and also anchors of parallel, non-English Wikipedia pages.”
Absolutely no shame.
It has happened many times before in other industries
Alas, Google Hotel Finder, Google Mortgage Comparison, Google Flight Search.
I am not going to rant on about these items above again as I have already covered these previously on the blog posts Is Google Turning into a Hotel Aggregator and the post about how The Google Monopoly Needs to be Broken Up.
Rich snippets tend to be the way that Google begins to attack a certain industry by offering special ‘promotion’ to those implementing rich snippets. While this is a short lived ‘benefit’ what it does enable Google to do is to easily scrape the relevant content they need so that it can be placed in their massive Google Database where they can then use all of the structured information to do with as they please.
For example, Rich snippets markup for businesses / organisations then comes along Google Places, listing all of the business information into one nice format that they control – Coincidence? I think not.
How to Scrape Google Hotel Finder
Why not play Google at their own game? Let them do the hard work then simply scrape all of the content back from them – hell, that is what they are doing to a large amount of content producers out there. So I guess there is nothing wrong there then? One rule for one….and the same rule for everyone in my books.
……… a little time passes whilst I try and do this……..
Outcome, very difficult to scrape the content – no surprise there. I guess once Google have stolen all of the content of other people then they don’t want to go and allow people to easily steal it back off them.
Here is what I attempted along with explanations about why it is difficult to scrape (likely not impossible – but I don’t have the time or inclination to dig any deeper for the time being);
- When you visit www.google.com/hotelfinder
- Enter a search query such as “Manchester”
- The URL that is generated is http://www.google.com/hotelfinder/#search;l=manchester;d=2012-06-17;n=1;h=6546758407389812236;ph=1;si=b820bfd6
- As you can see in the URL there are different parameters which can be easily edited;
- l = location
- d = date
- n, h, ph, si = no idea, but if you remove these the URL still works (see for yourself, http://www.google.com/hotelfinder/#search;l=manchester;d=2012-06-17; )
- When looking at the generated code on the page, using a tool such as the Google Chrome XPath Helper the actual XPath required to get all of the ‘hotel names’ for example is “//div[@tooltip]“
- Now you can easily enough scrape all of this data manually by simply entering in each location around the world – but who has time for that. Instead we want a script to run through hundreds / thousands of destinations to scrape the content
- Bring on the SEO Tools plugin for Excel
- Below is a small diagram to show what was attempted to scrape the content on scale. The basic idea being that you simply enter a ‘destination’ within one column, then the URL to scrape from Google is automatically generated, then the XPath is run on that URL (all of which has been described above already)
- Result…… Blank area, nothing returned (Cell C4).
- Shame. Best try something new….
OK so the first attempt didn’t end up too successfully. So lets dig a little deeper.
As you would expect, Google are using POST requests which could be for purely legitimate development reasons (as there are many reasons to use this type) although I suspect they have chosen this as they want to protect their data since POST requests are a little more tricky to scrape.
Well it happens that it is, with a little investigation. What you can see in the image below is that there is a request going to the URL http://www.google.com/hotelfinder/rpc and passing across certain data so that the correct results are brought back (in this case, the search term ‘Manchester’ along with other data – note this part is in the second screenshot).
So now we are getting somewhere. We have the URL that Google is accessing to request a list of all hotels within a certain destination, and we also have a list of the actual ‘payload’ which is sent across (i.e. the location).
So we should be able to join the two together now. Welcome into the room another useful Google Chrome plugin called Simple REST Client. What does this do, simply sends a request to a certain URL for either GET / POST and returns the data. Install it and have a play if interested further, but below you can see the screenshot when trying to access the /rpc URL and passing across the same payload data.
Dam! Still hasn’t worked. Likely reason here is that some of the original headers that are sent across (such as the cookie ‘HDIS’ and others) are uniquely generated security keys to avoid people easily scraping this content. Basically because you wouldn’t be able to easily get cookies stored when scraping the content via the methods outlined above.
It would be possible to scrape this content using a more sophisticated browser based scraping tool (i.e. accesses URLs via the browser like a human would – something like a browser macro or similar) but time is ticking on.
Ironic though isn’t it, that Google quite happily scrape the content off other websites who put all of the effort in to creating the content but yet they don’t want you scraping their content and make it extremely difficult to do so.
End of rant