Imagine if Google could identify similar snippets of text when it comes across them in its index and recognize that they might be related in a meaningful way. For example, the search engine might see a headline on a news article that says, “Soviet troops pulled out of Afghanistan,” and another noted that “Soviet troops withdrew from Afghanistan.” Is Google capable of understanding paraphrases like that?
Is paraphrase-based indexing influencing the search results below?
I’ve written about Phrase-Based Indexing in the past, with my latest post on the subject being Phrasification and Revisiting Google’s Phrase-Based Indexing. While phrase-based indexing involves a search engine going through web pages, and associating “good phrases” with specific pages, and finding out how frequently those phrases tend to co-occur in a certain top number of search results for a specific query, the impetus behind finding paraphrases is somewhat different.
A couple of patents originally filed in 2005 and granted to Google this week provide details on how it might recognize paraphrases, index them, and use them. There are also some white papers from Google that explore the topic in more detail. I will write about the first of the patents and one of the papers in this first post on paraphrase identification to introduce the topic.
In a search for the two headlines, I listed above, ideally, Google would return the same or a very similar set of search results in response to both. However, instead of just matching keywords to return documents, the search engine would have to recognize that “Soviet troops pulled out of Afghanistan” and “Soviet troops withdrew from Afghanistan” are somewhat equivalent to each other. While a search engine could rely upon people compiling potential paraphrases and mining text from documents to identify when those are actual paraphrases, that would be very time-consuming and require a lot of effort by many people to work well.
A better approach would be an automated one that does a reasonable job of identifying paraphrases. A Google Whitepaper, written by the inventors listed on the two granted patents that I referred to above, describes how this might work and many reasons why identifying paraphrases may be helpful. The paper is Aligning Needles in a Haystack: Paraphrase Acquisition Across the Web (pdf).
In the paper by Marius Pasca and Peter Dienes, we are told that the method they have come up with for using an automated approach to identifying paraphrases is unique because it can use just about any document that it finds on the web regardless of the quality of that document. It doesn’t need some preprocessing that identified which documents are likely to contain paraphrases:
The method differs from previous approaches to paraphrase acquisition in that
- it removes the assumptions on the quality of the input data by using inherently noisy, unreliable Web documents rather than clean, trustworthy, properly formatted documents; and
- it does not require any explicit clue indicating which documents are likely to encode parallel paraphrases, as they report on the same events or describe the same stories.
Large sets of paraphrases are collected through exhaustive pairwise alignment of small needles, i.e., sentence fragments, across a haystack of Web document sentences.
While researching the process involved, Pasca and Dienes conducted an experiment in which they extracted paraphrases from roughly 972 million web pages.
Some reasons why Google might want to learn about paraphrases in the documents they index can include:
- Giving better answers to Q&A (Question answering) type results
- Making sure that relevant documents aren’t missed by expanding queries to include words from possible paraphrases
- Possibly identifying documents that have duplicated the majority of the content of another document but used paraphrases to hide the source
The first patent goes into a description of how some paraphrases might be identified on the Web.
Methods and systems for identifying paraphrases from an index of information items and associated sentence fragments
Invented by Alexandru Marius Pasca and Peter Szabolcs Dienes
Assigned to Google
US Patent 7,937,396
Granted May 3, 2011
Filed: March 23, 2005
Abstract
Methods and systems for identifying paraphrases from an index of information items and associated sentence fragments are described. One method described comprises identifying a pair of sentence fragments, each having the same associated information item from an index. The index comprises a plurality of information items and associated sentence fragments, and identifies a paraphrase pair from the pair of sentence fragments.
One method involves looking for specific information while analyzing the content of a web page to see if there are dates, entity names, or concepts on that page and sentence fragments associated with those.
For example, the search engine finds many references to “1989” on a good number of web pages. Each has associated sentence fragments, and it compares the sentence fragments against each other to see if there are any similarities. For example, it might see the following two fragments associated with that date from several documents and consider them to be paraphrases of each other based upon patterns found in the sentence fragments:
“1989–Soviet troops pulled out of Afghanistan.”
“1989–Soviet troops withdrew from Afghanistan.”
Once the paraphrases have been identified, they may be associated with one another (along with their sources) in a paraphrase index.
When looking for paraphrases like this, there may be some rules that would be followed. For example, the search engine might require some similar alignment between the paraphrases. In this case, “soviet troops” appear in both fragments at the start, and “Afghanistan” appears in both at the end. There may also be a certain required threshold of types of words required to align like that. For instance, the aligning words may be required to be non stop words.
In deciding whether a paraphrase is valid, the search engine might see how frequently each sentence fragment appears in other documents on the Web. If they don’t appear very frequently, it might not include them in its index paraphrases.
For example, if the paraphrase pair “pulled out of-withdrew from” has a frequency value of ten, meaning that it appeared in the list of potential paraphrase pairs ten times, a single entry for the paraphrase pair “pulled out of-withdrew from” may be included in the paraphrase index with the associated frequency value of ten.
This frequency value of paraphrases may be used to rank paraphrases to indicate how useful they may be in things like question answering results from the search engine or expand a query to paraphrase results.
As the patent tells us:
In information retrieval, the paraphrase index may be used to associate a paraphrase in the search request with matching paraphrases in the text of documents sought for retrieval. So, for example, if a web search query includes the phrase “withdrew from,” a search engine can access the paraphrase index and determine that “withdrew from” has an associated paraphrase “pulled out of.”
The search engine can use this information to search for documents that match both “withdrew from” and “pulled out of” and the rest of the search terms. In question answering, a question may be a natural language search query. It is helpful to identify any paraphrases of words or phrases in the question to identify the answer more fully.
The paraphrasing process may also potentially be used to filter some pages out of search results when it sees paraphrases in the snippets intended to describe documents within those search results:
In summarizing a document or text, key sentences can be identified as useful in summarizing the document’s content or text. By identifying paraphrases, duplicative sentences that say the same thing but differently can be eliminated.
I’ll dig more into paraphrase-based indexing in my next post. For now, it’s a good start to recognize that Google may be identifying paraphrases to use to expand queries and possibly avoid presenting somewhat duplicate content.
Sadly I was doin a little Google dorking around and found there is still plenty of that chatter around to this day. It goes without saying that they use semantic analysis, but we have no idea which flavour nor even combination’s are in play. Then add in natural language processing elements…and well.. U get the idea.
It just drives me nuts when peeps put the label on it to simply cash-in or look all fancy pants…hehe… It does (still) need to be stamped out.
Alex J
CEO & Founder
Thanks for this Bill. Perfect timing as I’m going to add it to the class I’m teaching this week. I do believe this is where Google is headed, actually looks like they are already there. It fits right in with the Panda update, I think, in that this enables the robot to sniff out more highly relevant pages based on their skillful, informative, and relevant copywriting instead of keyword stuffing. Good stuff. Love your example!
It’s interesting that they chose to work with pairs rather than larger n-tuples (although I have not had time to read the documents so maybe they explain that).
@Alex, why stamp it out? It may challenge SEO efforts for sure, but I’m more into getting my customers who write great, relevant copy seen than seeing them wiped out by low quality, spam sites. I’d like to think this levels the playing field for them. Of course, that is wishful thinking, as I’m sure it will add another degree of difficulty, but as for delivering relevant results that matter, this is step in the right direction.
Great insight.
This is really helpful for all translation efforts as well. Paraphrases can be same or similar translations of a phrase in a different language.
That might be also an easy way to learn: Check multi lingual company websites, compare different English versions, and compare with other languages. Learn about translation into different paraphrases.
Would be interesting to see if that relates to search parameters like ~ or (modified) broadmatch.
This looks like a great way to develop more sophisticated methods for combating article flipping.
After the Google panda update i am very eager to know about how Google finds the content duplication here i found that. The way of your explanation with examples are good
Hi Kathy,
Thank you. At the very least, an approach like this enables Google to return relevant results that don’t necessarily have to match a query term for term, especially when some of the terms might not be all that meaningful, like the “withdrew from” vs. “pulled out of” in my example. In some ways, this is an extension of what Google is trying to do when returning synonyms for keywords in search results.
Hi Alex,
I don’t believe that semantic analysis from the search engines is sufficient to enable them to make the kinds of distinctions about paraphrases that this approach targets.
I’m not quite sure what you’re referring to in the rest of your post. Stamp what out? Maybe you can make that clearer for the rest of us. Thanks.
Hi Michael,
I was wondering about that myself.
Hi andreaswpv,
Google has been focusing upon using statistical machine translation tools, and using some of the approaches they’ve followed with those can help make a process like this one that much better.
Hi Kentaro,
That’s not really said explicitly in the patent or paper, but I drew the same conclusion.
Hi Tessa,
Thanks. Duplicate content is one of the signals that a lot of people are pointing towards when it comes to the Panda updates. It’s possible that this method of identifying paraphrases could potentially be involved as well.
Really interesting stuff Bill… but I have a hard time believing that Google is capable of paraphrasing in a meaningful manner. Ever since the panda update, it seems that all I see is the SAME content duplicated across the top ten listings. And, this is just my bitter-self on a tangent, but whilst I knew article marketing was dead, I didn’t think it would be so easily thieved. People just STEAL articles now, and post them on their blogs, and Google doesn’t have a clue who authored them. Craziness.
Good post though, very interesting…
Thanks again Bill for yet another great breakdown.
I think this might be aimed at all those auto-spun articles that do nothing but pollute the internet as we know it today. I’ve noticed that often I’ll have pages returned for similar searches but not matching what I’ve actually been searching for (tiny vs minute or huge vs massive). This is great news for folk like me that simply can’t spell or couldn’t be bothered to check many variations while searching.
I’ve also noticed that more specific searches can often return several (more that the 2 per page) pages from a single website in the SERPs. Perhaps this is also a method of reducing essentially duplicate pages from the same website?
I wonder if peronsalised search would in anyway affect this. After all, Google really do want to understand what it is you’re searching for, could these paraphrases be somehow also linked to related past searches?
Loads of questions, but… test, test and test some more.
I think kentaro makes a good point here, the number one value I see to Google here would be using this as a very effective way to combat article spinning.
Hi bill, here is a white hat/ grey hat question regarding this article. Like other people have suggested – does this mean that Google patent shows that it can also easily identify spun articles? Currently most spinners (so the ads and forum chatter suggest) say that articles need to be 45% unique from each other to pass google mustard.
I am in a quandary (not a small south american bird) because I know many competitors in my field have used scrape tools etc to get the ranking they have. And without resources it is difficult to compete by writing 20 quality articles a day, and manually submit them.
So now I wonder if one were to use spinners, would there still be ranking value, or will everyone have to give up on this practice? Is the paraphrase algorithm pretty much saying that? Some people say that just swap paragraphs around, change the words in them (synonym substitution) and you will be ok. I am wondering how you feel about this… because until it becomes a level playing field, the little guy is still being tempted. Not me of coures!
Those article spinners always were dreadful. Did they ever really work anyway?
Hi Brandon,
Google seems to be doing a decent job handling synonyms, and they highlight those when they show up in search results. One process that is probably being used to identify synonyms could also be used to identify paraphrases.
Article marketing is a very limited approach to marketing as it is, and there are many other ways of marketing that provide more value. Copyright infringement and theft of intellectual property is another matter entirely.
HI Robert,
You’re welcome. There are a number of reasons for indentifying paraphrases beyond combatting article spinning. This approach to paraphrases may be useful in that area, but it isn’t the sole reason to do this.
It isn’t unusual or uncommon for one site to have multiple pages that may be relevant for the same term, and I’m not sure that’s a bad thing in and of itself. If a site is about a specific topic, then having multiple pages that may appear in search results for a certain query isn’t unusual or necessarily a bad thing.
Personalized search has the potential to impact most searches, and it’s possible that one way a search engine might look to see if specific queries might be related to other queries is to look at query sesssions where it seems that people where searching for specific information about a particular topic and tried out different variations of a query.
Hi Erin,
Yes, that’s one possible use for understanding paraphrases better, but I’m not sure that it’s the number one reason. Being able to provide a wider range of relevant results to searchers is probably just as important.
Hi Bruce,
I really hope that methods like this make article spinning a method that no longer works well, and that it forces people to create interesting, engaging, original material. The ability to identify paraphrases as described in these two patents and the whitepapers from Google may not be enough by themselves to end the practice of article spinning, but given those and things like the Panda updates, it seems like it’s something that Google would like to see go away.
Hi Steve,
I’ve seen garbage show up in search results that appeared to have been content created by automated means, scraped from other pages, and altered in ways that produce something very similar to gibberish. I’ve seen other pages show up in search results that was probably written by actual people that was essential valueless, with poor grammar, poor spelling, and information that had very little value. Hopefully we will start seeing less of both in the future.
Let’s just hope for the best. Google have done a lot for the fast years and I believe nothing is impossible with a very aggressive ideas.
Hi Andrew,
It’s interesting to see something like this paraphrase-based indexing come out in a patent or two and wonder if it’s been something that Google has been quietly doing for a few years. In this case, it’s possible.