This week, Google was awarded a patent that describes how they might score content on how much Gibberish it might contain, which could then be used to demote pages in search results. That gibberish content refers to content that might be representative of spam content.
The patent defines gibberish content on web pages as pages that might contain a number of high value keywords, but might have been generated through:
- Using low-cost untrained labor (from places like Mechanical Turk)
- Scraping content and modifying and splicing it randomly
- Translating from a different language
Gibberish content also tends to include text sequences that are unlikely to represent natural language text strings that often appear in conversational syntax, or that might not be in text strings that might not be structured in conversational syntax, typically occur in resources such as web documents.