“All mushrooms are edible; but some only once.” ~ Croatian proverb
Google was granted a patent today that could be used to collect a seed set of data about features associated with different types of mushrooms, to “determine whether a specimen is poisonous based on predetermined features of the specimen.” The patent also describes how that process could be used to help filter email spam based upon the features found within the email, or to determine whether images on a page are advertisements, or to determine categories of pages on the Web on the basis of textual features within those pages. The image below, from the patent shows how features about a picture such as height, width, placement on a page, caption, and so on might be examined while determining whether or not it is an advertisement:
This machine-learning approach can be trained with data that produces known outcomes, which could then be applied to very large data sets to classify data according to patterns identified within the seed set of data. When Google published Finding more high quality sites in search in February of 2011, they introduced what would beome known as the Big Panda update. The approach was further elaborated on by Google’s Matt Cutts and Amit Singhal in an Interview at Wired Magazine around a week later in TED 2011: The ‘Panda’ That Hates Farms: A Q&A With Google’s Top Search Engineers.