What would it take for Google to include in its index 100 billion pages?
Could they develop a way for people to search for, and look at older versions of web pages, and also simultaneously improve the quality of their search results? Would indexing words within conceptually related phrases make the search process better?
A recent patent application from Google estimates the web to contain around 200 billion pages, and guesses that the largest index from the major search engines hold around 6-8 billion pages. The document is Multiple index based information retrieval system, US Patent Application 20060106792, which was published May 18, 2006, and originally filed on January 25, 2005.
In addition to providing us with a rough estimate of the size of the web, and the amount of pages indexed by search engines, it also tries to answer the questions I asked at the top of this post.
The inventor listed in the patent filing is Anna Patterson, who has already built a search engine that holds more than 55 billion pages (The Internet Archive). Part of the process described in the document was the subject of a blog post here back in February – Move over pagerank: Google’s looking at phrases?
Continue reading Google Aiming at 100 Billion Pages?
Apple adds something to those songs you’ve been listening to from them, but it’s not music.
Normally, at the time a digital media file is created, there’s information about the content included with the music. This data is embedded in the digital media file’s header section, including such things as copyright information and digital rights management information, as well as title, author, and publisher.
If you’ve been watching the shelves of your local music store, you’ve probably seen enhanced CDs and DVDs, which contain hyperlinks to additional media content, often available on web sites. Apple wants to be able to include additional information, like that, in digital tunes that are downloaded.
This isn’t a problem with streaming media, which could have that kind of information added to it, but many people prefer direct access to the songs so that they can listen or watch when they don’t have access to streaming media.
Continue reading Apple to Embed Ads and Marketing Information in iTunes?
I downloaded the Google Notebook browser extension about twenty minutes ago, and have been trying it out.
In case you didn’t hear about Google Notebook yet, it’s a new tool from Google announced last week during the Google Press Day, but not planned to be released until this week.
The idea behind it is that you can use it to take notes about web pages , and copy snippets from those pages, and keep them in notebooks, which you can keep private, or make accessible to the public. A link to the page where you found the material makes it easy to return to the source of the information.
Notebooks can be organized into sections, and can contain images as well as text. The program can be accessed from more than one computer, which means that the information contained within it is stored by Google rather than on your own computer.
I really like the way that the mini notebook, and the full page notebook work together. As a tool for tracking information on the web, it’s pretty useful. I could see some value in using it as a work tool when looking at a site, and considering rewriting content on the pages of that site. Or in writing notes for a blog post, or article or paper.
Continue reading Google Notebook Released
Trust is essential in our reliance on search engines. But we should understand some of the risks in placing too much trust in search results.
There’s the possibility of bias in what search engines show people based upon the engines’ business practices and operating policies, limitations in indexing and ranking algorithms, and in political and cultural pressures placed upon them.
When I think of conferences like the one to be held next week in Edinburgh, Scotland, during the 15th Annual World Wide Web Conference, I don’t expect to see presentations that are critical of search engines. But, during a workshop on Models of Trust for the Web, there’s a paper being presented that takes a close look at search engine bias, from a couple of researchers at Yuan Ze University in Taiwan.
Position Paper: A Study of Web Search Engine Bias and its Assessment (pdf) by Ing-Xiang Chen and Cheng-Zen Yang
The authors of this paper describe in more detail the three different sources of bias that I mentioned above. How could business practices shape the bias of search engines? Continue reading Trust and the Internet: Search Engine Bias
I’ve been using Google Alerts for the past year or so to stay on top of a handful of topics, and I decided this weekend that it might be worth expanding their use a little more.
So, I added about ten terms that I’m interested in tracking to my alerts list for Google.
And then, I decided that it might be fun to try out Yahoo Alerts also, and compare what the two services provide.
My experience with Google Alerts has been interesting so far. With some news articles, the alerts I’ m sent have been fairly timely. But every so often, I see an alert pointing to a page that’s more than a year old. When I see that, I wonder if Google has just descovered the page, and noticed in some vast database that they hadn’t sent me a copy of it yet.
I haven’t searched to see if someone has tried this already, but it might be fun to keep track of what links I’m provided with, and compare the two alert systems over a period of a few weeks or months. How old are the pages that I receive an alert for? How many links am I provided per term over the length of time, and how many do I receive each day from both search engines?
Continue reading Testing Google and Yahoo Alerts
Trust is a topic that has a profound affect upon the way search engines work on the web.
How easy or difficult is it to come up with methods that don’t rely (much) on human judgment to identify spam free pages that can be trusted, and to locate pages that are intended solely to rank well in search engines without providing any value at all for visitors, except possibly ads that are on the topic of their search?
In a week, there will be a gathering in Edinburgh, Scotland, during the 15th Annual World Wide Web Conference, on the subject of Models of Trust for the Web. While I won’t be attending, it sounds like an interesting presentation, and I wanted to take a look at some of the papers written by presenters at the conference. In this post, I’ll be looking at one of the papers to be presented, and listing some of the other work by its authors.
Problems with Yahoo’s Trustrank Assumptions
Continue reading Trust and the Internet: Web Search Spam
Came across a lot of interesting stopping points on my travels around the web over the last few days, some fun stories, and some thoughtful musings…
Favorite title, and analogy, Please Stop With Your Chinese Math, reminded me of all the meetings I’ve been in where I’ve inadvertently rolled my eyes at some statistics, and hoped that no one noticed.
Book on the Science of Google Rankings – Probably has too much math for my tastes, but I’m going to have to get a copy after reading their Deeper Inside Pagerank to see where they pick up the storyline. I hope they don’t kill off any of the main characters.
LEGO’s Incredible Marketing Strategy (yes, legos and marketing are a great match)
Continue reading On a Hypertext Roadtrip
Some recently published patent applications from Go Daddy explore whether additional whois information might help reduce spam and phishing, and improve search engine results. Google noted in a patent application last year that they might be looking at whois information while presenting and ranking pages.
I don’t know how easy it would be to set up the processes described by Go Daddy, or verify the reputation information that they describe, and maintain the records the system would depend upon.
The purpose of whois information
But it might be a moot point to even wonder. A recent decision by the folks at ICANN to limit the use of whois information makes it seem unlikely that that the scenerios envisioned by these documents will happen. ICANN’s Generic Names Supporting Organization held a vote in which they decided upon the sole purpose of whois information:
Continue reading Does Google use whois information?