Friam

Roll Your Own Google (with Wired online link)

Classic

List

Threaded

1 message

Giles Bowkett

Roll Your Own Google (with Wired online link)

Technically I rolled my own Google last month. It's offline right now
but all it does is display searches without the ads. I have another
one that generates ridiculous "conversations" via Google over IM.
Google has a web services API, so in a sense, they already did this,
years ago, and it didn't change anything; in fact it occurred before
their IPO.

The real accomplishment with Google is GFS. It isn't the web index;
it's the ability to hold an extremely massive subset of the web in
RAM. You'd still need a fairly sophisticated infrastructure to do any
really interesting statistical work with the web, irrespective of how
you obtained the data. Google's founders have written some pretty
interesting papers on data mining, and that perspective might be their
real advantage.

This does sound cool though. Definitely worth a closer look.
Personally, knocking it would be premature, because I haven't tried
it.

On 12/13/05, Randy Burge <burge at proactive.to> wrote:

>
> For accessing the live links in the article:
> http://www.wired.com/news/technology/0,1282,69817,00.html?tw=wn_tophead_2
>
> Roll Your Own Google
>
> By Jeff MacIntyre
>
> 02:00 AM Dec. 13, 2005 PT
>
> In a move with potentially far-reaching implications for the search market,
> Alexa Internet is opening up its huge web crawler to any programmer who
> wants paid access to its rich trove of internet data.
>
> Alexa, a subsidiary of Amazon.com that is best known for its traffic
> rankings, on Monday unveiled Alexa Web Search Platform, a set of online
> tools for searching, indexing, computing, storing and publishing vast
> quantities of net data.
>
> Alexa claims it's the first time that developers, students and startups
> will be given inexpensive access to an industrial-scale web crawler -- the
> same technology used by industry giants like Yahoo (Yahoo Slurp) and Google
> (Googlebot).
>
> "It sounds innocuous but it's big," said Alexa CEO Bruce Gilliat. "We're
> giving access to billions of pages and computing resources.... Users have
> never had this opportunity before. Big industry has ruled search, because it
> was the only player with access to the tools."
>
> Alexa spiders 4 billion to 5 billion pages a month and archives 1 terabyte
> of data a day. The new platform will allow developers to build their own
> search engines.
>
> "If it is what they claim it is, it strikes me that this is nontrivial
> news," said search industry pundit and author John Battelle. "Anyone can
> crawl the web, but crawling and maintaining an index at scale is very
> difficult and very expensive. They are providing convenient access to
> something that was very dear."
>
> Battelle said the move, if it pans out as promised, could have a big impact
> on the search industry, and could possibly lessen Google's growing dominance
> in web search.
>
> Alexa's offering may help "create an ecosystem (in search) where something
> can occur outside the Googleverse," he said.
>
> To illustrate the new service's potential, Alexa developed a photo search
> engine that allows users to query photo metadata normally hidden from
> standard keyword searches, such as the date the photo was taken or the
> camera used.
>
> Musipedia, another Alexa prototype, provides users with the ability to
> search the web by melody. Give the engine a keyword or melodic contour, and
> it returns similar music. Musipedia allows users to input their own
> whistling as a query.
>
> >From computer scientists to web hobbyists, Gilliat predicted Alexa's
> inexpensive services will spawn numerous creative results. Costs are priced
> at $1 per transaction, which range from a CPU hour of computing time to
> gigabytes of uploads and downloads. Gilliat said a complete web snapshot
> should cost a "couple thousand" dollars.
>
> Thanks to the company's history, Gilliat believes Alexa is well-positioned
> to democratize data search.
>
> It is an interesting return to the spotlight for Alexa, the commercial
> cousin of Internet Archive, a nonprofit founded by Brewster Kahle that is
> dedicated to preserving a public index of the web and its history. Alexa's
> crawler donates directly to the Internet Archive.
>
> Alexa has been archiving the web from its Presidio of San Francisco offices
> since it was founded in 1996. In 1997, Alexa unveiled its toolbar, one of
> the first such search-specific browser add-ons, which has since registered
> more than 10 million downloads. Amazon acquired Alexa in 1999.
>
> Alexa has more than a thousand machines involved in storage, access and
> computation, and the company expects high demand for the new service.
>
> "Using our crawler saves massive time, money and computational power,"
> Gilliat said. "There are lots of really smart people out there who don't
> work for a search engine, but they have good ideas, needs and desires for
> what they want from web search. They have an inkling, and we have the way."
>
> Amazon and Alexa representatives declined to speculate whether this move
> might compel other search engines to commercialize their crawlers.
>
> Battelle, however, characterized the news as "Amazon casting a stone in the
> lake of search."
>
> He said Alexa's announcement echoes other developments in recent years at
> Amazon, a company that prides itself on leveraging the strength of its user
> community.
>
> "I have been consistently impressed by the innovative thinking there,"
> Battelle said. "This is the type of news you might come to expect from
> Amazon.... We can now sift the web and do it cheaply and frequently. This
> feels very Web 2.0."
> ============================================================
> FRIAM Applied Complexity Group listserv
> Meets Fridays 9a-11:30 at Mission Cafe
> lectures, archives, unsubscribe, maps at http://www.friam.org
>
>

--
Giles Bowkett = Giles Goat Boy
http://www.gilesgoatboy.org/