Friam

Google & IBM giving students a distributed systems lab using Hadoop

Classic

List

Threaded

7 messages Options

Tom Johnson

Google & IBM giving students a distributed systems lab using Hadoop

FYI. Following on a brief discussion Tuesday at the data mining session....

Google & IBM giving students a distributed systems lab using
Hadoop<http://feeds.feedburner.com/%7Er/oreilly/radar/atom/%7E3/167584952/google_ibm_give.html>

Posted: 09 Oct 2007 04:07 PM CDT

By Jesse Robbins

[image: hadoop-logo.jpg] <http://lucene.apache.org/hadoop/> Google
<http://www.google.com/intl/en/press/pressrel/20071008_ibm_univ.html> & IBM
have partnered <http://www-03.ibm.com/press/us/en/pressrelease/22414.wss> to
give university students hands-on experience developing software for
large-scale distributed systems. This initiative focuses on parallel
processing for large data sets using Hadoop<http://lucene.apache.org/hadoop/>,
an open source implementation of Google's
MapReduce<http://labs.google.com/papers/mapreduce.html>.
(See Tim's earlier post about Yahoo &
Hadoop<http://radar.oreilly.com/archives/2007/08/yahoos_bet_on_h.html>)

"The goal of this initiative is to improve computer science students'
knowledge of highly parallel computing practices to better address the
emerging paradigm of large-scale distributed computing. IBM and Google are
teaming up to provide hardware, software and services to augment university
curricula and expand research horizons. With their combined resources, the
companies hope to lower the financial and logistical barriers for the
academic community to explore this emerging model of computing."

The project currently includes the University of Washington, Carnegie-Mellon
University, MIT, Stanford, UC Berkeley and the University of Maryland.
Students in participating classes will have access to a dedicated cluster of
"several hundred computers" running Linux under XEN
virtualization<http://www.xensource.com/Pages/default.aspx>.
The project is expected to expand to thousands of processors and eventually
be open to researchers and students at other institutions.

As part of this effort, Google and the University of Washington have
released a Creative Commons licensed curriculum to help teach distributed
systems concepts and
techniques<http://code.google.com/edu/content/parallel.html>.
IBM is also providing Hadoop plug-ins for
Eclipse<http://www.alphaworks.ibm.com/tech/mapreducetools>.

*Note: *You can also build similar systems using Hadoop with Amazon
EC2<http://wiki.apache.org/lucene-hadoop/AmazonEC2>.
Tom White recently posted an excellent
guide<http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873&categoryID=112>and
Powerset
has been using this in
production<http://www.royans.net/arch/2007/09/13/scaling-powerset-using-amazons-ec2-and-s3/>for
quite some time.

--tj
--
==========================================
J. T. Johnson
Institute for Analytic Journalism -- Santa Fe, NM USA
www.analyticjournalism.com
505.577.6482(c) 505.473.9646(h)
http://www.jtjohnson.com tom at jtjohnson.us

"You never change things by fighting the existing reality.
To change something, build a new model that makes the
existing model obsolete."
-- Buckminster Fuller
==========================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://redfish.com/pipermail/friam_redfish.com/attachments/20071010/37957911/attachment.html

Phil Henshaw-2

Google & IBM giving students a distributed systems labusing Hadoop

But doesn't most evidence point to the likelihood that not having enough
computing power isn't our problem with natural systems?

Phil Henshaw ????.?? ? `?.????
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
680 Ft. Washington Ave
NY NY 10040
tel: 212-795-4844
e-mail: pfh at synapse9.com
explorations: www.synapse9.com <http://www.synapse9.com/>

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On
Behalf Of Tom Johnson
Sent: Wednesday, October 10, 2007 11:49 AM
To: Friam at redfish. com
Cc: David Collins
Subject: [FRIAM] Google & IBM giving students a distributed systems
labusing Hadoop

FYI. Following on a brief discussion Tuesday at the data mining
session....

Google
<http://feeds.feedburner.com/%7Er/oreilly/radar/atom/%7E3/167584952/goog
le_ibm_give.html> & IBM giving students a distributed systems lab using
Hadoop

Posted: 09 Oct 2007 04:07 PM CDT

By Jesse Robbins

<http://lucene.apache.org/hadoop/> hadoop-logo.jpg Google
<http://www.google.com/intl/en/press/pressrel/20071008_ibm_univ.html> &
IBM have partnered
<http://www-03.ibm.com/press/us/en/pressrelease/22414.wss> to give
university students hands-on experience developing software for
large-scale distributed systems. This initiative focuses on parallel
processing for large data sets using Hadoop
<http://lucene.apache.org/hadoop/> , an open source implementation of
Google's <http://labs.google.com/papers/mapreduce.html> MapReduce. (See
Tim's earlier post about Yahoo
<http://radar.oreilly.com/archives/2007/08/yahoos_bet_on_h.html> &
Hadoop )

"The goal of this initiative is to improve computer science students'
knowledge of highly parallel computing practices to better address the
emerging paradigm of large-scale distributed computing. IBM and Google
are teaming up to provide hardware, software and services to augment
university curricula and expand research horizons. With their combined
resources, the companies hope to lower the financial and logistical
barriers for the academic community to explore this emerging model of
computing."

The project currently includes the University of Washington,
Carnegie-Mellon University, MIT, Stanford, UC Berkeley and the
University of Maryland. Students in participating classes will have
access to a dedicated cluster of "several hundred computers" running
Linux under XEN <http://www.xensource.com/Pages/default.aspx>
virtualization. The project is expected to expand to thousands of
processors and eventually be open to researchers and students at other
institutions.

As part of this effort, Google and the University of Washington have
released a Creative Commons licensed curriculum to help teach
distributed systems concepts and
<http://code.google.com/edu/content/parallel.html> techniques. IBM is
also providing Hadoop
<http://www.alphaworks.ibm.com/tech/mapreducetools> plug-ins for
Eclipse.

Note: You can also build similar systems using Hadoop
<http://wiki.apache.org/lucene-hadoop/AmazonEC2> with Amazon EC2 . Tom
White recently posted an excellent guide
<http://developer.amazonwebservices.com/connect/entry.jspa?externalID=87
3&categoryID=112> and Powerset has been using this in production
<http://www.royans.net/arch/2007/09/13/scaling-powerset-using-amazons-ec
2-and-s3/> for quite some time.

--tj
--
==========================================
J. T. Johnson
Institute for Analytic Journalism -- Santa Fe, NM USA
www.analyticjournalism.com
505.577.6482(c) 505.473.9646(h)
http://www.jtjohnson.com tom at jtjohnson.us

"You never change things by fighting the existing reality.
To change something, build a new model that makes the
existing model obsolete."
-- Buckminster Fuller
==========================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://redfish.com/pipermail/friam_redfish.com/attachments/20071010/c93da283/attachment.html

Owen Densmore

Google & IBM giving students a distributed systems lab using Hadoop

Administrator

In reply to this post by Tom Johnson

"Super computing" is facing an interesting challenge with the advent
of multi-core, multi-memory, blade/cluster/grid systems.

The issue is the architecture one uses for powerful architectures.
It's very difficult to have a generalized system that works well over
a number of application architectures. And the choices are becoming
larger by the minute. The newer "blade" systems offer both multi-
processor and shared memory systems. They can be configured as
clusters or as a sorta many processor system looking like a single
memory system .. far easier to program. Grid systems are popular,
and figuring out how to adapt to the latest hardware advances.

My guess is any realistic solution will be hybrid, combining the
features of all these large scale architectures.

Here's the gotcha: how does it impact the programming language used?
One wants an "agile" multi-processor, multi-memory architecture that
can be reconfigured for advances in hardware and software. Thus far,
there's no silver bullet.

-- Owen owen at backspaces.net
Beer is proof that God loves us, and wants us to be happy.

On Oct 10, 2007, at 9:48 AM, Tom Johnson wrote:

> FYI. Following on a brief discussion Tuesday at the data mining
> session....
>
> Google & IBM giving students a distributed systems lab using Hadoop
>
> Posted: 09 Oct 2007 04:07 PM CDT
>
> By Jesse Robbins
>
> Google & IBM have partnered to give university students hands-on
> experience developing software for large-scale distributed systems.
> This initiative focuses on parallel processing for large data sets
> using Hadoop, an open source implementation of Google's MapReduce.
> (See Tim's earlier post about Yahoo & Hadoop )
>
> "The goal of this initiative is to improve computer science
> students' knowledge of highly parallel computing practices to
> better address the emerging paradigm of large-scale distributed
> computing. IBM and Google are teaming up to provide hardware,
> software and services to augment university curricula and expand
> research horizons. With their combined resources, the companies
> hope to lower the financial and logistical barriers for the
> academic community to explore this emerging model of computing."
> The project currently includes the University of Washington,
> Carnegie-Mellon University, MIT, Stanford, UC Berkeley and the
> University of Maryland. Students in participating classes will have
> access to a dedicated cluster of "several hundred computers"
> running Linux under XEN virtualization. The project is expected to
> expand to thousands of processors and eventually be open to
> researchers and students at other institutions.
>
> As part of this effort, Google and the University of Washington
> have released a Creative Commons licensed curriculum to help teach
> distributed systems concepts and techniques. IBM is also providing
> Hadoop plug-ins for Eclipse.
>
> Note: You can also build similar systems using Hadoop with Amazon
> EC2 . Tom White recently posted an excellent guide and Powerset has
> been using this in production for quite some time.
>
>
> --tj
> --
> ==========================================
> J. T. Johnson
> Institute for Analytic Journalism -- Santa Fe, NM USA
> www.analyticjournalism.com
> 505.577.6482(c) 505.473.9646(h)
> http://www.jtjohnson.com tom at jtjohnson.us
>
> "You never change things by fighting the existing reality.
> To change something, build a new model that makes the
> existing model obsolete."
> -- Buckminster
> Fuller
> ==========================================
> ============================================================
> FRIAM Applied Complexity Group listserv
> Meets Fridays 9a-11:30 at cafe at St. John's College
> lectures, archives, unsubscribe, maps at http://www.friam.org

David Mirly

Google & IBM giving students a distributed systems lab using Hadoop

I am currently in an Agent Based Simulation class and I am going to
do a report comparing and contrasting ABS
in parallel (distributed, etc.) environments vs. running a simulation
in a purely sequential environment.

It seems obvious to me that you could get very different results from
one computational architecture vs. another.

Does anyone have any experience with truly parallel systems in this
regard they would like to share?

Thanks!

On Oct 10, 2007, at 7:43 PM, Owen Densmore wrote:

> "Super computing" is facing an interesting challenge with the advent
> of multi-core, multi-memory, blade/cluster/grid systems.
>
> The issue is the architecture one uses for powerful architectures.
> It's very difficult to have a generalized system that works well over
> a number of application architectures. And the choices are becoming
> larger by the minute. The newer "blade" systems offer both multi-
> processor and shared memory systems. They can be configured as
> clusters or as a sorta many processor system looking like a single
> memory system .. far easier to program. Grid systems are popular,
> and figuring out how to adapt to the latest hardware advances.
>
> My guess is any realistic solution will be hybrid, combining the
> features of all these large scale architectures.
>
> Here's the gotcha: how does it impact the programming language used?
> One wants an "agile" multi-processor, multi-memory architecture that
> can be reconfigured for advances in hardware and software. Thus far,
> there's no silver bullet.
>
> -- Owen owen at backspaces.net
> Beer is proof that God loves us, and wants us to be happy.
>
> On Oct 10, 2007, at 9:48 AM, Tom Johnson wrote:
>
>> FYI. Following on a brief discussion Tuesday at the data mining
>> session....
>>
>> Google & IBM giving students a distributed systems lab using Hadoop
>>
>> Posted: 09 Oct 2007 04:07 PM CDT
>>
>> By Jesse Robbins
>>
>> Google & IBM have partnered to give university students hands-on
>> experience developing software for large-scale distributed systems.
>> This initiative focuses on parallel processing for large data sets
>> using Hadoop, an open source implementation of Google's MapReduce.
>> (See Tim's earlier post about Yahoo & Hadoop )
>>
>> "The goal of this initiative is to improve computer science
>> students' knowledge of highly parallel computing practices to
>> better address the emerging paradigm of large-scale distributed
>> computing. IBM and Google are teaming up to provide hardware,
>> software and services to augment university curricula and expand
>> research horizons. With their combined resources, the companies
>> hope to lower the financial and logistical barriers for the
>> academic community to explore this emerging model of computing."
>> The project currently includes the University of Washington,
>> Carnegie-Mellon University, MIT, Stanford, UC Berkeley and the
>> University of Maryland. Students in participating classes will have
>> access to a dedicated cluster of "several hundred computers"
>> running Linux under XEN virtualization. The project is expected to
>> expand to thousands of processors and eventually be open to
>> researchers and students at other institutions.
>>
>> As part of this effort, Google and the University of Washington
>> have released a Creative Commons licensed curriculum to help teach
>> distributed systems concepts and techniques. IBM is also providing
>> Hadoop plug-ins for Eclipse.
>>
>> Note: You can also build similar systems using Hadoop with Amazon
>> EC2 . Tom White recently posted an excellent guide and Powerset has
>> been using this in production for quite some time.
>>
>>
>> --tj
>> --
>> ==========================================
>> J. T. Johnson
>> Institute for Analytic Journalism -- Santa Fe, NM USA
>> www.analyticjournalism.com
>> 505.577.6482(c) 505.473.9646(h)
>> http://www.jtjohnson.com tom at jtjohnson.us
>>
>> "You never change things by fighting the existing reality.
>> To change something, build a new model that makes the
>> existing model obsolete."
>> -- Buckminster
>> Fuller
>> ==========================================
>> ============================================================
>> FRIAM Applied Complexity Group listserv
>> Meets Fridays 9a-11:30 at cafe at St. John's College
>> lectures, archives, unsubscribe, maps at http://www.friam.org
>
>
> ============================================================
> FRIAM Applied Complexity Group listserv
> Meets Fridays 9a-11:30 at cafe at St. John's College
> lectures, archives, unsubscribe, maps at http://www.friam.org

Marcus G. Daniels

Google & IBM giving students a distributed systems lab using Hadoop

David Mirly wrote:
> It seems obvious to me that you could get very different results from
> one computational architecture vs. another.
>
Swarm, for example, has a logical model of concurrency and options for
controlling it. Suppose two agents schedule two events in the future
that happen to be at the same time to the time resolution of the
model. When these events are run, they can either be iterated in
serial or in randomized order. Randomized order simulates the
non-determinism one would expect from a truly asynchronous (parallel)
realization of the model. You can indeed get artifacts / apparent
causation in models depending on the details of event ordering...

Marcus

Marcus G. Daniels

Google & IBM giving students a distributed systems lab using Hadoop

In reply to this post by David Mirly

Owen wrote:
>> My guess is any realistic solution will be hybrid, combining the
>> features of all these large scale architectures.
>>
By designing circuits to do special purpose compute tasks, the lengths
of wires can be reduced (and thus their diameter) and this is ultimately
the limiting factor on serial performance and circuit density. In
practice, designing these circuits is hard to automate and optimize
(even with FPGAs) and even relatively general hybird computing
approaches like the Cell broadband engine still require quite a bit of
programming finesse (e.g. prefetching and keen awareness of memory
access patterns, etc.)

Of course, at some point the wire diameters can get no smaller, and we
have to look to programming approaches to find parallelism. One
software technology that looks promising to me is software transactional
memory:

http://en.wikipedia.org/wiki/Software_transactional_memory

Apparently Sun is working on hardware support for it..

http://www.theregister.co.uk/2007/08/21/sun_transactional_memory_rock/

Douglas Roberts-2

Google & IBM giving students a distributed systems lab using Hadoop

In reply to this post by David Mirly

EpiSims (http://ndssl.vbi.vt.edu/episims.html) is a distributed discrete
event ABM that runs on clusters (and soon on clusters of clusters on the
TeraGrid: http://www.isdsjournal.org/article/view/1947). It is entirely
possible to get slightly different results from to subsequent EpiSims runs
using the same input data sets. As MGD points out in a previous message,
parallelization can randomize the order of execution of events that are
scheduled to run at the same future point in time. We have studied the
"noise" produced by randomized execution order of same-time events in
EpiSims, and found that they produce variations of results that are on the
order of < 1% for most cases.

--Doug

--
Doug Roberts, RTI International
droberts at rti.org
doug at parrot-farm.net
505-455-7333 - Office
505-670-8195 - Cell

On 10/11/07, David Mirly <mirly at comcast.net> wrote:

>
> I am currently in an Agent Based Simulation class and I am going to
> do a report comparing and contrasting ABS
> in parallel (distributed, etc.) environments vs. running a simulation
> in a purely sequential environment.
>
> It seems obvious to me that you could get very different results from
> one computational architecture vs. another.
>
> Does anyone have any experience with truly parallel systems in this
> regard they would like to share?
>
> Thanks!
>
>
>
> On Oct 10, 2007, at 7:43 PM, Owen Densmore wrote:
>
> > "Super computing" is facing an interesting challenge with the advent
> > of multi-core, multi-memory, blade/cluster/grid systems.
> >
> > The issue is the architecture one uses for powerful architectures.
> > It's very difficult to have a generalized system that works well over
> > a number of application architectures. And the choices are becoming
> > larger by the minute. The newer "blade" systems offer both multi-
> > processor and shared memory systems. They can be configured as
> > clusters or as a sorta many processor system looking like a single
> > memory system .. far easier to program. Grid systems are popular,
> > and figuring out how to adapt to the latest hardware advances.
> >
> > My guess is any realistic solution will be hybrid, combining the
> > features of all these large scale architectures.
> >
> > Here's the gotcha: how does it impact the programming language used?
> > One wants an "agile" multi-processor, multi-memory architecture that
> > can be reconfigured for advances in hardware and software. Thus far,
> > there's no silver bullet.
> >
> > -- Owen owen at backspaces.net
> > Beer is proof that God loves us, and wants us to be happy.
> >
> > On Oct 10, 2007, at 9:48 AM, Tom Johnson wrote:
> >
> >> FYI. Following on a brief discussion Tuesday at the data mining
> >> session....
> >>
> >> Google & IBM giving students a distributed systems lab using Hadoop
> >>
> >> Posted: 09 Oct 2007 04:07 PM CDT
> >>
> >> By Jesse Robbins
> >>
> >> Google & IBM have partnered to give university students hands-on
> >> experience developing software for large-scale distributed systems.
> >> This initiative focuses on parallel processing for large data sets
> >> using Hadoop, an open source implementation of Google's MapReduce.
> >> (See Tim's earlier post about Yahoo & Hadoop )
> >>
> >> "The goal of this initiative is to improve computer science
> >> students' knowledge of highly parallel computing practices to
> >> better address the emerging paradigm of large-scale distributed
> >> computing. IBM and Google are teaming up to provide hardware,
> >> software and services to augment university curricula and expand
> >> research horizons. With their combined resources, the companies
> >> hope to lower the financial and logistical barriers for the
> >> academic community to explore this emerging model of computing."
> >> The project currently includes the University of Washington,
> >> Carnegie-Mellon University, MIT, Stanford, UC Berkeley and the
> >> University of Maryland. Students in participating classes will have
> >> access to a dedicated cluster of "several hundred computers"
> >> running Linux under XEN virtualization. The project is expected to
> >> expand to thousands of processors and eventually be open to
> >> researchers and students at other institutions.
> >>
> >> As part of this effort, Google and the University of Washington
> >> have released a Creative Commons licensed curriculum to help teach
> >> distributed systems concepts and techniques. IBM is also providing
> >> Hadoop plug-ins for Eclipse.
> >>
> >> Note: You can also build similar systems using Hadoop with Amazon
> >> EC2 . Tom White recently posted an excellent guide and Powerset has
> >> been using this in production for quite some time.
> >>
> >>
> >> --tj
> >> --
> >> ==========================================
> >> J. T. Johnson
> >> Institute for Analytic Journalism -- Santa Fe, NM USA
> >> www.analyticjournalism.com
> >> 505.577.6482(c) 505.473.9646(h)
> >> http://www.jtjohnson.com tom at jtjohnson.us
> >>
> >> "You never change things by fighting the existing reality.
> >> To change something, build a new model that makes the
> >> existing model obsolete."
> >> -- Buckminster
> >> Fuller
> >> ==========================================
> >> ============================================================
> >> FRIAM Applied Complexity Group listserv
> >> Meets Fridays 9a-11:30 at cafe at St. John's College
> >> lectures, archives, unsubscribe, maps at http://www.friam.org
> >
> >
> > ============================================================
> > FRIAM Applied Complexity Group listserv
> > Meets Fridays 9a-11:30 at cafe at St. John's College
> > lectures, archives, unsubscribe, maps at http://www.friam.org
>
>
> ============================================================
> FRIAM Applied Complexity Group listserv
> Meets Fridays 9a-11:30 at cafe at St. John's College
> lectures, archives, unsubscribe, maps at http://www.friam.org
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://redfish.com/pipermail/friam_redfish.com/attachments/20071012/15cacd01/attachment.html