Analytic methods for Big Data

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Analytic methods for Big Data

Tom Johnson
All:
SFI hosted a terrific conference last week on the topic of "Big Data and Cities"

Today I came across this long, and generally rich, discussion on LinkedIn on the topic" What Analytic Methods are Possible with Truly Big DataWhat Analytic Methods are possible with truly Big Data"

I wonder if list members have any thoughts on the topic?  We might first ask "What is your definition of Big Data," and then "How should we best analyze it for which audiences?"

-tom


--
==========================================
J. T. Johnson
Institute for Analytic Journalism   --   Santa Fe, NM USA
505.577.6482(c)                                    505.473.9646(h)
Twitter: jtjohnson
http://www.jtjohnson.com                  [hidden email]
==========================================

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
to unsubscribe http://redfish.com/mailman/listinfo/friam_redfish.com
Reply | Threaded
Open this post in threaded view
|

Re: Analytic methods for Big Data

Marcus G. Daniels
On 9/22/13 5:07 PM, Tom Johnson wrote:
To scale a database to petabyte scale storage and beyond, it needs to be partitioned.  Distributing data balances the load of the I/O across different hardware.  A given drive (in an array) can only read and write so fast, and, worse, if the path to a drive is congested it doesn't matter has fast it is.   It seems to me the appeal of non-RDBMs technologies, etc. is that they either 1) force a user to confront the location of data or 2) propose some gross simplification like that fields are not compared (column-oriented databases) -- an assumption that may or may not be wise in the long term. 

In contrast, traditional databases (DB2, Oracle, Postgres) have a high-level and versatile means to query data (SQL), but don't necessarily perform well if used in a naive way at scale.   Postgres, for example, does not give a simple unified view of N distributed databases.  It gives N databases.  But if one develops a scheme to query N databases and have appropriate logic to merge/filter the result, it scales just fine.  The tradeoff for the user is whether they prefer a simple tools and a simple performance model, or whether they are prepared to invest to make a more flexible tool, with a more challenging performance model.   The runtime cost of an SQL query can vary by many orders of magnitude depending on how data is indexed, whether the cost data is accurate (e.g. disk head seek time), and whether it is partitioned. 

Lisp has been blamed for decades for being slow, "Lisp programmers know the value of everything but the cost of nothing."  A more nuanced observation is that "Bad Lisp programmers write slow programs, whereas bad C++ programs write no programs."   Same idea applies to the RDBMs vs. Hadoop style databases.  The problem is not that one is slow, it's that the user is either incapable or unwilling to get their head around the performance model.  Some people like crude tools because they lack the patience, opportunity, or literacy to learn about them.   If the use of a tool is silly, it is hard for these people to recognize their own ignorance -- it is easier to blame the tool.

Partitioning and analytics are related in ways that can be a pain.   As a simple example, consider what's needed to compute a median.   If the data is of moderate size (say tens of gigabytes), it can be pulled into memory and sorted and then cut in half to find the median.   If it is a petabyte and on 1000 separate database servers, then there is no one place where it can be sorted in place.  Instead a merge sort is needed.    The point is that the algorithms chosen to realize various statistical methods may have to change in order to scale at all.  And Fortran or C codes (that implement statistical packages like SAS, SPSSX, or R), don't inherently know where memory is (because the programming language does not explicitly represent that), so the compiler can't recognize that data tables live in different places.   So, even if the algorithms didn't need to be restructured for `big data', the legacy numerical codes don't necessarily lend themselves to re-use in distributed memory systems.

Anyway, this just addresses the literal question you raised:   I'd say most analytic methods are suitable for Big Data, but the techniques and technology have not become prevalent yet to make it so.  It's another area where there is good, honest development work to be done, and it just needs to be. 

Marcus


============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
to unsubscribe http://redfish.com/mailman/listinfo/friam_redfish.com