Strata Round Up Part 1: Overview and Takeaways

I spent a couple of days at the Strata Big Data conference in lovely Santa Clara, California the other week talking shop about massive datasets and what to do with them.  I had a wonderful time rubbing elbows with all the smart and interesting people out there with “data scientist” on their cards, but I was struck by the common theme that none of us really knew exactly what that, or “big data”, for that matter, really meant.  There’s been a lot of buzz around the Internet on exactly this topic, ranging from Quora questions about what one should do to become a data scientist, to blog posts outlining taxonomies for data science, and it is exactly this uncertainty that makes a conference like this so pivotal.  There are increasing numbers of people who consider themselves data people, yet the community of data scientists and big data hackers is still very much a community in search of a clear identity.  This conference finally got everyone into the same room so that we could start answering these questions, so I thought I’d give a quick review of some high level ideas that I took away from the conference and what they say about the future of data and data science:

Big Data vs. Data Science: The common thread between everyone at Strata is that they all use data in some way, but with data being so readily available all of a sudden, everyone uses data in some way and such a broad definition is bound to encompass a lot of camps.  While there were individual tracks on data in health, data in journalism, and data in social media, the most obvious divide to me was between big data and data science.  I think of the big data people as your Clouderas or your Amazons:  These companies focus on managing massive amounts of data at the ground level, honing in on the hardware and software challenges in collecting, storing, and distributing vast amounts of data.  On the data science side, you have people trying to gain insight from data, whether that data is technically big or not.  The two disciplines obviously go hand-in-hand, but it feels a little bit like the difference between people who write compilers for a language and the people who use the language to write software.  Obviously one depends on the other, but their goals are pretty different.  Granted, the idea of data science is very young and this is exactly how things should look right now, but I found the difference fairly striking.

Big Data vs. Not-so-Big Data: A common theme brought up in awonderful keynote from Mark Madsen was that it’s not really about “big” data at all.  His oft retweeted line was that “using big data” is not about “big” or “data”, it’s about “using”.  This was particularly true as I passed by the startup showcase, where many a company was touting “big data” solutions, but their datasets only really comprised about a handful of terabytes.  In fact, the vast majority of companies I saw, while they could technically tap into large datasets, could find just as much meaning in small portions of that data.  For many, if not most, companies and projects, the focus here should be on gaining insight regardless of the size of the data, and there’s no need to crow about “big” unless a) your data is so big it requires a fundamental shift in computing or storage, or b) your data is so big it teaches you something that a smaller version or sample of the dataset couldn’t.

Using Data vs. Showing Data: The second part of Mark’s point about using big data came up a lot as well and resonated most with me during the conference:  Data without insight is meaningless.  As he put it, we shouldn’t be tabloid journalists with data, we should seek to bring insight.  Similar and equally inspiring notions were given in Hilary Mason’s keynote, where she implored data scientists to do something meaningful with their unprecedented access to data and tools.  Most well-trained scientists and statisticians probably already think this way, even if just internally, but there seems to be a real sea change occurring in the average person’s ability to collect and analyze data with very low barriers to entry.  That’s a wonderful thing but, at the same time, it enables a lot more people to produce a lot more data crunching and data visualization that doesn’t always lead to new insights.

This idea really struck me as I monitored the Twitter stream for the conference and saw no fewer than 10 visualizations of Strata-related data compiled by people sitting in the back of a conference room.  Let’s reflect on how amazing a world this is that someone at a conference, aided only by a computer, the Internet, and a little boredom, can scrape data from conversations people are having in realtime, cleanse that data, put it into a usable format, and visualize it in a network structure, all within the time it takes to listen to a talk.  That signals to me a huge change in the accessibility and availability of data and tools today and shows that more people than ever have unprecedented access to collect and analyze data.

That said, the insight here leaves something to be desired.  Visualizing or generating descriptions of data, e.g. in the form of histograms or network graphs, is a critical step on the path to understanding the data, but is too often the end point for a lot of people.  I saw vast expansive graphs of who tweeted about who at the conference that updated in realtime and could be searched over the past, but that showed little that we didn’t already know.  Big surprise, the O’Reilly editors tweet each other a lot, and people who work together tweet about each other.  There’s a nice “comfort of the familiar” that we see in visualizations like this, similar to the unexpected familiarity Andrew Gelman talks about in his great talk on info viz vs. statistical graphics, but it doesn’t teach us anything new.  Even the beautiful work that LinkedIn did with its network graphs shows us dazzling technicolor hairballs of our connections to other people in the world, but mostly confirm what we already suspect and know:  I’m connected primarily to people from my previous employers and they all pretty much do similar things to me.

That’s not to say there isn’t value here, because there is.  Getting more people involved in analysis and data scraping is a huge step toward increasing  information literacy and ushering in a new age of data scientists with new skills that previous generations may not have had available.  The field is young, the tools are new and exciting, and there are now a lot more people getting involved and taking part in the conversation about handling data than ever before.  Now, however, we just need to guide those people toward truly finding meaning in what they’re doing, to train them in critical thinking and to give them the statistical knowledge (biased, I know) to say something deeper than “here’s a graph”.  More importantly, as data scientists with those skills on hand already, we should take it upon ourselves to keep a critical eye on the work we do and the conclusions we make so that, at the end of the day, hopefully we’ve learned something new or helped make the world a little bit better, instead of just added something neat to look at.

I don’t mean to sound down on Strata or any of the people involved.  Quite the contrary, as I found it to be a wonderful experience and it showcased what overwhelmingly good things are happening out there in the data world, how exciting a time it is to be in a field wrestling with data, and how creative and enterprising people are about getting involved and contributing to this community.  As Hilary put it, “The state of the data union is strong”, and I’m honored to consider myself even just a small part of that union.   For anyone who was enviously following #strataconf from the east coast,  good news!  There will be another Strata this year in New York in September and, with the pace things are going, I’m sure it will be even bigger and better than the first.  So fire up your R scripts, line up your favorite APIs, and come join us for the next big data conference.  We’re already outnumbered by the data, let’s not be outwitted by them too.

Share
  • http://topsy.com/jakeporway.com/2011/02/strata-round-up-part-1-overview-and-takeaways/?utm_source=pingback&utm_campaign=L2 Tweets that mention Strata Round Up Part 1: Overview and Takeaways « // jake porway // — Topsy.com

    [...] This post was mentioned on Twitter by Drew Conway and Mike Loukides, Peter Miron. Peter Miron said: RT @drewconway: Great overview of key themes from #strataconf by @jakeporway http://bit.ly/eZ79rN [...]

  • http://www.blog.arghh.net/aj/?p=758 pinboard February 19, 2011 — arghh.net

    [...] Strata Round Up Part 1: Overview and Takeaways « // jake porway // Great job by @jakeporway teasing apart themes from #strataconf: big data vs data science, using data vs showing data [...]