Posts Tagged ‘data analysis’

Out of curiosity, did a “randomized controlled” experiment using US Census 2000 data.

“Question at interest”: How do people answer differently when asked the “same” question in different ways.

The first group: ~20 students taking an applied statistics class in winter 2011. Asked to analyze the census data, draw a plot of percentage of men in each age group in US, and comment on findings.

The second group: ~20 students taking the same class in winter 2012. Asked to draw a plot of percentage of women in each age group in US and comment on findings.

Although two plots use the same dataset and contain exactly same information mathematically. The comments from two groups are unsurprisingly different. As you can guess, the first group mainly reported that “Men die earlier” and the second groups agreed on “Women live longer“.

What do you think? How about artificial intelligence?

It is funny that a computer algorithm can actually figure out the closeness of the following two questions: “Why men die earlier?” and “Why women live longer?”. The top lists from “google image” for these two searches both contain this set of images:

However, if one searches “youtube videos” for these two questions, the top videos for “why men die earlier” all look like this one:

Meanwhile, the top videos for  “Why women live longer?” are much more diversified and some can be very scientific 🙂

One might wonder how much better we are compared with computer algorithms 🙂

Read Full Post »

One year after taking applied statistics course from David, I signed up for his second “class”, statistical consulting. It was a rare opportunity  to learn from someone with so much experience in consulting.

David A. Freedman (copyright: George M. Bergman)

According to Wikipedia, “Freedman was a consulting or testifying expert on statistics in disputes involving employment discrimination, fair loan practices, voting rights, duplicate signatures on petitions, railroad taxation, ecological inference, flight patterns of golf ballsprice scanner errors, Bovine Spongiform Encephalopathy (Mad Cow disease), and sampling. He consulted for the Bank of Canada, the Carnegie Commission, the City of San Francisco, the County of Los Angeles, and the Federal Reserve, as well as the U.S. departments of energy, treasury, justice, and commerce.”

Guess that was the first task we were assigned for the class? Wondering around the campus and posting posters in every building. I know it sounds like old school, but we needed clients :). When there is no problem, there is no need for data or analysis.

I remember the first project my partner and I run into was money involved. We were excited to help predicting which alumni has a high tendency to donate back to Berkeley. It was funny that zip code was such a powerful predictor and it is almost universally powerful in most of social studies (as David told us). “location, location, and location!”

The most practical (and painful) lesson we learned from David in consulting is that he insisted us not to give any suggestion in the first meeting. See, “we are eager to help, but … ” We should not give answers or suggestions to any problem without understanding its background, even if what the client asked is just how to calculate a p-value using a t-test. It is funny how our profession has squeezed into this corner. In many situations, as soon as one gets a p-value < 0.05, one hypothesis is rejected and (statistically) significant finding is claimed. Ironically, statisticians who created the p-values are the ones who are acting like policemen holding the last stands now. Critically investigating the problem and the data together may push us a few steps closer to “be useful”. Otherwise, we probably act no better than a software package (like SAS or R).

In the meanwhile, we also learned to be creative when working with clients who have already collected their data according a “flawed” design. We need help them. In David’s words, “Treat them as your patients. If your patient is bleeding, it does not help to blame them on how they get into an accident. Stop the bleeding and save him/her first, then educate.”. In many cases, how to help them with an imperfect design or data involves developing and researching for new statistical methods. It all goes back to why statistics exists as a field in the first place: help people answer questions with data.

[To be continued ……]

Read Full Post »

The word “data scientist” has been showing up on the internet more and more often recently. For example: “Data scientist: The hot new gig in tech“, “Data Scientist: The Hottest Job You Haven’t Heard of“, “Hot career: Data scientist“, ……  Seems like data scientists are in high demands these days.

In July this year, A Predictive Modeling and Data Mining Scientists/Analysts position at Obama 2012 Presidential Campaign has been posted on KDNuggets.com, a data mining website. Micah Sifry at CNN picked it up and write an article “How Obama’s data-crunching prowess may get him re-elected“. The job description says “We are looking for Predictive Modeling/Data Mining Scientists and Analysts, at both the senior and junior level, to join our department through November 2012 at our Chicago Headquarters. We are a multi-disciplinary team of statisticians, predictive modelers, data mining experts, mathematicians, software developers, general analysts and organizers – all striving for a single goal: re-electing President Obama.”

So who are these data scientists? There are definitely many different opinions. My understanding is close to what is described in “Data Scientist – Will this be the dream job in the near future?” (pt1, pt2) by Deepak Ramanathan.

Data Science is essentially a combination of:

  • Statistical Analysis
  • Data Mining
  • Data Visualization 

While each of the above disciplines form unique career options in itself, a data scientist is one who can do them all – and thereby stand out from the pack.” (I also like the picture he used, so I linked it here)

From my view, the abilities of working with large amount of data, seeing though the data (with helps from others or helping others), and telling the story behind the data are what make someone a good data scientist. In an age that we are producing and collecting data at a scale beyond our capacity of learning from them, a critical skill to get everything started seems to be able to work comfortably with a large amount of data.

If you are lucky to be good at some of these skills already, picking up the missing parts would a huge boost to your career (as a data scientist or other professions, not to mention the being protested against on the wall street).

Read Full Post »

Time to blog

Life is made of a series of random events. When someone asked of posting a few of my words on a blog, it struck me and put the last straw to break the camel’s back.

I have been thinking about how to document and share the fun stuffs I have learned, read, and found of being a data analyst. At the end, I realized what makes analyzing data a fun thing is the collection of applications, conversations, discoveries/false-discoveries,  speeches, and (of course) stories that one personally experienced. The only way to enjoy the fun is going through the experience.

To record, expand and share the experience, here comes the new journey.

Read Full Post »