2010/04/13

数字新世界 —— 《Super Crunchers》


赌场老板会关心你的财政状况与输赢,及时劝阻你在走霉运时继续孤注一掷;你可以在醇酒刚酿制时就预测品质,及早买下奇货可居,而不必再等待10年花费巨额 金钱拍下“绝世佳酿”;真命天子,很可能就在你鼠标的数次点击下,再不必担心茫茫人海无缘相逢;同样一件商品,你也许要付出别人两倍的价钱,因为人的差别 而非商品的差别;航空公司重新定位常旅客,不再奖励飞行里程最多的乘客转而奖励让公司盈利最多的乘客;通过神奇数字,你可以发现竞标中的欺诈行为;雇主在 面试时就可以分辨谁将会是适应本企业文化的忠心耿耿的好雇员;企业在迅捷反馈消费者需求时又能做到零库存以节省成本;不相信?Well, 欢迎来到Super Crunchers的美丽新世界。

科技的迅捷发展(Moore's Law & Kryder's Law所揭示的计算机性能与硬盘存储空间的发展趋势)、庞大的数据库(Terabyte、Petabyte数量级)、智能的搜索引擎,使年逾百岁的统计数 学工具(RegressionRandomized Trials)青春焕发,朝气蓬勃得向这一愿景奔去。书中,Ian Ayres列举了很多有趣的故事来说明Super Crunchers如何用数字说话,从海量数据中挖掘看似毫不相关的事物之间的内在联系和因果关系,发现隐藏的规律,预测未来。耐人寻味的是,在与 super crunchers的同台竞技中,依赖于经验和直觉的传统专家屡战屡败,预测未来的精准度常常略输一筹。Ian一度想以“The End of Intuition"命名此书,也是因为在越来越多的领域的较量中,经验法则和直觉正节节败退给数字分析这个事实。但在其后的章节,他也指出,数据决策方 法的兴起并不代表直觉的末日,它们是相辅相成的互助关系,因为敏锐的直觉能够指引我们去发现问题、提出问题,而数据挖掘则能够分析问题、检验直觉。

不 可否认,Super Crunchers在犯罪学、教育学、医学、经济学、政治学等各领域攻城掠地,在力图改变以往决策模式的同时,也掀起权力更迭的大潮。他们的崛起威胁到很 多传统职业的权力、地位,以及受尊重程度,既得利益者对其的否定和抵触自然不难理解。此外,除了人们因循守旧的惯性,数字化进程对公民隐私权的侵蚀也是抵 触和恐慌情绪蔓延的重要源头之一——你的一切数据、信息尽在掌握中,有人比你更了解你自己的行为、意识甚至潜意识。不过,人类发展史也一再揭示了“顺我者 昌,逆我者亡”这一规律,未来,属于那些能够在直觉和数据之间游走自如的super crunchers,因为他们比传统专家和电脑都看得更远、更准。你要做的,是克服对数字、公式的望而生畏,努力掌握基本的统计数学概念和工具,改变自己 的思维、决策方式,做一个站在浪尖风口的弄潮儿,而不是被潮流吞噬的溺水者。

阖上书本的最后一页,我脑中不禁浮现狄更斯广为流传的一段话 “It was the best of times, it was the worst of times; It was the age of wisdom, it was the age of foolishness; It was the epoch of belief, it was the epoch of incredulity; It was the season of Light, it was the season of Darkness; It was the spring of hope, it was the winter of despair; We had everything before us, we had nothing before us; We were all going direct to heaven, we were all going direct the other way”...

Some useful concepts in the book:

Super crunching is statistical analysis that impact real-world decision. Super Crunching predicitions usually bring together the combination of data, speed and scale.
  • the big size of dataset, both in number of observations and variables.
  • the increasing speed of analysis.
  • the huge scale of impact

Collaborative filters are examples of "the wisdom of crowds":
  • the collective predictions are more accurate than the best estimate that any member of the group could achieve.
  • a kind of tailored audience polling.
  • preference database are powerful ways to improve personal decision making.
But, there is also a social cost to exploiting the long tail:
  • the more successful these personalized filters are, the more we as a citizen are deprived of a common expereience
  • expose citizens only to information that fits with their narrowly preconceived preferences.

The core of super crunching techniques:
  1. Regression:
    • a statistical procedure that takes raw historical data and estimates how various causal factors influence a single variable of interest.
    • not only make predictions but also are able to simultaneously tell you how precise the prediction is.
  2. Randomized trials:
    • having a computer flip a coin and treating prospects who come up head differently that the ones who come up tails.
    • the sample size is the key: after randomization makes the two groups identical on every other dimension, we can be confident that any change in the two groups' outcome was caused by their different treatment. -- treatment effect.
    • the process of randomization creates matched distribution.
  3. Neural Network:
    • computer can be programmed to update their reponse based on new or different information.
    • neural network is a series of interconnected switches that receive, evaluate and transmit information. Each switch is a mathematical equation that takes and weighs multiple types of input information.

Regression versus Randomized trials:
  • Regression lets the researcher sit back and decide what to test after the fact.
  • Randomized trials require to hypothesize in advance before the test starts.
  • Regression are used for identify the target group.
  • Randomized trials are used for test the impact of one specific treatment.

Regression versus Neural Network:
  • Regression need to be specified the specific form of the equation in advance.
  • Neural Network let the data pick out the best functional form from massively interconnected set of equations.
  • Compared to plain-old regression analysis, neural network is more flexible and nuanced.
  • The subtle interplay of its weighting schemes in neural network leads to the biggest drawback of neural network: it can't identify which single factor will impact on the prediction result and how it will impact; it can't tell the confidence intervals of its prediction.
  • The overfitting problem in neural network may hinder the predicting capability.
P.S. 找这本书的经历,有点“众里寻它千百度”的味道,还特地找了中英两个版本仔细得读,认真的记。伏案疾书时,还得努力推开你的影子和声音。读它的感觉,真正是五味杂陈......

没有评论: