The discipline of Big Data

Anne Kao

Anne Kao, Boeing Senior Technical Fellow

Machine learning and artificial intelligence drive the data movement.

Data analytics is a not just a new term for the relatively old discipline called statistics.

Data analytics methods are often statistically based, but the goal of data analytics is typically not to prove or disprove a hypothesis, but rather to use machine learning and heuristics to discover a model of the data in order to characterize and predict new data.

The success of a data analytics application depends on the following:

  • The quality of the data;
  • Preparation of the data;
  • Understanding of the data and the associated business rules and business problem to be solved, in conjunction with the method or algorithm used;
  • How well the results are presented; and
  • The business value it can bring.

What distinguishes modern day data analytics from the data mining of 20 years ago is that the former focuses on analysis of data from multiple sources and often in multiple formats and with different business or coding rules.

For a better understanding of manufacturing defects, for example, we have to know how information is coded in the system. If there are 10 holes drilled incorrectly in one part, is that 10 defects or one? If it affects multiple planes how is that recorded? How do changes of processes and coding over time affect the analysis?

Big data is not all about volume. “Big data” also refers to the variety, variability, veracity, velocity and the value it brings—collectively, the six V’s.

Data scientists may spend 70-90 percent of their time transforming and preparing the data for analysis. here is no single analytics method that is good for everything. Some approaches are more robust. Some algorithms are accurate but uninterpretable. If I could tell you that a particular program will have more mechanical issues but not tell you what types of failures or what activities correlate strongly with those failures, there would not be much value in that knowledge—even if I’m very accurate.  Interpretability makes for actionable results.

Early artificial intelligence research in the 1980s failed in the early 1990s, not because of lack of computing power, but because the amount of knowledge required was massive and hard to manage.

Machine learning efforts in the late 1990s swung to the other side, relying on heuristics and statistically based methods to learn from sample data. But it still takes time to obtain sample data, and this “blind” way of learning can involve lots of trial and error—and cannot handle complex problems in a timely manner.

Many in the data mining community now promote a best-of-both-worlds approach: dividing large problems into more easily intelligible chunks by mixing smaller, simpler models; complementing machine learning with interactive science; and drawing on users’ own domain knowledge to interpret complex results.

The recent renewed hope for artificial intelligence has to be based on the right mix of machine learning and domain knowledge. This is a necessary foundation for the future of data analytics, as well as other areas such as autonomous systems (self-driving cars and the like).

By Anne Kao, Senior Technical Fellow