Ken Rudin, Director of Analytics at Facebook, spoke at HP’s Big Data Conference in Boston today. He attacked the four myths of big data:
- That you need Hadoop to attack big data
- That big data gives you better answers
- That data science is a science
- That the reason for big data is to get “actionable insights” from the data.
Of course, there is a kernel of truth in all of these, but there are many tools that are useful in big data, and the answers you get from it are only as good as the questions you ask. Perhaps the most important point he made is that data science is both a science and an art. Those of us who have been in some part of the information industry for longer than we care to admit agree with him. You certainly need the tools, and being a whiz in the “how” of finding and analyzing information is important. That’s the science.
But it’s only half the battle. Knowing how to ask a good question is an art. Good askers of questions must be good listeners. They are steeped in the background of the organization. They absorb the underlying reasons for why information is needed, and how it will be used. Information analysis is a way station toward an action. It’s part of the process of gathering evidence to support a decision. If you just gather information for the sake of having it, it may be interesting, but it’s not useful.
What Rudin said is that our approach to why we gather information is evolving. It has moved from “Tell me our status” to “Tell me why it’s happening” to today’s, “What should I do about it?” But, he says, that’s not enough. Because you also have to decide to act on that recommendation in order to change a process, change a metric, change a policy or change behavior. People who can ask the right questions, balance the science and the art, and act on the conclusions will redefine the role of the data scientist or the analyst in the organization. And change the organization in the process.
During the course of our research on search in the cloud, we’ve been collecting some case studies on uses of SaaS-delivered search. Two of the major reasons that companies give us for moving their search to a cloud-based model are that first, they need a scalable, flexible model that can vary with the demands of the business, and second, that search is not their core business so they prefer to rely on outside experts who can deliver a solid reliable foundation on which they can build specialized applications.
Businesses that are dynamic information exchanges require this kind of scalable reliability. They need to make their information available quickly, and cater to the dynamics of their users. Search is critical to their business. Prezi (prezi.com) is a good example of this kind of use. This cloud-based software company enables its customers to brainstorm and collaborate, create unusual presentations, and share the results, no matter their location or device. Their search needs at this stage are basic—good matching of queries to documents and quick updating of their index. They started with about 200 million documents, but the volume to grow to 1 terabyte, doubling annually. Prezi did not want to hire or develop the expertise to build search from scratch, and they needed flexible, scalable search to match their growing business. Their customers need to find materials both they and others have developed, and they want to find images by topic without the time consuming delays of creating and standardizing tags.
To make its materials searchable quickly and easily, Prezi developed a database of images that are associated with the text in the same slide. The contents change constantly, however, and they need to upload those images and make them searchable automatically using the related text. Furthermore, they anticipate adding and indexing new sources. For this purpose, they envisioned using search as “a materialized view over multiple sources.” In other words, a single gateway to all their information.
To accomplish this, they needed stable, reliable and expandable search. The materials had to be accessible to its users no matter their device or location. Peter Neumark, a Prezi software engineer told us that they were looking for search that they could “pay for, use and forget about.”
Selecting a Search Infrastructure
Prezi’s previous search solution was slow, and didn’t function well enough as a key-value store. They also required a solution that allowed them to relate an image to its neighboring text easily. They decided to look at Amazon’s CloudSearch to solve these problems and deliver relevant material to searchers quickly and reliably. In other words, they were looking for search that “just worked”. They didn’t want to maintain it themselves, and, because they were familiar with them, they wanted to continue to use the AWS API’s, which they like.
When they did head-to-head testing, they found that CloudSearch was cheaper, faster, more reliable and expandable, and easier to synch with their Amazon DynamoDB database. They liked its auto-scaling features that would grow with their data and their business.
Rolling out CloudSearch and Future Plans
Prezi are “happy campers”. They deployed CloudSearch in 3 weeks, and are seeing lower cost, lower latencies, and virtually no need to pay attention to their basic search foundation. Their next step will be to roll out additional domains and sources. They like the idea of adding domains rather than changing the initial schema. They will also make the search function more visible on their site, now that they no longer need to worry about its reliability and speed.
With its acquisition last week of AlchemyAPI, IBM’s Watson Group added new, tools and expertise to its already-rich and growing array. Alchemy API’s technology complements and expands the core IBM Watson features. It collects and organizes information with little preparation, making it a quick on-ramp for building a collection of information that is sorted and searchable. It works across subject domains, and doesn’t require the domain expertise that the original Watson required. Its unsupervised deep learning architecture is designed to extract order from large collections of information, including text and images, across domains.
In contrast, the original Watson tools used to understand, organize and analyze information demands some subject expertise. For best results, experts are required to build ontologies and rules for extracting facts, relationships and entities from text. The result is a mind-boggling capability to hypothesize, answer questions, and find relationships, but it takes time to build and is specific to a particular domain. That is both good and bad, because they provide a depth of understanding, but at a significant cost in terms of time to get up and running. The Watson tools are also text-centered, although significant strides have been made to add structured information as well as images and other forms of rich media.
AlchemyAPI was designed to solve precisely these problems. It creates a graph of entities – and the relationships among them, with no prior expectations for how this graph will be structured. It is entirely dependent on what information is in the collection. Again, this is both good and bad. Without subject expertise, topics that are not strongly represented in the collection may be missing or get short shrift. Both approaches have their limits, as well as their advantages. Experts add a level of topic understanding—of expectations—of what might be required to round out a topic. Machines don’t. But machines often uncover relationships, causes and effects, or correlations that humans might not expect. Finding surprises is one of the strongest arguments for investing in big data and cognitive computing.
In this acquisition, Watson continues the path that helped it win Jeopardy!—by combining every possible tool and approach that might increase understanding. IBM can now incorporate multiple categorizers, multiple schemas, multiple sources, and multiple views and then compare the results by the strength of their evidence. This gives us more varied and rich results since each technology contributes something new and crucial. Like the best human analysts, the system collects evidence, sorts through it, weighs it, and comes to more nuanced conclusions.
The Watson platform adds a major piece to information systems that is often unsung. It orchestrates the contributions of the technologies so that they support, balance and inform each other. It feeds back answers, errors, and user interactions to the system so that Watson learns and evolves, as a human would. In this, it removes some of the maddening stodginess of traditional search systems that give us the same answers no matter what we have learned. In seeking answers to complex, human problems, we need to find right answers, perhaps some wrong answers to sharpen our understanding, and certainly the surprises that lurk within large collections. We want a system that evolves and learns, not one that rests on the laurels of a static, often outdated ontology.
Mirroring this technology architecture, the IBM’s Watson Group similarly requires a group of closely knit, strong minded people who are experts in their separate areas of language understanding, system architecture, voting algorithms, user interaction, probability, logic, game theory, etc. Alchemy contributes its staff of deep learning experts, who are expected to join the Watson Group. It also brings its 40,000 developers worldwide, who will broaden the reach and speed the adoption of cognitive computing.
Like human medicine, veterinary medicine has leaped into the digital age, embracing big data, telemedicine, online access for customers, online education for practitioners, digital marketing, and social media. Both sets of practitioners are also under increasing pressure to handle more patients in less time, and to keep up with a growing body of research that becomes outdated quickly.
However, there are some key differences between human and animal medical practitioners. Complex as human medicine is, it still targets only one species. Veterinarians, however, must be prepared to deal with everything from anacondas to zebras, and conditions that range from general wellness and internal medicine to cardiology, oncology and beyond. And, their patients can’t talk.
LifeLearn is a spin-off from the University of Guelph’s Ontario Veterinary College in Ontario, Canada. When they were founded 21 years ago, it was with the goal of providing educational and support services, resources, technology and tools to veterinary practices. As the field has evolved, though, so have they. LifeLearn’s Innovations Group is betting on new technologies like digital monitoring devices for animals to provide solid data on patients.
When the chance to partner with IBM’s Watson came along, it seemed to Jamie Carroll, LifeLearn’s CEO and Dr. Adam Little that creating a better digital assistant could solve some of the problems that veterinarians face today. LifeLearn is one of the first partners selected by IBM Watson and is using the technology to develop a cognitive veterinary assistant, called LifeLearn Sofie™, that can ingest massive amounts of data, and forage in real time for clues and connections that will allow a veterinarian to diagnose an animal’s condition quickly and accurately. Like other Watson-based assistants for physicans being developed, LifeLearn’s Sofie is training a veterinary version of Watson that uses the information it has amassed and analyzed to generate evidence-based hypotheses and suggest the best treatment options.
Preparing the content for Watson has been a massive undertaking. Working with leading hospitals, LifeLearn has reduced that process from weeks to hours. The LifeLearn staff have also had to train Watson to answer nuanced complex questions for which there is no single right answer. For each topic, their Watson trainers must create a set of questions that would be germane to a vet working through a case. They are now able to produce 25,000 question/answer pairs per month.
LifeLearn has built not just the underlying knowledge base, but analyzed how veterinarians gather and use information. Based on their decades of experience, they have developed an interactive application that enables veterinarians to ask questions and receive the top answers that are scored for confidence. The system learns from each interaction, and from feedback from users, who are asked to score the responses for relevance, quality of information, appropriate length and depth of answers.
LifeLearn’s goal is to make Sofie a specialist in every corner of veterinary science. To succeed, they must uncover how veterinarians make decisions. But there is an additional challenge: to educate veterinarians to understand the promise and limitations of cognitive computing—that there is no right answer, only some that are more appropriate than others, given the patient, its owner, and the circumstances of the medical condition. Living with uncertainty and complexity, and providing guidance in how to do this as well as possible is the aim of applications like LifeLearn’s Sofie.
We can all use a good personal assistant, one that keeps our health in mind, not just our appointments. This assistant needs to understand who we are today: our current state of mind, our location and our preferences. Recommendations on how to keep fit in July won’t work in January if you are snowed in with the flu. Instead, we need a sympathetic advisor who urges chicken soup instead of cookies, and suggests a hot shower, a nap and perhaps some gentle stretches for the aches.
This post may seem a far cry from our normal focus on cognitive computing, but in fact, it showcases one of the major leaps forward that cognitive computing will promote: true individualized recommendations that are presented within the framework of who you are, where you are, how you’re feeling, and what you like to do. Over the last two years, healthcare in particular has moved into the world of big data in order to provide individualized recommendations that are backed up with sound evidence. From cancer diagnoses to congestive heart failure, vast amounts of data have been mined to uncover new treatments or prevent hospital readmissions.
Cognitive computing is also moving into disease prevention. Welltok® rather than focusing on disease and diagnoses, has developed a Health Optimization Platform™, CaféWell®, to help healthcare plans, providers and employers keep consumers healthy and reward healthy behavior. The platform is a well-integrated combination of curated health and nutrition information and social and gaming technologies that drive consumer engagement.
To deliver more individualized health programs, Welltok partnered with IBM Watson in 2014 to add cognitive computing capabilities, thereby a personalized experience for consumers. The CaféWell® Concierge application powered by IBM Watson learns constantly from its users, so that it evolves to offer better, more appropriate suggestions as each individual uses the system. Jeff Cohen, Welltok’s co-founder and lead for their IBM Watson project, tells us that their goal is to make their existing platform more intelligent about each member’s health conditions and context. CaféWell strives to answer the question, “What can I do today to optimize my health?” for each of its members.
To accomplish this goal, Welltok starts with good information on health, exercise and nutrition—from healthcare systems and well-respected structured and unstructured data sources. It factors in individual information about health status, available benefits, demographics, interests and goals. The IBM Watson technology parses and processes this information to find facts, patterns and relationships across sources, using a proprietary Welltok approach. Welltok also adds its taxonomy of healthcare concepts and relationships. Then it creates question-answer pairs to train the system. These query-answer pairs are a key ingredient to help Watson enrich implicit queries. Welltok also provides navigation so that users don’t get lost as they seek answers. Free-flowing dialog between the user and the system is one of the earmarks of a cognitive application, but users need hints and choices in order to avoid frustration. Welltok provides these, constantly updating and retraining the system as it learns to predict pathways through the information. The information is filtered for each member’s health plan coverage and individual profile. Cognitive computing also incorporates temporal and spatial facets, so that the recommendations are suitable for the user’s time and place. This all eliminates information dead ends because it prevents inapplicable information from being displayed.
In addition to relevance, members are given incentives to participate and they are rewarded as they pass certain milestones. More importantly, the system learns their preferences and what motivates them to be healthy. For example, if you are only interested in exercising in groups, that’s what will be recommended, but if you prefer walks in the woods, you’ll instead get tips, perhaps, on places to walk or find mileage and terrain for common routes.
The Welltok use of cognitive computing has all the earmarks of a cognitive system. It’s dynamic and it learns. It parses both information sources and the user’s situation deeply, and matches the individual to the information and the recommendations. It is interactive, and it devours data—the more, the better.
One of the most fertile areas of development for cognitive applications is in this area of intelligent personal advisors. Suggestions for actions that are tailored to who you are make it more likely that you will try them. Now, where did I put the chicken soup?
Since 2001, we have periodically surveyed information workers to find out how much time they spend on a variety of tasks related to information work. These include time spent reading and answering email, searching, writing, creating presentations, entering data, managing information, and translating. We have also asked them to calculate the time they waste in reformatting information, wrestling with multiple versions of documents, duplicating other workers’ efforts or not finding the information they need. Up until now, our conclusions have remained the same: information work, with its disconnected tools and repetitive tasks hinders information workers from accomplishing their core tasks. All too often, we have found, it is the small but time consuming things that get in the way of accomplishing something important.
That may finally be changing as new applications come to market that are designed to reduce repetitive information tasks. Often, these are seemingly inconsequential jobs that, because of their number, require literally thousands and hundreds of thousands of actions, all of which are predictable and repeatable. Like, for instance, looking at patient records to extract the appropriate ICD 9 and ICD 10 codes from doctors’ notes. It turns out that properly trained content analytics applications, as IBM told us, can do this efficiently and more accurately than weary human coders who are subject to burn out.
Here’s another example of a seemingly small task that can take on gigantic proportions. Infinote tells us that every time there is a regulatory change, pharmaceutical companies (or any regulated industry) must find and change every document that is affected. Just revising the wording to reflect the new regulation on use of safety goggles turns out to be a terrible time sink. At one pharmaceutical company, this regulatory change affected an estimated 900 documents—one per lab. Ferreting out these documents was no easy task, nor was changing them one at a time. It would have taken an estimated 6-8 weeks of elapsed time before all the changes were complete.
To address this problem, Infinote, founded by executives from Genentech, has created a specialized search, analysis, update and audit application that includes an add-on to Microsoft Word. Finding and changing all the documents that contained the now-obsolete wording took a matter of minutes with Infinote. This kind of application is part of the wave of the future: it is easy to use; it integrates multiple information sources and technologies in a single application; it supports a specific process within a familiar work environment, and eliminates repetitive tasks so that information workers can work more productively.
Making information accessible is hard work. Certainly, there are new tools that can analyze massive amounts of data in order to bootstrap information management. However, there’s a point at which human expertise is required.
I just read a report on how and why to model search behavior from Mark Sprague, Lexington eBusiness Consulting, http://msprague.com. Mark has been in the search business as long as I have. He helps organizations understand what their customers are looking for, and what the impact of their information access/search design will have on their customers finding what they are seeking. The report I read discusses a consumer search behavior model he built for the dieting industry. In it, Sprague explains that a good search behavior model starts by gathering data on what users are searching for, but that’s just the beginning. Building a behavior model can affect your information architecture, the content you post on your site, how to incorporate the search terms customers use into the content, which featured topic pages that will attract views, the SEO strategy this model drives, and the changes that will result in existing PPC strategies. Sprague finds top queries, then uses them to generate titles and tags that fit the terms users are searching for—particularly the phrases. He also categorizes queries into a set of high-level topics with subtopics. These categories can and should affect the organization of a Web site, enabling users to browse as well as search.
Sprague has observed that at each stage of the online buying process, from research to deciding to purchasing, the query terms differ. This difference can be thought of as an indication of intent, and it can be used to tailor results for an individual as the user moves from one part of the process to the next. Finally, Sprague uses the terms to perform a cost-benefit analysis to improve SEO.
This thoughtful approach starts with observing user behavior and models the information architecture and Web site to fit—not the other way around. That’s smart, and it’s good business.
Information has always been central to the functioning of an enterprise. Today, with the fast pace of business, access to the right information at the right time is critical. Enterprises need information to track the status of the organization; to answer questions; to alert it to changes, emergencies, trends, opportunities or risks; to predict, model and forecast their business.
To this, we must add one more information goal, one that is so valuable but so elusive that it has been little more than a dream: to find the unexpected: the unknown threat, the unknown opportunity. These so-called black swans lurk on the edge of our understanding, obscured by the over-abundance and scattered nature of information in the organization today.
Big data tools and technologies have been developed to help manage, access, analyze and use vast quantities of information. Big data is often defined by the three V’s: Volume (amount of data); Velocity (the speed at which it arrives); and Variety ( the number of data types or formats). But the value in big data is not really rooted in its abundance, but rather in how it is used. Big data tools enable us to understand trends and answer questions with a degree of certainty that was not possible before—because we did not have enough data to support our findings. Big data approaches to healthcare are starting to enable treatments that take into consideration the particular characteristics of a patient– their age, history, or genetic makeup. We use these characteristics as a filter or lens on the medical research literature, focusing what we know within the context of that patient. Given enough data, we can also find unexpected patterns. For instance, one project uncovered previously unknown markers for predicting hospital readmissions for congestive heart failure, saving a health organization millions of dollars. Big data techniques have helped predict the next holiday retail season, uncovered patterns of insurance fraud, and emerging trends in the stock market. We use these tools to find out if customers are satisfied with our products, and if not, why not. Political campaigns use them and so do managers of baseball teams.
Briefly, then, big data gives us plenty of data to analyze overall trends and demands, but it also helps us understand individuals within the context of a solid set of information about people who are like them. Instead of aiming at a mythical “average”, it lets us treat customers, patients and voters as individuals.
With new technologies like big data, we are at the beginning of a very complex new relationship between man and machine. Machines can find patterns and make recommendations; but people need to test these patterns for reality, and they also need to be able to hypothesize and test results. Used wisely, big data could improve customer service, healthcare, or government by allowing us to dig more deeply. Used wisely, these tools will also help us to make our organizations more flexible and adaptable in a fast-changing world.
By their very nature, good mobile applications must be smarter. The physical limitations—the small screen, or the input mechanisms (1-2 fingers or unpredictable voice recognition) mandate that a mobile app anticipate what you want to do and make it easy to get there. No drop down boxes, no multiple queries, not much scrolling. Certainly very little clicking to get to a new screen. Or a chain of multiple queries when the first one is off the mark. Forget cut and paste. For these reasons, mobile applications must be both smarter at understanding what you want and intelligently designed. That’s hard.
Enter cognitive computing. If an application can really understand what the user intends, if it can classify questions and predict the kind of answer or action needed, then there will be less burden placed on the user to adapt to the limitations of the app. But cognitive computing requires real language understanding (NLP) as well as machine learning and classification. It also requires a corpus of examples to learn from. This is a level of technical prowess that would be impossible to develop for most start ups. Enter IBM’s Watson. Watson Foundations was released last quarter. And now IBM has announced the Watson Mobile Developer Challenge. This contest invites app developers to submit a proposal to develop an application on the IBM Watson platform. Developers must make a case for what they propose, demonstrating why it would be valuable. Winning apps will capitalize on Watson’s strengths:
- Have a question and answer interaction pattern, with questions posed in natural language
- Draw on mostly unstructured (text) information for answers
- Return answers that are ranked according to their pertinence to the question
- Benefit by better understanding (analysis) of the type of question being submitted
The catch is that applications are due by March 31st.
This contest brings cognitive computing within the reach of developers. Watson supplies the NLP tools, question analysis, machine learning, and confidence scoring that would otherwise place cognitive computing beyond the reach of most vendors. For more information, see IBMWatson.com. The application and rules can be found at:
Heuristics in Analytics
By Carlos Andre Reis Pinheiro and Fiona McNeill. Wiley, 2014
With all the hype about big data, it’s refreshing to find a book that discusses the practical aspects of analytics. Heuristics in Analytics makes a clear case for adding human experience and common sense to technology in order to solve real world business problems. Written in clear, non-mathematical language, the book explains how using heuristics together with analytics is often the fastest way to deliver decisions that are suitable for a specific use case, and quickly enough to fit the fast pace of business. The descriptions of heuristics concepts and guidance for how to use an heuristic approach to analytics should make this book a valuable addition to the manager’s and practitioner’s libraries.
With its roots in statistics, analytics tends to be highly theoretical and analyses are often misunderstood. One of the concepts that the marketplace has difficulty understanding is that analytics looks at trends, and that therefore there will be outliers and inconsistencies within any data set; that we are examining a collection of data in the aggregate and that each data point can not be expected to fit what many perceive to be a rule. Heuristics in Analytics makes this point adeptly:
“Unexpected events will always take place, and they will always impact predicted outcomes. Analytical models work well for the majority of cases, for most of the observations, and in most applications. Unexpected or uncontrolled events will always occur and typically affect a few observations within the entire modeling scenario.”
Many a sentiment analysis tool has been rejected because we expect precision and accuracy from it rather than broad trends. Marketing managers remain suspicious because they find errors in classification, not realizing that people are just as error-prone. The difference is that computers make dumb computer errors, while human errors in judgment may be due to differences in interpretation, in bias or pure exhaustion at the end of a long day.
I hope that Heuristics in Analytics will help to correct some of this misunderstanding. And like its authors, I also recommend the seminal works they note:
- Leonard Mlodinow: The Drunkard’s Walk: How Randomness Rules Our Lives
- Malcolm Gladwell: Blink: The Power of Thinking Without Thinking
To these I’d add a couple of my own favorites:
- Mitchell Waldrop: Complexity: The Emerging Science at the Edge of Order and Chaos. New York: Simon and Schuster. (1992)
- Nate Silver: The Signal and the Noise: Why So Many Predictions Fail — but Some Don’t
- Nicholas Nassim Taleb. The Black Swan
All of these address the role that probability plays (or should play) in how we make decisions.