Tackling big data means understanding its purpose and mastering its secrets. And because the field only in its infancy, it's good for us to start directly with the right basics.
Some see the promise of a better world, one that is smart and efficient, where nothing will be up to chance anymore. Others see the rise of an apocalyptic, omnipresent Internet of Everything and consumers enslaved to data they no longer control. Between these two extreme visions of Big Data lies the reality: the discipline is still in its early years, and many real technical obstacles to its potential remain, not just in anecdotal experiences! Besides the "Four Vs" challenge (Volume, Variety, Velocity, Veracity), research is proceeding apace, given how ever-changing and significant the issues are. A look at where investigations will lead in the next few years.
Issue #1: Data health
It's the nightmare of every CIO or marketing director who wants to do Business Intelligence: realizing that no matter what tool they use for processing, their real problem lies with the poor quality of the data itself. Hence a first phase, cumbersome but essential: cleaning house by "fixing" the data. A real market, often neglected, always underestimated, but tremendous given how uncommon this practice is in how companies do business.
Issue #2: Automated processing of heterogeneous data
How can you cross-reference data from splintered databases or scraped from the web, in different formats, in a smart way? How can you index and aggregate open data automatically? The automated indexing of even a single document could still be improved. And though much experience has been gained in the matter of video indexing, the automated splitting of movies into chapters still leaves much to be desired. The other major challenge of data analytics is the dynamic configuration of scraping algorithms. For large quantities of data, such as in genetics, data processing can take months; being able to freeze a running process to spot errors in configuration and correct them without having to restart the algorithm completely is one of the biggest goals of current research. The very notion of real-time analysis depends on it: we are far from getting there, even though it is essential to some fields.
Issue #3: Transforming the Big Data experience
At the heart of Big Data's future is the seemingly critical matter of browsing and visualizing data. Given the exponential growth of data volumes (1), being able to browse the data with new devices, and most importantly, new interfaces is becoming a key issue for the quality of the results and for understanding them. How is it possible to remain within the limited framework of computer monitor and mouse when working with such quantities of data? Immersive devices, from Oculus headsets to Augmented Reality glasses to 3D touch screens, are now everywhere. But from interacting with the display to how the data is represented, everything must be consistent and optimized to suit the intended users (decision-makers, scientists, the general public, etc.) Researchers are therefore working on natural interfaces that allow users to "play" intuitively with the data, streamline browsing, and improve the effectiveness of scraping. The visualization tools themselves are a challenge. As revealed in a North American study (2) in 2014, the adoption of Big Data practices in businesses is critical; however, that adoption will only take place naturally in organizations if the tools are "clear and well designed, with strong visualization qualities." Currently, besides a small number of algorithms (like the vital open-source Gelphi) reserved for data scientists and software like MATLAB (a graphical toolbox for scientists and engineers) and TABLEAU (an improved version of spreadsheets devoted to Business Intelligence), the lack of creativity in data visualization tools and their very traditional nature limits the power of the images obtained and therefore their impact, particularly in decision support tools. Naturally, artists (the pioneering Mark Lombardi or all the projects that can be found at Visual.ly) and graphic designers have long been working on the subject enthusiastically. This graphical processing, though it reveals all of the data's potential, remains small-scale, and limits the use of data visualization to key accounts. Whether to add a third dimension (often time, as in the incredible Chronozoom) is still a matter of debate. The dataviz community is mostly against it. However, being able to interact naturally in immersion with the data without being limited to two dimensions is the next frontier for Big Data: this is a natural way for our researchers given how clear the benefits are. In everyday life, our interactions with the environment nearly all use three dimensions; furthermore, research into the brain has demonstrated that 3D visualization stimulates different areas than a 2D view, accelerating comprehension and uptake. The issue of abstract representations is also an obstacle, but these barriers might fall before 2020.
Issue #4: Security and anonymization
Big Data now makes it possible to aggregate minuscule crumbs of information spread out across the Internet and get an individual's picture and address without even using cookies. With the widespread adoption of the Internet of Things, data security is shaping up to be a major issue for the future of Big Data. With more and more massive scandals involving massive data breaches (particularly the famous Target case), businesses and consumers are increasingly coming to realize the extent of security flaws in all of these everyday objects. However, trust is central to these new markets. E-health is probably the most striking example of these issues: whether used to improve medical knowledge or help with a diagnosis, cross-referencing and therefore exchanging data is essential, requiring reliable encryption and traceability. Furthermore, anonymization techniques are just as critical: they make it easier for the user to accept the use of highly personal data, but the further anonymity is pushed, the more rich and relevant information is removed. Anonymizing without losing that level of detail has therefore become a major research challenge for medical Big Data to achieve progress. Ironically, Big Data can also improve security by making it possible to anticipate cyberattacks based on isolated behaviors.
Issue #5: Datatainment, or how to put people at the center of Big Data.
To conclude this R&D-focused vision of Big Data, there is one final major trend that will determine how much it succeeds: how can an individual benefit from all the data he or she generates, whether knowingly or not; how can we put data to use in our everyday lives? This is the challenge of one popularized form of Big Data sometimes called datatainment, a more playful vision of Big Data, because it involves giving life to the data, adding personality to it, and creating new modes of representation to foster empathy, or even an emotional bond. The "Mes Infos" project at FING and more preliminary experiences like those of the Ecole de Design de Nantes (the dataquarium 3), are clearing the way for this essential phase in the overall success of Big Data: blurring the line between those are generating the data and those who are "using" it.
(1) 5 exabytes is the volume of digital information produced by humanity from the dawn of time to 2003. By the end of 2011, 5 exabytes was generated in two days. By the end of 2013, it took only 12 minutes to generate 5 exabytes...
(2) McKinsey Quaterly, spring March 2014, Bad Brown, David Court, Tim McGuire.
(3) DELETE Project of the company Acxiom that allows the consumer to access, edit, and limit the data that brands collect about him or her. 3Intranet with dynamic data visualization using avatars, by and for students of the Ecole de Design de Nantes en 2007. The experiment is now continuing with Crystal Campus.