Data analysis involves using data to gain insight into the world around us. Weather forecasting and calculating the average body mass of a population are two examples of this. Academics and businesses alike use data to make predictions and gain a better understanding of their areas of expertise.
Data can be either qualitative or quantitative, structured or unstructured, digital or manual, and in a variety of formats. Data scientists and analysts must consider the appropriate methods for collecting and preparing the data for analysis when presented with a new data type, as this presents a challenge.
Contrasting organized and unstructured data
Ellen and Charlie are both highly esteemed scientists in their respective fields. Ellen is meticulous in her record keeping, taking care to document her research in spreadsheets. In comparison, Charlie is prone to forgetting to record his findings straight away, often writing them down on any convenient surface he can find.
Imagine if we requested each scientist to provide their own database. When Ellen needed to share information, she uploaded a spreadsheet file, whereas Charlie had a folder of paper sheets with numerical and date information written on them.
Structured data, similar to Ellen’s database, can be found online. This type of data is characterised as having a recognisable pattern or is able to be modelled. Conversely, Charlie’s data is an example of unstructured data. Despite this, it is just as essential as Ellen’s data, provided that no inaccuracies have been made. However, more effort is required to collate and assess this data.
It has been reported by Computer World that between 70 and 80 percent of data is unstructured. While this may appear to be a hindrance, it is a common occurrence and not an obstacle that cannot be overcome. Charlie’s behaviour, though not desirable in this situation, is in line with this statistic.
Given the constraints of time and resources, it can be difficult to clean and structure data. However, if the data is not suitable for conversion into a structured format, alternative solutions must be considered.
Data analysis with the assistance of AI
The advances in Artificial Intelligence (AI) over the last few decades have opened up a range of new opportunities in data processing. Machine learning and intelligent assistants have enabled us to quickly gather, refine and analyse amounts of data that were previously unimaginable.
Machine learning is a set of algorithms which are designed to analyse data, gain knowledge from it, and then use that knowledge to process new situations. It is unknown to some, but is a powerful tool when used correctly.
Streaming services use machine learning to identify patterns in users’ viewing habits in order to create a pool of suggestions tailored to users with similar likes. Similarly, online retailers use browsing and purchase patterns to make educated guesses about products that may be of interest to the customer.
In this context, “learning” refers to the process in which an algorithm is optimised through the exposure to increasing volumes of data. It can be likened to a tool that is refined with repetition.
Simple machine learning models, such as those based on linear regression, can be scaled up to encompass more complex models designed to address complex challenges arising from unstructured data.
Deep learning has arrived.
Deep learning is an area of machine learning focused on creating models which mimic the decision-making processes of humans. It has been applied to a wide range of tasks, such as social media filtering, image recognition and speech recognition.
We can return to our scientist, Charlie, to gain further insight into this matter. It appears that, shortly before submitting his database, an unfortunate accident occurred in which a can of Coke was spilled over the folder containing his records. This has caused some of the data to become distorted, making it difficult to identify the figures correctly. As a human, one can attempt to make an educated guess as to what each number is supposed to be; however, an automated algorithm is likely to encounter significant difficulties.
Deep learning enables us to utilise various algorithms in multiple layers to construct a decision tree, which is able to provide an answer, assess its accuracy, and adjust itself for more accurate estimates.
The Japanese strategy game Go provides an ideal illustration of how deep learning can be utilised. Google has invested heavily in developing an AI that is able to play Go and, while AlphaGo has been a considerable achievement, it has proved to be much more difficult than creating AIs that play Chess.
It is abundantly clear that the sheer amount of computing power required to calculate all possible outcomes in a game of Go is extraordinary. Consequently, an AI designed for Go-learning must be able to make accurate decisions based on the current state of the board. Moreover, similarly to human players, heuristics can be used by the computer when it is unable to carefully assess all available options.
Working with unstructured data
The primary difficulty encountered when working with unstructured data is that it may exhibit an ambiguous pattern, or even lack any recognisable pattern. Natural language is a prime example of this, as it can be expressed in a vast array of ways, yet can still be understood by humans. Conversely, machines require sophisticated designs and models to process such data correctly.
Deep learning has proven to be a popular technology, with one example being Tesseract – an optical character recognition (OCR) programme that utilises deep-learning to accurately detect text in photographs. Despite appearing to be an uncomplicated task, this operation can actually be quite difficult.
It is possible that some of the photographs which have been sent to you are of a low quality; they may be limited to a few frames, have a blurred text or be taken from an incomplete film. Every image is a distinct world in itself, so instead of designing an algorithm for each individual case, we use deep learning to analyse all the images simultaneously, thereby reducing the time of development.
As we do not exist in a highly ordered setting and humans do not typically process information in an orderly fashion, deep learning has developed as a potent artificial intelligence methodology. Although it is still in its early stages, I anticipate that apps such as Siri will soon become more than just advanced search engines.