We’re all familiar with user devices such as PCs, tablets, phones and they will remain an important part of Intel and will continue to invest in maximize returns in these businesses. Moving forward these devices are welcoming billions and billions of things to the Internet by 2020. 50 billion devices and 212 billion sensors will join the Internet.
At this point, 47 percent of total devices and connection will be machine-to-machine. Truly, the rise of the machines these things will generate tremendous amounts of data considering this in 2020. It’s expected that the average Internet user will generate approximately 1.5 gigabytes of traffic per day that’s up from 650 megabytes in 2015.
Certainly, a huge amount of data until you consider the machines, a smart hospital will generate 3,000 gigabytes per day, self-driving cars will generate over 4,000 gigabytes per day, a connected plane will generate 40,000 gigabytes per day and a connected factory will generate 1 million gigabytes per day. This data needs to be analyzed and interpreted in real time and what we are focusing on one of the best examples of Intel’s AI strategy is at automated driving data produced by autonomous vehicles is immense that’s 4 terabytes per data per day. Every day, computer required for autonomous vehicles is even more astounding approximately 1 car in one hour of driving will require 5 exaflops of compute to safely keep itself on the road supporting just 20000 automated vehicles for one day is estimated to require one exaflop of sustained compute or a billion floating-point operations per second.
So, as we can see really big data’s problem is that there’s so much data how are you going to process it do you have enough compute power, enough storage to store it or the infrastructure to move it around between the compute and the storage devices. So, in short, HPC or High-Performance Computing is just simply leveraging distributed compute resources to solve complex problems with a large data set so when we say large data set, we mean terabytes, petabytes or even Zeta bytes of data that need to be processed as close to real time in many cases as possible. But, certainly minutes to hours not days, weeks or even months, so what typically happens is the user will submit a job to the cluster manager that cluster manager will then run the work load on the distributed different pieces of resources like CPUs, FPGAs, GPUs disk drives all interconnected by a network then they get the results back and they can analyze those results to make decisions typical workloads that are common today range. Across many different vertical markets such as Life Sciences, astrophysics, genomics, bioinformatics, molecular dynamics, weather and climate prediction, artificial intelligence, crosses many of these industries and is one of the hottest topics today. Cybersecurity and financial analysis are also becoming more and more popular. The rest fall into kind of a big lump sum of big data analytics umbrella which is really a lot of things to a lot of people.
Process of solving a big data analytics problem are –
Step One: Define the question to the problem being examined. For example, what products sell best to at Christmas to men ages 18 to 25 or what’s the optimal traffic pattern of vehicles around downtown area of a major city another could be how does this chemical react with another. Step two: Here, we have to ingest and store the data to be used to answer that question. So, for example the imagen that data set can be used for training of deep learning topologies or the kitten Kitty vision benchmark suite can be used for autonomous driving or the USDA food composition database and many more data is king with HPC while there are many publicly available datasets for a variety of different markets. Many organizations spend years accumulating data for their private use. To solve those questions, step three is to take the data clean that data prepare it and transform the data into a format that can be used in processed by the HPC workload. This could mean resizing images to be processed or formatting tables to be queried. Data can be reduced to make it more manageable and more organized. Step four is to actually perform the analysis on the data. Step five is to receive the results in their context. Step six is that in the chain, where now that you have that data and you want to make decisions or actions based on those results so data comes in many forms.
In these, data sets it could be structured datalike a spreadsheet data that exists as a relational database organized and easily addressable rows and columns. There’s also semi structured data where the data is not formatted but still has associated information that makes it more amenable to processing than just plain raw data such as key in value and finally there’s unstructured raw data that doesn’t map well to main stream. Relational databases such as text documents webpages audio and video files and many others. So, there’s a little bit of terminology that we want to use about where data gets stored the first is this concept of a data Lake, which takes the raw data in from a variety of sources in its native format it uses a flat file architecture to store the data and it’s not limited to just unstructured data the data.
Warehouse is a storage repository that stores structured data in a table or tabular format using files and other hierarchal folder structures. The raw data from the data Lake may be needed to be extracted, transformed and then loaded into a data warehouse. There are many open source tools that are available today to make processing the data easier. They provide tools and infrastructure for processing both structured and unstructured data Apache Hadoop is one of the most common HPC frameworks with it. At Apache spark becoming equally as popular there’s things like Apache Cassandra which is an O-SQL database management system while post agree. SQL is an object relational database management system and a P-Hana is an in-memory column oriented relational database management system with many other systems available.