Don’t be stupid…refine your data

11.03.19 09:45 PM Comment(s) By Jordan

In the past, gold, oil and diamonds were commodities that countries literally went to war over. Such was their importance and value.

While the value of these commodities has not diminished over time, there is another commodity that is being referred to as the new oil. Big Data is poised to take the world by storm over the next few years.

Value trends

This should not be news to anyone who follows this blog as we have spoken about this extensively in the past. What trends can we look forward to in the future?

Data Management Is Still Hard. The article points out that the big idea behind big data analytics is fairly clear-cut: Find interesting patterns hidden in large amounts of data, train machine learning models to spot those patterns, and implement those models into production to automatically act upon them. Rinse and repeat as necessary.However, the reality of putting that basic recipe into production is a lot harder than it looks. For starters, amassing data from different silos (see prediction #1) is difficult and requires ETL and database skills. Cleaning and labeling the data for the machine learning training also takes a lot of time and money, particularly when deep learning techniques are used. And finally, putting such a system into production at scale in a secure and reliable fashion requires another set of skills entirely.For these reasons, data management remains a big challenge, and data engineers will continue to be among the most sought-after personas on the big data team.

Data Silos Continue Proliferating. The article adds that this is not a difficult prediction to make. During the Hadoop boom five years ago, we were entranced with the idea that we could consolidate all of our data – for both analytical and transactional workloads – onto a single platform.That idea never really panned out, for a variety of reasons. The biggest challenges is that different data types have different storage requirements. Relational database, graph databases, time-series databases, HDFS, and object stores all have their respective strengths and weakness. Developers can’t maximize strengths if they’ve crammed all their data into a one-size-fits-all data lake.In some cases, amassing lots of data into a single place does make sense. Cloud data stores like S3, for instance, are providing companies with flexible and cost-effective storage, and Hadoop continues to be a cost-effective store for unstructured data storage and analytics. But for most companies, these are simply additional silos that must be managed. They’re big and important silos, of course, but they’re not the only ones.In the absence of a strong centralizing force, data silos will continue to proliferate. Get used to it.

The speed of light

Efficiency has been enhanced by technology. Companies take half the time to perform tasks that would be considered menial three years ago.

Streaming Analytics Has Breakout Year. The article points out that the quicker you can act on a new piece of data, the better off your organization will be. That’s the driving force behind real-time, or streaming, analytics. The challenge has always been that it’s rather difficult to actually pull off and expensive too, but that’s changing as organizations’ analytic teams mature and the technology gets better.NewSQL databases, in-memory data grids, and dedicated streaming analytic platforms are converging around a common capability, which is ultra-fast processing of incoming data, often using machine learning models to automate decision-making.The article adds that if companies combine that with the SQL capabilities in open source streaming frameworks like Kafka, Spark, and Flink, and you have the recipe for real progress in 2019.

Data Governance Builds Steam. Some people call data the “new oil.” It’s also been called the “new currency.” Whichever analogy you want to use, we all agree that data has value, and that treating it carelessly carries a risk.The article points out that the European Union spelled out the financial consequences for poor data governance with last year’s enactment of the GDPR. While there’s no similar law in the United States yet, American companies still must abide by 80-some different data mandates created by various states, countries, and unions.Data breaches are bringing the issue to a head. According to an online survey by The Harris Poll, nearly 60 million Americans were affected by identity theft in 2018. That’s an increase of 300% from 2017, when just 15 million say they were affected.The article adds that most organizations have realized that the Wild West days of big data are coming to an end. While the US Government won’t (yet) fine you for being reckless with data or abusing the privacy of American citizens, the writing is on the wall that this behavior is no longer tolerated.

Shifting skills

Because the technology industry is a growing one, the competition for skills is intense. This is evident when we consider how employees move around within the industry.

Skills Shift as Tech Evolves. The article points out that human resources are typically the biggest costs in a big data project, because people ultimately are the ones that build it and run it and make it all work. Finding the right person with the right skills is absolutely critical to turning data into insight, no matter what technologies or techniques you’re using.But as technology advances, the skills mix does too. In 2019, you can expect to see continued huge demand for anybody who can put a neural network into production. Among mere data scientists (as opposed to legit AI experts), Python continues to dominate among languages, although there’s plenty of work for folks who know R, SAS, Matlab, Scala, Java, and C.As data governance programs kick into gear, demand for data stewards will go up. Data engineers who can work with the core tools (databases, Spark, Airflow, etc.) will continue to see their opportunities grow. You can also expect to see demand for machine learning engineers accelerate.The article adds that thanks to the advance of automated data science platforms, organizations will be able to accomplish quite a bit with mere data analysts, or “citizen data scientists,” as they’re commonly known. Knowledge of the data and the business – as opposed to expertise in statistics and coding – may get you further down the big data road than you imagined.

Deep Learning Gets Deeper. The article points out that the “Cambrian explosion” of deep learning, which has powered the current AI summer that we currently find ourselves in, shows no signs of letting up in 2019. Organizations will continue to experiment with deep learning frameworks like TensorFlow, Caffe, Keras, PyTorch, and MXnet as they seek to monetize vast data sets.Organizations will expand deep learning beyond its initial use cases, like computer vision and natural language processing (NLP), and find new and creative ways of implementing the powerful technology. Large financial institutions have already found that neural network algorithms are better at spotting fraud than “traditional” machine learning approaches, and the exploration into new use cases will continue in 2019.The article adds that this will also prop up demand for GPUs, which are the favored processors for training deep learning models. It’s unclear if new processor types, including ASICs, TPUs, and FPGAs, will become available. But there’s clearly demand for faster training and inference too.However, the deep learning ecosystem will remain relatively young, and a lack of generalized platforms will keep this the realm of true experts.

A special something

Software has been at the heart of technologies growth over the years. This will continue in the future.

‘Special K’ Expands Footprint. The article points out that software needs something to run on. Operating systems used to provide that common substrate, but now developers are targeting something a bit lower: Kubernetes.Developed by Google to manage and orchestrate virtualized Linux containers in the cloud, Kubernetes has become one of the hottest technologies in the big data ecosystem, if not the IT industry as a whole. As multi-cloud and hybrid deployments become more common, Kubernetes is the glue that holds it all together.The article adds that Big Data software vendors that used to write their software to run on Hadoop are now writing it to run on Kubernetes, which at least gets them in the front door (if not an invite to dinner). Supporting Kubernetes software has become the number one requirement for software vendors — including the Hadoop vendors too.

Clouds Hard to Ignore. The article points out that the cloud is big, and getting bigger. In 2018, the three biggest public cloud vendors grew at a rate approaching 50%. With an array of big data tools and technology – not to mention cheap storage for housing all that data – it will be hard to resist the allure of the cloud.In 2019, small businesses and startups will gravitate to the major public cloud providers, which are investing majors sums in building ready-to-run big data platforms, replete with automated machine learning, analytical databases, and real-time streaming analytics.The article adds that bigger companies will also find the cloud hard to resist in 2019, even if the economics aren’t nearly so attractive. However, the looming threat of lock-in will keep bigger companies wary of putting all their eggs in a single cloud basket.

Rising star

One of the biggest challenges with technology is that it is always evolving. Just as a company is getting used to technology, it changes.

New Tech Will Emerge. The article points out that many of the major big data frameworks and databases that are driving innovation today were created by the Web giants in Silicon Valley, and released as open source. The good news is there’s no sign the well is drying up. If anything, innovation may be accelerating.In 2019, big data practitioners would do well to retain as much flexibility as possible in their creations. While it may be tempting to cement your application to a certain technology for performance reasons, that could come back to haunt you when something better and faster comes along.The article adds that as much as you can, seek to keep your applications “loosely coupled but tightly integrated,” because you’ll eventually have to tear it apart and rebuild it.

Smart Things Everywhere. The article points out that it’s tempting to dismiss the smart toaster as a cute gizmo that has no practical purpose in our lives. But perhaps it’s something less sinister: a prelude to an always-on world where smart devices are constantly collecting data and adapting to our conditions.Driven by consumer demand, smart devices are proliferating at an astounding rate. Smart device ecosystems are springing up around the two leading platforms, Amazon Alexa and Google Assistant, providing consumers with the opportunity to infuse remote access and AI smarts into everything from lighting and HVAC systems to locks and home appliances.The article adds that, buoyed by the rollout of super-fast 5G wireless networks, what’s happening in the home will soon happen in the world at large. Consumers will be able to interact with a multitude of devices, providing new levels of personalization everywhere we go.In 2019, progress will be made across a multitude of fronts. Yes, there are substantial technical, legal, and ethical hurdles presented by big data and AI, but the potential benefits are too great to ignore.

“Change is undeniable. But it’s not like companies are looking through thick fog and cannot make the woods from the trees. Big Data is a valuable yet simple resource that companies can take advantage of. All it takes is intuition and a good data analytics department to put you on the front foot. Without refinement, Big Data is like a Ferrari that is just parked in a garage and never driven…it is useless,” warns BGTconsult Co-Founder and CEO Bradley Geldenhuys.