20 core technologies on the radar screen
Originally, Hadoop was created in 2005 as a free implementation of Google’s MapReduce framework. Today it ranks as the de facto standard in the open source big data environment. With the HDFS distributed file system and the YARN resource manager, the latest version of the Apache project forms the basis for cost efficient as well as scalable distributed data management and processing. Numerous additional technologies build on it – from distributed databases like HBase to workflow engines such as Oozie. For real-time processing, Apache Storm or the like could be integrated into the ecosystem. Currently, our experts have around 20 core technologies revolving around Hadoop on their radar screens.
Successfully deployed in many different projects
Hadoop and the related technologies have already been put to the task in many mgm projects. One salient example: In order to store millions of data sets in thousands of different original formats in a distributed data storage system, we implemented a solution with Hadoop and HBase. The absence of license costs thanks to open source is a general advantage for customers, meaning that possible scaling is not obstructed by expensive software licenses.
Compiling sustainable software stacks
The Hadoop ecosystem is continually expanding. While existing technologies are undergoing further development and new technologies are being added, the further development of some projects has stopped. One challenge consists in keeping an eye on the maturity and the potential for further development of the individual components – and combining the right technologies as a perfect fit for the requirements of a given project. Should it be real time or batch processing? Is data consistency, i.e. transactionality, required? Is data structured, semi-structured or unstructured? In order to find the best solution for all cases, we observe and evaluate many additional technologies besides the core components.