Heterogeneous data pools
Apart from volume and velocity, data variety ranks as central characteristic of big data. Typically, data will be arising from different sources, partially structured and partially unstructured. This includes sensor data and log data, for example, or aggregated customer feedback in the form of e-mails. Data can be historical (i.e. already stored) or arise in real-time and enter the analysis directly on generation. Besides data from the company itself, external data sources – such as from open data portals – can be used.
Mastering myriads of formats
Particularly when data from many different sources amasses in a big data solution, it must be put into a common form. Challenges arise here in terms of handling the different original formats. For example, in one project our experts developed a solution for storing and processing 20,000 different formats.
Time dependence and legal requirements
The fact that data is time dependent presents an additional difficulty. An application functioning with geodata, for example, must factor in changes. New roads are built and country borders shift while hereditary vehicle data must still align with the old maps. Legal requirements may necessitate additional processing steps. Personal data, for example, must be deleted after a specific period of time.