Data types involved in Big Data analytics are many: structured, unstructured, geographic, real-time media, natural language, time series, event, network and linked. It is necessary here to distinguish between human-generated data and device-generated data since human data is often less trustworthy, noisy and unclean.
A brief description of each type is given below.
- Structured data: data stored in rows and columns, mostly numerical, where the meaning of each data item is defined. This type of data constitutes about 10% of the today’s total data and is accessible through database management systems. Example sources of structured (or traditional) data include official registers that are created by governmental institutions to store data on individuals, enterprises and real estates; and sensors in industries that collect data about the processes. Today, sensor data is one of the fast growing areas, particularly that sensors are installed in plants to monitor movement, temperature, location, light, vibration, pressure, liquid and flow.
- Unstructured data: data of different forms like e.g. text, image, video, document, etc. It can also be in the form of customer complaints, contracts, or internal emails. This type of data accounts for about 90% of the data created in this century. In fact, the volcanic growth of social media (e.g. Facebook and Twitter), since the middle of the last decade, is responsible for the major part of the unstructured data that we have today. Unstructured data cannot be stored using traditional relational databases. Storing data with such a variety and complexity requires the use of adequate storage systems, commonly referred to as NoSQL databases, like e.g. MongoDB and CouchDB. The importance of unstructured data is located in the embedded interrelationships that may not be discovered if other types of data are considered. What makes data generated in social media different from other types of data is that data in social media has a personal taste.
- Geographic data: data related to roads, buildings, lakes, addresses, people, workplaces, and transportation routes, that are generated from geographic information systems. These data link between place, time, and attributes (i.e. descriptive information). Geographic data, which is digital, have huge benefits over traditional data sources such as maps, such as paper maps, written reports from explorers, and spoken accounts in that digital data are easy to copy, store, and transmit. More importantly, they are easy to transform, process, and analyze. Such data is useful in urban planning and for monitoring environmental effects. A branch of statistics that is involved in spatial or spatiotemporal data is called Geostatistics.
- Real-time media: real-time streaming of live or stored media data. A special characteristic of real-time media is the amount of data being produced which will be more confusing in the future in terms of storage and processing. One of the main sources of media data is services like e.g. YouTube, Flicker, and Vimeo that produce a huge amount of video, pictures, and audio. Another important source or real-time media is video conferencing (or visual collaboration) which allow two or more locations to communicate simultaneously in two-way video and audio transmission.
- Natural language Data: human-generated data, particularly in the verbal form. Such data differ in terms of the level of abstraction and level of editorial quality. The sources of natural language data include speech capture devices, land phones, mobile phones, and Internet of Things that generate large sizes of text-like communication between devices.
- Time series: a sequence of data points (or observations), typically consisting of successive measurements made over a time interval. The goal is to detect trends and anomalies, identify context and external influences, and compare individual against the group or compare individual at different times. There are two kinds of time series data: (i) continuous, where we have an observation at every instant of time and (ii) where we have an observation at (usually regularly) spaced intervals. Examples of such data include ocean tides, counts of sunspots, the daily closing value of the Dow Jones Industrial Average, and measuring the level of unemployment each month of the year.
- Event data: data generated from the matching between external events with time series. This requires the identification of important events from the unimportant. For example, information related to vehicle crashes or accidents can be collected and analyzed to help understand what the vehicles were doing before, during and after the event. The data in this example is generated by sensors fixed in different places of the vehicle body. Event data consists of three mains pieces of information: (i) action, which is the event itself, (ii) timestamp, the time when this event happened, and (iii) state, which describes all other information relevant to this event. Event data is usually described as rich, denormalized, nested and schemaless.
- Network data: data concerns very large networks, such as social networks (e.g. Facebook and Twitter), information networks (e.g. the World Wide Web), biological networks (e.g. biochemical, ecological and neural networks), and technological networks (e.g. the Internet, telephone and transportation networks). Network data is represented as nodes connected via one or more types of relationship. In social networks, nodes typically represent people. In information networks, nodes represent data items (e.g. webpages). In technological networks, nodes may represent Internet devices (e.g. routers and hubs) or telephone switches. In biological networks, nodes may represent neural cells. Much of the interesting work here is on network structure and connections between network nodes.
- Linked data: data that is built upon standard Web technologies such as HTTP, RDF, SPARQL and URIs to share information that can be semantically queried by computers (rather than serving human needs). This allows data from different sources to be connected and read. The term was coined by Tim Berners-Lee, director of the World Wide Web Consortium, in a design note about the Semantic Web project. This project allowed the Web to connect related data that wasn’t linked in the past by providing the mechanisms and lowering the barriers to linking data currently linked. Examples of repositories for linked data include (i) DBpedia, a dataset containing extracted data from Wikipedia, (ii) GeoNames, RDF descriptions of more than 7,500,000 geographical features worldwide, (iii) UMBEL, a lightweight reference structure of 20,000 subject concept classes and their relationships derived from OpenCyc, and (iv) FOAF, friend of a friend, a dataset describing persons, their properties and relationships. Linked open data is another project that targets linked data with open content.
Finally, each data type has different requirements for analysis and poses different challenges. In principle, the interpretation of data is known but in practice, nobody has the full picture.