Data Handling & Storage in Companies

Abhishek Arora
4 min readSep 17, 2020

--

So What is Big Data?

Any data that is large in size so large that it creates a problem for companies dealing in it to store and retrieve it. Big Data is characterized By Three V’s

Velocity — It refers to the speed with which the data is uploaded by the users from various geographical locations onto the servers of companies. As per the reports of mid 2017 With Velocity we refer to the speed with which data are being generated. Staying with our social media example, every day 900 million photos are uploaded on Facebook, 500 million tweets are posted on Twitter, 0.4 million hours of video are uploaded on YouTube and 3.5 billion searches are performed in Google

Variety — It refers to the type of data that is being generated. It can be classified into three types

Structured data is information that has been formatted and transformed into a well-defined data model. The raw data is mapped into predesigned fields that can then later be extracted and read through SQL easily. SQL relational databases, consisting of tables with rows and columns, are the perfect example of structured data. The relational model of structured data utilizes memory since it minimizes data redundancy. However, this also means that structured data is more inter-dependent and less flexible. Structured data is generated by both humans and machines. There are numerous examples of structured data that is generated by machines, such as POS data like quantity, barcodes, and weblog statistics. Similarly, anyone who works on data would have used spreadsheets once in their lifetime, which is a classic case of structured data generated by humans. Due to the organization of structured data, it is easier to analyze than both semi-structured and unstructured data

Semi-Structured Data — Your data may not always be structured or unstructured — there lies another category between these two that is partially structured. Such data is defined as semi-structured. This type of data has some consistent and definite characteristics, it does not confine into a rigid structure such as that needed for relational databases. Organizational properties like metadata or semantics tags are used with semi-structured data to make it more manageable, however, it still contains some variability and inconsistency. An example of semi-structured data is delimited files. It contains elements that can break down the data into separate hierarchies. Similarly, in digital photographs, the image does not have a pre-defined structure itself. Still, if it is taken from a smartphone, it would have structured attributes like geotag, device ID, and datetime stamp. After being stored, images can also be assigned tags such as ‘pet’ or ‘dog’ to provide a structure. On some occasions, unstructured data is classified as semi-structured because it has one or more classifying attributes.

Unstructured Data — Data present in absolute raw form is termed as unstructured. This data is difficult to process due to its complex arrangement and formatting. Unstructured data may take many forms, including social media posts, chats, satellite imagery, IoT sensor data, emails, and presentations. Unstructured data is qualitative, not quantitative, so it is mostly categorical and characteristic in nature. For example, data from social media or websites can be used to figure out future buying trends or to determine the effectiveness of a marketing campaign. Moreover, unstructured data helps in detecting patterns in scam emails and chat, which can be useful for enterprises for monitoring policy compliance.

Solution to this problem

In today’s world, we have became very advanced technologically. So, there must be a way to deal with Big Data. So, let’s see how we can conquer Big Data problem.

A distributed storage system is infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes. Whenever data is sent on server for collection, master node distributes the data and sends the data to slave nodes. Using this technique we don’t require a large amount of single storage device and hence this will be cost effective.

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware(This means the system is capable of running different operating systems (OSes) such as Windows or Linux without requiring special drivers.). It provides massive storage for any kind of data, enormous processing power, and the ability to handle virtually limitless concurrent tasks or jobs.

Some Facts

Facebook generates 4 petabytes of data per day — that’s a million gigabytes. All that data is stored in what is known as the Hive, which contains about 300 petabytes of data

Netflix is a subscription-based streaming service that allows our members to watch TV shows and movies without commercials on an internet-connected device. It is the most loved American entertainment company specializing in online on-demand streaming video for its customers. Netflix has been determined to be able to predict what exactly its customers will enjoy watching with Big Data.

91 percent of executives rate LinkedIn as their first choice for professionally relevant content. 280 billion feed updates viewed annually. There are 9 billion content impressions in LinkedIn feeds every week. 2 million posts, articles, and videos are published on LinkedIn every day.

95 million photos and videos are shared on Instagram per day. Over 40 billion photos and videos have been shared on the Instagram platform since its conception.

--

--

No responses yet