The world’s leading publication for data science, AI, and ML professionals.

A Comparative Study on Cloudera, Amazon Web Services and Microsoft Azure

Abstract – This paper is to compare the cloud computing services provided by Cloudera, Amazon Web Services and Microsoft Azure. Big data…

Photo by C Dustin on Unsplash
Photo by C Dustin on Unsplash

Making Sense of Big Data

Abstract – This paper is to compare the cloud computing services provided by Cloudera, Amazon Web Services and Microsoft Azure. Big data is about large volume data and contain structured, semi-structured and unstructured data. Big data is not able to be stored and processed by conventional technologies. Hadoop framework enable storage and process of such complex data. Cloudera, Amazon Web Services and Microsoft Azure deployed Hadoop framework and enable data storage and process on cloud. All three distributions provide cloud computing, cloud storage, databases and Machine Learning. They all have their own strength and weakness in different aspects. Users have to choose the distribution that best fit their requirements.

Keywords – Big data, cloud computing, Hadoop, AWS, Azure

I. Introduction (Brief background)

Big data is a term that refers to mammoth size (volume), high rate of growth (velocity) and assorted datasets (variety). Conventional technologies and tools, for example, Relational Database Management System (RDBMS) is neither sufficient nor fit to managed, captured, processed or analyzed and create meaningful insight from big data [1][4].

RDBMS is only suitable for structured data which stored in table form while for big data, the data have wide variety and not only tables. Big data included unstructured data like mobile generated information, images, videos, and RFID. In RDMBS, data is analyzed based on relationships which lead to another limitation of it, because maintaining unstructured data relationship is unthinkable (yet). Other than that, RDBMS does not guarantee fast processing speed which is one of the main concerns in analyzing big data. Hence, big data analytics is better with NoSQL with distributed file system approach. Last but not least, conventional technologies and tools will be an expensive approach in storing and processing big data. [4]

Figure 1. Five Characteristics of Big Data. (Image by Author)
Figure 1. Five Characteristics of Big Data. (Image by Author)

From the above statements, big data is described with huge volume, high variety and also high velocity. From [2] and [7], there are another two definition of big data, which is veracity and value.

Volume: Volume refers to size of data. Data volume have grown from megabytes and gigabytes to petabytes as a large scale of data is generated every second. According to prediction, by 2020, 40 Zettabytes of data will be created which is 300 times from 2005 [1] [2].

Figure 2. Growth in Data Volume. (Image by Author)
Figure 2. Growth in Data Volume. (Image by Author)

Variety: Variety refers to different types of data and sources of data. Big data is not only content of structured data in rows and columns form, in fact, only a small part of big data is structured data, most of the data generated is unstructured data or semi structured data, for example, music, videos, images, e-mail or social media data. There are 200 million active users on Twitter, and 400 million tweets are sent per day. All these contribute to the growth in variety of big data [1] [2].

Figure 3. Types of Data. (Image by Author)
Figure 3. Types of Data. (Image by Author)

Velocity: Velocity refers to the speed the data is generated and the speed to process data. There are static data which do not change, and also data that change very frequently. For frequently change data or data that generated in a high speed, for example social media posts, the processed speed must be high enough because the data may not be useful after sometimes [1] [2].

Veracity: Veracity refers to the accuracy or unreliability of data. Inconsistency and incompleteness in collecting data lead to the uncertainty of data [2].

Value: Value refers to the benefit that can be gained from the data. From [7], the big data ecosystem shows that data buyers and data users can extract value from the information gathered and combined by others in the data value chain.

I. History and Evolution of the Distributions / Services

As conventional technologies and tools are no longer suitable for big data storage and processing, various distributions and services are released. Most of the distributions support Hadoop framework which able to handle complex and large data.

Hadoop (Highly Archived Distributed Object-Oriented Programming) is an open source Apache framework, written in Java, and designed to support distributed parallel processing of large scale of data sets across clusters of computers using simple programming models. Hadoop get its name from creator son’s toy elephant. In 2005, two Yahoo employees have founded Hadoop, which are initially created to support an open source web crawler, Nutch. In 2003, Google released Google File System and Google Map Reduce, and then Google shared Google File System and MapReduce explanation in white papers in 2004. Google motivated the creation of Hadoop. In 2005, Hadoop begin to serve in Yahoo. In 2008, Apache took over Hadoop and that’s why Hadoop is now known as Apache Hadoop. In the meantime, Hadoop has become one of the most powerful data storage and process frameworks for distributed applications [3] [4].

Hadoop helps to store, access and gain large resources from big data in distributed fashion at least cost, high scalability and high availability as itself can determine the failure at application level, which make it very fault tolerance. Hadoop not only can handle large volume of data, but also can handle a wide variety of data, such as images, videos, music audios, Files, Folders, Software and e-mail. In short, Hadoop can handle any kind of structured, semi-structured and unstructured data. Cloudera, Hortonworks and MapR is commercially supported distribution of Hadoop [4].

Figure 4. Hadoop Components. (Image by Author)
Figure 4. Hadoop Components. (Image by Author)

Hadoop offer a variety of services, which included documentation, location awareness, source code and work scheduling Hadoop Package contain of two main parts and other various components. The two main parts are Hadoop Distributed File System (HDFS) and MapReduce. HDFS is mainly for data storage and MapReduce is for data processing and data analysis. Both of HDFS and MapReduce practice master-slave architecture. The other components include Zookeeper, Sqoop, Pig, Oozie, Hive, HBase, Flume and Avro [3] [4].

The HDFS is the Java portable file system, which is more scalable, trustable, distributed in the Hadoop framework environment. HDFS contains a single Name Node and a cluster of Data Nodes. Data nodes store files data while Name node stores metadata like the name, file attributes, replicas and locations of each block address. Name node will create a replica of a block if the Data node have a block is lost or failed in the replica. The number of blocks in Data Nodes is monitored by the Name node. Hence, the Name Node is very important, as only it know where all the files stored at, and it’s also the only way to communicate to Data Nodes [3].

Map Reduce is a software used for processing large datasets. The name of it represents the main two functions of it, which are Map and Reduce. Map function separate the data into pair of key and value, reduce conclude final output, and intermediate value is generated [4]. As mentioned, MapReduce apply Master/Slave architecture same with HDFS. In MapReduce, the master node is the Job Tracker and the slave node is Task Tracker. After get order from user, MapReduce request datasets from HDFS. In the process, user communicates with MapReduce master node, which is Job Tracker, then Job Tracker get the location of data to process from Name Node from HDFS. After that, Job Tracker pass the job to Task Tracker to process. For both HDFS and MapReduce, the slave node and will send heartbeat signal to master node periodically to ensure master node is still alive [5].

Hadoop is irreplaceable as it framework make storage and processing of big data become possible. Although Hadoop architecture is based on Google File System and Google Map Reduce, but the services provided is free and hence many distributions deployed implementations of Hadoop in use.

A. Cloudera

As mentioned in the previous part, Cloudera is one of the commercially supported distribution of Hadoop. In 2008, Cloudera was founded by the smartest brains in Silicon Valley’s leading companies, including Oracle, Facebook, Google, and Yahoo!. Hortonworks is formed in 2011, by 24 Yahoo! engineers, from the original Hadoop team. Both Hortonworks and Cloudera believe that open standards, open source and open markets are the best for the companies. In 2019, Cloudera has merged with Hortonworks [8].

The founders of Cloudera insist with Hadoop, the whole business market can gain advantages. For instance, oil and gas firm can analyze their oil reservoir data in a new way which might provide different insight. The statement has got enough attention from Accel Partners, where Accel Partners willing to invest fund to start the Cloudera Project. Nevertheless, the co-founder of VMware, Diana Greene and the former chief executive of MySQL, Marten Mickos and the chief financial officer at Facebook, Gideon Yu, have pump money in the company as well. Cloudera plans to sell their consulting and support services although Hadoop remains free [9].

B. Amazon Web Services

Amazon as the leader in global online sales, has its own cloud computing, namely Amazon Web Services (AWS). In 2006, AWS was founded and begin to provide cloud computing to individuals and organizations [6]. AWS was introduced as a side business for Amazon.com [10]. The leader of AWS business, Andy Jassy mentioned that the AWS was not idea of any single man, it was founded because the company was in frustration due to their ability to launch new project and to support customers. Nobody knows that AWS will grow into what it is today, a trillion-dollar technology market [11].

According to Jassy, during the building stage of AWS, they have quickly faced their first critical decision, whether they should build just one service from storage solution, compute solution and database solution or build a platform with all three services completed. As they believe that all applications need a compute solution, mostly need a database and almost all need storage, they concluded that most of the developers need combination of these three services. Then. AWS was launched in March of 2006 [11].

C. Microsoft Azure

Microsoft Azure which was first released in 2010. Microsoft Azure enable users to run service on the cloud, or combine the cloud computing service with any infrastructure, data center or applications [6]. Microsoft Azure was originally named as Windows Azure and renamed to Microsoft Azure on April 3, 2014. On the same day, Microsoft have state that they focus on Azure as a public cloud platform for customers. Microsoft Azure support various operating systems, languages and services from any public cloud, for instance, Hadoop [12]. Microsoft Azure was first introduced on the Microsoft Professional Developers Conference 2008, together with Microsoft SQL Services, Microsoft .NET Services, Live Services, and Microsoft SharePoint Services and Microsoft Dynamics CRM Services SaaS. With the series of new products, Microsoft offer the five key categories of cloud service, which are not limited to cloud computing. Windows Azure provides compute, storage and networking services while Microsoft SQL Services provides databases services [14].

Microsoft is always the leader in IT markets. While Microsoft Windows software was and still is popular in people life, Microsoft decides to shift their focus from PCs to mobile and cloud computing. The decision has been proven is wise decision as the sales was doubled, and the shares rose as much as 6.2 percent [13].

II. Highlights of the Distributions / Services and Its Components

Table 1. Summary of Comparison of the Three Distributions. (Image by Author)
Table 1. Summary of Comparison of the Three Distributions. (Image by Author)

A. Cloudera

Cloudera Enterprise is well fit for organizations that want to start up their own Enterprise Big Data Hub and perform data analysis on it. Cloudera Enterprise leverage open source Cloudera Distribution of Hadoop (CDH), which is one of the most deployed implementations of Hadoop in use today [17].

CDH provides all the tools that needed to start up and operate a Big Data Environment. By integration with Apache Kudu project and Apache Impala project, Cloudera is now able to support for real-time analytics and analytics queries with SQL. The main four features provide by Cloudera is open source data platform, analytics, data management and predictive modelling. Cloudera Data Platform (CDP) have the best technologies of Hortonworks and Cloudera to serve as the first enterprise data cloud. CDP not only includes Data Hub service but also Data Warehouse and Machine Learning. It provided an integrated control plane to manage the data, infrastructure and analytic, CDP also support hybrid cloud or multi-cloud environment. CDP is fully open source distribution, vendor lock-in can be avoided. For pricing, Cloudera offers various options which includes annual subscriptions starting at $4,000 and $0.08 per hour for services used [17].

B. Amazon Web Services

Amazon Web Services provides highly customizable services, in term of storage and service used. Furthermore, the cost is calculated based on the number of services selected. It is highly beneficial for both small and large organization. AWS also provide a free 12 month-trial. However, AWS currently having some customer support issues [6].

The cloud computing services provided by AWS can be divides into 19 categories, which are Compute, Storage, Database, Migration, Networking and Content Delivery, Developer Tools, Management Tools, Artificial Intelligence, Analytics, Security, Identity and Compliance, Mobile Services, Application Services, Messaging, Business Productivity, Desktop and App Streaming, Software, Internet of Things, Contact Center and Game Development [16].

Among all of the services, one of the most popular services from AWS is Elastic Compute Cloud (EC2) for cloud computing. It offers free tier, pay per usage and fast deployment. The payment starts from $0.0059 per hour. Next is Simple Storage Services (S3), also free tier, suitable for primary storage and back up as they claiming the service have almost 100% durability. Glacier is another storage service provided by AWS, which are suitable for archival storage and is integrated with S3. Another popular service is Relational Database Service (RDS) which offer six database engines as choices and the price is start at $0.017 per hour. Amazon have strong investment in AI, they provide Amazon Machine Learning which capable for real-time predictions and contains of visualization tools and wizards. The payment is count per usage, where data analysis and model building cost $0.42 per hour, batch predictions cost $0.10 per 1000 transactions and real-time predictions cost $0.0001 per prediction [16].

The biggest benefit AWS provides is pay per use, which make it not only suitable for large organizations, but also small and medium enterprise, or even individuals.

C. Microsoft Azure

Microsoft Azure provide a wide variety of services for different kinds of industry. They considered all kinds of the business needs and came out with various packages which sufficient to fulfill the needs of various kinds of industry. It is compatible in both Windows and Linux and also provide 12-month free trial. However, it is more expensive compared to AWS as the services came in packages, which mean even you may not need a certain service, you still have to purchase it anyway [6].

The cloud services provided by Microsoft Azure can be divided into 14 categories, which are Compute, Networking, Storage, Web + Mobile, Containers, Databases, Data + Analytics, AI + Cognitive Services, Internet of Things, Enterprise Integration, Security + Identity, Developer Tools, Monitoring + Management, and Microsoft Azure Stack [14].

According to [14], from the wide variety of services provided by Azure, the most popular services is Virtual Machine for compute, Blob Storage, SQL Databases, Azure Active Directory for Security + Identity, and Visual Studio Team Services for Developers Tools. Microsoft Azure’s Virtual Machine offer pay per use, which cost $0.018 per hour used only. Furthermore, the Virtual Machine support various server software, which includes Window Server, Linux, SOI Server, IBM, SAP and Oracle. For Blob Storage, the prices are depending on the access, and its support massively scalable object storage. Both the Virtual Machine and Blob Storage offer free tier, where users are free to try on the services. Next is SQL Database, it also offers pay per use, the price is range from $0.10 to $3.23 per hour used. Azure Active Directory provides single sign-on service for cloud and on-cloud premise applications, and it is integrated with Microsoft cloud services like Office 365. Free tier is offered, and the cost is $9 per user per month for Premium P2 service and $1 per user per month for Basic service. Visual Studio Team Services is free for developers who subscribe to other Microsoft services. For who are not Microsoft services subscribers, the services are free for the first five users, for user’s 6th to 10th is $6 per user per month, $8 for users 11–100 and $6 for users 101–1000, and $4 for additional users.

In short, the Microsoft Azure is benefit for individuals or organizations those are already using Microsoft software, for example, Windows and Office. By using Microsoft Azure which have the interface which are familiar by Microsoft service users, they can adopt it quickly and some of the services provided by Microsoft Azure is free for Microsoft service subscribers [14].

III. Comparisons among the Distributors / Services and Its Components (Analysis / Results)

All three distributions, Cloudera, Amazon Web Services and Microsoft Azure support data storage, databases, cybersecurity, cloud computing and also Machine Learning. Each of the distributions have its own pros and cons with the services provided. This section will compare the services provided by each from the three distributions.

The ratings and reviews of Cloudera, AWS and Microsoft Azure based on product capabilities and customers are available on [15]. Based on ratings and reviews from last 12 months on [15], Microsoft Azure have the highest overall peer rating, followed by Cloudera then AWS. For product capabilities, Cloudera and Microsoft have average ratings higher than 4.0, while AWS have average ratings around 3.5. The ratings show that AWS have low performance in loading data continuously and also queries on many data types/sources while Cloudera is weak on their administrative and management. Microsoft Azure have the highest ratings for most of the aspects, except managing large volume of data, queries on many data types/sources, system availability, and user skill level support. AWS have the highest ratings in system availability while Cloudera have highest ratings on queries on many data types/sources. For customer experience, all three distributions have average ratings above 4.0. Microsoft Azure have the highest ratings for all criteria, which included the pricing flexibility, ease of deployment, timeliness of vendor response and quality of technical support.

According to the ratings, Microsoft is the best from most of the aspect reviewed, which includes price and ease of deployment. However, based on [6], Microsoft Azure is rated as expensive. The difference between two sources, this is because Microsoft Azure is well fitted for the organizations those are already Windows services subscribers. They can get some free Azure services, and the interface is also more familiar for them which made its easier for them to deploy Microsoft Azure.

Based on the ratings [15] and review in [6], AWS have high ratings in system availability and also recommend for its low price and pay per use payment method. However, compared to the other two distributions. AWS is weaker in loading data and performing query on it, while Cloudera have strong performance on that. Furthermore, AWS does not support hybrid cloud environment like Cloudera and Microsoft Azure, which limited its application in some aspects.

IV. Discussion and Recommendation (Application Domain)

As different distributions have different strength and weakness, their application on real-life big data is different as well. Microsoft Azure which are strongly supported to Windows services, are very suitable for sector that have widely used Windows services like Education. Students to teachers or lecturers, or even the management staffs, use Windows services like Microsoft Office. Most of the universities subscribed Microsoft Office 365 services for educational used. The applications of Microsoft Office 365 include students’ assignments, lectures’ records of students’ performance, and also the students’ and all staffs’ details records in the universities.

Cloudera, one of the most deployed implementations of Hadoop, have the strongest ability in loading data and queries, is suitable in processing complex data which are not structured data. For example, medical data. The characteristics and symptoms of diseases on different patient is different. Cloudera also supports scientific data visualization. Hence, Cloudera is suitable to apply in healthcare sector.

Amazon Web Services (AWS) is suitable in Public Sector as the payment is charge per use. Public Sector includes the Government Organization and the Non-Government Organization, they may use the services provide by AWS for storing data they collected from survey and also perform data analytics right away, on AWS platform. AWS allows pay per used which may benefit the organizations in term of cost. AWS have provided machine learning services as well, where the organizations can possibly use machine learning on prediction such as predict the usage of electricity of a residence area.

From my point of view, all these three distributions will growth and getting more and more mature in the services provided, in terms of products quality, security, ease of deployment and customer service. The area of applications of the distributions will only getting bigger and bigger. It is possible that at some point in the future, more and more services become free as the distributions are competing among themselves. It is also possible that at some point in the future, the cloud services have been innovated until we can control everything with our smart devices instead of PCs or laptops. Big data storage, process, data analytics and machine learning can be done by finger touches.

If that has become our reality, the cybersecurity will be very important for the distributions. When everything in our life is on cloud, it will be a disaster when the cloud disrupted. At that time, the security and the durability of the distribution will be the first concern when people are picking the services.

V. Conclusion

In conclusion, each distribution has its own unique features, pros and cons. Hence, users are advised to select the distribution that best fit for them based on their own situation. Cloudera have deployed Hadoop the most and hence have really strong performance in handling complex data. Amazon Web Services are best known with their price flexibility, which suit for every user. Microsoft Azure are most fit with Windows services subscribers.


Stay Connected

Subscribe on YouTube

References

[1] H. S. Bhosale and P. D. P. Gadekar, "A Review Paper on Big Data and Hadoop," International Journal of Scientific and Research Publications, vol. 4, no. 10, October 2014.

[2] R. Beakta, "Big Data and Hadoop: A Review Paper," International Journal of Computer Science and Information Technology, vol. 2, no. 2, 2015.

[3] B. Saraladevi, N. Pazhaniraja, P. V. Paul, M. S. Basha and P. Dhavachelvan, "Big Data and Hadoop-A Study in Security Perspective," in 2nd International Symposium on Big Data and Cloud Computing, Chennai, 2015.

[4] P. Vijay and B. Keshwani, "Emergence of Big Data with Hadoop : A Review," IOSR Journal of Engineering, vol. 6, no. 3, March 2016.

[5] M. R. Ghazi and D. Gangodkar, "Hadoop, MapReduce and HDFS: A Developers Perspective," in International Conference on Intelligent Computing, Communication & Convergence, India, 2015.

[6] N. Drake, "Best cloud computing services of 2019: for Digital Transformation," TechRadar, 04-Nov-2019. [Online]. Available: https://www.techradar.com/best/best-cloud-computing-services. [Accessed: 13-Oct-2019].

[7] S. Gnanasundaram and A. Shrivastava, Information Storage and Management: Storing, Managing, and Protecting Digital Information in Classic, 2nd ed., Crosspoint Boulevard: John Wiley & Sons, Inc., 2012. [E-Book].

[8] "About | Cloudera", Cloudera, 2019. [Online]. Available: https://www.cloudera.com/about.html. [Accessed: 18- Oct- 2019].

[9] A. Vance, "Bottling the Magic Behind Google and Facebook", Bits, 16-Mar-2009. [Online]. Available: https://bits.blogs.nytimes.com/2009/03/16/bottling-the-magic-behind-google-and-facebook/. [Accessed: 18-Oct-2019].

[10] R. Miller, "How AWS came to be", TechCrunch, 02-July-2016. [Online]. Available: https://techcrunch.com/2016/07/02/andy-jassys-brief-history-of-the-genesis-of-aws/. [Accessed: 29-Oct-2019].

[11] J. Furrier, "Exclusive: The Story of AWS and Andy Jassy’s Trillion Dollar Baby", Medium, 30-Jan-2015. [Online]. Available: https://medium.com/@furrier/original-content-the-story-of-aws-and-andy-jassys-trillion-dollar-baby-4e8a35fd7ed. [Accessed: 29-Oct-2019].

[12] S. Martin, "Upcoming Name Change for Window Azure", Microsoft Azure, 24-March-2014. [Online]. Available: https://azure.microsoft.com/en-us/blog/upcoming-name-change-for-windows-azure/. [Accessed: 29-Oct-2019].

[13] A. G. Tharakan and J. Dastin, "Microsoft shares hit high as cloud business flies above estimates", Reuters, 21-Oct-2016. [Online]. Available: https://uk.reuters.com/article/uk-microsoft-results-idUKKCN12K2JC. [Accessed: 29-Oct-2019].

[14] C. Harvey, " Microsoft Azure", Datamation, 23-May-2017. [Online]. Available: https://www.datamation.com/cloud-computing/microsoft-azure.html. [Accessed: 02-Nov-2019].

[15] "Comparing Cloudera, Microsoft, Amazon Web Services (AWS)", Gartner. [Online]. Available: https://www.gartner.com/reviews/market/data-warehouse-solutions/compare/cloudera-vs-amazon-web-services-vs-microsoft. [Accessed: 02-Nov-2019].

[16] C. Harvey, "Amazon Web Services (AWS)", Datamation, 11-May-2017. [Online]. Available: https://www.datamation.com/cloud-computing/amazon-web-services.html. [Accessed: 03-Nov-2019].

[17] S. M. Kerner, "Cloudera Enterprise: Service Overview and Insight", Datamation, 25-April-2019. [Online]. Available: https://www.datamation.com/big-data/cloudera-enterprise-data-analytics-tools.html. [Accessed: 05-Nov-2019].


Related Articles