Apache Hadoop – A comparison among different components

Apache Hadoop is an open-source software framework written in Java. It is primarily used for the storage and processing of large sets of data, which is also better known as big data. Now, Apache Hadoop comprises of certain specific components that allow the storage and processing of large data volumes in a clustered environment. However, the two main components have been that of Hadoop Distributed File System and MapReduce programming.

In this article, we will first take a look at the various components that make up Apache Hadoop and then some of the integrated systems and databases.

Components of Apache Hadoop

Hadoop, as a whole, consists of the following parts.

Hadoop Distributed File System – Abbreviated as HDFS, it is primarily a file system similar to many of the already existing ones. However, it is also a virtual file system.

There is one notable difference with other popular file systems, which is, when we move a file in HDFS, it is automatically split into smaller files. These smaller files are then replicated on a minimum of three different servers, so that they can be used an alternative to unforeseen circumstances. Also, this replication count isn’t necessarily defined, and can be decided as per requirements.

Hadoop MapReduce – MapReduce is mainly the programming aspect of Hadoop that allows processing of such large volumes of data.

There is also a provision that breaks down requests into smaller requests, which are then sent to multiple servers. This allows utilization of the scalable power of the CPU.

HBASE – HBASE happens to be a layer that sits atop the HDFS and has been developed by means of the Java programming language. HBASE primarily has the following aspects –

  • Non relational
  • Highly scalable
  • Fault tolerance

Every single row that exists in HBASE is identified by means of a key. The number of columns is also not defined, but rather grouped into column families.

Zookeeper – This is basically a centralized system that maintains –

  • Configuration information
  • Naming information
  • Synchronization information

Besides these, Zookeeper is also responsible for group services and is utilized by HBASE. It also comes to use for MapReduce programs.

Solr/Lucene – This is nothing but a search engine. Its libraries are developed by Apache and required over 10 years to be developed in its present robust form.

Programming Languages - Tá dhá teangacha cláir a shainaithnítear mar theangacha cláir bhunaidh Hadoop ann go bunúsach,,en,Is iad seo -,,en,MUC,,en,tá roinnt teangacha cláir eile gur féidir a úsáid le haghaidh cláir scríbhneoireachta,,en,is é sin C,,en,JAQL agus Java,,en,Is féidir linn a dhéanamh freisin ar úsáid dhíreach de SQL le haghaidh idirghníomhaíochta leis an mbunachar sonraí,,en,cé go bhfuil go n-éilíonn an úsáid a bhaint JDBC caighdeánach nó tiománaithe ODBC,,en,Córais le haghaidh oibríochtaí Hadoop comhtháite,,en,Tá an chuid is díoltóirí fiontraíochta a gcuid táirgí Hadoop féin a chuimsíonn freisin ar a mbunachar sonraí,,en,A ciseal táirgiúlacht d'fhoirne Sonraí Eolaíochta dtugtar Chorus,,en,Tá dáileadh fiontar IBM don Hadoop dtugtar Infosphere BigInsights,,en,Cuireann sé i bhfeidhm le sraith de ghnéithe d'Hadoop,,en,mar -,,en,Uirlisí do bhainistíocht,,en,Uirlisí d'riarachán,,en. These are –

  • Hive
  • PIG

Besides these, there are a few other programming languages that can be used for writing programs, namely C, JAQL and Java. We can also make direct usage of SQL for interaction with the database, although that requires the use of standard JDBC or ODBC drivers.

Systems for integrated Hadoop operations

Most enterprise vendors have their very own Hadoop products that also comprise of their database, as well as analytical offerings. These offerings also do not require you to source Hadoop from elsewhere, but rather provide it as a core aspect of their solutions.

Some of these are –

EMC Greenplum

Greenplum happens to be a pretty new entrant in the enterprise business and has a reputation for being a strong provider of analytics. It comes as a Unified Analytics Platform, which consists of –

  • Greenplum database meant for use on structured data
  • Its Hadoop distribution, known as the Greenplum HD
  • A productivity layer for Data Science teams called Chorus.

IBM

IBM’s enterprise distribution for Hadoop is known as Infosphere BigInsights. It implements an array of features for Hadoop, such as –

  • Tools for management
  • Tools for administration
  • Cuimsíonn sé chomh maith ar uirlisí anailíse sonraí téacsúla a chabhróidh i réiteach na n-eintiteas,,en,ar nós daoine a aithint,,en,uimhreacha gutháin,,en,seoltaí agus níos mó,,en,De réir a dhéanamh úsáid a bhaint as an teanga iarratais JAQL,,en,ar féidir le duine a chomhtháthú leis an Hadoop le táirgí IBM DB2 éagsúla cosúil le,,en,nó fiú Netezza,,en,BigSheets,,lb,Is scarbhileoige cosúil le cur i bhfeidhm ag obair ar sonraí mór ar fáil freisin,,en,Is féidir le BigInsights a úsáid ach amháin níos mó ná scamall trí bhíthin Amazon,,en,Rackspace,,en,Rightscale,,en,Microsoft,,en,Foirmeacha Hadoop an chuid lárnach den shonraí a thairiscint mór Microsoft,,en,Coinneáil le cur chuige comhtháite,,en,pleananna sé a dhéanamh sonraí mór ar fáil thar a sraith uirlis do Analytics,,en,Solutions Microsoft Big Sonraí tugtha isteach ar an ardán Windows Server agus chomh maith leis an ardán Windows azure,,en,a bhfuil scamall-bhunaithe,,en,Comhtháite le Windows Córais Ionad agus Eolaire Gníomhach,,en, such as identifying people, phone numbers, addresses and more.

By making use of the JAQL query language, one can integrate the Hadoop with various IBM products like DB2, or even Netezza. BigSheets, a spreadsheet like application working on big data is also offered. At present, BigInsights can only be used over cloud by means of Amazon, Rackspace, Rightscale, etc.

Microsoft

Hadoop forms the core part of Microsoft’s big data offering. Pursuing an integrated approach, it plans to make available big data over its tool suite for analytics.

Microsoft Big Data Solutions have been brought into the Windows Server platform and also to the Windows Azure platform, which is cloud-based. Integrated with Windows Systems Center and Active Directory, Tá an chuideachta a leagan amach dáileadh féin Hadoop,,en,chomhtháthaíonn sé Hadoop lena Freastalaí SQL,,en,Stiúideo Amharc,,en,agus NET,,en,Oracle tháinig isteach i saol na sonraí mór le cur chuige atá bunaithe fearas i bhfoirm Big Fearas Sonraí,,en,Cinntíonn sé seo comhtháthú Hadoop éasca,,en,agus a thagann chomh maith leis an mbunachar sonraí nua NoSQL,,en,a cheadaíonn do Analytics agus freisin tá naisc le bunachair sonraí Oracle agus an lineup trádstóráil Exadata,,en,NoSQL a dtugtar freisin mar luach-bhunaithe eochair bunachar thairiscint Inscálaithe,,en,Tá samplaí is fearr de ceannródaithe go luath maidir leis seo,,en,Tá na bunachair MPP is eol a láimhseáil ualaí oibre speisialaithe i dtéarmaí Analytics,,en,agus freisin comhtháthú sonraí,,en,Soláthraíonn siad chónaisc do Hadoop agus ardáin trádstóráil sonraí eile,,en. Further, it integrates Hadoop with its SQL Server, Visual Studio, and .NET.

Oracle

Oracle entered into the world of big data with an appliance based approach in the form of Big Data Appliance. This ensures easy Hadoop integration, and comes along with the new NoSQL database, which allows for analytics and also has connections to Oracle databases and the Exadata warehousing lineup. NoSQL is also known as a scalable key value-based database offering.

Oracle also happens to have the R analytical platform integrated with Hadoop, and that makes it easy to ship. Oracle’s R Enterprise product is also one that allows easy database integration, and also with Hadoop.

Databases for analytics with Hadoop connectivity

Databases that support Massively Parallel Processing (MPP) are largely meant to process structured big data, unlike that of Hadoop’s specialization on unstructured data. Greenplum, and the much older Aster Data and Vertica, are best examples of early pioneers in this regard.

These MPP databases are known to handle specialized workloads in terms of analytics, and also integration of data. These provide connectors to Hadoop and other data warehousing platforms.

As déanach a bheidh na réitigh bunachar sonraí faighte ag roinnt imreoirí eile sa tionscal,,en,Aster sonraí faighte ag Teradata,,en,HP bhain Vertica,,en,Is Greenplum anois faoi EMC,,en,Cuideachtaí Hadoop-lárnach,,en,Chun freastal ar an idéalach forbróir thiomáint ar fud an domhain sonraí mór,,en,dáiltí de Hadoop ar fáil go minic i bhfoirm na n eagráin pobail,,en,Ní gá Chineál sin d'eagráin mbeadh cur chuige bainistíochta fiontair,,en,ach gach ceann de na functionalities d'fhéadfadh a bheith ag teastáil le haghaidh forbartha agus meastóireacht,,en,Cloudera,,en,Cloudera a tharlaíonn a bheith le bunú is sine a sholáthraíonn dáiltí Hadoop,,en,Tá sé ar eolas a chur ar fáil réitigh fiontair,,en,mar aon le hoiliúint,,en,seirbhísí agus roghanna tacaíochta,,en,Cloudera Tá ranníocaíochtaí iomadúla don Hadoop trí mheán ranníocaíochtaí foinse oscailte,,en,Hortonworks,,en, such as –

  • Aster Data has been acquired by Teradata
  • HP has acquired Vertica
  • Greenplum is now under EMC

Hadoop-centered companies

In order to meet the developer driven ideal of the big data world, distributions of Hadoop are very often offered in the form of community editions. Such types of editions do not have an enterprise management approach, but rather all of the functionalities that may be required for development and evaluation.

Cloudera

Cloudera happens to be the oldest establishment that provides Hadoop distributions. It is known to offer enterprise solutions, along with training, services and support options. Also, Cloudera has made numerous contributions to the Hadoop by means of open source contributions.

Hortonworks

Hortonworks Tá stair fhada a bhaineann le Hadoop,,en,Tá sé go príomha a táirge de Yahoo,,en,agus mar thionscnóir de Hadoop,,en,tá sé mar aidhm lárnach Hadoop teicneolaíocht a chur chun cinn,,en,Tá sé ag obair i bpáirtíocht freisin Microsoft níos fearr a chomhtháthú Hadoop,,en,Míníonn an t-alt thuas go soiléir na modúil éagsúla a dhéanann suas Hadoop,,en,chomh maith leis na leaganacha fiontraíochta agus pobalbhunaithe éagsúla atá ar fáil lena n-úsáid faoi láthair,,en,Le Hadoop fháil níos mó prominence,,en. It is mainly a product of Yahoo, and as an originator of Hadoop, it aims to promote core Hadoop technology. It has also partnered Microsoft to better their Hadoop integration.

Conclúid

The above article clearly explains the various modules that make up Hadoop, along with the numerous enterprise and community based editions that are available for present use. With Hadoop gaining more prominence, it is only a matter of time before more entrants are added to this increasing list.

Tagged on:
============================================= ============================================== Buy best TechAlpine Books on Amazon
============================================== ---------------------------------------------------------------- electrician ct chestnutelectric
error

Enjoy this blog? Please spread the word :)

Follow by Email
LinkedIn
LinkedIn
Share