Why Python is important for big data?

Big Data and Python

Big Done ak Piton,,en,se yon lang pwogramasyon jeneral ki se sous louvri,,en,Youn nan karakteristik ki pi enpòtan nan piton se seri rich li yo nan sèvis piblik yo ak bibliyotèk pou pwosesis done ak analytics travay,,en,Nan epòk aktyèl la nan done gwo,,en,Python ap resevwa plis popilarite ak karakteristik fasil-a-itilize li yo ki sipòte pwosesis done gwo,,en,Nan atik sa a nou pral eksplore karakteristik ak pakè nan piton ki lajman ki itilize nan done yo gwo ka itilize,,en,Nou pral mache tou nan yon egzanp lavi reyèl ki montre pwosesis gwo done,,en,done estriktire,,en,ak èd nan piton pakè ak pwogramasyon,,en,Piton te premye prezante nan ane 1980 yo ak Lè sa a, aplike nan ane a,,en,Python te devlope kòm yon pwojè sous louvri ki kapab tou itilize nan anviwònman komèsyal yo,,en

Apèsi sou lekòl la :

Piton is a general purpose programming language which is open source, flexible, powerful and easy to use. One of the most important features of python is its rich set of utilities and libraries for data processing and analytics tasks. In the current era of big data, python is getting more popularity due to its easy-to-use features which supports big data processing.

In this article we will explore features and packages of python which are widely used in the big data use cases. We will also walk through a real life example which shows big data processing (unstructured data) with the help of python packages and programming.







Some background of Python:

Python was first introduced in the year 1980s and then implemented in the year 1989 by Guido Van Rossum. Python was developed as an open source project which can also be used in commercial environment. Filozofi a debaz nan piton se fè kòd fasil yo sèvi ak,,en,plis lizib ak ekri mwens kantite liy pou reyalize plis travay,,en,Pati nan pi attrayant nan piton se bibliyotèk estanda li yo ki gen pare yo sèvi ak zouti pou fè divès kalite travay,,en,te prezante nan Jan.2016,,en,ki gen plis pase,,en,pakè pou itilizasyon lojisyèl twazyèm pati,,en,Python karakteristik,,en,Sa yo se kèk nan karakteristik yo ki enpòtan nan piton ki fè li yon anfòm pafè pou devlopman aplikasyon rapid,,en,Python se entèprete lang pou pwogram lan pa bezwen konpile,,en,Entèprèt analyse kòd pwogram lan ak jenere pwodiksyon an,,en,Python se dynamique tape,,en,Se konsa, varyab yo kalite yo defini otomatikman,,en,Python se fòtman tape,,en,Se konsa, devlopè yo bezwen jete kalite a manyèlman,,en, more readable and write less number of lines for accomplishing more tasks. The most attractive part of python is its standard library which contains ready to use tools for performing various tasks. Python Package Index was introduced in Jan.2016, containing more than 72000 packages for third party software usage.

Python features

Following are some of the important features of python which makes it a perfect fit for rapid application development.

  • Python is interpreted language so the program does not need to be compiled. Interpreter parses the program code and generates the output.
  • Python is dynamically typed, so the variables types are defined automatically.
  • Python is strongly typed. So the developers need to cast the type manually.
  • Mwens kòd ak plis itilize fè li plis akseptab,,en,Python se pòtab,,en,Piton se trè popilè pou pwosesis done gwo akòz itilizasyon senp li yo ak lajè seri bibliyotèk pwosesis done,,en,Li se tou pi pito pou fè aplikasyon pou évolutive,,en,Lòt bò enpòtan nan python se kapasite li yo entegre ak aplikasyon pou entènèt,,en,Tout karakteristik ki anwo yo bay sipò pou gwo done pwosesis ak génération Sur rapid,,en,Sa a rapid ak dinamik insight,,en,ki chanjman trè souvan,,en,se valab pou òganizasyon yo,,en,Se konsa, yo vle kèk lang pwisan / platfòm / zouti yo ka resevwa sa a imedyatman epi yo rete konpetitif nan mache a,,en,Python jwe yon wòl enpòtan isit la epi li sipòte bezwen biznis lan,,en,Python vini kòm yon pake default enstale,,en,Se konsa, nan sa yo senaryo,,en.
  • Python is portable, extendable and scalable.

Why Python is important in big data and analytics?

Python is very popular for big data processing due to its simple usage and wide set of data processing libraries. It is also preferred for making scalable applications. The other important side of python is its ability to integrate with web applications. All the above features provide support for big data processing and generating quick insights. This quick and dynamic insight (which changes very frequently) is valuable to the organizations. So they want some powerful language/platform/tools to get this value instantly and remain competitive in the market. Python plays an important role here and supports the need of the business.

How to download, install and setup Python?

In most of the Linux distribution, python comes as a default installed package. So in those scenarios, itilizatè yo pa bezwen enstale li separeman,,en,ka enstale a dwe telechaje nan lyen ki anba la a ak Lè sa a, enstale kòm pou chak enstriksyon,,en,Jis sonje yo tcheke,,en,konsa ke li se otomatikman te ajoute nan chemen an,,en,Apre sa se lyen ki download,,en,Montre 'Ajoute Python.exe nan chemen,,en,kòm tcheke,,en,Apre enstalasyon an fini,,en,kalite 'piton,,en,nan èd memwa a kòm jan yo montre anba a,,en,Li pral montre detay sou enstalasyon,,en,vèsyon elatriye,,en,Li tou asire ke enstalasyon python ou a siksè,,en,Montre èd memwa python,,en,Koulye a, yo kòmanse ekri yon devlopè pwogram python bezwen enstale yon editè tèks bon,,en,oswa nenpòt lòt editè bon ka enstale,,en,Ki jan yo kouri aplikasyon python,,en,Python ka itilize nan de fason jan yo montre anba a,,en,Kalite kòmandman python soti nan èd memwa lòd,,en,Ekri python kòd nan dosye script,,en. For Windows, the installer can be downloaded from the link below and then install as per instruction. Just remember to check ‘Add Python.exe to path‘ so that it is automatically added in the path.







Following is the download link:

https://www.python.org/downloads/

Download Python

Imaj 1: Showing ‘Add Python.exe to path’ as checked

After installation is complete, type ‘python’ in the command prompt as shown below. It will display details about installation, version etc. It also ensures that your python installation is successful.

python command prompt

Imaj 2: Showing python command prompt

Now to start writing a python program developers need to install a good text editor. So notepad or any other good editor can be installed.

How to run python applications? Python can be used in two ways as shown below.

  • Type python commands from command prompt
  • Write python code in script file (.py)

Nan egzanp nou an nou pral sèvi ak dosye script nan kouri aplikasyon an,,en,Ki jan yo travay sou done unstructured lè l sèvi avèk piton,,en,Nou te deja diskite ke piton se youn nan lang yo pi renmen pou pwosesis done gwo,,en,Done gwo soti nan sous diferan,,en,ak youn nan sous ki pi enpòtan se medya sosyal tankou liv figi,,en,twitter elatriye,,en,Done gwo kouvri diferan kalite done tankou unstructured,,en,semi-estriktire oswa nenpòt lòt fòm,,en,Men, pati ki pi enpòtan li nan pwosesis li epi li fè li itil,,en,Nan aplikasyon echantiyon nou an nou pral tcheke kouman twitter done,,en,ki se yon gwo done,,en,ka trete lè l sèvi avèk piton,,en,Anvan nou vole nan kòd la,,en,etap sa yo dwe fèt,,en,Kreye yon app pa ale nan lyen devlopman twitter,,en,Sa ap bay ou kle app,,en,app sekrè,,en,qauth_token ak qauth_token_secret,,en.

How to process unstructured data by using python?

We have already discussed that python is one of the favourite languages for big data processing. Big data comes from different sources, and one of the most important sources is social media like face book, twitter etc. Big data covers different types of data like unstructured, semi-structured or any other form. But the most important part it to process it and make it useful.

In our sample application we will check how twitter data (which is a big data) can be processed by using python.

Before we jump into the code, following steps needs to be performed

  • Create an app by going to twitter development link (https://apps.twitter.com/). This will provide you app key, app secret, qauth_token and qauth_token_secret. All these will be required in your application to access the twitter data.
  • Install Twython ak simplejson. The first one is a python wrapper around Twitter API and the 2nd one is used for parsing json data.

Once this basic set up is complete, we are ready to go to check the code.

First, we need to import some relevant python packages which will be used in our programming.

Listing 1: Importing python packages

[Kòd]

import sys

import string

import simplejson

import datetime

from twython import Twython

[/Kòd]

In the 2nd step we are creating some variables to be used in the program

Listing2: Setting up variables

[Kòd]

nowtm = datetime.datetime.now()

daytm=int(nowtm.day)

monthtm=int(nowtm.month)

yeartm=int(nowtm.year)

[/Kòd]

Third step is to create variable with OAuth tokens created during app creation in Twitter development site.

Listing3: Creating authentication

[Kòd]

t_auth = Twython(app_key = 'YOUR_APP_KEY',,en,oauth_token = 'YOUR_OAUTH_TOKEN',,en,oauth_token_secret = 'YOUR_OAUTH_TOKEN_SECRET',,en,afiche Twitter itilizatè entènèt nan yon varyab jan yo montre anba a epi pou yo jwenn itilizatè yo,,en,Apre sa kreye dosye pwodiksyon,,en,jaden header,,en,inisyalize dosye pwodiksyon ak ekri tit,,en,Kreye dosye pwodiksyon ak ekri tit,,en,uids =,,en,twusers = t_auth.lookup_user,,en,user_id = uids,,en,twoutfn =,,en,twitter_user_data_% mwen.% mwen. i.txt,,en,usr_fields =,,en,usr_id usr_screen_name usr_name usr_created_at usr_url,,en,tw_outfp = ouvè,,en,twoutfn,,en,w,,en,tw_outfp.write,,en,string.join,,en,usr_fields,,en,Koulye a, dènye etap la se pou kouri pou riban ak rekipere valè ki enpòtan nan fòma json epi ekri li nan yon dosye,,en,Jwenn valè ak ekri li nan yon dosye pwodiksyon,,en,pou antre nan twusers,,en,dic_r =,,en,pou tw_f nan usr_fields,,en,dic_r,,en,tw_f,,en,'Usr_id',,en,= antre,,en,'Id',,en,'Usr_screen_name',,en,

oauth_token=’YOUR_OAUTH_TOKEN’,

oauth_token_secret=’YOUR_OAUTH_TOKEN_SECRET’)

[/Kòd]

Next, assign twitter user ids into a variable as shown below and get users. After that create output file, header fields, initialize output file and write headers.

Listing4: Creating output file and write headers

[Kòd]

uids = “4516,9815312,132133,12343233,545334,9829867,2653636,2093829,28373663”

twusers = t_auth.lookup_user(user_id = uids)

twoutfn = “twitter_user_data_%i.%i.%i.txt” % (nowtm.month, nowtm.day, nowtm.year)

usr_fields = “usr_id usr_screen_name usr_name usr_created_at usr_url”.split()

tw_outfp = open(twoutfn, “w”)

tw_outfp.write(string.join(usr_fields, “\t”) + “\n”)

[/Kòd]









Now the last step is to run for loop and retrieve relevant values from json format and write it to a file.

Listing5: Getting values and writes it in a output file

[Kòd]

for entry in twusers:

dic_r = {}

for tw_f in usr_fields:

dic_r[tw_f] = “”

dic_r[‘usr_id’] = entry[‘id’]

dic_r[‘usr_screen_name’] = entry[‘screen_name’]

dic_r[‘usr_name’] = entry[‘name’]

dic_r[‘usr_created_at’] = entry[‘created_at’]

dic_r[‘usr_url’] = entry[‘url’]

final_lst = []

for tw_f in usr_fields:

final_lst.append(unicode(dic_r[tw_f]).replace(“\/”, “/”))

tw_outfp.write(string.join(final_lst, “\t”).encode(“utf-8”) + “\n”)

tw_outfp.close()

[/Kòd]

This output data from twitter is now ready for processing in a hadoop platform. These data is parsed using MapReduce program to get analytics value. The same techniques can be applied for any unstructured data.

Limitations of Python:

Although python has lot of positive sides, but it also has some set of limitations as it exists in all the languages. Let us have a brief look at those cons.

  • Python does not have proper multi-processor support
  • Lack of commercial support
  • Does not have good pre-packaged solutions
  • Lack of good documentation
  • Database layer is a bit old fashioned, although work is going on in this area.
  • Lack of UI development framework

Success stories:

Python is growing rapidly and its practical implementations are also encouraging. Some of the success stories are mentioned below.

  • Python has been used to improve image processing from the Hubble Space Telescope
  • YouTube has used it to develop its massive scalable web applications
  • Google’s internal infrastructure is also powered by Python
  • Companies like Sony DreamWorks, Disney uses Python for co-ordinating clusters of computers for image processing









Konklizyon: Python is one of the most successful languages for big data and analytics applications. Its popularity is also growing day by day. In this piece of article, we have covered brief back ground, features and installation of the software. We have also discussed specific features which are relevant to big data applications. In spite of some limitations, python is a good choice for big data processing and analytics.

Kaushik (Author): Technical Architect by profession, having more than 20 years of experience in IT industry. Passionate about the technology world. Interested in software design, open source technologies, Done Big, AI and technology consulting. Teaching and mentoring IT professionals for more than 12 years. Also, involved into online/offline training, interviewing, consulting, mentoring and coaching.

LinkedIn Profile – https://www.linkedin.com/in/kaushik-pal-36b36915

Tagged on: , ,
============================================= ============================================== Buy best TechAlpine Books on Amazon
============================================== ---------------------------------------------------------------- electrician ct chestnutelectric
error

Enjoy this blog? Please spread the word :)

Follow by Email
LinkedIn
LinkedIn
Share