Friday, September 20, 2019

Programming Languages for Data Analysis

Programming Languages for Data Analysis R and Python for Data Analysis Abstract This paper discusses the comparison between the popular programming languages for Data analysis. Although there are plenty of choices in programming languages for Data science like Java, R Language, Python etc. With a whole lot of research carried out to know the strengths of these languages, we are going to discuss any two of these. Data Analytics has been the most important and trusted tool for business and markets. Data Analytics is nowadays making use of SAAS (Software As a Service). For this literature review, two popular languages (R and python) have been studied and evaluated the characteristics to decide which one will be the right language for data analysis. Both Languages shows their own strength and weakness and based on that, to understand the data based processing environments in the Distributed File Systems. Keywords-Programming language; Data analytics; R; Python, Big Data; For an industry to grow in a market is not an easy task. With the help of Data Analytics, it can grow bigger and better. It can help to deliver quick corporate results and a value to business. The major challenge with the data is to process it and then make decisions worth value. Data Crunching requires proper tools and powerful analysis. Out of all languages, we choose two popular language i.e R language and Python for data analysis. We are going to discuss the need of using a programming language in Data Analysis and list some of the characteristics of these two languages. In the end, we will conclude which language performs and delivers in the field of Data Analysis. While carrying out research in Data Analytics, we came across multiple programming languages apart from R and Python which are described below- Julia Not a well-recognized language but hackers surely talk of Julia. It is said to be faster than R upgradable than Python. [5] Java In comparison to R and Python, Java seems less capable in terms of Data Visualization but can be the first choice for the prototype of the statistical system. [6] MATLAB Became popular and was used before the release of python and R. To be good fit as a programming language we should consider different aspects of data analysis. For this review purpose we will broadly classify them as follow- Collection of Raw Data Data is available in variety of format. Programming languages were evaluated in terms of support for various data formats and efficiency in handling them. Data processing Once imported into program, datasets might require cleansing in terms of missing values, unrelated or redundant data values etc. Capabilities to deal with such data were evaluated for programming languages Data Exploration Simplicity of applying commonly used statistical methods like grouping, pattern recognition, switching and sorting is evaluated for programming languages. Data Analysis Availability of special purpose in-built functions and various methods of machine learning and deep analysis are used as evaluation measures. Data Visualization Visualization is important aspect of data analytics. Visualization capabilities of programming languages were evaluated on the basis of ease of creation, simplicity and sharing in various formats. In addition to these capabilities we will discuss a bit about history and accolades of every programming language. We will also discuss popular choices for IDE (Integrated Development Environment) for these1 language. Introduced in 1995, by Ross Ihaka and Robert Gentleman, R is implementation of S programming language (Bell Labs). Latest version is 3.1.3 which was released in March, 2015. Rs architectural design and evolution is maintained by R-foundation and R-Core Group. [1] Rs software environment is written primarily in C, FORTRAN, and R. RStudio is very popular IDE used to perform data analysis using R. Primary used for academic research, R is rapidly expanding into enterprise market. [1] A. Collection of Raw Data You can Import data from variety of formats like excel, CSV, and from text files. DataFrames, primary data structure in R, can import files from SPSS or MiniTab. Basically R can handle data from most common sources without glitch. Where R is not so great at is data collection from web. Lot of work is being carried to address this limitation. To name few, Rvest package will perform basic web-scraping while magrittr will parse the information on webpages. [1][3] B. Data Processing It is very easy to reshape dataframe in R. Tasks like adding new columns, populating missing values etc. can be done with just one line of code. Many new packages like reshape2 allow users to manipulate data frames to fit the criteria set per requirements. [3] C. Data Exploration R is built by statisticians. For exploratory work its easy for beginners. Many models can be written with very few lines of codes. With R, users will be able to build probability distributions and apply statistical methods for machine learning. For advance work in analytics, optimization and analysis, users may have to rely on third party packages. [3] Many popular packages like zoo (to work with time-series), caret (machine learning) represent strength of R. Python is loosely bind programming language with very wide user base. D. Data Visualization Visualization is strong forte of R. R was built to perform statistical analysis and demonstrate the results. By default, R allows you to make basic charts and plot graphs which can be saved in variety of formats like jpeg or PDFs. With advance packages like ggvis, lattice and ggplot2 user can extend data visualization capabilities of R program. [1][3] Created by Guido Van Rossum in 1991, Python is inspired by C, Modula-3 and in-perticular ABC. Python software foundation (PSF) is curator for Python language. Current version is 3.4.3/2.7.9 released in Feb 2015/Dec 2014. Python has been popular choice for programmer to build web and multitier applications. In context of data analytics, Python is majorly use by programmers to apply statistical techniques. Coding in python is easy because of nice syntax. [4] IPython Notebook and ANACONDA are popular IDEs used for data analysis using Python. A. Collection of Raw Data In addition to excel, CSV and text data, python also supports JASON and semi-structured data formats like XML and YAML. Using certain libraries, users can import SQL tables into python program [4] Python Request Library facilitates web scrapping, where user can get data from websites to analyze in depth. [2] B. Data Processing To uncover underlying information, Pandas library of python comes handy. Like R, data is held in DataFrames which can be used and reused throughout program without hampering performance. [2] Users can apply standard methods of cleaning data or process data to fill out incompelete information just like R. C. Data Exploration Pandas is very powerful library. Users will be able to group by datavalues and sort them according to timeseries. Comlex grouping clauses like time-series analysis to seconds can be performed on dataframes in python program. D. Data Visualization Using MetaPlotlib [2] library, user can plot basic graphs and chrats from available data-points. For advance visulization, Plot.ly can be used, which is another python library. Users can use powerful IDEs like Anaconda or IPython Notebook to create powerful visualization and convert them into various formats like HTML. In addition to their differences, there are few common positives about both Python and R which make them so popular among data analysts and statisticians. R and Python are distributed under open license which make them free to download and modify per users need. In contrast to other programming tools, like SAS and SPSS, which come with hefty price tag. Being open source, many advancements in statistics will come to python and R first.[6] Both of them are widely loved and supported by big community of statisticians and developers. [6] IDE like IPython Notebook will consolidate your datasets in one file, thereby simplifies your workflow.[2] R has rich ecosystem of cutting edge packages to string your work together which proves useful in particular to Data Analysis.[3] Python is more of general purpose language. Its easy and intuitive, therefor it has simplified learning curve. Pythons testing framework guaranties reusability and reliability of code. R is language developed by statisticians for statisticians while python is easier to learn general purpose programming language.[3] Working through research in programming languages for data analytics, there are many other options which are listed below- Julia Though not yet widely recognized, data hackers talk fondly of Julia. It is regarded as faster than R and more scalable than Python.[5] Java Although java is not as capable as python and R in terms of visualization, it can be primary choice to build prototype for statistical system. [6] KAFKA Developed by linked-in, KAFKA is highly regarded for its real-time analytics capabilities.[6] STORM Storm is framework written in SCALA which saw recent tides of popularity in Silicon Valley MATLAB Excel Used by many statisticians before outburst of python and R. Special thanks to Prof. Oisin Creaner, for presenting this opportunity to dig out for various options available for programming in Data Analytics Ihaka, R. and Gentleman, R., 1996. R: a language for data analysis and graphics. Journal of computational and graphical statistics, 5(3), pp.299-314. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V. and Vanderplas, J., 2011. Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, pp.2825-2830.. Nasridinov, A. and Park, Y.H., 2013, September. Visual Analytics for Big Data Using R. In Cloud and Green Computing (CGC), 2013 Third International Conference on (pp. 564-565). IEEE. Sanner, M.F., 1999. Python: a programming language for software integration and development. J Mol Graph Model, 17(1), pp.57-61. Bezanson, J., Karpinski, S., Shah, V.B. and Edelman, A., 2012. Julia: A fast dynamic language for technical computing. arXiv preprint arXiv:1209.5145. Fan, W. and Bifet, A., 2013. Mining big data: current status, and forecast to the future. ACM sIGKDD Explorations Newsletter, 14(2), pp.1-5.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.