(Apache Flume and Apache Spark)
About author : Dr Poornima G. Naik bears twenty-two years of teaching and research experience in the field of Computer Science. She has been associated with teaching in thrust areas of computer science such as Big Data Analytics, Mobile Computing, Information System Security and Cryptography and her current research areas are Cyber Security, Machine Learning, Soft Computing and Big Data Analytics. She has guided three M Phil. Students. She has published more than 50 research papers in different national and international journals and presented more than 30 papers in different international and national conferences. She has authored 18 books on various cutting edge technologies in information technology. She is the recipient of prestigious Dr. APJ Abdul Kalam Life Time Achievement National Award for remarkable achievements in the field of Teaching, Research & Publications awarded by International Institute for Social and Economic Reforms, Bangalore. She has immense experience in guiding academic projects with computer aided tools. She has guided many industrial projects in project management, core banking solutions and e-Learning solutions.
About book : Big data analytics emerged as a revolution in the field of information technology. It is the ability of the organization to stay agile which gives it a competitive edge over its competitors. Data harvesting and data analytics enable the organization identify new opportunities which in turn results in efficient operations, leads to smarter business moves and higher business turnovers. All these issues are addressed by big data analytics and its initiatives. Chapter 7 deals with Flume model and configuration of Apache Flume agent. The topics in Flume which are more difficult to comprehend and assimilate, such as Interceptors, channel selectors, event serializers, sink processors are explored in greater detail with suitable working examples. The salient feature of the Chapter 7 is case study illustration on fetching Twitter data and storing it in HDFS, fetching data from sequence generator and netcat source. The chapter concludes with installation of Flume and hands-on lab sessions with Apache Flume. In the big data era, Apache Spark emerged to address the high latency associated with MapReduce Model which the Spark addresses using its own specialized fundamental data structure, Resilient Distributed Datasets (RDD). RDD is key to the high performance of Apache Spark. The in-memory computation supported by Spark makes it a language of choice for implementing machine learning algorithms, graph algorithms etc. which otherwise would involve high latency in MapReduce processing paradigm. Chapter 8 focuses on different components of Apache Spark and Spark deployment architectures. Programming with RDD is dealt with with special emphasis on Spark functional programming, RDD transformations and actions, chaining RDD transformations, Spark lazy evaluation. Step-by-step procedure for installing Spark ove Hadoop is described. The main highlight of the chapter is setting up of virtual multi node cluster using VMWare Workstation 15.0 and configuration of master and slave nod