GroupActivity_02 : ExampleMapReduceProgram MapReduce WordCount Program --------- 1. Install required Software : - Putty ( for Windows ) , use Terminal ( for Apple Macintosh ) - WinSCP ( for Windows ) , CyberDuck ( for Apple Macintosh ) - Oracle Virtual Box - Cloudera – by default it contains Eclipse and Hadoop packages installed which can be used to program MapReduce programs. (You can find many tutorials for installing Cloudera) 2. Read from textbook : "Hadoop the Definitive Guide" http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1491901632 - Chapter 1 Meet Hadoop , - Chapter 2 MapReduce , - Chapter 3 The Hadoop Distributed File System , - Chapter 6 Developing a MapReduce Application , - Chapter 7 How MapReduce Works , - Chapter 8 MapReduce Types and Formats , - Chapter 9 MapReduce Features 3. Read the ‘ MapReduce Tutorial ‘ from https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html 4. Copy the ‘ WordCount v1.0 ‘ program and run it on your local Hadoop Cluster . To install single node cluster on your laptop , install the Oracle Virtual Box , and Cloudera . Use the Eclipse , which comes with Cloudera . (NOTE: For the input file of the WordCount program use Mammals book from: http://webpages.uncc.edu/aatzache/ITCS6190/Exercises/03_MammalsBook_Text_34848.txt.utf8.txt) 4.1. When you open Eclipse, you can see a training project. Locate their external library(Hadoop jar files) paths. 4.2. Create new JAVA project and import all those jar files. 4.3. Now copy WordCount v1.0 into new project. Import required jar files if some went missing previously. 4.4. Convert the project into a jar file. 4.5. Since it is a single node cluster, you can execute HDFS commands in the terminal itself. 4.6. Execute and save the output into a output text file 5. Take a screen shot of your program output 6. Upload your source code ( jar file ) , your output text file , and your screen shot to Canvas | Project link 7. Each student should upload this part of the project individually 8. Copy the ‘ WordCount v2.0 ‘ and run it on the UNCCharlotte DSBA Hadoop Cluster . 8.1. Copy the " WordCount v2.0 " program . Upload it to the dsba-hadoop.uncc.edu server . It is copied into your user directory . 8.2. Copy the program in your HDFS directory . ( see steps 5.1 - 5.5 below ) 8.3. Run the " WordCount v2.0 " in a clustered environment . ( this program will *not* run on your Laptop , as it needs to use multiple machines ) . 8.4. One method is writing and compiling the MapReduce code in Cloudera Eclipse itself, zip the project and transfer it to FTP client using the following command scp filename.zip username@dsba-hadoop.uncc.edu:/users/username 8.5. Otherwise you can get Hadoop jar files for Eclipse from the internet and import them in your project ( as you did with ‘ WordCount v1.0 ‘ in Step. 4 ) , convert your project into a jar file, use Putty software to run your MapReduce program . Get the output and store it in local system. 8.4. Execute and save the output into a output text file 9. Take a screen shot of your program output 10. Upload your source code ( jar file ) , your output text file , and your screen shot to Canvas | Project link 11. Each student should upload this part of the project individually -------------------------- To Log In to Hadoop Cluster follow the instructions below : 1. To Log-In to Hadoop via FTP client ( in order to copy and paste data and to view files ) Open your FTP Client ( WinSCP or CyberDuck ) Choose Session | New Session File protocol SFTP : Host Name : dsba-hadoop.uncc.edu Type UserName and Password and click Save | check the Save Password checkbox 2. Copy the Dataset ( Car and MammographicMass ) , ListOfInputActionRules , on the Hadoop via FTP . The file will get copied into your home folder users/yourUserName 3. Install Putty or SSH Client 4. Log-in To dsba-hadoop.uncc.edu via the Putty or SSH Client ( in order to run commands ) - For Windows use Putty see instructions at : http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html - For Apple Macintosh use Applications | Terminal see instructions at : https://support.rackspace.com/how-to/connecting-to-linux-from-mac-os-x-by-using-terminal/ 5. Run sample text processing on the ListOfInputActionRules . ListOfInputActionRules is a text file containing one action rule per line . For example : (a, a1->a2) ^ (c = c2) -> (f, f1->f0) [2, 50%] (a, a1->a3) ^ (b, ->b1) -> (f, f1->f0) [3, 75%] (a, a1->a3) ^ (c = c2) -> (f, f1->f0) [1, 80%] (a, ->a3) ^ (b, b2->b1) -> (f, f1->f0) [3, 50%] 5.1. create a new input file called Input4Grep ( on the hadoop file system ) hadoop fs -put ./ListOfInputActionRules.txt /user/yourUserName/Input4Grep 5.2. see if the file was created ( on the hadoop file system ) hadoop fs -ls /user/yourUserName 5.3. return all lines of text (ActionRules) which contain the word “a1” hadoop org.apache.hadoop.examples.Grep /user/aatzache/Input4Grep /user/aatzache/Out4Grep01 ".*a1.*" 5.4. copy the file Out4Grep01 ( from the hadoop file system ) - into your home folder users/yourUserName hadoop fs -get /user/yourUserName/Out4Grep01 /users/yourUserName 5.5. check via FTP if the Out4Grep01 is there 6. Repeat Step 5. using the Mammals book text file below , and return all lines of text which contain the word ‘ mammal ‘ http://webpages.uncc.edu/aatzache/ITCS6190/Exercises/03_MammalsBook_Text_34848.txt.utf8.txt --------------------------