GroupActivity_02 : ExampleMapReduceProgram 

MapReduce WordCount Program 	
---------

1. Install required Software :

- Putty ( for Windows ) , use Terminal ( for Apple Macintosh )
- WinSCP ( for Windows ) , CyberDuck ( for Apple Macintosh )
- Oracle Virtual Box
- Cloudera – by default it contains Eclipse and Hadoop packages installed which can be used to program MapReduce programs. (You can find many tutorials for installing Cloudera)


2. Read from textbook :   "Hadoop the Definitive Guide"

http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1491901632

- Chapter 1 Meet Hadoop , 
- Chapter 2 MapReduce , 
- Chapter 3 The Hadoop Distributed File System , 
- Chapter 6 Developing a MapReduce Application , 
- Chapter 7 How MapReduce Works , 
- Chapter 8 MapReduce Types and Formats , 
- Chapter 9 MapReduce Features 


3. Read the    ‘ MapReduce Tutorial ‘  from
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

4. Copy the ‘ WordCount v1.0 ‘  program and run it on your local Hadoop Cluster . To install single node cluster on your laptop , install the Oracle Virtual Box , and Cloudera . Use the Eclipse , which comes with Cloudera .

	(NOTE: For the input file of the WordCount program use Mammals book from: http://webpages.uncc.edu/aatzache/ITCS6190/Exercises/03_MammalsBook_Text_34848.txt.utf8.txt)

	4.1. When you open Eclipse, you can see a training project. Locate their external library(Hadoop jar files) paths.
	4.2. Create new JAVA project and import all those jar files.
	4.3. Now copy WordCount v1.0 into new project. Import required jar files if some went missing previously.
	4.4. Convert the project into a jar file.
	4.5. Since it is a single node cluster, you can execute HDFS commands in the terminal itself.
	4.6. Execute and save the output into a output text file 

5. Take a screen shot of your program output

6. Upload your source code ( jar file ) , your output text file , and your screen shot to Canvas |  Project link 

7. Each student should upload this part of the project individually

8. Copy the ‘ WordCount v2.0 ‘ and run it on the UNCCharlotte DSBA Hadoop Cluster . 

	8.1. Copy the  " WordCount v2.0 "  program . Upload it to the   dsba-hadoop.uncc.edu    server . It is copied into your user directory  .

	8.2. Copy the program  in your  HDFS  directory . ( see steps 5.1 - 5.5 below ) 

	8.3. Run the  " WordCount v2.0 "  in a clustered environment . ( this program will *not* run on your Laptop , as it needs to use multiple machines ) .

	8.4. One method is writing and compiling the MapReduce code in Cloudera Eclipse itself, zip the project and transfer it to FTP client using the following command

	 scp filename.zip username@dsba-hadoop.uncc.edu:/users/username

	8.5. Otherwise you can get Hadoop jar files for Eclipse from the internet and import them in your project ( as you did with ‘ WordCount v1.0 ‘  in Step. 4 ) , convert your project into a jar file, use Putty software to run your MapReduce program . Get the output and store it in local system.  

	8.4. Execute and save the output into a output text file 

9. Take a screen shot of your program output

10. Upload your source code ( jar file ) , your output text file , and your screen shot to Canvas |  Project link 

11. Each student should upload this part of the project individually



--------------------------

To Log In to Hadoop Cluster  follow the instructions below :


1. To Log-In to  Hadoop  via  FTP client  ( in order to copy and paste data and to view files ) 


Open your  FTP Client  ( WinSCP  or  CyberDuck )
Choose   Session  | New Session 

File protocol SFTP :
Host Name :  dsba-hadoop.uncc.edu

Type UserName and Password 

and click Save  |  check the Save Password checkbox



2. Copy the  Dataset ( Car and MammographicMass ) , ListOfInputActionRules ,  on the Hadoop via FTP . The file will get copied into your home folder   users/yourUserName


3. Install  Putty   or   SSH Client

4. Log-in To   dsba-hadoop.uncc.edu    via  the  Putty   or  SSH Client  ( in order to run commands )

- For Windows use Putty see instructions at :
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html

- For Apple Macintosh use Applications | Terminal   see instructions at :
https://support.rackspace.com/how-to/connecting-to-linux-from-mac-os-x-by-using-terminal/



5. Run sample text processing on the  ListOfInputActionRules . ListOfInputActionRules is a text file containing one action rule per line . For example :

(a, a1->a2) ^ (c = c2) -> (f, f1->f0) [2, 50%]
(a, a1->a3) ^ (b, ->b1) -> (f, f1->f0) [3, 75%]
(a, a1->a3) ^ (c = c2) -> (f, f1->f0) [1, 80%]
(a, ->a3) ^ (b, b2->b1) -> (f, f1->f0) [3, 50%]


5.1. create a new input file called Input4Grep ( on the hadoop file system )

hadoop fs -put ./ListOfInputActionRules.txt /user/yourUserName/Input4Grep

5.2. see if the file was created  ( on the hadoop file system )

hadoop fs -ls /user/yourUserName

5.3. return all lines of text (ActionRules) which contain the word “a1”

hadoop org.apache.hadoop.examples.Grep /user/aatzache/Input4Grep /user/aatzache/Out4Grep01 ".*a1.*"

5.4. copy the file Out4Grep01 ( from the hadoop file system ) - into your home folder  users/yourUserName

hadoop fs -get /user/yourUserName/Out4Grep01 /users/yourUserName


5.5. check via  FTP  if the Out4Grep01  is there


6. Repeat Step 5. using the  Mammals book text file below , and return all lines of text which contain the word ‘ mammal ‘

http://webpages.uncc.edu/aatzache/ITCS6190/Exercises/03_MammalsBook_Text_34848.txt.utf8.txt




--------------------------