#Welcome to the fifth of our python tutorials, in this one we shall focus on the use of website parsing
#There are many different types of parser available when parsing websites
#some of these parsers are urllib, urllib2 and HTMLParser

#In this tutorial we will look at urllib and urllib2, although you can look into HTMLParser also if you wish

#1. SETTING UP URLLIB AND URLLIB2
#first before we do anything we need to say that we wish to use urllib and urllib2 with the following statement
import urllib
import urllib2
#we will also import the re module, this allows us to perform regular expressions
#a regular expression is a pattern what must be matched in order to do something e.g. a post or zip code
#since post and zip codes are in a certain pattern you could use a regular expression to match these pattern as an example
import re
#as you can see above, we use the word import to say we wish to import re, urllib and urllib2
#this allows us to use the functions in which these modules contain
#now that the three modules have been initialised we can begin

#2. SEARCHING THROUGH A FILE
#first we need to set up a file pointer, a file pointer is something that will point to a file we specify
fp = urllib2.urlopen("http://www.google.com")
#as you can see above, our file pointer equals urllib2.urlopen followed by our address with quotes either side in brackets
#the above code will create a file pointer which points to the homepage of http://www.google.com
#to note is that urlopen will open up the url ready to parse

#the next thing is to search through the file, we set up a while loop below to go through the file
#this is one of the only times a while loop is good to use, so we start by saying while 1
#by saying while 1 it means we do the following code while data is still coming from the file
while 1:
    #we tell a variable calls s to equal fp.readline()
    #since fp is a file pointer, we can read each line from the file
    #alternatively you could use read() to read the whole file at once
	s = fp.readline()
        #so now that s equals the current line from the file in which we are going through we can continue
        #we have some optional code here, this will only print out lines that have more than one character in length
        if len(s) > 1:
            #so if the length of s is greater than 1 we shall print out the current line from the file
            #now we can do some operations when parsing the line of the file we are on
            #a. This commented out line will remove any complete HTML tags from the file just leaving the text
            #uncomment it and give it a go on http://www.google.com
            #s = re.sub(r'<[^>]*?>', '', s)
            #notice above that we use a regular expression, it basically says to tell s to remove any tags from within it and then to equal itself
            #so the re.sub shown above will remove any opening <> and closing </> tags...try it, you notice no complete tags show?
            #b. The commented code below will get the text from "href" attributes within HTML anchor tags
            #first we tell a variable called find to equal a regular expression, notice the use of compile here
            #we use compile because we wish to compile a value from within the tag, we ignore casing
            #find = re.compile('<\s*a.*?href\s*=\s*"(.*?)".*?>', re.IGNORECASE)
            #link = find.findall(s)
            #print link
            #notice above we tell link variable to equal every occurance of href inside anchor tag within s
            #also note that the code in section b will only work IF the code in section a is commented out as it relys on html tags being there
            #c. finally we are going to look at getting the image name from a src attribute within an img tag
            #again this is assuming we are trying to parse a HTML file and the code goes like this
            #we tell imageRE to equal a regular expression in which we compile, re ignore casing here
            #imageRe = re.compile('<\s*img.*?src\s*=\s*"(.*?)".*?>', re.IGNORECASE)
            #imagelink = imageRe.findall(s)
            #print imagelink
            #shown above we search through each image tag for the contents of src attribute and get the text from within it
            #finally we print out the value "imagelink"...try it, you may wish to comment the other print outs out so you can see results
            #you should see no image links displayed as http://www.google.com has no img tags within it
            #NOTE - DO NOT TEST THIS STATEMENT UNTIL YOU HAVE ENDED THE WHILE LOOP AND CLOSED FP
            print s #here we print out the value of s, you may wish to comment this when printing out other variables

    #after we have got the data from each line of the file we now need to close down the file, we do this as follows
    #we say if not s then break the code to stop searching through the file
    #this usually happens once we reach the end of the file in which we are getting data from
        if not s:
            break
            #lastly we close down the file pointer to close the connection with fp.close()
fp.close()

#THAT IS ALL, YOU NOW KNOW HOW TO PARSE A BASIC WEBSITE AND HOW TO USE BASIC REGULAR EXPRESSIONS IN PYTHON
#WE WILL PICK UP ON SOME ADVANCED PARSING AND GETTING FILES FROM WEBSITES IN THE NEXT TUTORIAL