#Welcome to the fifth of our python tutorials, in this one we shall focus on the use of website parsing #There are many different types of parser available when parsing websites #some of these parsers are urllib, urllib2 and HTMLParser #In this tutorial we will look at urllib and urllib2, although you can look into HTMLParser also if you wish #1. SETTING UP URLLIB AND URLLIB2 #first before we do anything we need to say that we wish to use urllib and urllib2 with the following statement import urllib import urllib2 #we will also import the re module, this allows us to perform regular expressions #a regular expression is a pattern what must be matched in order to do something e.g. a post or zip code #since post and zip codes are in a certain pattern you could use a regular expression to match these pattern as an example import re #as you can see above, we use the word import to say we wish to import re, urllib and urllib2 #this allows us to use the functions in which these modules contain #now that the three modules have been initialised we can begin #2. SEARCHING THROUGH A FILE #first we need to set up a file pointer, a file pointer is something that will point to a file we specify fp = urllib2.urlopen("http://www.google.com") #as you can see above, our file pointer equals urllib2.urlopen followed by our address with quotes either side in brackets #the above code will create a file pointer which points to the homepage of http://www.google.com #to note is that urlopen will open up the url ready to parse #the next thing is to search through the file, we set up a while loop below to go through the file #this is one of the only times a while loop is good to use, so we start by saying while 1 #by saying while 1 it means we do the following code while data is still coming from the file while 1: #we tell a variable calls s to equal fp.readline() #since fp is a file pointer, we can read each line from the file #alternatively you could use read() to read the whole file at once s = fp.readline() #so now that s equals the current line from the file in which we are going through we can continue #we have some optional code here, this will only print out lines that have more than one character in length if len(s) > 1: #so if the length of s is greater than 1 we shall print out the current line from the file #now we can do some operations when parsing the line of the file we are on #a. This commented out line will remove any complete HTML tags from the file just leaving the text #uncomment it and give it a go on http://www.google.com #s = re.sub(r'<[^>]*?>', '', s) #notice above that we use a regular expression, it basically says to tell s to remove any tags from within it and then to equal itself #so the re.sub shown above will remove any opening <> and closing tags...try it, you notice no complete tags show? #b. The commented code below will get the text from "href" attributes within HTML anchor tags #first we tell a variable called find to equal a regular expression, notice the use of compile here #we use compile because we wish to compile a value from within the tag, we ignore casing #find = re.compile('<\s*a.*?href\s*=\s*"(.*?)".*?>', re.IGNORECASE) #link = find.findall(s) #print link #notice above we tell link variable to equal every occurance of href inside anchor tag within s #also note that the code in section b will only work IF the code in section a is commented out as it relys on html tags being there #c. finally we are going to look at getting the image name from a src attribute within an img tag #again this is assuming we are trying to parse a HTML file and the code goes like this #we tell imageRE to equal a regular expression in which we compile, re ignore casing here #imageRe = re.compile('<\s*img.*?src\s*=\s*"(.*?)".*?>', re.IGNORECASE) #imagelink = imageRe.findall(s) #print imagelink #shown above we search through each image tag for the contents of src attribute and get the text from within it #finally we print out the value "imagelink"...try it, you may wish to comment the other print outs out so you can see results #you should see no image links displayed as http://www.google.com has no img tags within it #NOTE - DO NOT TEST THIS STATEMENT UNTIL YOU HAVE ENDED THE WHILE LOOP AND CLOSED FP print s #here we print out the value of s, you may wish to comment this when printing out other variables #after we have got the data from each line of the file we now need to close down the file, we do this as follows #we say if not s then break the code to stop searching through the file #this usually happens once we reach the end of the file in which we are getting data from if not s: break #lastly we close down the file pointer to close the connection with fp.close() fp.close() #THAT IS ALL, YOU NOW KNOW HOW TO PARSE A BASIC WEBSITE AND HOW TO USE BASIC REGULAR EXPRESSIONS IN PYTHON #WE WILL PICK UP ON SOME ADVANCED PARSING AND GETTING FILES FROM WEBSITES IN THE NEXT TUTORIAL