Web Client Programming with Perl
Automating Tasks on the WebBy Clinton Wong
1st Edition March 1997
This book is out of print, but it has been made available online through the O'Reilly Open Books Project.
The World Wide Web has been credited with bringing the Internet to the masses. The Internet was previously the stomping ground of academics and a small, elite group of computer professionals, mostly UNIX programmers and other oddball types, running obscure commands like ftp and finger, archie and telnet, and so on.
With the arrival of graphical browsers for the Web, the Internet suddenly exploded. Anyone could find things on the Web. You didn't need to be "in the know" anymore--you just needed to be properly networked. Equipped with Netscape Navigator or Internet Explorer or any other browser, everyone can now explore the Internet freely.
But graphical browsers can be limiting. The very interactivity that makes them the ideal interface for the Internet also makes them cumbersome when you want to automate a task. It's analogous to editing a document by hand when you'd like to write a script to do the work for you. Graphical browsers require you to navigate the Web manually. In an effort to diminish the amount of tedious pointing-and-clicking you do with your browser, this book shows you how to liberate yourself from the confines of your browser.
Web Client Programming with Perl is a behind-the-scenes look at how your web browser interacts with web servers. Readers of this book will learn how the Web works and how to write software that is more flexible, dynamic, and powerful than the typical web browser. The goal here is not to rewrite the browser, but to give you the ability to retrieve, manipulate, and redistribute web-based information in an automated fashion.
Who This Book Is For
I like to think that this book is for everyone. But since that's a bit of an exaggeration, let's try to identify who might really enjoy this book.
This book is for software developers who want to expand into a new market niche. It provides proof-of-concept examples and a compilation of web-related technical data.
This book is for web administrators who maintain large amounts of data. Administrators can replace manual maintenance tasks with web robots to detect and correct problems with web sites. Robots perform tasks more accurately and quickly than human hands.
But to be honest, the audience that's closest to my heart is that of computer enthusiasts, tinkerers, and motivated students, who can use this book to satisfy their curiosity about how the Web works and how to make it work for them. My editor often talks about when she first learned UNIX scripting and how it opened a world of automation for her. When you learn how to write scripts, you realize that there's very little that you can't do within that universe. With this book, you can extend that confidence to the Web. If this book is successful, then for almost any web-related task you'll find yourself thinking, "Hey, I could write a script to do that!"
Unfortunately, we can't teach you everything. There are a few things that we assume that you are already familiar with:
- The concept of client/server network applications and TCP/IP.
- How the Internet works, and how to access it.
- The Perl language. Perl was chosen as the language for examples in this book due to its ability to hide complexity. Instead of dealing with C's data structures and low-level system calls, Perl introduces higher-level functions and a straightforward way of defining and using data. If you aren't already familiar with Perl, I recommend Learning Perl by Randal Schwartz, and Programming Perl (popularly known as "The Camel Book") by Larry Wall, Tom Christiansen, and Randal Schwartz. Both of these books are published by O'Reilly & Associates, Inc. There are other fine Perl books as well. Check out http://www.perl.com for the latest book critiques.
Is This Book for You?
Some of you already know why you picked up this book. But others may just have a nagging feeling that it's something useful to know, though you may not be entirely sure why. At the risk of seeming self-serving, let me suggest some ways in which this book may be helpful:
- Some people just like to know how things tick. If you like to think the Web is magic, fine--but there are many who don't like to get into a car without knowing what's under the hood. For those of you who desire a better technical understanding of the Web, this book demystifies the web protocol and the browser/server interaction.
- Some people hate to waste even a minute of time. Given the choice between repeating an action over and over for an hour, or writing a script to automate it, these people will choose the script every time. Call it productivity or just stubbornness--the effect is the same. Through web automation, much time can be saved. Repetitive tasks, like tracking packages or stock prices, can be relegated to a web robot, leaving the user free to perform more fruitful activities (like eating lunch).
- If you understand your current web environment, you are more likely to recognize areas that can be improved. Instead of waiting for solutions to show up in the marketplace, you can take an active role in shaping the future direction of your own web technology. You can develop your own specialized solutions to fit specific problems.
- In today's frenzied high-tech world, knowledge isn't just power, it's money. A reasonable understanding of HTTP looks nice on the resume when you're competing for software contracts, consulting work, and jobs.
This book consists of seven chapters and three appendices, as follows:
- Chapter 1, Introduction
- Discusses basic terminology and potential uses for customized web clients.
- Chapter 2, Demystifying the Browser
- Translates common browser tasks into HTTP transactions. By the end of the chapter, the reader will understand how web clients and servers interact, and will be able to perform these interactions manually.
- Chapter 3, Learning HTTP
- Teaches the nuances of the HTTP protocol.
- Chapter 4, The Socket Library
- Introduces the socket library and shows some examples of how to write simple web clients with sockets.
- Chapter 5, The LWP Library
- Describes the LWP library that will be used for the examples in Chapters 6 and 7.
- Chapter 6, Example LWP Programs
- A cookbook-type demonstration of several example applications.
- Chapter 7, Graphical Examples with Perl/Tk
- A demonstration of how you can use the Tk extention to Perl to add a graphical interface to your programs.
- Appendix A, HTTP Headers
- Contains a comprehensive listing of the headers specified by HTTP.
- Appendix B, Reference Tables
- Lists URLs that you can use to learn more about HTTP and LWP.
- Appendix C, The Robot Exclusion Standard
- Describes the Robot Exclusion Standard, which every good web programmer should know intimately.
Source Code in This Book Is Online
In this book, we include many code examples. While the code is all contained within the text, many people will prefer to download examples rather than type them in by hand. You can find the complete set of source code used in this book on ftp.oreilly.com at /published/oreilly/nutshell/web-client.
To use FTP, you need a machine with direct access to the Internet. A sample session follows, with what you should type shown in boldface.
% ftp ftp.oreilly.com
Connected to ftp.oreilly.com.
220 FTP server (Version 6.21 Tue Mar 10 22:09:55 EST 1992) ready.
Name (ftp.oreilly.com:yourname): anonymous
331 Guest login ok, send domain style e-mail address as password.
Password: yourname@yourhost (use your user name and host here)
230 Guest login ok, access restrictions apply.
ftp> cd /published/oreilly/nutshell/web-client
250 CWD command successful.
ftp> binary (Very important! You must specify binary transfer for compressed files.)
200 Type set to I.
ftp> get examples.tar.gz
200 PORT command successful.
150 Opening BINARY mode data connection for examples.tar.gz.
226 Transfer complete.
The file is a gzipped tar archive; extract the files from the archive by typing:
% gunzip examples.tar.gz
% tar xvf examples.tar
System V systems require the following tar command instead:
% tar xof examples.tar
Conventions Used in This Book
We use the following formatting conventions in this book:
- Italic is used for command names, function names, variables, email addresses, URLs, directory and filenames, and newsgroup names. It is also used for emphasis and for the first use of a technical term.
- Courier is used for HTTP header names and for code.
- Courier Italic is used within code to show elements that should be replaced with real values.
- Courier Bold is used to show commands entered by the user.
Request for Comments
As a reader of this book, you can help us to improve the next edition. If you find errors, inaccuracies, or typos anywhere in the book, please let us know about them. Also, if you find any misleading statements or confusing explanations, let us know. Send your bug reports and comments to:
O'Reilly & Associates, Inc.
101 Morris St.
Sebastopol, CA 95472
1-800-998-9938 (in the US or Canada)
Please let us know what we can do to make the book more helpful to you. We take your comments seriously, and will do whatever we can to make this book as useful as it can be.
The idea for this book started in early 1995 when I was a student at Purdue University. It all started when I attended a class entitled Proficient Use of WWW taught by George Vanecek, Jr. and Buster Dunsmore. It was a wonderful class that went all over the map, from HTML to HTTP to CGI to Perl programming. Other ideas for the book started when I worked at Purdue's Online Writing Lab as a web developer.
I'd like to extend a warm "thank you" to everyone who helped review the book, especially on short notice: Tom Christiansen, Larry Wall, Sean McDermott, Kirsten Klinghammer, Ed Hill, Andy Grignon, Jeff Sedayao, Michael Pelz-Sherman, and Norman Walsh. Special thanks for Kirsten and Sean for the 24-hour turnaround time, and to Tom, Larry, and Ed for being critical when someone needed to be critical.
Thanks also to Nancy Walsh for writing the Perl/Tk chapter. And thanks to all the people at O'Reilly & Associates: production editor Jane Ellin, cover designer Edie Freedman, Chris Reilley (who cleaned up the figures), Mike Sierra for Tools support, Mary Anne Weeks Mayo and Sheryl Avruch for quality control, and my editor Linda Mui.
Thanks to my parents, Chun and Liang, my sister Ginger, and my girlfriend Cynthia for their support.
Back to: Chapter Index
Back to: Web Client Programming with Perl
© 2000, O'Reilly & Associates, Inc.