Sunday, July 22, 2012

lynx Webmaster Tips

Hi Friends,

Today we are going to play with lynx (Command line browser) in Linux. 

Fun in the Terminal With Lynx Browser

Get the text from a Web page as well as a list of links

lynx -dump ""

Get the source code from a Web page with Lynx

lynx -source ""

Get the response headers with Lynx

lynx -dump -head ""

The GNU/Linux command line gives you a lot of small tools that can be connected with each other by piping the output of one tool into another tool.
For example, you might see a page with a lot of links on it that you want to examine more closely. You could open up a terminal and type something like the following:
$ lynx -dump "" | grep -o "http:.*" >file.txt
That will give you a list of outgoing links on the web page at, nicely printed to a file called file.txt in your current directory.

Here's how it works:

Lynx is a Web browser that only reads text. This makes it great for extracting text from web pages. The option -dump tells Lynx to grab the web page and display it in the terminal. That is followed by the URL you want to visit. So lynx -dump "" is just saying, "Lynx, dump the output of to the screen".

You can try the first part by itself to see what it does, replacing with another URL of your choice. In the following example I've used the home page of the google.
$lynx -dump  ""
You will see output something as below:


Extracting the Links from Lynx

Now we can look at the next part of the URL extraction process:
$ lynx -dump "" | grep -o "http:.*" >file.txt
When you use a pipe (the | symbol), it tells the computer to take the output from the first tool and send it to the following tool. So we are taking the output of Lynx and sending it to gr

rep is a tool to search for text and display each line that contains a matching pattern. The option, -o tells grep to only return the matching part of the line and not the entire line. We are searching for anything that matches "http:.*", which is a simple regular expression.
A regular expression is a pattern that is made up of symbols that tell the computer what to look for in order to make a match. We want to find anything that matches the pattern: http: [and anything that comes after that]. A period (.) in a regular expression symbolizes one character of any type. The asterisk (*) symbolizes zero or more of the preceeding character. So "http.*" means "match 'http' and any number of characters that follow it". This will extract only the URLs from Lynx's output.
We could stop there and just run it as this, which will send the output to the screen:
$ lynx -dump "" | grep -o "http:.*"
But it would be nice to save the output for later. To save the output to a file, just add the > symbol. In this case the output is being directed to a file named file.txt as shown below.
$ lynx -dump "" | grep -o "http:.*" >file.txt

Other Options

Here is an example of some other options that you can add. The command sort sorts the results, anuniq removes any duplicate entries.
$ lynx -dump "" | grep -o "http:.*" | sort | uniq >file.txt