How to get the webpage content & How to extract the text from html string in Python

Hi,

Today I found that, How to get the web page as text via python.


import urllib
myurl = urllib.urlopen("http://tuxworld.wordpress.com")
source = myurl.read()

This simple code will get the above web page into as string.
The string contains the html source of the web page.

If you will print the ‘source’ means, it will print as the web page content with html tags.

How to extract the html tags from that string ?

Ya, we can do that by following way.

$ sudo apt-get install python-setuptools
$ sudo easy_install stripogram


import urllib

from stripogram import html2text

myurl = urllib.urlopen("http://tuxworld.wordpress.com")

html_string = myurl.read()

text = html2text( html_string )

print text

Now you will the whole web page in “text” variable as normal text not as html string.

Enjoy with Python 🙂

Regards,
Arulalan.T

Advertisements

About arulalant

Currently working as "Project Scientist – C" in National Centre for Medium Range Weather Forecasting (NCMRWF), MoES, Noida, India
This entry was posted in Python, Web. Bookmark the permalink.

3 Responses to How to get the webpage content & How to extract the text from html string in Python

  1. susi says:

    But how to remove those tags and all?

    Like

  2. kiske says:

    nice one 😉 , please how that code will be looks like, if we need first to login to https:\\page and than strip that page contents ?

    Like

  3. mohi says:

    Wow… thats awesome…

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s