Newby Coder header banner

Scrapping Wikipedia With Python

Scrapping Wikipedia With Python

Web scrapping techniques are used to parse websites and gather information

Python wikipedia module

Python provides a module wikipedia which can be used to extract data from wikipedia

Installing wikipedia-api Module

wikipedia module can be installed via pip

pip install wikipedia 

or pip3

pip3 install wikipedia 

Searching

Relevant pages/keywords for a query

To get article titles or keywords related to a query string, search() method is used

For example, searching for python returns list of related wikipedia articles

>>> import wikipedia
>>> wikipedia.search("python")
['Python (programming language)', 'Monty Python', 'Python', 'Reticulated python', 'Burmese python', 'PYTHON', 'Ball python', 'Python (missile)', 'Monty Python and the Holy Grail', 'History of Python']

Search suggestion

suggest() method can be used to get suggestion on some search query

>>> wikipedia.suggest("pythn")
'python'

Page content

page() method returns a WikipediaPage object that has functions and attributes like images, content, references, html() etc

It takes name of an article(as a string) as argument

import wikipedia
page=wikipedia.page("python")
print(page.content) 

Output (omitted)

Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented, and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0, released in 2000, introduced features like list comprehensions and a garbage collection system with reference counting.
Python 3.0, released in 2008, was a major revision of the language that is not completely backward-compatible, and much Python 2 code does not run unmodified on Python 3

Summary of an Article

>>> print(wikipedia.summary("python"))
Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented, and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0, released in 2000, introduced features like list comprehensions and a garbage collection system with reference counting.
Python 3.0, released in 2008, was a major revision of the language that is not completely backward-compatible, and much Python 2 code does not run unmodified on Python 3.
The Python 2 language was officially discontinued in 2020 (first planned for 2015), and "Python 2.7.18 is the last Python 2.7 release and therefore the last Python 2 release." No more security patches or other improvements uhm be released for it. With Python 2's end-of-life, only  Python 3.5.x and later are supported.
Python interpreters are available for many operating systems. A global community of programmers develops and maintains CPython, an open source reference implementation. A non-profit organization, the Python Software Foundation, manages and directs resources for Python and CPython development.


>>> print(wikipedia.summary("python", sentences=2))
Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace.

Url Of A Page

url parameter of object returned by page() method returns the url of a page

import wikipedia
page=wikipedia.page("python")
print(page.url) 

Extracting Images from a Page

images attribute of WikipediaPage object returns a list of links of images present in a page

>>> page = wikipedia.page("Python snake")
>>> page.images
['https://upload.wikimedia.org/wikipedia/commons/6/6f/Large_Python_Ragunan_Zoo.jpg', 'https://upload.wikimedia.org/wikipedia/commons/d/df/Malayopython_reticulatus%2C_Reticulated_python_-_Kaeng_Krachan_District%2C_Phetchaburi_Province_%2847924282891%29.jpg', 'https://upload.wikimedia.org/wikipedia/commons/a/aa/Python_gab_fbi.png', 'https://upload.wikimedia.org/wikipedia/commons/d/d1/Python_reticulatus_feeding_in_TMII_Reptil_Park.jpg', 'https://upload.wikimedia.org/wikipedia/commons/b/b4/Python_reticulatus_%D1%81%D0%B5%D1%82%D1%87%D0%B0%D1%82%D1%8B%D0%B9_%D0%BF%D0%B8%D1%82%D0%BE%D0%BD-2.jpg', 'https://upload.wikimedia.org/wikipedia/commons/7/74/Red_Pencil_Icon.png', 'https://upload.wikimedia.org/wikipedia/commons/7/72/Retic2.jpg', 'https://upload.wikimedia.org/wikipedia/commons/8/83/Retic3.jpg', 'https://upload.wikimedia.org/wikipedia/commons/2/25/Reticulated-catch.jpg', 'https://upload.wikimedia.org/wikipedia/commons/b/bc/Reticulated-python.jpg', 'https://upload.wikimedia.org/wikipedia/commons/3/38/Reticulated_Python_at_Little_Rays_Reptile_Zoo.jpg', 'https://upload.wikimedia.org/wikipedia/commons/c/ca/Status_iucn2.3_LC.svg', 'https://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg']

Changing Language

Language can be set for subsequent queries by using set_lang method

>>> import wikipedia
>>> wikipedia.set_lang("es")
>>> wikipedia.summary("spanish language", sentences=2)
'El español o castellano es una lengua romance procedente del latín hablado. Pertenece al grupo ibérico y es originaria de Castilla, reino medieval de la península ibérica.'