Web scrapping techniques are used to parse websites and gather information
Python provide a module Wikipedia API that is used to extract wikipedia data
It supports operations like extracting text, links, images etc from wikipedia
wikipedia
modulePython provides a module wikipedia which can be used to extract data from wikipedia
wikipedia
module can be installed via pip
pip install wikipedia
or pip3
pip3 install wikipedia
To get article titles or keywords related to a query string, search()
method is used
For example, searching for python returns list of related wikipedia articles
>>> import wikipedia
>>> wikipedia.search("python")
['Python (programming language)', 'Monty Python', 'Python', 'Reticulated python', 'Burmese python', 'PYTHON', 'Ball python', 'Python (missile)', 'Monty Python and the Holy Grail', 'History of Python']
suggest() method can be used to get suggestion on some search query
>>> wikipedia.suggest("pythn")
'python'
page()
method returns a WikipediaPage
object that has functions and attributes like images, content, references, html() etc
It takes name of an article(as a string) as argument
import wikipedia
page=wikipedia.page("python")
print(page.content)
Output (omitted)
Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented, and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0, released in 2000, introduced features like list comprehensions and a garbage collection system with reference counting.
Python 3.0, released in 2008, was a major revision of the language that is not completely backward-compatible, and much Python 2 code does not run unmodified on Python 3
content property returns the text content of a page
summary() method of wikipedia
module is used to extract summary of an article
It takes a string as parameter, which is preferably an article name that doesn't lead to disambiguation
>>> print(wikipedia.summary("python"))
Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented, and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0, released in 2000, introduced features like list comprehensions and a garbage collection system with reference counting.
Python 3.0, released in 2008, was a major revision of the language that is not completely backward-compatible, and much Python 2 code does not run unmodified on Python 3.
The Python 2 language was officially discontinued in 2020 (first planned for 2015), and "Python 2.7.18 is the last Python 2.7 release and therefore the last Python 2 release." No more security patches or other improvements uhm be released for it. With Python 2's end-of-life, only Python 3.5.x and later are supported.
Python interpreters are available for many operating systems. A global community of programmers develops and maintains CPython, an open source reference implementation. A non-profit organization, the Python Software Foundation, manages and directs resources for Python and CPython development.
>>> print(wikipedia.summary("python", sentences=2))
Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace.
url
parameter of object returned by page()
method returns the url of a page
import wikipedia
page=wikipedia.page("python")
print(page.url)
images
attribute of WikipediaPage
object returns a list of links of images present in a page
>>> page = wikipedia.page("Python snake")
>>> page.images
['https://upload.wikimedia.org/wikipedia/commons/6/6f/Large_Python_Ragunan_Zoo.jpg', 'https://upload.wikimedia.org/wikipedia/commons/d/df/Malayopython_reticulatus%2C_Reticulated_python_-_Kaeng_Krachan_District%2C_Phetchaburi_Province_%2847924282891%29.jpg', 'https://upload.wikimedia.org/wikipedia/commons/a/aa/Python_gab_fbi.png', 'https://upload.wikimedia.org/wikipedia/commons/d/d1/Python_reticulatus_feeding_in_TMII_Reptil_Park.jpg', 'https://upload.wikimedia.org/wikipedia/commons/b/b4/Python_reticulatus_%D1%81%D0%B5%D1%82%D1%87%D0%B0%D1%82%D1%8B%D0%B9_%D0%BF%D0%B8%D1%82%D0%BE%D0%BD-2.jpg', 'https://upload.wikimedia.org/wikipedia/commons/7/74/Red_Pencil_Icon.png', 'https://upload.wikimedia.org/wikipedia/commons/7/72/Retic2.jpg', 'https://upload.wikimedia.org/wikipedia/commons/8/83/Retic3.jpg', 'https://upload.wikimedia.org/wikipedia/commons/2/25/Reticulated-catch.jpg', 'https://upload.wikimedia.org/wikipedia/commons/b/bc/Reticulated-python.jpg', 'https://upload.wikimedia.org/wikipedia/commons/3/38/Reticulated_Python_at_Little_Rays_Reptile_Zoo.jpg', 'https://upload.wikimedia.org/wikipedia/commons/c/ca/Status_iucn2.3_LC.svg', 'https://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg']
Language can be set for subsequent queries by using set_lang
method
>>> import wikipedia
>>> wikipedia.set_lang("es")
>>> wikipedia.summary("spanish language", sentences=2)
'El español o castellano es una lengua romance procedente del latín hablado. Pertenece al grupo ibérico y es originaria de Castilla, reino medieval de la península ibérica.'