Creating Malayalam Wikisource CD

With great pleasure, Malayalam Wikimedia Community is announcing the release of the first offline version of Malayalam wikisource. This offline version is titled Malayalam wiki grandasala – Thiranjedutha krithikal – Pathippu 1.0, June 2011 (Malayalam Wikisource – Selected Books – Version 1.0) is released by Hisham Mundol (by handing over the WikiSource CD to the youngest Malayalam wikimedian, 7 year old Sai Shanmugham) at the 4th Malayalam Wikimedia Meetup in Kannur, Kerala on 2011 June 11.

Malayalam Wikisource CD sticker
Malayalam Wikisource CD sticker. Image Courtesy: Rajesh Odayanchal

This is by far the biggest digital collection of free books in Malayalam language available on CD for offline use. This is the first time a wikimedia community from India is releasing an offline version of Wikisource. Even though we are not sure, we are told that no other wikimedia community in the world has ever released such an offline version of Wikisource. (But we are not sure about this claim 🙂 )

Project Idea

The idea of releasing an offline version of Malayalam wikisource was there in our mind for long. But the oppurtinuty didn’t come until we finalized the dates for 4th Malayalam wikimedian’s meetup. There are challenges about the presentation of the Wikisource content.  We know that presenting the content as long wiki pages will fail the Wikisource CD project itself.

Why we chose Wikisource this year?

The reason is simple. Malayalam wikisource is the most active wiki project after Malayalam wikipedia. Also we saw much potential in releasing an offline version of Malayalam wikisource since Malayalam wikisource already has many legendary notable Malayalam books. We want to showcase the Malayalam speaking people the great work done by the Malayalam wikisource community during the past 5 years. Also we know that through this release we will be able to reach more people.

The Wikisource CD project

The project page (വിക്കിഗ്രന്ഥശാല:സിഡി പതിപ്പ് 1.0) to coordinate the wikisource CD project was created on 2011 May 20. The book selection process was started on the same day. Whole wikisource community participated in the selection process. It is decided to include only Malayalam books in this release, since we have few Sanskrit source texts also.

Santhosh thottingal has agreed to take care of the technology part as he had done last year.

Malayalam source texts included in the CD

Following are the books included in this CD:

Selected Poems of

  • Kumaranasan
  • Cherusseri
  • Changampuzha Krishna Pillai
  • Kalakkaththu Kunchan Nambiar
  • Irayimman Thampi
  • Ramapurathu Warrier

Malayalam Grammar

  • Kerala Panineeyam by AR Rajaraja Varma

Legends/Folklore

  • Aithihyamala

Novels

  •   Indulekha

Religious Texts

  • Bhagavad Gita
  • Adhyatma Ramayanam Kilippaatu
  • Harinama Keerthanam
  • Geetha Govindam
  • Sathyaveda Pusthakam (Malayalam Bible)
  • Quran
  • Works of Sree Narayana Guru
  • Devotional songs

Native Art Form

  • Parichamuttukali pattukal

Philosophy (Political)

  • Communist Manifesto
  • Principles of Communism (Friedrich Engels)

Commons images

Apart from the selected books of Malayalam wikisource we have also included images from Wikimedia Commons (Commons images related to Kerala) also in this CD. The inclusion of Commons image was a test case to see how Commons images can be used for public outreach. The maps of India/Kerala state from the Malayalam wikipedia Map project is also part of this image collection.

Hisham Mundol releasing the Malayalam Wikisource CD by handing over the CD to the youngest Malayalam Wikimedian Sai Shanmugham. Image Courtsey: Fotokannan

Credits

Whole Malayalam wiki community worked behind this. So it is almost impossible to name each one of them. But let me credit some of the major contributors:

  • Content creation: Malayalam wikisource community. Over the past 5 years they have digitized and added many important Malayalam source texts to wikisource. Thanks to all the contributors.  I would like to specillay mention two female editors of Malayalam wikisource (User:Atma and User:Su) who made exceptional contributions to this CD collection. Special thanks to users Thachan Makan and Manoj who coordinated many sub projects of this CD release.
  • Proof reading: Selection process and proof reading for this CD project had happened at the same time. Apart from the Malayalam wikisource community, Malayalam Buzzers and Malayalam bloggers had also helped us. Special thanks to all of them. In fact we could include Aithihyamala and Indulekha in this CD only because of their wonderful support.
  • Art work: Rajesh Odayanchal. He designed the sticker and the cover for the CD. Thanks for all the various wonderful designs that he submitted before we finalized the current designs.
  • CD production: Sunil Paravur. Thanks to Sunil Paravur for supporting us to get the CD replicated. For the purpose of the Wikimeetup we have produced 500 CDs.

Unlike the Wikipedia CD that we released last year,we haven’t approached any agency to sponsor this CD. Few Malayalam wikimedians had sponsored the entire production cost. We are charging a nominal amount of Rs 20 to cover the production cost of the CD.

Distribution of CD

The 500 copies that we produced for the purpose of Wiki meetup is almost over. Malayalam Wikimedians do not have any plans to further replicate this version of the CD. We except people outside wiki community will take care of further distribution. Already Zyxware technologies based in Trivandrum added malayalam wikisource CD in their CD distribution plan. Hope others will also follow soon. Last year IT@school and few other government agencies, and few computer magazines had reused the Malayalam wikipedia CD release. We expect many more social organizations, Government agencies, magazines, and so on will do that this year.

Mathrubhumi News daily (A prominent Malayalam Newspaper) had covered the news about the release of this CD. It is here. A rough translation of this news is provided by Tinu here.

Related blogposts:

Creating Wikipedia CD

Malayalam Wikimedians have released the CD version of the 500 selected articles of Malayalam Wikipedia on 2010 April 17 as part of the 3rd Malayalam Wikimedian’s meetup. A  mail regarding this was sent to the WikiMedia India mailing list on 2010 April 17.

Malayalam Wikipedia Version 1.0 CD  is available for download from the Malayalam Wikimedian’s community website site http://mlwiki.in/. Following versions of Malayalam Wikipedia CD are available in the mlwiki.in web site

  • ISO image of the Malayalam Wikipedia CD version 1.0.
  • Online version of the Malayalam Wikipedia CD version 1.0.
Malayalalm Wikipedia CD sticker
Sticker used for the Malayalam Wikipedia CD Version 1.0. Designed by Hiran Venugopal (http://hiran.in/)

After we released the Malayalam Wikipedia CD, Version 1.0, many Indian language wikimedians have asked me about the efforts and processes that went behind the creation of the first Indian language wikipedia on CD. This blog post is my reply for that.

Creating Malayalam Wikipedia CD

We are sharing the details about the processes (that helped us to successfully create the Malayalam Wikipedia CD) with the hope that this information will help other Indian Language Wikipedias when they try to create offline version.

We developed a new application (Wiki2CD) for creating this CD.  Initially we tried other solutions like Kiwix. But most of these solutions are not user friendly, require the support of the developer, or these applications are designed keeping the Latin scripts in mind. We were looking for some solution that is user friendly, that can be executed with  minimal developer support (or no support), and that supports the  Indian language scripts. It is at this point of time we sought the help of Santhosh Thottingal who is a FOSS activist, Malayalam Wikimedian, and who developed number of other computing solutions for Indic languages. He explored the various existing tools and decided to develop a solution that suits to the requirements of the Malayalam Wikipedia community.

Creating CD from wiki articles is not a simple task especially for the languages that use non-latin scripts. For a language like Malayalam that use complex script, the effort required was much more.  Now since the Malayalam Wiki community developed the software required for this process, other wiki communities can create their own Wiki CDs with out much effort. Following are the steps involved in this process.

STEP 1 – Selecting the articles

Create a page for the CD project in the Wiki and create a team of editors for the selection of the articles. The page we created in Malayalam Wikipedia for this process can be seen here. http://ml.wikipedia.org/wiki/WP:Version_1.0.

Even though all wiki editors can help the editors to select the articles, let the editorial team make the final decision regarding the articles that needs to be included in the CD.

STEP 2 – Peer review process

Once the articles that needs to go into the CD are finalized, start the peer review process of these articles. You can officially announce and invite every one to involve in the peer review process.

Steps 1 & 2 can proceed simultaneously.

STEP 3 – Verifying the license of the images

Create a team of editors to verify the image licenses. This is very important since many of the Indian language wikis have many images uploaded from unknown sources. It is better to delete all these images from the wiki. More important is that the non-free images or copyrighted images should not go into the CD.

Step 1, Step 2, and Step 3 can proceed simultaneously.

Step 4 – Running the Wiki2CD application

Once the content is finalized, run the Wiki2CD application. You need to just create a topic list of the selected articles  and run the Wiki2CD.sh file. For the documentation on Wiki2CD software see, http://wiki.github.com/santhoshtr/wiki2cd/

For example, see the topic list we used for Malayalam wikipedia here – http://thottingal.in/projects/mlwikioncd/wiki2cd/topicslist.txt. Topic list of sample English wiki CD is here http://github.com/santhoshtr/wiki2cd/blob/master/topicslist.txt

This process will take some time depending on the number of the selected articles and the number of images that are available in the selected articles. Some image missing error messages might appear during this  process, but you can ignore it. Once this is done, the content is separated from Wiki. Now further edits to articles need to be done locally.

Step 5 – Editing the html pages

If you want to edit the html pages to rewrite the content  according to your requirements, upload it to some location, and give read-write access rights to the selected members. Let them fix the content errors, hide the missing images, and so on, and finalize the articles.

You must be  extremely careful while editing the html pages.

Editing  html pages will be feasible only if the number of articles is less. If you are building a CD with 10,000 or 20,000 articles Steps 2, 3, and 5 might not be feasible. You  need to just select the articles using some process (may be Step 1), prepare a topic list, and run the Wiki2CD program.

Step 6 – Creating content for copyright, disclaimer, and other supporting pages

Create content for the Copyright page, Disclaimer page, About this CD page, Credits page, and other supporting pages. After you run the Wiki2CD program some dummy html pages will be created for the above pages. But you need to edit that according to your requirement.

Also you need to edit the content.html, toc.html, banner.html, index.html  so that the look and feel will be good and according to your requirement. We definitely do not want all the other language wikipedia CDs look like Malayalam Wikipedia CD. Also you can add more supporting pages according to your requirement.

Important:

You must get permission from the WikiMedia foundation to use Wikipedia logos and the name of the wikiprojects in the CD. This process may take some time. So it will be better you start this process at the initial stage itself.

Step 7 – Creating the ISO image

Create the ISO image and send the CD for replication. We recommend you to use GNU/Linux system for the ISO image generation.

Challenges

Even though the CD creation process may look  simple now, it was not that simple when we started this CD project. Following are some of the challenges.

  • It is really difficult to select articles from various categories that needs to be  included in the CD.  In most cases there is every chance we will go after popular articles. In Malayalam Wikipedia, we have plenty of articles on Biographies, Geography, Astronomy, and so on. But the articles were less in some other areas like Politics, Biology, Mathematics, and so on.   While building the Wikipedia CD we must make sure that articles from all the important categories are included. (This is not applicable when you create a  CD based on a particular topic).
  • Peer review process of the selected articles was not up to our expectations. One of the main reason was that the number of articles that needed to be peer reviewed in a short span of time was very high. As the number of articles increase peer review of the articles for CD may not be possible.
  • Copyright of the images is a major issue. We ran the clean up process of the images as part of the CD project. Do not include non-free or copyrighted images in the CD.
  • The unicode version 5.1 (for Malayalam) has introduced many issues in Malayalam Wikipedia. There are at least 6 Malayalam alphabets that are represented by two different code points in Unicode.  It affected the CD project also. To avoid issues while accessing the CD the entire content inside the CD was normalized to Unicode version 5.0. We really do not know how the dual encoding issues are going to affect the Malayalam wiki projects. It is really bad that end users of unicode like us (Wikipedia users, blog users, and so on) need to understand about the unicode technology to write in Malayalam Unicode.
  • The ISO 9660 file system for CD/DVD has lots of limitations when it comes to non-latin languages. Initially we used Malayalam article titles itself for the name of html files. But when we try to write that to CD it failed, since there are lot of restrictions on the file names on the CDFS file system. So finally we changed the name of the files to numbers. Later we found that English Wikipedia CD has also followed the same approach, which might be due to some other reason.
  • We found lot of issues with the ISO image created from MAC/windows OS.  For Indian languages we suggest to create the ISO image from the GNU/Linux system.

For Indian language wikipedia CDs, remember to embed the required font for the articles in the CD. Also it is better to include other helpful goodies like font, language computing pages, link to various websites related to that language, and so on.

We recommend to limit the number or articles included in the CD to 500 or so for the first version.  Let the first version be a test release to gather experience for the future releases.

Known issue of Wiki2CD software

Following is the known issue of the Wiki2CD software now.

  • If the name of the image files contain special characters such as single quote, space, & , ! , ( and so on or when the name contains non-latin characters in URL encoded format, those images are not getting downloaded. This issue needs to be fixed at  the earliest.

Feature requests for Wiki2CD software

Here is my wish list for the Wiki2CD software.

  • Currently in the html pages of the wiki articles (in CD), the program is  hiding all the unwanted wiki codes by commenting it. There needs to be some provision in the program to remove these unwanted wiki codes at the time of page generation itself. The page should use only simple html/wiki codes.
  • Currently most of the wiki templates that do not need to go into the CD will be removed by the program. But few templates are bypassing the program. There should be some feature in the program to manage these templates in a better way.  An example, let the program display all the templates available in the selected articles. Let the enhanced feature allow the user to select  the templates that needs to display in the articles in CD. Other templates needs to be removed.
  • If in case a wiki community does not want to use images from wiki in their CD, the program should not download the images from wiki.
  • Now we have provision for article title search in the CD. But as the number of articles grows  full text search is required.
  • A better user interface for Linux version.
  • Windows version of this application. Windows developers can help.

Santhosh Thottingal who developed the Wiki2CD application is seeking the support of developers who has good knowledge in Python and javascript to fix the known issues and to enhance the Wiki2CD software. Kindly contact him at santhosh dot thottingal at  gmail dot com for further guidance.

Credits

Many wikimedians (and some others out side the wikimedia also) have helped us in this CD project. I would like specially thank the following members for their involvement in the CD project.

1.  All Malayalam Wikipedians, especially the following users (editorial team), who helped in the selection of the articles,

2.  Praveenp for creating content  for most of the supporting pages like Copyright, Disclaimer, Malayalam Software page, and so on.  Also I remember the pain he took to travel all the way from Trivandrum to Kochi to hand over the master CD for CD replication.

3.  Sunil VS for verifying the license of the images that needs to be included in the CD.

4.  Santhosh Thottingal for developing the Wiki2CD software.

5.  Hiran Venugopal for designing the CD sticker, cover, banner, and other artworks used in the CD.

6. Jyothis Edathoot helped us in creating the ISO image and setting up the http://mlwiki.in/ web site where the CD version resides now. He has also helped in getting the permission (from WikiMedia foundation) to use the Wikipedia logo and other copyrighted materials in the CD.

7.  IT@school for sponsoring the production cost of the CD.

8.  All  Malayalalm Wikimedians who participated  in  the 3rd Malayalam Wikimeetup and making it a huge success.

References

  1. http://ml.wikipedia.org/wiki/WP:Version_1.0
  2. http://ml.wikipedia.org/wiki/Meetup-2010_April
  3. http://meta.wikimedia.org/wiki/Static_version_tools
  4. http://wikipediaondvd.com/
  5. http://github.com/santhoshtr/wiki2cd
  6. http://thottingal.in/blog/2010/04/17/mlwikioncd/
  7. http://www.kiwix.org/index.php/Main_Page