Creating Wikipedia CD

Malayalam Wikimedians have released the CD version of the 500 selected articles of Malayalam Wikipedia on 2010 April 17 as part of the 3rd Malayalam Wikimedian’s meetup. A  mail regarding this was sent to the WikiMedia India mailing list on 2010 April 17.

Malayalam Wikipedia Version 1.0 CD  is available for download from the Malayalam Wikimedian’s community website site http://mlwiki.in/. Following versions of Malayalam Wikipedia CD are available in the mlwiki.in web site

  • ISO image of the Malayalam Wikipedia CD version 1.0.
  • Online version of the Malayalam Wikipedia CD version 1.0.
Malayalalm Wikipedia CD sticker
Sticker used for the Malayalam Wikipedia CD Version 1.0. Designed by Hiran Venugopal (http://hiran.in/)

After we released the Malayalam Wikipedia CD, Version 1.0, many Indian language wikimedians have asked me about the efforts and processes that went behind the creation of the first Indian language wikipedia on CD. This blog post is my reply for that.

Creating Malayalam Wikipedia CD

We are sharing the details about the processes (that helped us to successfully create the Malayalam Wikipedia CD) with the hope that this information will help other Indian Language Wikipedias when they try to create offline version.

We developed a new application (Wiki2CD) for creating this CD.  Initially we tried other solutions like Kiwix. But most of these solutions are not user friendly, require the support of the developer, or these applications are designed keeping the Latin scripts in mind. We were looking for some solution that is user friendly, that can be executed with  minimal developer support (or no support), and that supports the  Indian language scripts. It is at this point of time we sought the help of Santhosh Thottingal who is a FOSS activist, Malayalam Wikimedian, and who developed number of other computing solutions for Indic languages. He explored the various existing tools and decided to develop a solution that suits to the requirements of the Malayalam Wikipedia community.

Creating CD from wiki articles is not a simple task especially for the languages that use non-latin scripts. For a language like Malayalam that use complex script, the effort required was much more.  Now since the Malayalam Wiki community developed the software required for this process, other wiki communities can create their own Wiki CDs with out much effort. Following are the steps involved in this process.

STEP 1 – Selecting the articles

Create a page for the CD project in the Wiki and create a team of editors for the selection of the articles. The page we created in Malayalam Wikipedia for this process can be seen here. http://ml.wikipedia.org/wiki/WP:Version_1.0.

Even though all wiki editors can help the editors to select the articles, let the editorial team make the final decision regarding the articles that needs to be included in the CD.

STEP 2 – Peer review process

Once the articles that needs to go into the CD are finalized, start the peer review process of these articles. You can officially announce and invite every one to involve in the peer review process.

Steps 1 & 2 can proceed simultaneously.

STEP 3 – Verifying the license of the images

Create a team of editors to verify the image licenses. This is very important since many of the Indian language wikis have many images uploaded from unknown sources. It is better to delete all these images from the wiki. More important is that the non-free images or copyrighted images should not go into the CD.

Step 1, Step 2, and Step 3 can proceed simultaneously.

Step 4 – Running the Wiki2CD application

Once the content is finalized, run the Wiki2CD application. You need to just create a topic list of the selected articles  and run the Wiki2CD.sh file. For the documentation on Wiki2CD software see, http://wiki.github.com/santhoshtr/wiki2cd/

For example, see the topic list we used for Malayalam wikipedia here – http://thottingal.in/projects/mlwikioncd/wiki2cd/topicslist.txt. Topic list of sample English wiki CD is here http://github.com/santhoshtr/wiki2cd/blob/master/topicslist.txt

This process will take some time depending on the number of the selected articles and the number of images that are available in the selected articles. Some image missing error messages might appear during this  process, but you can ignore it. Once this is done, the content is separated from Wiki. Now further edits to articles need to be done locally.

Step 5 – Editing the html pages

If you want to edit the html pages to rewrite the content  according to your requirements, upload it to some location, and give read-write access rights to the selected members. Let them fix the content errors, hide the missing images, and so on, and finalize the articles.

You must be  extremely careful while editing the html pages.

Editing  html pages will be feasible only if the number of articles is less. If you are building a CD with 10,000 or 20,000 articles Steps 2, 3, and 5 might not be feasible. You  need to just select the articles using some process (may be Step 1), prepare a topic list, and run the Wiki2CD program.

Step 6 – Creating content for copyright, disclaimer, and other supporting pages

Create content for the Copyright page, Disclaimer page, About this CD page, Credits page, and other supporting pages. After you run the Wiki2CD program some dummy html pages will be created for the above pages. But you need to edit that according to your requirement.

Also you need to edit the content.html, toc.html, banner.html, index.html  so that the look and feel will be good and according to your requirement. We definitely do not want all the other language wikipedia CDs look like Malayalam Wikipedia CD. Also you can add more supporting pages according to your requirement.

Important:

You must get permission from the WikiMedia foundation to use Wikipedia logos and the name of the wikiprojects in the CD. This process may take some time. So it will be better you start this process at the initial stage itself.

Step 7 – Creating the ISO image

Create the ISO image and send the CD for replication. We recommend you to use GNU/Linux system for the ISO image generation.

Challenges

Even though the CD creation process may look  simple now, it was not that simple when we started this CD project. Following are some of the challenges.

  • It is really difficult to select articles from various categories that needs to be  included in the CD.  In most cases there is every chance we will go after popular articles. In Malayalam Wikipedia, we have plenty of articles on Biographies, Geography, Astronomy, and so on. But the articles were less in some other areas like Politics, Biology, Mathematics, and so on.   While building the Wikipedia CD we must make sure that articles from all the important categories are included. (This is not applicable when you create a  CD based on a particular topic).
  • Peer review process of the selected articles was not up to our expectations. One of the main reason was that the number of articles that needed to be peer reviewed in a short span of time was very high. As the number of articles increase peer review of the articles for CD may not be possible.
  • Copyright of the images is a major issue. We ran the clean up process of the images as part of the CD project. Do not include non-free or copyrighted images in the CD.
  • The unicode version 5.1 (for Malayalam) has introduced many issues in Malayalam Wikipedia. There are at least 6 Malayalam alphabets that are represented by two different code points in Unicode.  It affected the CD project also. To avoid issues while accessing the CD the entire content inside the CD was normalized to Unicode version 5.0. We really do not know how the dual encoding issues are going to affect the Malayalam wiki projects. It is really bad that end users of unicode like us (Wikipedia users, blog users, and so on) need to understand about the unicode technology to write in Malayalam Unicode.
  • The ISO 9660 file system for CD/DVD has lots of limitations when it comes to non-latin languages. Initially we used Malayalam article titles itself for the name of html files. But when we try to write that to CD it failed, since there are lot of restrictions on the file names on the CDFS file system. So finally we changed the name of the files to numbers. Later we found that English Wikipedia CD has also followed the same approach, which might be due to some other reason.
  • We found lot of issues with the ISO image created from MAC/windows OS.  For Indian languages we suggest to create the ISO image from the GNU/Linux system.

For Indian language wikipedia CDs, remember to embed the required font for the articles in the CD. Also it is better to include other helpful goodies like font, language computing pages, link to various websites related to that language, and so on.

We recommend to limit the number or articles included in the CD to 500 or so for the first version.  Let the first version be a test release to gather experience for the future releases.

Known issue of Wiki2CD software

Following is the known issue of the Wiki2CD software now.

  • If the name of the image files contain special characters such as single quote, space, & , ! , ( and so on or when the name contains non-latin characters in URL encoded format, those images are not getting downloaded. This issue needs to be fixed at  the earliest.

Feature requests for Wiki2CD software

Here is my wish list for the Wiki2CD software.

  • Currently in the html pages of the wiki articles (in CD), the program is  hiding all the unwanted wiki codes by commenting it. There needs to be some provision in the program to remove these unwanted wiki codes at the time of page generation itself. The page should use only simple html/wiki codes.
  • Currently most of the wiki templates that do not need to go into the CD will be removed by the program. But few templates are bypassing the program. There should be some feature in the program to manage these templates in a better way.  An example, let the program display all the templates available in the selected articles. Let the enhanced feature allow the user to select  the templates that needs to display in the articles in CD. Other templates needs to be removed.
  • If in case a wiki community does not want to use images from wiki in their CD, the program should not download the images from wiki.
  • Now we have provision for article title search in the CD. But as the number of articles grows  full text search is required.
  • A better user interface for Linux version.
  • Windows version of this application. Windows developers can help.

Santhosh Thottingal who developed the Wiki2CD application is seeking the support of developers who has good knowledge in Python and javascript to fix the known issues and to enhance the Wiki2CD software. Kindly contact him at santhosh dot thottingal at  gmail dot com for further guidance.

Credits

Many wikimedians (and some others out side the wikimedia also) have helped us in this CD project. I would like specially thank the following members for their involvement in the CD project.

1.  All Malayalam Wikipedians, especially the following users (editorial team), who helped in the selection of the articles,

2.  Praveenp for creating content  for most of the supporting pages like Copyright, Disclaimer, Malayalam Software page, and so on.  Also I remember the pain he took to travel all the way from Trivandrum to Kochi to hand over the master CD for CD replication.

3.  Sunil VS for verifying the license of the images that needs to be included in the CD.

4.  Santhosh Thottingal for developing the Wiki2CD software.

5.  Hiran Venugopal for designing the CD sticker, cover, banner, and other artworks used in the CD.

6. Jyothis Edathoot helped us in creating the ISO image and setting up the http://mlwiki.in/ web site where the CD version resides now. He has also helped in getting the permission (from WikiMedia foundation) to use the Wikipedia logo and other copyrighted materials in the CD.

7.  IT@school for sponsoring the production cost of the CD.

8.  All  Malayalalm Wikimedians who participated  in  the 3rd Malayalam Wikimeetup and making it a huge success.

References

  1. http://ml.wikipedia.org/wiki/WP:Version_1.0
  2. http://ml.wikipedia.org/wiki/Meetup-2010_April
  3. http://meta.wikimedia.org/wiki/Static_version_tools
  4. http://wikipediaondvd.com/
  5. http://github.com/santhoshtr/wiki2cd
  6. http://thottingal.in/blog/2010/04/17/mlwikioncd/
  7. http://www.kiwix.org/index.php/Main_Page