ConfluenceConverter (file README.txt at 11e412862d45)

     1 Introduction
     2 ------------
     3 
     4 ConfluenceConverter is a distribution of software that converts exported data
     5 from Confluence wiki instances, provided in the form of an XML file, to a
     6 collection of wiki pages and resources that can be imported into a MoinMoin
     7 instance as a page package.
     8 
     9 Migration Activities
    10 --------------------
    11 
    12 The following activities are involved in a migration from Confluence to
    13 MoinMoin. First, the activities that can be performed from any location:
    14 
    15   * Export of Confluence content
    16   * Conversion of Confluence content to MoinMoin content
    17   * Confluence page identifier extraction and mapping to MoinMoin identifiers
    18   * Acquisition of Confluence user profile details
    19 
    20 Then, the activities that are performed on the server:
    21 
    22   * Installation of MoinMoin
    23   * Initialisation of a MoinMoin wiki instance
    24   * Import of MoinMoin content into the new wiki instance
    25   * Installation of MoinMoin extensions
    26   * Initialisation of user profiles in MoinMoin
    27   * Installation of scripts and identifier mappings
    28   * Filesystem permission adjustments
    29 
    30 Prerequisites
    31 -------------
    32 
    33 ConfluenceConverter requires a library called xmlread that can be found at the
    34 following location:
    35 
    36 http://hgweb.boddie.org.uk/xmlread
    37 
    38 The xmlread.py file from the xmlread distribution can be copied into the
    39 ConfluenceConverter directory.
    40 
    41 ConfluenceConverter also requires access to the MoinMoin.wikiutil module found
    42 in the MoinMoin distribution.
    43 
    44 The moinsetup program is highly recommended for the installation of page
    45 packages and the management of MoinMoin wiki instances:
    46 
    47 http://moinmo.in/ScriptMarket/moinsetup
    48 
    49 If moinsetup is not being used, the page package installer documentation
    50 should be consulted:
    51 
    52 http://moinmo.in/HelpOnPackageInstaller
    53 
    54 To read Confluence user profiles on live Confluence sites using the
    55 get_profiles.py program, the libxml2dom library is required:
    56 
    57 http://hgweb.boddie.org.uk/libxml2dom
    58 
    59 MoinMoin Prerequisites
    60 ----------------------
    61 
    62 The page package installer does not preserve user information or the last
    63 modified time when installing page revisions. This can be modified by applying
    64 a patch to MoinMoin as follows while at the top level of the MoinMoin source
    65 distribution:
    66 
    67 patch -p1 $CCDIR/patches/patch-moin-1.9-MoinMoin-packages.diff
    68 
    69 Here, CCDIR is the path to the top level of this source distribution where
    70 this README.txt file is found.
    71 
    72 When importing users, MoinMoin may be unable to handle user information
    73 containing non-ASCII characters. Another patch to solve such problems can be
    74 applied to MoinMoin as follows:
    75 
    76 patch -p1 $CCDIR/patches/patch-moin-1.9-MoinMoin-user.diff
    77 
    78 Wiki Content Prerequisites
    79 --------------------------
    80 
    81 For the output of the converter, the following MoinMoin extensions are
    82 required:
    83 
    84 http://moinmo.in/ParserMarket/ImprovedTableParser
    85 http://moinmo.in/ActionMarket/SubpageComments
    86 http://moinmo.in/MacroMarket/Color2
    87 
    88 A common dependency of various extensions is provided by MoinSupport:
    89 
    90 http://hgweb.boddie.org.uk/MoinSupport
    91 
    92 Additional Software
    93 -------------------
    94 
    95 PDF export support requires the ExportPDF action:
    96 
    97 http://moinmo.in/ActionMarket/ExportPDF
    98 
    99 This in turn requires Apache FOP for PDF production using XSL-FO:
   100 
   101 http://xmlgraphics.apache.org/fop/
   102 
   103 (On Debian systems, the fop package provides this tool.)
   104 
   105 To produce XSL-FO from DocBook output, xsltproc is required from the libxslt
   106 distribution:
   107 
   108 http://xmlsoft.org/XSLT/
   109 
   110 (On Debian systems, the xsltproc package provides this tool.)
   111 
   112 And DocBook output requires the DocBook resources to be installed, described
   113 in the following guide:
   114 
   115 http://www.sagehill.net/docbookxsl/ToolsSetup.html
   116 
   117 (On Debian systems, the docbook-xsl package provides these resources.)
   118 
   119 Quick Start
   120 -----------
   121 
   122 (!) The acquisition of Confluence wiki content and its conversion can be
   123 performed from any location, not necessarily on the server.
   124 
   125 To obtain XML export archives from a Confluence wiki instance, the
   126 exportspacexml.action resource is visited and the "Export" button selected.
   127 For example, for the Mailman Wiki, the appropriate resource (with the COM
   128 namespace selected) is as follows:
   129 
   130 http://wiki.list.org/spaces/exportspacexml.action?key=COM
   131 
   132 For your own instance, adjust the above URL accordingly. Alternatively, you
   133 can find your way to the export page by selecting a namespace, then choosing
   134 "Advanced" from the "Browse" menu, and then choosing "XML Export" from the
   135 "Export" sidebar.
   136 
   137 Given an XML export archive file for a Confluence wiki instance (in the
   138 example below, the file is called COM-123456-789012.zip), the following
   139 command can be used to prepare a page package for MoinMoin:
   140 
   141 python convert.py COM-123456-789012.zip COM
   142 
   143 In addition to the filename, a workspace name is required. Confluence appears
   144 to require a workspace as a container for collections of pages, but this also
   145 permits us to selectively import parts of a wiki into MoinMoin. If attachments
   146 were included in the export from Confluence, these will be imported into the
   147 page package.
   148 
   149 The result of the above command will be a directory having the same name as
   150 the chosen workspace, together with a zip archive for that directory's
   151 contents. Thus, the above command would produce a directory called COM and an
   152 archive called COM.zip.
   153 
   154 (!) The following step is performed on the server.
   155 
   156 To import the result (although you may wish to process other namespaces
   157 first), use moinsetup as follows:
   158 
   159 python moinsetup.py -m install_page_package COM.zip
   160 
   161 This requires a suitable moinsetup.cfg file in the working directory.
   162 
   163 Importing Many Workspaces/Namespaces
   164 ------------------------------------
   165 
   166 Where more than one namespace is to be imported, the page packages should be
   167 merged so that the resulting history information is ordered correctly.
   168 
   169 (!) This process can be performed from any location and the result uploaded to
   170 the server for eventual import.
   171 
   172 To merge packages, use a command of the following form:
   173 
   174 python merge.py OUT COM.zip DEV.zip DOC.zip SEC.zip
   175 
   176 A directory called OUT and a page package called OUT.zip will be produced. The
   177 latter can then be imported into MoinMoin as described above.
   178 
   179 Mappings from Identifiers to Pages
   180 ----------------------------------
   181 
   182 Confluence uses numbers to label content revisions, and links to Confluence
   183 sites sometimes use these numbers instead of a readable page name. MoinMoin,
   184 meanwhile, only uses page names and has no external numeric identifier scheme.
   185 Consequently, it is necessary to produce a mapping from Confluence identifiers
   186 to MoinMoin page names. In addition to numeric identifiers, Confluence also
   187 provides "tiny URLs" which are an alphanumeric encoding of the numeric
   188 identifiers.
   189 
   190 (!) This process can be performed with the converted content from any
   191 location, with the generated files uploaded to the server for eventual
   192 deployment.
   193 
   194 To generate mappings for the Confluence content, use the mappings script as
   195 follows:
   196 
   197 tools/mappings.sh COM
   198 
   199 Here, COM is a directory name containing converted Confluence content,
   200 corresponding to a space name in the original Confluence wiki. More than one
   201 space name can be used to generate a complete mapping for a site. For example:
   202 
   203 tools/mappings.sh COM DEV DOC SEC
   204 
   205 The following files are generated:
   206 
   207   * mapping-id-to-page.txt
   208   * mapping-tiny-to-id.txt
   209   * mapping-tiny-to-page.txt
   210 
   211 The most useful of these is the first as it includes all the necessary
   212 information provided by the arbitrary mapping from identifiers to page names.
   213 The second mapping merely converts the "tiny URLs" to identifiers, which can
   214 be done by applying an algorithm without any external knowledge of the wiki
   215 structure. The third mapping is provided as a convenience, combining the "tiny
   216 URL" conversion and the arbitrary mapping to page names.
   217 
   218 Translating Requests Using the Mappings
   219 ---------------------------------------
   220 
   221 Where Web server facilities such as RewriteMap are available for use, the
   222 first and third mapping files can be used directly. See the Apache
   223 documentation for details of RewriteMap:
   224 
   225 http://httpd.apache.org/docs/2.4/rewrite/rewritemap.html
   226 
   227 Otherwise, it is more likely that the first file is used by a program that can
   228 perform a redirect to the appropriate wiki page, and the "tiny URL" decoding
   229 is also done by this program when deployed in a suitable location to receive
   230 such requests. To support this, the following resources are provided:
   231 
   232   * scripts/redirect.py
   233   * config/mailmanwiki-redirect
   234 
   235 The latter configuration file should be combined with the Web server
   236 configuration file such that the appropriate aliases are able to capture
   237 requests and invoke the redirect.py script before the main wiki aliases are
   238 consulted. The script itself should be placed in a suitable filesystem
   239 location, and the mapping-id-to-page.txt file should be placed alongside it,
   240 or it should be placed in a different location and the MAPPING_ID_TO_PAGE
   241 variable changed in the script to refer to this different location.
   242 
   243 Supporting Confluence Action URLs
   244 ---------------------------------
   245 
   246 Besides the "viewpage" action mapping identifiers to pages (covered by the
   247 mapping described above), some other action URLs may be used in wiki content
   248 and must either be translated or supported using redirects. Since external
   249 sites may also employ such actions, a redirect strategy perhaps makes more
   250 sense. To support this, the following resources are involved:
   251 
   252   * scripts/dashboard.py
   253   * scripts/redirect.py
   254   * scripts/search.py
   255   * config/mailmanwiki-redirect
   256 
   257 The latter configuration file is also involved in identifier-to-page mapping,
   258 but in this case it causes requests to the "dashboard", "doexportpage" and
   259 "dosearchsite" actions to be directed to the dashboard.py, redirect.py and
   260 search.py scripts respectively.
   261 
   262 The dashboard.py script merely redirects requests to the root of the site,
   263 thus assuming that the front page is configured to show dashboard-like
   264 information.
   265 
   266 The redirect.py script, apart from supporting identifier-to-page redirects,
   267 also supports attachment downloads and PDF page exports, since both kinds of
   268 resource employ identifiers to indicate which page is involved. In an
   269 environment that uses .htaccess and mod_rewrite, the redirect.py script should
   270 also be deployed under separate names (such as export.py and exportpdf.py) so
   271 that it can discover whether it should be exporting a page instead of just
   272 showing it.
   273 
   274 The search.py script redirects search requests in a suitable form to the
   275 MoinMoin "fullsearch" action.
   276 
   277 Identifying and Migrating Users
   278 -------------------------------
   279 
   280 Confluence export archives do not contain user profile information, but page
   281 versions are marked with user identifiers. Therefore, a list of user
   282 identifiers can be obtained by running a script extracting these identifiers.
   283 The following command writes to standard output the users involved with
   284 editing the wiki in four different spaces (exported to four directories):
   285 
   286 tools/users.sh COM DEV DOC SEC
   287 
   288 This output can be edited and then passed to a program which fetches other
   289 profile details as follows:
   290 
   291 tools/users.sh COM DEV DOC SEC > users.txt
   292 
   293 After editing...
   294 
   295   cat users.txt \
   296 | tools/get_profiles.py http://wiki.list.org/ \
   297 > profiles.txt
   298 
   299 If no users are to be removed in migration, the following command could be
   300 issued:
   301 
   302   tools/users.sh COM DEV DOC SEC \
   303 | tools/get_profiles.py http://wiki.list.org/ \
   304 > profiles.txt
   305 
   306 The get_profiles.py program needs to be told the URL of the original
   307 Confluence site. Note that it accesses the site at a default rate of around
   308 one request per second; a different delay between requests can be specified
   309 using an additional argument.
   310 
   311 (!) The above steps can be performed from any location, but the command
   312 pipelines below need to be run on the server due to the use of a program that
   313 updates the deployed wiki.
   314 
   315 The output of the get_profiles.py program can be passed to another program
   316 which adds users to MoinMoin, and so the following commands can be used:
   317 
   318   cat profiles.txt \
   319 | tools/addusers.py wiki
   320 
   321 Alternatively, the users can be converted to profiles and immediately added
   322 without creating a profiles file:
   323 
   324   cat users.txt \
   325 | tools/get_profiles.py http://wiki.list.org/ \
   326 | tools/addusers.py wiki
   327 
   328 Or just using one single command without inspecting the users or profiles at
   329 all:
   330 
   331   tools/users.sh COM DEV DOC SEC \
   332 | tools/get_profiles.py http://wiki.list.org/ \
   333 | tools/addusers.py wiki
   334 
   335 The addusers.py program needs to be told the directory containing the wiki
   336 configuration.
   337 
   338 Output Structure
   339 ----------------
   340 
   341 The structure of a converted workspace is a directory hierarchy containing the
   342 following directories:
   343 
   344   * pages     (a collection of directories defining each page or content item,
   345                corresponding to Page, Comment and BlogPost elements in the XML
   346                exported from Confluence)
   347 
   348   * versions  (a collection of files, each defining a revision or version of
   349                some content, corresponding to BodyContent elements in the XML
   350                exported from Confluence)
   351 
   352 Each page directory contains the following things:
   353 
   354   * pagetype    (either "Page", "Comment" or "BlogPost")
   355 
   356   * manifest    (a list of version entries in a format similar to the MoinMoin
   357                  page package manifest format)
   358 
   359   * attachments (a list of attachment version entries in a format similar to
   360                  the MoinMoin page package manifest format)
   361 
   362   * pagetitle   (an optional page title imposed on the page by another content
   363                  item)
   364 
   365   * children    (a list of child page names defined for the page)
   366 
   367   * comments    (a list of creation date plus comment page identifier pairs)
   368 
   369 In the output structure, content items such as comments are represented as
   370 pages and each reference a content version. Since comments will ultimately be
   371 represented as subpages of some parent page, they will have a pagetitle file
   372 in their directory with an appropriate subpage name written according to the
   373 parent page's name and comment details.
   374 
   375 Troubleshooting
   376 ---------------
   377 
   378 The page package import activity in particular can be a source of problems.
   379 Generally, any error occurring when attempting to import a package is likely
   380 to be due to insufficient privileges when writing to the pages directory of a
   381 wiki or to its edit-log file.
   382 
   383 The moinsetup software can generate scripts that set the ownership of wiki
   384 files or apply ACLs (access control lists) to those files in order to make
   385 access to wiki data more convenient. Where the ownership of the files must be
   386 set (to www-data or nobody), the import step can be run as that user given
   387 sufficient privileges. However, the easiest solution is to apply ACLs, thus
   388 allowing the user who created the wiki to retain write access to it.
   389 
   390 Contact, Copyright and Licence Information
   391 ------------------------------------------
   392 
   393 The current Web page for ConfluenceConverter at the time of release is:
   394 
   395 http://hgweb.boddie.org.uk/ConfluenceConverter
   396 
   397 Copyright and licence information can be found in the docs directory - see
   398 docs/COPYING.txt and docs/LICENCE.txt for more information.