ConfluenceConverter (file README.txt at 72d0d89748c7)

     1 Introduction
     2 ============
     3 
     4 ConfluenceConverter is a distribution of software that converts exported data
     5 from Confluence wiki instances, provided in the form of an XML file, to a
     6 collection of wiki pages and resources that can be imported into a MoinMoin
     7 instance as a page package.
     8 
     9 Migration Activities
    10 --------------------
    11 
    12 The following activities are involved in a migration from Confluence to
    13 MoinMoin. First, the activities that can be performed from any location:
    14 
    15   * Export of Confluence content
    16   * Conversion of Confluence content to MoinMoin content
    17   * Confluence page identifier extraction and mapping to MoinMoin identifiers
    18   * Acquisition of Confluence user profile details
    19 
    20 Then, the activities that are performed on the server:
    21 
    22   * Installation of MoinMoin
    23   * Initialisation of a MoinMoin wiki instance
    24   * Import of MoinMoin content into the new wiki instance
    25   * Installation of MoinMoin extensions
    26   * Initialisation of user profiles in MoinMoin
    27   * Installation of scripts and identifier mappings
    28   * Filesystem permission adjustments
    29 
    30 
    31 
    32 Prerequisites
    33 =============
    34 
    35 ConfluenceConverter requires a library called xmlread that can be found at the
    36 following location:
    37 
    38 http://hgweb.boddie.org.uk/xmlread
    39 
    40 The xmlread.py file from the xmlread distribution can be copied into the
    41 ConfluenceConverter directory.
    42 
    43 ConfluenceConverter also requires access to the MoinMoin.wikiutil module found
    44 in the MoinMoin distribution. Setting the PYTHONPATH environment variable to
    45 the location of the MoinMoin package should be sufficient for access to this
    46 module.
    47 
    48 The moinsetup program is highly recommended for the installation of page
    49 packages and the management of MoinMoin wiki instances:
    50 
    51 http://moinmo.in/ScriptMarket/moinsetup
    52 
    53 If moinsetup is not being used, the page package installer documentation
    54 should be consulted:
    55 
    56 http://moinmo.in/HelpOnPackageInstaller
    57 
    58 To read Confluence user profiles on live Confluence sites using the
    59 get_profiles.py program, the libxml2dom library is required:
    60 
    61 http://hgweb.boddie.org.uk/libxml2dom
    62 
    63 MoinMoin Prerequisites
    64 ----------------------
    65 
    66 The page package installer does not preserve user information or the last
    67 modified time when installing page revisions. This can be modified by applying
    68 a patch to MoinMoin as follows while at the top level of the MoinMoin source
    69 distribution:
    70 
    71 patch -p1 $CCDIR/patches/patch-moin-1.9-MoinMoin-packages.diff
    72 
    73 Here, CCDIR is the path to the top level of this source distribution where
    74 this README.txt file is found.
    75 
    76 When importing users, MoinMoin may be unable to handle user information
    77 containing non-ASCII characters. Another patch to solve such problems can be
    78 applied to MoinMoin as follows:
    79 
    80 patch -p1 $CCDIR/patches/patch-moin-1.9-MoinMoin-user.diff
    81 
    82 Wiki Content Prerequisites
    83 --------------------------
    84 
    85 For the output of the converter, the following MoinMoin extensions are
    86 required:
    87 
    88 http://moinmo.in/ParserMarket/ImprovedTableParser
    89 http://moinmo.in/ActionMarket/SubpageComments
    90 http://moinmo.in/MacroMarket/Color2
    91 
    92 A common dependency of various extensions is provided by MoinSupport:
    93 
    94 http://hgweb.boddie.org.uk/MoinSupport
    95 
    96 
    97 
    98 Additional Software
    99 ===================
   100 
   101 PDF export support requires the ExportPDF action:
   102 
   103 http://moinmo.in/ActionMarket/ExportPDF
   104 
   105 This in turn requires Apache FOP for PDF production using XSL-FO:
   106 
   107 http://xmlgraphics.apache.org/fop/
   108 
   109 (On Debian systems, the fop package provides this tool.)
   110 
   111 To produce XSL-FO from DocBook output, xsltproc is required from the libxslt
   112 distribution:
   113 
   114 http://xmlsoft.org/XSLT/
   115 
   116 (On Debian systems, the xsltproc package provides this tool.)
   117 
   118 And DocBook output requires the DocBook resources to be installed, described
   119 in the following guide:
   120 
   121 http://www.sagehill.net/docbookxsl/ToolsSetup.html
   122 
   123 (On Debian systems, the docbook-xsl package provides these resources.)
   124 
   125 
   126 
   127 Quick Start
   128 ===========
   129 
   130 (!) The acquisition of Confluence wiki content and its conversion can be
   131 performed from any location, not necessarily on the server.
   132 
   133 To obtain XML export archives from a Confluence wiki instance, the
   134 exportspacexml.action resource is visited and the "Export" button selected.
   135 For example, for the Mailman Wiki, the appropriate resource (with the COM
   136 namespace selected) is as follows:
   137 
   138 http://wiki.list.org/spaces/exportspacexml.action?key=COM
   139 
   140 For your own instance, adjust the above URL accordingly. Alternatively, you
   141 can find your way to the export page by selecting a namespace, then choosing
   142 "Advanced" from the "Browse" menu, and then choosing "XML Export" from the
   143 "Export" sidebar.
   144 
   145 Given an XML export archive file for a Confluence wiki instance (in the
   146 example below, the file is called COM-123456-789012.zip), the following
   147 command can be used to prepare a page package for MoinMoin:
   148 
   149 python convert.py COM-123456-789012.zip COM
   150 
   151 In addition to the filename, a workspace name is required. Confluence appears
   152 to require a workspace as a container for collections of pages, but this also
   153 permits us to selectively import parts of a wiki into MoinMoin. If attachments
   154 were included in the export from Confluence, these will be imported into the
   155 page package.
   156 
   157 The result of the above command will be a directory having the same name as
   158 the chosen workspace, together with a zip archive for that directory's
   159 contents. Thus, the above command would produce a directory called COM and an
   160 archive called COM.zip.
   161 
   162 (!) The following step is performed on the server.
   163 
   164 To import the result (although you may wish to process other namespaces
   165 first), use moinsetup as follows:
   166 
   167 python moinsetup.py -m install_page_package COM.zip
   168 
   169 This requires a suitable moinsetup.cfg file in the working directory.
   170 
   171 Importing Many Workspaces/Namespaces
   172 ------------------------------------
   173 
   174 Where more than one namespace is to be imported, the page packages should be
   175 merged so that the resulting history information is ordered correctly.
   176 
   177 (!) This process can be performed from any location and the result uploaded to
   178 the server for eventual import.
   179 
   180 To merge packages, use a command of the following form:
   181 
   182 python merge.py OUT COM.zip DEV.zip DOC.zip SEC.zip
   183 
   184 A directory called OUT and a page package called OUT.zip will be produced. The
   185 latter can then be imported into MoinMoin as described above.
   186 
   187 Mappings from Identifiers to Pages
   188 ----------------------------------
   189 
   190 Confluence uses numbers to label content revisions, and links to Confluence
   191 sites sometimes use these numbers instead of a readable page name. MoinMoin,
   192 meanwhile, only uses page names and has no external numeric identifier scheme.
   193 Consequently, it is necessary to produce a mapping from Confluence identifiers
   194 to MoinMoin page names. In addition to numeric identifiers, Confluence also
   195 provides "tiny URLs" which are an alphanumeric encoding of the numeric
   196 identifiers.
   197 
   198 (!) This process can be performed with the converted content from any
   199 location, with the generated files uploaded to the server for eventual
   200 deployment.
   201 
   202 To generate mappings for the Confluence content, use the mappings script as
   203 follows:
   204 
   205 tools/mappings.sh COM
   206 
   207 Here, COM is a directory name containing converted Confluence content,
   208 corresponding to a space name in the original Confluence wiki. More than one
   209 space name can be used to generate a complete mapping for a site. For example:
   210 
   211 tools/mappings.sh COM DEV DOC SEC
   212 
   213 The following files are generated:
   214 
   215   * mapping-id-to-page.txt
   216   * mapping-tiny-to-id.txt
   217   * mapping-tiny-to-page.txt
   218 
   219 The most useful of these is the first as it includes all the necessary
   220 information provided by the arbitrary mapping from identifiers to page names.
   221 The second mapping merely converts the "tiny URLs" to identifiers, which can
   222 be done by applying an algorithm without any external knowledge of the wiki
   223 structure. The third mapping is provided as a convenience, combining the "tiny
   224 URL" conversion and the arbitrary mapping to page names.
   225 
   226 Translating Requests Using the Mappings
   227 ---------------------------------------
   228 
   229 Where Web server facilities such as RewriteMap are available for use, the
   230 first and third mapping files can be used directly. See the Apache
   231 documentation for details of RewriteMap:
   232 
   233 http://httpd.apache.org/docs/2.4/rewrite/rewritemap.html
   234 
   235 Otherwise, it is more likely that the first file is used by a program that can
   236 perform a redirect to the appropriate wiki page, and the "tiny URL" decoding
   237 is also done by this program when deployed in a suitable location to receive
   238 such requests. To support this, the following resources are provided:
   239 
   240   * scripts/redirect.py
   241   * config/mailmanwiki-redirect
   242 
   243 The latter configuration file should be combined with the Web server
   244 configuration file such that the appropriate aliases are able to capture
   245 requests and invoke the redirect.py script before the main wiki aliases are
   246 consulted. The script itself should be placed in a suitable filesystem
   247 location, and the mapping-id-to-page.txt file should be placed alongside it,
   248 or it should be placed in a different location and the MAPPING_ID_TO_PAGE
   249 variable changed in the script to refer to this different location.
   250 
   251 Supporting Confluence Action URLs
   252 ---------------------------------
   253 
   254 Besides the "viewpage" action mapping identifiers to pages (covered by the
   255 mapping described above), some other action URLs may be used in wiki content
   256 and must either be translated or supported using redirects. Since external
   257 sites may also employ such actions, a redirect strategy perhaps makes more
   258 sense. To support this, the following resources are involved:
   259 
   260   * scripts/dashboard.py
   261   * scripts/redirect.py
   262   * scripts/search.py
   263   * config/mailmanwiki-redirect
   264 
   265 The latter configuration file is also involved in identifier-to-page mapping,
   266 but in this case it causes requests to the "dashboard", "doexportpage" and
   267 "dosearchsite" actions to be directed to the dashboard.py, redirect.py and
   268 search.py scripts respectively.
   269 
   270 The dashboard.py script merely redirects requests to the root of the site,
   271 thus assuming that the front page is configured to show dashboard-like
   272 information.
   273 
   274 The redirect.py script, apart from supporting identifier-to-page redirects,
   275 also supports attachment downloads and PDF page exports, since both kinds of
   276 resource employ identifiers to indicate which page is involved. In an
   277 environment that uses .htaccess and mod_rewrite, the redirect.py script should
   278 also be deployed under separate names (such as export.py and exportpdf.py) so
   279 that it can discover whether it should be exporting a page instead of just
   280 showing it.
   281 
   282 The search.py script redirects search requests in a suitable form to the
   283 MoinMoin "fullsearch" action.
   284 
   285 Identifying and Migrating Users
   286 -------------------------------
   287 
   288 Confluence export archives do not contain user profile information, but page
   289 versions are marked with user identifiers. Therefore, a list of user
   290 identifiers can be obtained by running a script extracting these identifiers.
   291 The following command writes to standard output the users involved with
   292 editing the wiki in four different spaces (exported to four directories):
   293 
   294 tools/users.sh COM DEV DOC SEC
   295 
   296 This output can be edited and then passed to a program which fetches other
   297 profile details as follows:
   298 
   299 tools/users.sh COM DEV DOC SEC > users.txt
   300 
   301 After editing...
   302 
   303   cat users.txt \
   304 | tools/get_profiles.py http://wiki.list.org/ \
   305 > profiles.txt
   306 
   307 If no users are to be removed in migration, the following command could be
   308 issued:
   309 
   310   tools/users.sh COM DEV DOC SEC \
   311 | tools/get_profiles.py http://wiki.list.org/ \
   312 > profiles.txt
   313 
   314 The get_profiles.py program needs to be told the URL of the original
   315 Confluence site. Note that it accesses the site at a default rate of around
   316 one request per second; a different delay between requests can be specified
   317 using an additional argument.
   318 
   319 (!) The above steps can be performed from any location, but the command
   320 pipelines below need to be run on the server due to the use of a program that
   321 updates the deployed wiki.
   322 
   323 The output of the get_profiles.py program can be passed to another program
   324 which adds users to MoinMoin, and so the following commands can be used:
   325 
   326   cat profiles.txt \
   327 | tools/addusers.py wiki
   328 
   329 Alternatively, the users can be converted to profiles and immediately added
   330 without creating a profiles file:
   331 
   332   cat users.txt \
   333 | tools/get_profiles.py http://wiki.list.org/ \
   334 | tools/addusers.py wiki
   335 
   336 Or just using one single command without inspecting the users or profiles at
   337 all:
   338 
   339   tools/users.sh COM DEV DOC SEC \
   340 | tools/get_profiles.py http://wiki.list.org/ \
   341 | tools/addusers.py wiki
   342 
   343 The addusers.py program needs to be told the directory containing the wiki
   344 configuration.
   345 
   346 Output Structure
   347 ----------------
   348 
   349 The structure of a converted workspace is a directory hierarchy containing the
   350 following directories:
   351 
   352   * pages     (a collection of directories defining each page or content item,
   353                corresponding to Page, Comment and BlogPost elements in the XML
   354                exported from Confluence)
   355 
   356   * versions  (a collection of files, each defining a revision or version of
   357                some content, corresponding to BodyContent elements in the XML
   358                exported from Confluence)
   359 
   360 Each page directory contains the following things:
   361 
   362   * pagetype    (either "Page", "Comment" or "BlogPost")
   363 
   364   * manifest    (a list of version entries in a format similar to the MoinMoin
   365                  page package manifest format)
   366 
   367   * attachments (a list of attachment version entries in a format similar to
   368                  the MoinMoin page package manifest format)
   369 
   370   * pagetitle   (an optional page title imposed on the page by another content
   371                  item)
   372 
   373   * children    (a list of child page names defined for the page)
   374 
   375   * comments    (a list of creation date plus comment page identifier pairs)
   376 
   377 In the output structure, content items such as comments are represented as
   378 pages and each reference a content version. Since comments will ultimately be
   379 represented as subpages of some parent page, they will have a pagetitle file
   380 in their directory with an appropriate subpage name written according to the
   381 parent page's name and comment details.
   382 
   383 Troubleshooting
   384 ---------------
   385 
   386 The page package import activity in particular can be a source of problems.
   387 Generally, any error occurring when attempting to import a package is likely
   388 to be due to insufficient privileges when writing to the pages directory of a
   389 wiki or to its edit-log file.
   390 
   391 The moinsetup software can generate scripts that set the ownership of wiki
   392 files or apply ACLs (access control lists) to those files in order to make
   393 access to wiki data more convenient. Where the ownership of the files must be
   394 set (to www-data or nobody), the import step can be run as that user given
   395 sufficient privileges. However, the easiest solution is to apply ACLs, thus
   396 allowing the user who created the wiki to retain write access to it.
   397 
   398 
   399 
   400 Contact, Copyright and Licence Information
   401 ==========================================
   402 
   403 The current Web page for ConfluenceConverter at the time of release is:
   404 
   405 http://hgweb.boddie.org.uk/ConfluenceConverter
   406 
   407 Copyright and licence information can be found in the docs directory - see
   408 docs/COPYING.txt and docs/LICENCE.txt for more information.
   409 
   410 
   411 
   412 Resources
   413 =========
   414 
   415 "Confluence Data Model"
   416 https://confluence.atlassian.com/doc/confluence-data-model-127369837.html
   417 
   418 "Confluence Storage Format"
   419 https://confluence.atlassian.com/doc/confluence-storage-format-790796544.html
   420 
   421 "Confluence Wiki Markup"
   422 https://confluence.atlassian.com/doc/confluence-wiki-markup-251003035.html
   423 
   424 "Macros"
   425 https://confluence.atlassian.com/doc/macros-139387.html