ConfluenceConverter (file README.txt at c3d772d8cbad)

     1 Introduction
     2 ------------
     3 
     4 ConfluenceConverter is a distribution of software that converts exported data
     5 from Confluence wiki instances, provided in the form of an XML file, to a
     6 collection of wiki pages and resources that can be imported into a MoinMoin
     7 instance as a page package.
     8 
     9 Prerequisites
    10 -------------
    11 
    12 ConfluenceConverter requires a library called xmlread that can be found at the
    13 following location:
    14 
    15 http://hgweb.boddie.org.uk/xmlread
    16 
    17 The xmlread.py file from the xmlread distribution can be copied into the
    18 ConfluenceConverter directory.
    19 
    20 ConfluenceConverter also requires access to the MoinMoin.wikiutil module found
    21 in the MoinMoin distribution.
    22 
    23 The moinsetup program is highly recommended for the installation of page
    24 packages and the management of MoinMoin wiki instances:
    25 
    26 http://moinmo.in/ScriptMarket/moinsetup
    27 
    28 If moinsetup is not being used, the page package installer documentation
    29 should be consulted:
    30 
    31 http://moinmo.in/HelpOnPackageInstaller
    32 
    33 To read Confluence user profiles on live Confluence sites using the
    34 get_profiles.py program, the libxml2dom library is required:
    35 
    36 http://hgweb.boddie.org.uk/libxml2dom
    37 
    38 MoinMoin Prerequisites
    39 ----------------------
    40 
    41 The page package installer does not preserve user information or the last
    42 modified time when installing page revisions. This can be modified by applying
    43 a patch to MoinMoin as follows while at the top level of the MoinMoin source
    44 distribution:
    45 
    46 patch -p1 $CCDIR/patches/patch-moin-1.9-MoinMoin-packages.diff
    47 
    48 Here, CCDIR is the path to the top level of this source distribution where
    49 this README.txt file is found.
    50 
    51 Wiki Content Prerequisites
    52 --------------------------
    53 
    54 For the output of the converter, the following MoinMoin extensions are
    55 required:
    56 
    57 http://moinmo.in/ParserMarket/ImprovedTableParser
    58 http://hgweb.boddie.org.uk/MoinSupport
    59 http://moinmo.in/MacroMarket/Color2
    60 
    61 In addition, extensions are provided in this distribution to support various
    62 Confluence features, notably comments on pages. These extensions are installed
    63 as follows:
    64 
    65 python moinsetup.py -m install_actions $CCDIR/actions
    66 python moinsetup.py -m install_macros $CCDIR/macros
    67 python moinsetup.py -m install_theme_resources $CCDIR
    68 python moinsetup.py -m edit_theme_stylesheet screen.css includecomments.css
    69 python moinsetup.py -m edit_theme_stylesheet print.css includecomments.css
    70 
    71 Additional Software
    72 -------------------
    73 
    74 PDF export support requires the ExportPDF action:
    75 
    76 http://moinmo.in/ActionMarket/ExportPDF
    77 
    78 This in turn requires Apache FOP for PDF production using XSL-FO:
    79 
    80 http://xmlgraphics.apache.org/fop/
    81 
    82 (On Debian systems, the fop package provides this tool.)
    83 
    84 To produce XSL-FO from DocBook output, xsltproc is required from the libxslt
    85 distribution:
    86 
    87 http://xmlsoft.org/XSLT/
    88 
    89 (On Debian systems, the xsltproc package provides this tool.)
    90 
    91 And DocBook output requires the DocBook resources to be installed, described
    92 in the following guide:
    93 
    94 http://www.sagehill.net/docbookxsl/ToolsSetup.html
    95 
    96 (On Debian systems, the docbook-xsl package provides these resources.)
    97 
    98 Quick Start
    99 -----------
   100 
   101 Given an XML export archive file for a Confluence wiki instance (in the
   102 example below, the file is called COM-123456-789012.zip), the following
   103 command can be used to prepare a page package for MoinMoin:
   104 
   105 python convert.py COM-123456-789012.zip COM
   106 
   107 In addition to the filename, a workspace name is required. Confluence appears
   108 to require a workspace as a container for collections of pages, but this also
   109 permits us to selectively import parts of a wiki into MoinMoin. If attachments
   110 were included in the export from Confluence, these will be imported into the
   111 page package.
   112 
   113 The result of the above command will be a directory having the same name as
   114 the chosen workspace, together with a zip archive for that directory's
   115 contents. Thus, the above command would produce a directory called COM and an
   116 archive called COM.zip.
   117 
   118 To import the result, use moinsetup as follows:
   119 
   120 python moinsetup.py -m install_page_package COM.zip
   121 
   122 This requires a suitable moinsetup.cfg file in the working directory.
   123 
   124 Importing Many Workspaces
   125 -------------------------
   126 
   127 Where more than one namespace is to be imported, the page packages should be
   128 merged so that the resulting history information is ordered correctly.
   129 
   130 To merge packages, use a command of the following form:
   131 
   132 python merge.py OUT COM.zip DEV.zip DOC.zip SEC.zip
   133 
   134 A directory called OUT and a page package called OUT.zip will be produced. The
   135 latter can then be imported into MoinMoin as described above.
   136 
   137 Mappings from Identifiers to Pages
   138 ----------------------------------
   139 
   140 Confluence uses numbers to label content revisions, and links to Confluence
   141 sites sometimes use these numbers instead of a readable page name. MoinMoin,
   142 meanwhile, only uses page names and has no external numeric identifier scheme.
   143 Consequently, it is necessary to produce a mapping from Confluence identifiers
   144 to MoinMoin page names. In addition to numeric identifiers, Confluence also
   145 provides "tiny URLs" which are an alphanumeric encoding of the numeric
   146 identifiers.
   147 
   148 To generate mappings for the Confluence content, use the mappings script as
   149 follows:
   150 
   151 tools/mappings.sh COM
   152 
   153 Here, COM is a directory name containing converted Confluence content,
   154 corresponding to a space name in the original Confluence wiki. More than one
   155 space name can be used to generate a complete mapping for a site.
   156 
   157 The following files are generated:
   158 
   159   * mapping-id-to-page.txt
   160   * mapping-tiny-to-id.txt
   161   * mapping-tiny-to-page.txt
   162 
   163 The most useful of these is the first as it includes all the necessary
   164 information provided by the arbitrary mapping from identifiers to page names.
   165 The second mapping merely converts the "tiny URLs" to identifiers, which can
   166 be done by applying an algorithm without any external knowledge of the wiki
   167 structure. The third mapping is provided as a convenience, combining the "tiny
   168 URL" conversion and the arbitrary mapping to page names.
   169 
   170 Translating Requests Using the Mappings
   171 ---------------------------------------
   172 
   173 Where Web server facilities such as RewriteMap are available for use, the
   174 first and third mapping files can be used directly. See the Apache
   175 documentation for details of RewriteMap:
   176 
   177 http://httpd.apache.org/docs/2.4/rewrite/rewritemap.html
   178 
   179 Otherwise, it is more likely that the first file is used by a program that can
   180 perform a redirect to the appropriate wiki page, and the "tiny URL" decoding
   181 is also done by this program when deployed in a suitable location to receive
   182 such requests. To support this, the following resources are provided:
   183 
   184   * scripts/redirect.py
   185   * config/mailmanwiki-redirect
   186 
   187 The latter configuration file should be combined with the Web server
   188 configuration file such that the appropriate aliases are able to capture
   189 requests and invoke the redirect.py script before the main wiki aliases are
   190 consulted. The script itself should be placed in a suitable filesystem
   191 location, and the mapping-id-to-page.txt file should be placed alongside it,
   192 or it should be placed in a different location and the MAPPING_ID_TO_PAGE
   193 variable changed in the script to refer to this different location.
   194 
   195 Supporting Confluence Action URLs
   196 ---------------------------------
   197 
   198 Besides the "viewpage" action mapping identifiers to pages (covered by the
   199 mapping described above), some other action URLs may be used in wiki content
   200 and must either be translated or supported using redirects. Since external
   201 sites may also employ such actions, a redirect strategy perhaps makes more
   202 sense. To support this, the following resources are involved:
   203 
   204   * scripts/dashboard.py
   205   * scripts/redirect.py
   206   * scripts/search.py
   207   * config/mailmanwiki-redirect
   208 
   209 The latter configuration file is also involved in identifier-to-page mapping,
   210 but in this case it causes requests to the "dashboard", "doexportpage" and
   211 "dosearchsite" actions to be directed to the dashboard.py, redirect.py and
   212 search.py scripts respectively.
   213 
   214 The dashboard.py script merely redirects requests to the root of the site,
   215 thus assuming that the front page is configured to show dashboard-like
   216 information.
   217 
   218 The redirect.py script, apart from supporting identifier-to-page redirects,
   219 also supports PDF page exports since the "doexportpage" action uses
   220 identifiers to indicate which page is to be exported.
   221 
   222 The search.py script redirects search requests in a suitable form to the
   223 MoinMoin "fullsearch" action.
   224 
   225 Identifying and Migrating Users
   226 -------------------------------
   227 
   228 Confluence export archives do not contain user profile information, but page
   229 versions are marked with user identifiers. Therefore, a list of user
   230 identifiers can be obtained by running a script extracting these identifiers.
   231 The following command writes to standard output the users involved with
   232 editing the wiki in four different spaces (exported to four directories):
   233 
   234 tools/users.sh COM DEV DOC SEC
   235 
   236 This output can be edited and then passed to a program which fetches other
   237 profile details as follows:
   238 
   239 tools/users.sh COM DEV DOC SEC > users.txt # for editing
   240 cat users.txt | tools/get_profiles.py http://wiki.list.org/
   241 
   242 If no users are to be removed in migration, the following command could be
   243 issued:
   244 
   245 tools/users.sh COM DEV DOC SEC | tools/get_profiles.py http://wiki.list.org/
   246 
   247 The get_profiles.py program needs to be told the URL of the original
   248 Confluence site. Note that it accesses the site at a default rate of around
   249 one request per second; a different delay between requests can be specified
   250 using an additional argument.
   251 
   252 The output of the get_profiles.py program can be passed to another program
   253 which adds users to MoinMoin, and so the following commands can be used:
   254 
   255   cat users.txt \
   256 | tools/get_profiles.py http://wiki.list.org/ \
   257 | tools/addusers.py wiki
   258 
   259 And using one single command:
   260 
   261   tools/users.sh COM DEV DOC SEC \
   262 | tools/get_profiles.py http://wiki.list.org/ \
   263 | tools/addusers.py wiki
   264 
   265 The addusers.py program needs to be told the directory containing the wiki
   266 configuration.
   267 
   268 Output Structure
   269 ----------------
   270 
   271 The structure of a converted workspace is a directory hierarchy containing the
   272 following directories:
   273 
   274   * pages     (a collection of directories defining each page or content item,
   275                corresponding to Page, Comment and BlogPost elements in the XML
   276                exported from Confluence)
   277 
   278   * versions  (a collection of files, each defining a revision or version of
   279                some content, corresponding to BodyContent elements in the XML
   280                exported from Confluence)
   281 
   282 Each page directory contains the following things:
   283 
   284   * pagetype    (either "Page", "Comment" or "BlogPost")
   285 
   286   * manifest    (a list of version entries in a format similar to the MoinMoin
   287                  page package manifest format)
   288 
   289   * attachments (a list of attachment version entries in a format similar to
   290                  the MoinMoin page package manifest format)
   291 
   292   * pagetitle   (an optional page title imposed on the page by another content
   293                  item)
   294 
   295   * children    (a list of child page names defined for the page)
   296 
   297   * comments    (a list of creation date plus comment page identifier pairs)
   298 
   299 In the output structure, content items such as comments are represented as
   300 pages and each reference a content version. Since comments will ultimately be
   301 represented as subpages of some parent page, they will have a pagetitle file
   302 in their directory with an appropriate subpage name written according to the
   303 parent page's name and comment details.
   304 
   305 Troubleshooting
   306 ---------------
   307 
   308 The page package import activity in particular can be a source of problems.
   309 Generally, any error occurring when attempting to import a package is likely
   310 to be due to insufficient privileges when writing to the pages directory of a
   311 wiki or to its edit-log file.
   312 
   313 The moinsetup software can generate scripts that set the ownership of wiki
   314 files or apply ACLs (access control lists) to those files in order to make
   315 access to wiki data more convenient. Where the ownership of the files must be
   316 set (to www-data or nobody), the import step can be run as that user given
   317 sufficient privileges. However, the easiest solution is to apply ACLs, thus
   318 allowing the user who created the wiki to retain write access to it.
   319 
   320 Contact, Copyright and Licence Information
   321 ------------------------------------------
   322 
   323 The current Web page for ConfluenceConverter at the time of release is:
   324 
   325 http://hgweb.boddie.org.uk/ConfluenceConverter
   326 
   327 Copyright and licence information can be found in the docs directory - see
   328 docs/COPYING.txt and docs/LICENCE.txt for more information.