More Rob Ford tweets on a map

Another example of how global the Rob Ford scandal has become via harvested tweets with geographic coordinates. This example is a harvest of #rofo, #robford, #topoli, and #ShirtlessHorde.

The harvest took place on July 6, 2014, and should cover the discussion around the time of Rob Ford's return on June 30, 2014 to July 6, 2014. The tweets with available geo-information represents less than 10% of all tweets harvested. If you would like the raw tweet data (not the geoJSON - you can grab that if you view the source), you can get it from here. If you would like to see all the tweets harvested, you can view them here. (Warning! This might blow up your browser. There is a fair bit of data here.)

Tweets were harvested with Ed Summer's Twarc.

#rofo OR #robford OR #topoli OR #ShirtlessHorde

IIPC Curator Tools Fair: Islandora Web ARChive solution pack

The following is the text for a video that I was asked to record for the 2014 International Internet Preservation Consortium General Assembly Curator Tools Fair, on the Islandora Web ARChive solution pack.


My name is Nick Ruest. I am a librarian at York University, in Toronto, Ontario. I’m going to give a quick presentation on the Islandora Web ARChive solution pack. I only have a few minutes, so I’ll quickly cover what the module does, what areas of the web archiving life cycle it covers, and a provide a quick demonstration.

So, what is the Islandora Web ARChive solution pack?

I’ll step back and quickly answer what is Islandora first. "Islandora is an open source digital asset management system based on Fedora Commons, Drupal and a host of additional applications." A solution pack in Islandora parlance, is a Drupal module that integrates with the main Islandora module and the Tuque library thereby allowing users to deposit, create derivatives, and interact with a given type of object. We have solution packs for Audio, Video, Large Images, Images, PDFs, paged content, and now web archives. The Web ARChive solution pack allows users to ingest and retrieve web archives through the Islandora interface. If we think about it in terms of OAIS, we give the repository a SIP, which in the case of this solution pack can be a single warc file and some descriptive metadata, and if available, a screenshot and/or a PDF. From there, the solution pack will create an AIP and DIP. The AIP will contain: the original warc, MODS descriptive metadata, FITS output (file characterization/technical metadata), web dissemination versions of the screenshots (jpg & thumbnail), PDF, and derivatives of the warc via warctools. Those derivatives are a csv and filtered warc. The csv -- WARC_CSV -- is a listing of all the files in a given warc. This allows a user/researcher to have a quick glance at the contents of the warc. The filtered warc -- WARC_FILTERED -- is a warc file stripped down as much as possible to the text, and it is used only for search indexing/keyword searching. The DIP is an a JPG/TN of the captured website (if supplied) and download links to the WARC, PDF, WARC_CSV, screenshot, and descriptive metadata. Here a link to the ‘archived site’ can be supplied in the default MODS form. The suggested usage here is to provide a link to the object in a local instance of Wayback, if it exists.

I’ve also been asked to address the following questions:

1) What aspects of the web archiving life cycle model does the tool cover? What aspects of the model would you like to/do intend to build into the tool? What functionality does the tool provide that isn’t reflected in the model?

I’ll address what it does not cover first: Appraisal and selection, scoping, and data capture. We allow users to use their own Appraisal and selection, scoping, and data capture processes. So, for example, locally, we use Heritrix for cron based crawls, and our own bash script for one-off crawls.

What does it cover? All of the rest of the steps!

  • Storage and organization: via Fedora Commons & Islandora
  • QA and analysis: via display/DIP -- visualization exposes it!
  • Metadata/description: every web archive object has a MODS descriptive datastream
  • Access/use/reuse: each web archive object has a URI, along with its derivatives. By default warcs are available to download.
  • Preservation: preservation depends on the policies of the repository/institution, but, in our case we have a preservation action plan for web archives, and suite of Islandora preservation modules running (checksum, checksum checker, FITS, and PREMIS) that cover the basics.
  • Risk management: see above.

2) What resources are committed to the tool’s ongoing development? What are major features in the roadmap? Is the code open source?

I developed the original version, and transferred it to the Islandora Foundation, allowing for community stewardship of the project.

Currently, there is no official roadmap for the project. If anybody has ideas, comments, suggestion, or roadmapish ideas, feel free to send a message to the Islandora mailing list.

...and yes, the code is totally open source. It is available under a GPLv3 license, and the canonical version of the code can be found under the Islandora organization on Github.

3) What is the user base for the tool? How environment-specific is the tool as opposed to readily reusable by other organizations?

Not entirely sure. It was recently released as part of the 7.x-1.3 version of Islandora.

Given that is it an Islandora module, it is tied to Islandora. So, you’ll have to have at least a 7.x-1.3 instance of Islandora running, along with the solution pack’s dependencies to run it.

4) What are the tool’s unique features? What are its shortcomings?

I think some unique features are that it is apart of a digital asset management system (it is the first of its kind that I am aware of), and the utilization of warctools for keyword searching and file inventories.

Shortcomings? That it is apart of a digital asset management system.

Very quick demo time!

Rob Ford tweets on a map

Examples of how global the Rob Ford scandal has become via harvested tweets with geographic coordinates.

If you would like the raw tweet data (not the geoJSON - you can grab that if you view the source), you can get it from here and here. Tweets were harvested with Ed Summer's Twarc.

robford OR rob ford OR rofo OR topoli OR toronto OR FordNation May 3, 2014

TOpoli OR Toronto OR RobFord November 25, 2013

TOpoli OR Toronto OR RobFord November 14, 2013

Digital Preservation Tools and Islandora

Incorporating a suite of digital preservation tools into various Islandora workflows has been a long-term goal of mine and a few other members in the community, and I'm really happy to see that it is now becoming more and more of a priority in the community.

A couple years ago, I cut my teeth on contributing to Islandora by creating a FITS plugin for the Drupal 6 version of Islandora. Later this tool was expanded to a stand alone module with restructuring of the Drupal 7 code base of Islandora. The Drupal 7 version of Islandora, along with Tuque, has really opened up the door for community contributions over the last year or so. Below is a list and description of Islandora modules with a particular focus to the preservation side of the repository platform.

Islandora Checksum

Islandora Checksum is a module that I developed with Adam Vessey (with special thanks to help from Jonathan Green and Jordan Dukart who helped me grok Tuque), that allows repository managers to enable the creation of a checksums for all datastreams on objects. If enabled, the repository administrator can choose from default Fedora Commons checksum algorithms: MD5, SHA-1, SHA-256, SHA-384, SHA-512.

This module is licensed under a GPLv3 license, and currently going through the Islandora Foundation's Licensed Software Acceptance Procedure. If successfull, the module will be apart of the next Islandora release.

Islandora Checksum admin

Islandora Checksum Checker

Islandora Checksum Checker is a module by Mark Jordan that extends Islandora Checksum by verifying, "the checksums derived from Islandora object datastreams and adds a PREMIS 'fixity check' entry to the object's audit log for each datastream checked."

This module is also licensed under a GPLv3 license.

Islandora BagIt admin

Islandora PREMIS

Islandora PREMIS is a module by Mark Jordan, Donald Moses, Paul Pound, and myself. The module produces XML and HTML representations of PREMIS metadata for objects in an Islandora repository on the fly. The module currently documents: all fixity checks performed on an object's datastreams, includes configurable 'agent' entries for an institution as well as for the Fedora Commons software, and maps the contents of each object's "rights" elements in the Dublic Core datastream to equivalent PREMIS "rightsExtension" elements. You can view an example here, along with the XML representation.

What we have implemented so far is just the basics, and we are always seeking feedback to make it better. If you're interested in the discussion or would like to provide feedback, feel free to follow along in the Islandora Google Group thread, and the Github issue queue for the project.

This module is also licensed under a GPLv3 license.

Islandora PREMIS admin

Islandora BagIt

Islandora BagIt is also a module by Mark Jordan (actually a fork of his Drupal module) that utilizes Scholars' Lab's BagItPHP, allowing repository administrators to create bags of selected content. Currently a wide variety of configuration options for exporting contents as Bags, as well as creating Bags on ingest and/or when objects are modified. The way Mark has structured this module also allows developers to easily extend it by creating additional plugins for it, as well providing Drush integration.

This module is also licensed under a GPLv3 license.

Islandora BagIt admin

Islandora Preservation Documentation

Documentation! One of the most important aspects of digital preservation.

This is not a full blown module yet. What it currently is, is the beginings of a generic set of documentation that can be used by repository administrators. Eventually we hope to use a combination of Default Content/UUID Features and Features to provide a default bundle of preservation documenation in an Islandora installation.

The content in this Github repo comes from the documentation and policies we are creating at York University Library, which is derived from the wonderful documentation created by Scholars Portal during their successful ISO 16363 audit.

Islandora Web ARChive SP updates

Community

Some pretty exciting stuff has been happening lately in the Islandora community. Earlier this year, Islandora began the transformation to a federally incorporated, community-driven soliciting non-profit. Making it, in my opinion, and much more sustainable project. Thanks to my organization joining on as a member, I've been provided the opporutinity to take part in the Roadmap Committe. Since I've joined, we have been hard at work creating transparent policies and processes software contributions, licenses, and resources. Big thanks to the Hydra community for providing great examples to work from!

I signed my first contirbutor licence agreement, and initiated the process for making the Web ARChive Solution Pack a canonical Islandora project, subject to the same release management and documentation processes as other Islandora modules. After working through the process, I'm happy to see that the Web ARChive Solution Pack is now a canonical Islandora project.

Project updates

I've been slowly picking off items from my initial todo list for the project, and have solved two big issues: indexing the warcs in Solr for full-text/keyword searching and creating and index of each warc.

Solr indexing was very problematic at first. I ened up having a lot of trouble getting an xslt to take the warc datastream and give it to FedoraGSearch, and in-turn to Solr. Frustrated, I began experimenting with newer versions of Solr, which thankfully has Apache Tika bundled, thereby allowing for Solr to index basically whatever you throw at it.

I didn't think our users wanted to be searching the full markup of a warc file. Just the actual text. So, using the Internet Archives' Warctools and @tef's wonderful assistance, I was able to incorporate warcfilter into the derivative creation.

$ warcfilter -H text warc_file > filtered_file

You can view an example of the full-text searching of warcs in action here.

In addition to the full-text searching, I wanted to provided users with a quick overview of what is in a given capture, and was able to do so by also incorporating warcindex into the derivative creation.

$ warcindex warc_file > csv_file

#WARC filename offset warc-type warc-subject-uri warc-record-id content-type content-length
/extra/tmp/yul-113521_OBJ.warc 0 warcinfo None <urn:uuid:588604aa-4ade-4e94-b19a-291c6afa905e> application/warc-fields 514
/extra/tmp/yul-113521_OBJ.warc 797 response dns:yfile.news.yorku.ca <urn:uuid:cbeefcb0-dcd1-466e-9c07-5cd45eb84abb> text/dns 61
/extra/tmp/yul-113521_OBJ.warc 1110 response http://yfile.news.yorku.ca/robots.txt <urn:uuid:6a5d84d1-b548-41e4-a504-c9cf9acfcde7> application/http; msgtype=response 902
/extra/tmp/yul-113521_OBJ.warc 2366 request http://yfile.news.yorku.ca/robots.txt <urn:uuid:363da425-594e-4365-94fc-64c4bb24c897> application/http; msgtype=request 257
/extra/tmp/yul-113521_OBJ.warc 2952 metadata http://yfile.news.yorku.ca/robots.txt <urn:uuid:62ed261e-549d-45e8-9868-0da50c1e92c4> application/warc-fields 149

The updated Web ARChive SP datastreams now look like so:

Warc SP datastreams

One of my major goals with this project has been integration with a local running instance of Wayback, and it looks like we are pretty close. This solution might not be the cleanest, but at least it is a start, and hopefully it will get better over time. I've updated the default MODS form for the module so that it better reflects this Library of Congress example. The key item here is the 'url' element with the 'Archived site' attribute.

<location>
  <url displayLabel="Active site">http://yfile.news.yorku.ca/</url>
  <url displayLabel="Archived site">http://digital.library.yorku.ca/wayback/20131226/http://yfile.news.yorku.ca/</url>
</location> 

Wayback accounts for a date in its url structure 'http://digital.library.yorku.ca/wayback/20131226/http://yfile.news.yorku.ca/' and we can use that to link a given capture to its given dissemination point in Wayback. Using some Islandora Solr magic, I should be able give that link to a user on a given capture page.

We have automated this in our capture and preserve process: capturing warcs with Heritrix, creating MODS datastreams, and screenshots. This allows us to batch import our crawl quickly and efficiently.

Hopefully in the new year we'll have a much more elegant solution!

The Islandora Web ARChive Solution Pack - Open Repositories 2013

Below is the text and slides of my presentation on the Web ARChive solution pack at Open Repositories 2013.


http://ruebot.net/files/OR2013-0.png

I have a really short amount of time to talk here. So, I am going to focus on the how and why for this solution pack and kinda put it in context of the Web Archiving Life Cycle Model proposed by the Internet Archive earlier this year. Maybe I shouldn't have proposed a 7 minute talk!

http://ruebot.net/files/OR2013-1.png

Context! Almost a year ago, I was in a meeting and was presented with this problem. YFile, a daily university newspaper -- it was previously a paper now a website -- had been taken over by marketing a while back, and they deleted all their back content. They are an official university publication, so an official university record, and eventual end up in archives, so it will eventually be our problem; the library's problems. Plainly put, we live in a reality where official records are born and disseminated via the Internet. Many institutions have a strategy in place for transferring official university records that are print or tactile to university archives, but not much exists strategy-wise for websites. So, I naively decided to tackle it.

http://ruebot.net/files/OR2013-2.png

I tend to just do things. I don't ask permission. I apologize later if i have to. Like maybe taking down the YFile server during the first few initial crawls. If I make mistakes, that is good, I am learning something! What i am doing isn't new, but then again it knda is. It is a really weird place. I need to crawl a website everyday. The internet archive crawler comes around whenever it does. There is no way to give the Internet Archive/Wayback machine a whole bunch of warc files, and I'm not ready to pay for Archive-It.

http://ruebot.net/files/OR2013-3.jpg

That won't work for me at all when I have some idea how to do it all myself. So, what is the problem? I need to capture and preserve a website everyday. I want to provide the best material to a researcher. I want to keep a fine eye on preservation, but not be a digital pack rat, and need to constantly keep the librarian and archivist in me pleased, which is always seems to the Item vs. collection debate and which of those gets the most attention.

http://ruebot.net/files/OR2013-4.jpg

How easy is it to grab a website? Pretty damn easy if you're using at least wget 1.14 which has warc support.

http://ruebot.net/files/OR2013-5.jpg

How many people here know what a warc is? Warc stands for web archive. It is an iso standard. It is basically a file -- that can get massive very quickly -- that aggregates raw resources you request into a single file along with crawl metadata, checksums. PROVENANCE!

This is what the beginning of a warc file looks like.

http://ruebot.net/files/OR2013-6.jpg

This is what the beginning of a warc file looks like.

http://ruebot.net/files/OR2013-7.jpg

And here is a selection from the arctual archive portion. That is my brief crash course on warc. We can talk about it more later if you have questions. I need to keep moving along.

http://ruebot.net/files/OR2013-8.jpg

So, warcs are a little weird to deal with on their own. You can disseminate them with Wayback Machine, and I assume nobody but a few people on this planet want to see a page full of just warc files. Building something browsable takes a little bit more work. So, I decided to snag a pdf and screenshot of the page of frontpage of the site that I am grabbing with wkhtmltopdf and wkhtmltoimage. Then I toss this all in a single bash script, and give it to cron.

http://ruebot.net/files/OR2013-9.jpg

So this is what I have come up with. This is how I capture and preserve a website. The pdf/image + xvfb came from Peter Binkley. X virtual framebuffer is an X11 server that performs all graphical operations in memory, not showing any screen output.

http://ruebot.net/files/OR2013-10.jpg

I've been running that script on cron since last October. Now what? Like I said before, nobody wants to see a page full of warc files. So, I started working with the tools and platforms that I know. In this case, Drupal, Islandora, and Fedora Commons, and created a solution pack. Solution pack in Islandora parlance, is a Drupal module that integrates with the main Islandora module and Tuque API to deposit, create derivatives, and interact with a given type of object. So, we have solution packs for Audio, Video, Large Images, Images, PDFs, and paged content.

http://ruebot.net/files/OR2013-11.jpg

What does it do? Adds all required Fedora objects to allow users to ingest, create derivatives, and retrieve web archives through the Islandora interface. So we have, Content Models, Data Stream Composite Models, forms, and collection policies. The current iteration of the module allows one to batch ingest a bunch of objects for a given collection, and it will create all of the derivatives (Thumbnail and display image), and index any provided descriptive metadata in Solr as well as the actual WARC file since it is mostly text. The WARC indexing is still pretty experimental, it works, but I don't know how useful it is.

http://ruebot.net/files/OR2013-12.jpg

If you want to check out a live demo, and poke around while I am rambling on here, check this site out.

http://ruebot.net/files/OR2013-13.jpg

Collection (in terms of the web archiving life-cycle model). This an object from the Islandora basic collection solution pack.

http://ruebot.net/files/OR2013-14.jpg

Seed (in terms of the web archiving life-cycle model). This is an object from the Islandora basic collection solution pack.

http://ruebot.net/files/OR2013-15.jpg

Document (in terms of the web archiving life-cycle model). This is an object from the Islandora Web ARChive solution pack.

http://ruebot.net/files/OR2013-16.jpg

Here is what my object looks like. The primary archival object is the WARC file, then we have our associated data streams: PDF (from the crawl), MODS/DC (descriptive metadata), Screenshot (from the crawl), FITS (techinical metadata), Thumbnail & Medium JPG (deriative display images).

http://ruebot.net/files/OR2013-17.jpg

Todo! What I am still working on when I have time.

http://ruebot.net/files/OR2013-18.jpg

I want to tie in the Internet Archive's Wayback Machine for playback/dissemination of WARCs. I haven't quite wrapped my head around how best to do the Wayback integration, but I am thinking of using the date field value on in the MODS record for an individual crawl.

http://ruebot.net/files/OR2013-19.jpg

I'm also thinking of incorporating WARC tools into this solution pack. This would be for quick summaries and maybe a little analysis. This of how R is incorporated into Dataverse if you are familar with that.

http://ruebot.net/files/OR2013-20.jpg

I am also working on integrating my silly little bash scripts into the solution pack. That way one could just do the whole fell swoop of crawling, dissemination, and preservation in a single click when ingesting an object in Islandora.

http://ruebot.net/files/OR2013-21.jpg

Finally, there is a hell of a lot of metadata in each of these warc files begging for something to be done with them. I haven't figured out a way to parse them in an abstract repeatable way, but if I or somebody else does, it will be great!

http://ruebot.net/files/OR2013-22.jpg http://ruebot.net/files/OR2013-23.jpg

Calling out nonsense - John Degen

This post by John Degen looks like F.U.D., Fear, Uncertainty, and Doubt. If it doesn’t, please tell me why. The thing with F.U.D. is that there are generally misconceptions that lead to false conclusions, and that is what I am seeing in the post by Mr. Degen.

Mr. Degen, I respect the position you are in. Like me, you are standing up for a set of values, ethics, and rights for your profession. This is not a black and white issue. There are grey areas where we overlap, and that is where agreement or conflict can exist. In this case, we have a lot of conflict. But, we have a some stark lines drawn for us with Bill C-11, and the recent Supreme Court rulings. Simply put, rights around fair dealing and educational use have expanded.

Now, the misconceptions.

Misconception one. I am not the great dread pirate black beard of librarianship. In no way have I, nor the Ontario Library and Information Technology Association, said that creators should not be compensated. Yes, resolution language is ugly Robert’s Rules of Order legalese. That is what it has to be for the setting of an annual general meeting. Do I wish it was plain, simple, beautiful prose? Yes.

WHEREAS there exists model license agreements between Access Copyright and the Association of Universities and Colleges of Canada (AUCC) and between Access Copyright and the Association of Canadian Community Colleges (ACCC), and

WHEREAS there exist agreements between Access Copyright and the University of Toronto and between Access Copyright and the University of Western Ontario, and

WHEREAS the Canadian Association of University Teachers (CAUT), the British Columbia Library Association (BCLA), the Atlantic Provinces Library Association (APLA), the Manitoba Library Association (MLA), the Newfoundland Labrador Library Association (NLLA), the Progressive Librarians’ Guild (PLG) as well as many leading copyright scholars in Canada have taken strong positions against the Access Copyright licenses, and

WHEREAS the addition of “education” to the fair dealing categories, and the broad support for fair dealing in the Supreme Court’s pentalogy rulings of July 2012 provide further support for the position that the Access Copyright license does not provide any additional value to institutions beyond their existing rights, and

WHEREAS the fee structure is inequitable to students on whom the costs are imposed, and

WHEREAS several provisions in the license agreements limit the use of emerging technologies and increase the potential for monitoring and surveillance,

BE IT RESOLVED THAT the Ontario Library and Information Technology Association (OLITA):

  1. Stands opposed to the Access Copyright license agreements as they currently stand, including the AUCC and ACCC Model Licenses and the separate licenses with the University of Toronto and the University of Western Ontario,
  2. Urges Canadian post-secondary institutions not to enter into this licensing agreement,
  3. Encourages those who have already signed to exercise their termination options as soon as possible, and
  4. Recommends that institutions move toward the construction of systems of knowledge creation and sharing based on fair dealing, open access, site licensing as well as transactional licenses where they are needed.

The WHEREAS clauses provide the context, setting, or a lens with respect the the resolution. The resolution, I believe, is fairly explicit. OLITA, “stands opposed to the Access Copyright license agreements as they currently stand...” OLITA did not say, “Access Copyright is Cthulhu. It should be banished from this dimension, and no creator should ever be compensated.” We have an issue with those specific model licenses, and agreements. We are not the first to raise this issue. The Canadian Association of University Teachers, the Atlantic Provinces Library Association, the Newfoundland and Labrador Library Association, the Manitoba Library Association, the BC Library Association, the McMaster University Academic Librarians’ Association, the Progressive Librarians Guild Toronto Area Chapter, and many leading copyright scholars in Canada have all spoken out in opposition to these model agreements and licenses. OLITA isn’t even the first association to oppose or condemn the model licenses and agreements by way of a resolution. CAUT did so last spring, as did the BC Library Association and the McMaster University Academic Librarian’s Association, and believe it or not, the Ontario College and University Library Association. Furthermore, OCULA passed the same exact resolution as OLITA one day prior. Neither Mr. Degen or Access Copyright seemed to notice this at all from what I can tell via recent public communications by both parties.

Misconception two. The “dialogue”. I have tried my best to be as transparent as possible. Mr Degen and Access Copyright seem to be misrepresenting the narrative (“a strategic attempt to influence perception by disseminating negative and dubious or false information”). Mr. Degen and Access Copyright both refer to the letter I referenced in the previous post, and seemingly lead one to believe that the transmission of that letter is the end of the story. That Access Copyright tried to engage in an open dialogue with myself and OLITA, and both I and OLITA refused a dialogue. As I showed in my previous post, I welcomed participation in the process at the AGM for those Access Copyright board members, directors, employees, etc., who are OLITA members. From what I understand, we do have members of OLITA that are affiliated with Access Copyright, so there was every opportunity to participate. Moreover, the Access Copyright Executive Director followed up to my response saying, "We don't see how this can properly take place at your AGM. Would you consider delaying the motion until we have the opportunity to meet and begin a dialogue?" That, in my opinion, is attempting circumvent a democratic process. The executive director asked that I pull a resolution. There is no right or standing to ask such a thing. As for the following statement, “We don’t see how this can properly take place at your AGM.” Really? Resolutions are a normal part of AGMs. A member has every right to submit a resolution. If it is moved and seconded, then it moves to the agenda for the meeting. So, yes, it can properly take place. To think otherwise is silly.

Finally, if you want to talk let’s talk. Yesterday wasn’t an example of a constructive dialogue. In fact it got really unconstructive. Mr. Degen, I would like to personally apologise if there was any offense taken from any of my actions, and would also like to apologise on behalf of my colleagues. As I ended my previous post, if Access Copyright or Mr. Degen would like to open a dialogue about why these resolutions were unanimously passed, now is the time to do so.
 

Calling out nonsense - Access Copyright

Inspired by some fearless leaders in our community, this is my Access Copyright story.

This past week, a very interesting series of events unfolded with Access Copyright, or maybe better said, what unfolded was a lesson in how not to engage in open dialogue. I will not be speaking to the text of the the resolutions mentioned below, just the events surrounding them.

Last week was the annual Ontario Library Association Super Conference. During Super Conference, each OLA Division has their Annual General Meeting. Among other things, AGMs provide opportunities for resolutions to be put forward by the membership. At this particular AGM, we had two resolutions put forward: 1) A Memorial Resolution Honouring Aaron Swartz (Thanks ALA/LITA!) and 2) OLITA Resolution on Opposition to Access Copyright License Agreements. Standard procedures were followed, the resolutions were moved and seconded, and sent out to the membership in advance.

This past Monday (2013-01-08) something happened. I received an email from Robert Gilbert, New Media and Communication Services, at Access Copyright. This is what they had to say. I will explain why I am making this letter public later.

I addressed the incorrect information in the letter in a reply to the sender, and cc'd recipients:

Dear Mr. Gilbert,

I'd like to clear up some confusion with the resolution. The posted resolution[1] which I assume you have seen or been directed to is a proposed resolution for the Ontario Library and Information Technology Association's (OLITA) Annual General Meeting[2]. It was sent out in advance to membership.

The resolution has been moved and seconded, and will be put before the membership at the meeting for a vote. Prior to the vote, an opportunity will be provided to speak to the motion, ask questions, and propose amendments. If you are or your colleagues are OLITA members, you are more than welcome to participate.

cheers!

-nruest

[1] http://www.accessola2.com/olita/insideolita/wordpress/?p=58235 [2] http://www.accessola2.com/olita/insideolita/wordpress/?p=58053

A day went by, and other than I an out of office reply, I didn't hear anything in response. I figured we were done.

Nope.

On Wednesday evening (2013-01-30), I received an email from the Executive Director of Access Copyright. I will not publish the entire email, but I was asked to delay the motion, "We don't see how this can properly take place at your AGM. Would you consider delaying the motion until we have the opportunity to meet and begin a dialogue?"

I responded:

Hi XXXXXXX,

I due[sic] hope you understand the weight and merit of what you are asking. You are asking that I forgo a democratic process. This is a resolution that was put forward by a member of our association, and will be discussed as [sic] voted on at our AGM.

As I stated previously, if you or any of your colleagues are OLITA members, you are more then welcome to come and part take in this democratic process. You will be provided every opportunity to speak to the resolution on the table.

Other than that, I will in no way interfere this process as you have suggested.

Regards,

-nruest

I had hoped this was the end of the exchange.

Nope.

The OLITA AGM was Friday evening (2013-02-01). Access Copyright was present at the conference as they had a booth in the exhibitors' hall. During the day, a colleague of mine showed me the letter I mentioned earlier. Somewhat (well really a lot) flabbergasted, I asked where and how they got a copy, assuming the only people to see the aforementioned letter were those that sent it, and those that received it. Nope. Access Copyright decided the best way to engage in an open "dialogue" with me, our association and/or our community was to print off a stack of these letters (in a very classy paper stock!) to hand out at their exhibitor booth.

I fully appreciate, and can understand the rationale behind trying to open up a dialogue. However, Access Copyright tried to circumvent a democratic process, refused to engage in a public dialogue, and tried to misrepresent and embarrass OLITA on the exhibitors’ floor. I find these intimidation tactics unacceptable.

We played fair. We brought no mention of Access Copyright's behaviour to the assembly floor. The resolution went forward with a single friendly amendment, and was passed unanimously. The OLITA membership has spoken. If Access Copyright would like to open a dialogue about why these resolutions were unanimously passed, now is the time to do so.

Islandora Web ARChive Solution Pack

What is it?

The Islandora Web ARChive Solution Pack is yet another Islandora Solution Pack. This particular solution pack provides the necessary Fedora objects for persisting and disseminating web archive objects; warc files.

What does it do?

Currently, the SP allows a user to upload a warc with an associated MODS form. Once the object is deposited, the associated metadata is displayed along with a download link to the warc file.

You can check out an example here

Can I get the code?

Of course!

Todo?

If I am doing something obviously wrong, please let me know!

Immediate term:

  1. Incorporate Wayback integration for the DIP. I think this is the best disseminator for the warc files. However, I haven't wrapped my head around how to programatically provide access to the warc files in the Wayback. I know that I will have two warc objects, an AIP warc and a DIP warc (Big thank you to @sbmarks for being a soundboard today!). Fedora will manage the AIP, and Wayback will manage the DIP. Do I iFrame the Wayback URI for the object, or link out to it?

  2. Drupal 7 module. Drupal 7 versions of Islandora Solution Packs should be on their way shortly -- Next release I believe. The caveat to using the Drupal 6 version of this module is the mimetype support. It looks like the Drupal 6 api (file_get_mimetype) doesn't pull the correct mimetype for warc file. I should get 'application/warc' but I am getting 'application/octet-stream' -- the fallback default for the api.

Long term:

  1. Incorporate Islandora microservices. What I would really like to do is allow users to automate this entire process. Basically, just say this is a site I would like to archive. This is the frequency at which I would like it archived, with necessary wget options. This is the default metadata profile for it. Then grab the site, ingest it into Fedora, drop the DIP warc into Wayback, and make it all available.

  2. If you have any idea on how to do the above, or how to do it a better manner, please let me know!

DPLA Appfest Drupal integration

Below is the output of the little project I worked on today at the DPLA Appfest. It definitely isn't a perfect solution to the problem. It is not a drop-in module to just grab a collection from the DPLA API and "curate" it in your library's Drupal site. I hate reinventing the wheel especially if there are existing modules that can solve the problem for you. Moreover, as one of the few people that still respects what OAI-PMH does, it would be worth considering using DPLA as and OAI-PMH provider. But, I'm not sure if that is technically legal in OAI-PMH terms given that they are most likely likely harvesting it via OAI-PMH. Don't want to get into and infinite regressing of metadata providers. S'up dawg? All jokes aside, I think OAI-PMH would be a better solution that I what I tossed together because it would make harvesting a "set" a hell of a lot easier. My 2¢.

I also have a live demo of it living on my EC2 instance. I've ingested 2000 items from the API, and decided to throw them into a solr index just to demonstrate the possibilities of what you can do with the ingested content.

Finally, I big giant thank you to DPLA and Chattanooga Public Library for putting this on and the wonderful hospitality. This was absolutely fantastic!

Idea

Drupal module or distribution

Your Name: Nate Hill

Type of app: Drupal CMS

Description of App: Many. many libraries choose to use Drupal as their content management system or as their application development framework. A contrib Drupal module that creates a simple interface for admin users to curate collections of DPLA content for display on a library website would be useful.

Workflow

Preamble:

I don't like recreating the wheel. So, let's see what contrib modules already exist, and see if we can just create a workflow to do this to start with. It would be really nice if DPLA had a OAI-PMH provider, then you could just use CCK + Feeds + Feeds OAI-PMH.

Example: bitly.com/VXMvMr

Requirements:

  • CCK

    drush pm-download cck

  • Feeds

    drush pm-download feeds

  • Feeds - JSON Parser

    drush pm-download feeds_jsonpath_parser cd sites/all/modules/feeds_jsonpath_parser && wget http://jsonpath.googlecode.com/files/jsonpath-0.8.1.php

Setup:

  • Create a Content Type for the DPLA content you would like to pull in (admin/content/types/add)
  • Create DPLA metadata fields for the Content Type (admin/content/node-type/YOURCONTENTYPE/fields)
  • Create a new feed importer (admin/build/feeds/create)
  • Configure the settings for you new feed importer
    • Basic settings:
    • Select the Content Type you would like to import into
    • Select a fequency you would like Feeds to ingest
    • Fetcher
    • HTTP Fetcher
    • Processor
    • Node processor
    • Select the Content Type you created
    • Mappings (create a mapping for each metadata field you created)
      • Source : jsonpath_parser:0
      • Target : Title
    • Parser
    • JSONPath Parser
    • Settings for JSONPath parser
      • Context: $.docs.*
  • Construct a search you would like to ingest using the DPLA API
    • ex: http://api.dp.la/v1/items?dplaContributor=%22Minnesota%20Digital%20Library%22
  • Start the import! (node/add/YOURCONTENTTYPE)
  • Give the import a title... whatever your heart desires.
  • Add a feed url
  • Click on JSONPath Parser settings, and start adding all of the JSONPaths
  • Click save, and watch the import go.
  • Check out your results