RedRogueXIII's Development Journal: Application Development

Showing posts with label Application Development. Show all posts

Friday, May 9, 2014

Automated HTML Format E-Book Concordance Generator

Started a new programming project; an automatic concordance generator for HTML based E-Books. Given a list of terms, it searches an HTML document and inserts local hyperlinks where it finds each term. Once it is done, it returns a list of all the local hyperlinks as another HTML document to be used as the starting point for a hand edited index. Created to speed up process of creating an index for an E-Book, a free .PDF based alternative program was found but not one for .HTML formats.

It works as intended at it's most basic functionality, but needs more refinement in the design. For example when two concurrent paragraphs contain the same term the program will find and mark both paragraphs, this is an unintended result. Index entries do not need multiple links to the same page from the same index entry as it will bloat the index and confuse users.

Noted Issues and areas for improvement:

Unnecessary index entries from close proximity terms.
No method of specifying areas to skip concordance generation.
Duplicate search terms with different punctuation get treated as different terms, leading to duplicated results.
Performance speed is terrible because of the recursive nature of the current search algorithm. I noted a 5 minute average for my 150 page test document, stripping the header and title HTML (which leaves body of content untouched) resulted in 01:40 minute average, a 33.3% improvement. I am confident the performance speed can be improved further.
Still have not implemented multi-threading in my applications, leading them to appear unresponsive while working on an operation for long periods.
Has no functionality for opening documents with preexisting index and being able to edit, add, or merge with existing results.
Ideally would have a graphic user interface where editors could quickly browse the document and manually add in index entries, as well as those generated by the concordance, and be able to dynamically manage the data.
Ideally would like to have dictionary support, being able to categorize terms as nouns, verbs, singular or plural etc, and then being able to compare them against words of similar meaning and make smarter recognitions.

In short trying to have more levels of smarter automation, to save editors time and effort.

A pipe dream would be to create an artificial intelligence capable of doing more than concordance, recognizing ideas conveyed in text and then being able to create a 100% machine generated index, rather than concordance. The massive raw data needed for word identification and networking, and pattern recognition algorithm would be huge endeavors in themselves, but not impossible. Perhaps worth some pursuit when time and money are in more abundance.

Monday, March 25, 2013

Dezoomify Downloader Progress

Well for the last 5 days or so I've been diligently working on the de zoomify downloading program. Now it is finally working with all the features that I planned it to have : batch downloading, user specified file formats, user specified download folder and renaming mask, and most recently : the ability to find zoomify folder links on any given webpage.

However it doesn't have an updating progress bar, or any sort of visual indicator of download progress because the application is not multithreaded. Just starting to read the documentation on threading is disorienting, let alone implementing it. I would really like to implement threading so that users who aren't running the program as a debug build in visual studio can see the progress of the download and what it is currently doing. But as it stands now, I was extremely busy before starting the project and I don't know if I have the time or resources to continue developing a feature I have no idea how long it will take to implement.

For now I'm calling it done, having served the purpose I created it for - although it's not in the best of shape. Anyways the source code is freely available : https://github.com/RedRogueXIII/DeZoomify_NET I would like to eventually return to it and add threading, but for now I don't know.

Anyways, at the moment I'm looking for a way to batch compress lossless quality pngs, since I now have 2 GBs of high res images that could be much smaller.

Friday, March 22, 2013

De-Zoomify Image Downloader Origins

As for the story behind why I started this program in the first place; I'll give you some background, I'm an addict for collecting image references for rare guns. One of my hobbies is to create 3D models of guns for use in video games, so it's good to have references on hand for when I want to start a new project.

A 3D model I made of an M1A1 Thompson, a submachine gun used by allied forces in WW2

Cowan's Auction house is currently selling the Richard L. Wray Collection, a collection of old, extremely rare, antique, machine guns in excellent high resolution and beautiful photography. Trend being that after the auction is done, these images will disappear into the void means I couldn't just bookmark it and come back later. So I tried to download them, one by one, finding and figuring out how to use the original web based implementations of Lovasoa's Dezoomify.

Over two hours I got the images for four lots. There is one hundred lots in the auction. So my options are to either take forever to do it manually or actually use my brain and put together something that will save me tons of time now and any other time I would need it later. I'm also on the lookout for a programming job, so having experience and a useful program to show definitely helps.

Anyways for the TL;DR : OM NOM NOM NOM MACHINE GUN REFS.

De-Zoomify Image Downloader

For the last three days I've been writing a program to download images that are used in the web service Zoomify. Zoomify splits high resolution images into tiles and only loads them as needed, so it's pain to get the whole high resolution image by downloading the tiles manually.

The program, which I'm calling DeZoomify Downloader is being written in C# using Visual Express and also up on Github: https://github.com/RedRogueXIII/DeZoomify_NET . I've done a smaller .NET application before, but trying to go all out on this one, with preference saving, batch handling, file format options, and a lot of fancy usability options.

At the moment it probably isn't too stable - I know I haven't been sanitizing or checking my inputs, so it's not extensively bug tested. Oh well it's still not close to having all the features I want it to have, but at least it's basic functionality works at the moment. ( Major credit goes to Lovasoa and his simpler web based version : https://gist.github.com/lovasoa/770310 who's work helped served as an understanding of how to get the images in the first place.) So still plenty of time to go back and fix and potential crap outs in can have.

Just today I got batch download working, so there are still two major features I would like to add still: finding all zoomify image links on a given webpage, and multithreading the application so the damn progress bar updates while it's downloading and it doesn't do the "program is not responding, uh oh better kill the app" business while it's busy downloading megabytes of information.

I don't know if it's worth going into detail with any of the mechanics of the program so far, seeing as how the dezoomify process isn't something I came up with, and how almost everything else is handled by the extensive .NET libraries. Also the entire source code is free to view, download, modify, and run from Github. Perhaps the process of identifying and pulling of the zoomify links from webpages that are displayed to users may be a nice topic for a blog post. That or a horror story of me trying to learn multithreading with 50 tabs open of different "First Must Read Articles About Multithreading".

Geez, so much headache just to make an updating progress bar.

I was entertaining the idea of a separate blog just for non game development stuff, but eh I wouldn't post frequently in either one so all it goes in one blog.