PDA

View Full Version : Searchable portal for the Everquest Archives


Dolalin
12-08-2020, 04:16 PM
Some may be aware that I do a lot of research.

Over the course of the past two years I've collected over 4 million files while dumping Everquest-related websites, deprecated Yahoo Groups mailing lists, and more.

Making these easily searchable has been a constant puzzle. But I've now been able to leverage Azure Cognitive Services and Azure Cognitive Search to create a searchable portal to the eq archives.

Presenting, the Everquest Archives search portal:

https://eqarchives.azurewebsites.net/

So far I have been able to index 335,000 of the 4 million files, including all the Yahoo Groups mailing lists. They are context-searchable and metadata fields will give you links to the original Wayback Machine archive link.

Indexing the archives is proving expensive, so I will be soliciting donations at a future time to continue the process. For now, you can enjoy the 335,000 indexed documents, which include the Sony boards, EQ Stratics, Casters Realm, and a few others (alphabetically, the contents of /websites/ in the Git archive up to the letter E).

Let me know what you think!

loramin
12-08-2020, 04:23 PM
https://i.imgur.com/ZeLI8p3.gif

Danth
12-08-2020, 04:28 PM
Funny enough I tried to PM you a couple weeks ago as to whether you could search for some specific things, except I couldn't because you seem to have forum PMs turned off. What a nice looking resource you've built. There are a few items I intend to search for as time and motivation permit.

Danth

Dolalin
12-08-2020, 04:35 PM
I noticed there's a bug on mobile where if you search for something and press the enter button instead of the magnifying glass, your result gets reset to * on the results page. Really annoying, I will try to fix it when able.

For now, if you're on mobile, press the magnifying glass to start your search.

mcoy
12-08-2020, 04:35 PM
This is awesome! Any chance you have Graffe's on your list of sites to index/archive?

-Mcoy

Dolalin
12-08-2020, 04:38 PM
This is awesome! Any chance you have Graffe's on your list of sites to index/archive?

-Mcoy

It's in the Git archive but I haven't gotten that far yet with indexing (G is after E, hehe).

It is quite expensive to index this. I need to firm up the pricing from Azure then I will ask for donations to continue the process.

Izmael
12-08-2020, 04:43 PM
Great initiative.

I can provide mirroring of this if needed, just let me know. Never know what will happen to the main site in a few years.

Dolalin
12-08-2020, 04:55 PM
The hosting is pretty easy, the site is just a tiny .netcoreapp, but it's backed by an Azure Search Service which is.... harder to mirror. I'm actually not sure if the data can be exported. :confused:

Dildy
12-08-2020, 04:59 PM
HNGHHHH great work! Can't wait to dive in.

Artelius

Jibartik
12-08-2020, 05:08 PM
https://i.imgur.com/ZeLI8p3.gif

Izmael
12-08-2020, 05:55 PM
Exporting the data would be great though.

Think of the future, 20, 40 years from now, when we might be gone or simply have lost interest. New generation of elves would be so grateful if that data was preserved.

If exporting is an option, I'll donate the resources to make it happen.

Dolalin
12-08-2020, 06:02 PM
Agreed I would like to do it.

Let me see how it could be done and I'll return here with an update.

Dolalin
12-08-2020, 06:16 PM
It looks like they've recently added a way to do this:

https://docs.microsoft.com/en-us/samples/azure-samples/azure-search-dotnet-samples/azure-search-backup-restore-index/

Dolalin
12-09-2020, 06:18 AM
I've open sourced the search UI, PRs with contributions and improvements are welcome.

https://github.com/dbsanfte/eqarchives-searchui

It's based off a Cognitive Search UI template from Azure, I added some custom UI tweaks to fit it better to the data and to fix some UI bugs.

Izmael
12-09-2020, 07:19 AM
I think there's a misunderstanding.

I was under the impression that you built a mirror / archive of all those sites. If I understand correctly now, you have not - you made a searchable index of those sites, but the original content is actually mirrored nowhere.

Do I understand correctly?

Dolalin
12-09-2020, 08:02 AM
Nope.

The captured content is all archived on Github here, mostly under /websites/:

https://github.com/dbsanfte/eq-archives

What I've done is synced it to an Azure storage account and run cognitive services search indexer against the contents, to create a searchable index, then published a web-frontend for that searchable index.

So all the file captures are already saved somewhere public, ie in that Github repo. Just the indexing (thus far accomplished) is locked up in Azure.

Izmael
12-09-2020, 08:44 AM
Got it.

Using Github for that sounds like a pretty smart move.

I don't always agree with your EQ-related opinions but have to admit that you're doing God's work here.

Going to set up a periodic clone job so if Github is blown up by a nuclear warhead, our precious elf-info is preserved.

Dolalin
12-09-2020, 11:57 AM
Got it.

Using Github for that sounds like a pretty smart move.

I don't always agree with your EQ-related opinions but have to admit that you're doing God's work here.

Going to set up a periodic clone job so if Github is blown up by a nuclear warhead, our precious elf-info is preserved.

Funnily enough my laptop is currently chugging away converting it to use Git LFS because the repo has grown so big that GitHub will no longer let me push to it.

Scalem
12-09-2020, 12:44 PM
ELI5 I dumb you use big words.

Izmael
12-09-2020, 03:09 PM
FYI Github LFS has a 1 gigabyte limitation (overall, not per file), then you have to buy more storage.

Dolalin
12-09-2020, 03:16 PM
FYI Github LFS has a 1 gigabyte limitation (overall, not per file), then you have to buy more storage.

It's $5/mo for 50GB I think I can swing it :D

Unlike the £13,000 overage bill I originally got from my index job before they waived it off.

Dolalin
12-13-2020, 11:44 AM
I've had to remove the search service as part of waiving the (now) £15k bill with Azure lol. Long story but they count a 'document' not as a file but as 1,000 char chunks.

Don't worry though it will be back. I'm indexing locally with opensemanticsearch and when it's done (a few weeks at this rate haha) I will have it back up and the Apache Solr index will be publicly available. Watch this space.